As Dan mentioned on Monday, Tracker uses an architecture in which a large number of polling requests are serviced by web servers communicating with an in-memory cache, and over the past few weeks we’ve hit a number of resource limits on our cache servers. Yesterday we found a small change that had a big impact on this resource load.
After moving the cache to dedicated servers earlier in the week, we continued to encounter network-related resource problems. One of the limits was on the total number of open network connections, which were reaching over 100,000 during peak usage. We reasoned that this large number of connections was due to new connections being created for every cache lookup. We also know (from discussing WebSockets among other things) that for small messages the overhead of opening and closing a connection is much larger than the message itself. Once we started looking for a way to make these connections persistent the answer wasn’t hard to find.
We rolled out this change yesterday; today network connections to the cache peaked at around 3000. The reduction in network traffic between the web and cache servers should keep us well clear of the kinds of resource limits we’ve recently encountered. We’re still very interested in moving to a WebSockets architecture in the future- in the meantime this change stabilizes our environment and gives us some room to grow.