Pivotal Tracker is a critical tool for thousands of teams, and providing a stable, reliable service is vitally important for us. Uptime is typically in the proverbial “three nines” (99.95% for the first half of the year), but in the last few weeks we’ve experienced a number of unexpected outages. The worst of which was yesterday, lasting almost two hours.
This was disruptive to many of you, and we’d like to apologize.
Outage details
These recent outages followed a similar pattern, and as is typical with catastrophic events, they were due to a number of factors conspiring together.
The triggering event was that requests to our externally hosted WordPress site started timing out or getting blocked (for different reasons), causing a backup at our load balancers. These shared load balancers operate with a high number of open connections, due to thousands of polling requests per second from the Tracker web application, and these backups pushed them over a certain threshold and into bad state.
Unfortunately, the new beta version of the Tracker web application, which the majority of our users are now on, did not handle server outages well, and clients on the beta continued to send a barrage of requests to the load balancers, making it difficult for them to recover.
Steps we’re taking
Here’s what we’re doing to make sure this never happens again. Some of these things were already in progress, but are being accelerated after yesterday’s outage.
First, we’ve made sure that the beta web clients immediately back off their polling requests when a server outage is detected, making it easier for overwhelmed load balancers to recover. We’ve also temporarily increased the interval between polling requests. Both of these changes were made yesterday afternoon.
Over the next few weeks, we are:
- Moving from shared load balancers to multiple pairs of dedicated load balancers, giving us more resources including a larger pool of available connections.
- Moving the externally hosted WordPress site (for our marketing pages) to our application production environment for more stability and control.
- Segregating polling traffic from other types of requests.
- Making our polling architecture more robust, which will also improve the user experience around going off-line.
- Analyzing the entire production environment for other opportunities to improve stability and performance.
On a longer term front, we’re starting work to transition Tracker’s polling architecture (for keeping clients up to date with project changes) to server “push,” most likely via server side events, with fall-back to polling only when necessary. Not only will this result in less chatty (and more stable) overall architecture, but changes to stories will propagate instantly, and it will open the door to a new class of “live” features in Tracker.
Stay tuned for updates on progress, and again, please accept our apologies for these recent outages. Please follow us on Twitter for all the latest updates, and to monitor Tracker system status, visit our status page.