On Wednesday, December 20, 2017, we experienced an outage that affected customer Hubs, Flipbooks, our back-end application, and our APIs. Today, we want to share details of the incident and what we're doing to help prevent similar incidents in the future.
On Wednesday, December 20, Uberflip's engineering team released a new version of Uberflip AI to production. This was a major release that included significant architectural changes.
During the release, we needed to migrate a large dataset — containing historic recommendation data — between two database clusters. After testing in a staging environment, we concluded that it could be performed safely; however, the team made two mistakes during the release to production:
- We migrated the large dataset after the changes were released in production, instead of immediately before. This meant our new version of Uberflip AI was already requesting the dataset before it was migrated.
- We performed the data migration against a secondary node in the database cluster, instead of the primary, so the migration took much longer than expected.
These two mistakes resulted in the dataset being locked for approximately 15 minutes.
As visitors navigated Hubs with Uberflip AI enabled, the recommendation requests being made in the background were blocked, waiting for the dataset to be unlocked.
Between 4:28 PM EST and 4:32 PM EST, internal monitoring notified us that there were an abnormally high number of blocked requests to all web servers.
Between 4:28 PM EST and 5:01 PM EST, our team worked to cancel the data migration and unlock the dataset.
At 4:41 PM EST, the number of blocked requests exceeded the maximum number of requests that could be handled by our infrastructure, and we began rejecting subsequent requests.
Between 4:41 PM EST and 5:01 PM EST, the team performed rolling restarts of our web servers to clear the old, blocked requests. However, new requests continued to roll in, and the web servers would again begin rejecting requests.
At 5:00 PM EST, the data migration was cancelled and the dataset was unlocked.
At 5:01 PM EST, external monitoring notified us that the platform was fully available.
Impact on Customers
Between 4:41 PM EST and 5:01 PM EST, customer Hubs, Flipbooks, our back-end application, and our APIs were either seriously degraded or unavailable. Visitors would have received either a “Connection Refused” message from their web browser or a “Service Unavailable” message from our infrastructure.
Uberflip's engineering team is reviewing its release process to understand how the mistakes that led to this outage were made and how they can be prevented in future releases.
The team is also reviewing load balancer, web server, and application configurations to ensure all layers of our infrastructure are configured to reject blocked requests in a reasonable amount of time.
Our Commitment to You
To those affected, we sincerely apologize that your service was degraded or unavailable yesterday, and we take full responsibility for what happened. Your satisfaction is what everyone at Uberflip works towards every day. As outlined above, we’re committed to ensuring that this does not happen again.
As always, your Account Executive and Customer Service team are available to help in any way. Likewise, please feel free to contact us directly if you have any questions or needs.