This past weekend, we encountered two incidents that restricted our platform’s availability to some of our customers. During these incidents, customer Hubs, customer Flipbooks, our APIs, and our management UI were either severely degraded or unavailable. Today, we want to share details of the incident and what we're doing to help prevent similar incidents in the future.
About Our Infrastructure
Uberflip currently serves customers from two data centers in downtown Toronto. Our infrastructure is designed for redundancy, and can handle the failure of a cable, NIC, PSU, machine, rack, or entire data center—without the platform becoming unavailable. Each data center has two connections to the Internet, and our infrastructure will handle the failure of one connection by failing over to the other automatically. Both connections are through the same Internet service provider (ISP), Beanfield Metroconnect.
Incident #1 Summary
On October 14 at 9:48pm EDT, we first received notice via our monitoring service that the Uberflip platform was unavailable from specific regional locations. We concluded the problem was with our ISP (Beanfield Metroconnect), and not our own hardware, software, or configuration.
On October 15 at 12:13am EDT, our monitoring services stopped notifying us that the platform was unavailable. We confirmed that Hubs and Flipbooks were available.
Incident #2 Summary
On October 15 at 3:13pm EDT, our monitoring services started notifying us that the platform was again unavailable from certain locations.
In working with Beanfield we discovered that the outage was caused by a bug in their Cisco core routers, and that they were working with Cisco to fix the bug.
On October 15 at 11:44pm EDT, we confirmed that the Uberflip platform (Hubs and Flipbooks) was available.
Today, Beanfield support notified us that they had put a workaround in place for the underlying bug and are working to fully resolve it.
Impact to Customers
We’ve compared our web traffic during the outages with web traffic for the four previous weeks to understand the impact to our customers. We’ve concluded that web traffic decreased by approximately one third during the outages.
During the outages, we nevertheless saw successful requests from countries around the world. The outages appear to be regional — for example, our monitoring nodes in Canada were completely unable to access our platform during the outages — or related to network topology.
- Fix the single point of failure. Each of our data centers requires multiple connections to the Internet through independent ISPs. We were planning to add additional ISPs before these incidents; now, however, we will review whether we can accelerate the plan without risking additional outages.
- Improve our status reporting. Normally, if our service were degraded in some way, we would publish status updates through our management UI. During these incidents, our management UI was unavailable, so we had no way to indicate that we were working on the issue. Going forward, we will publish status updates using a third-party reporting service (such as statuspage.io).
Our Commitment to You
To those affected, we sincerely apologize that your service was degraded or unavailable this past weekend, and we take full responsibility for what happened. Your satisfaction is what everyone at Uberflip works towards every day. As outlined above, we’re committed to ensuring that this does not happen again.
As always, your Account Executive and Customer Service team are available to help in any way. Likewise, please feel free to contact us directly if you have any questions or needs.