Post Mortem: AppSheet Service Outage: 06/17/2019

pravse · 06-19-2019 03:32 PM

The AppSheet service was very slow and significantly unresponsive for approximately 2 hours from 8AM PST to 10AM PST on Monday June 17th. The root cause of the problem was overloading of our North American servers due to an unintentional infinite loop generated by a customer app through the REST API mechanism. The customer made an erroneous change to their app just before 8AM PST and this set off the overload.

We maintain a monitoring service that alerts some of our key engineers/managers when there is any lack of server availability. Both Brian Sabino and I received email and SMS alerts within a couple of minutes of this incident beginning. We started an emergency phone call to discuss and try to solve the problem … this call lasted for 2 hours as we tried to determine the root cause and figure out a resolution. The root cause was difficult to determine because the CPU utilization pattern on each server was atypical. Each server would show a CPU spike for a few minutes and then show low CPU for a few minutes.

AppSheet has many servers hosting traffic in North America. When a server has very high CPU for a few minutes, AppSheet has a safety mechanism that will take the server out of rotation and restart it. This usually takes a few minutes and in general is a good stability mechanism to prevent any individual runaway request from overwhelming a server. However, in this case, this mechanism worked to our detriment. When a server was taken out of rotation, the load simply transferred to another server and overloaded it as well … so the various servers took turns getting overloaded and restarting. And this in hindsight was the cause for the confusing CPU pattern.

We went through a checklist of possible causes: database connectivity, memory cache connectivity, is Microsoft Azure down (our servers are hosted there), is there some kind of network availability issue, etc. We tried running the service on our local developer machines, where it ran without a flaw. We checked our Europe-based servers and they were running fine. We investigated various monitoring traces captured by Microsoft Azure to see if there were any hints. After pursuing a couple of red herrings, we finally found in our error analytics that a large number of errors had been recorded from the same customer app and this then helped us understand the root cause. Our servers were being swamped by thousands of REST API calls from the same app.

We blocked the specific app but there were still so many queued up requests that we were not convinced this would quiesce the system. So we temporarily disabled the entire REST API across our whole platform. This is what got the service back and working. A couple of hours later, once we had communicated with the customer and understood the root cause better, we re-enabled the REST API but now with throttles in place to prevent this issue from recurring.

On Tuesday June 18th, our engineers discussed how we could have prevented this issue. Running a multi-tenanted cloud service can be challenging, especially when the service hosts thousands of apps, each with their own custom logic. The service still needs to run with stability, and ensure independence across users and accounts. We have implemented internal mechanisms to track and “rate-limit” the use of any particular feature of AppSheet — for example, the number of calls to the REST API made by any particular account. So far, we have not enforced rate-limiting, but this incident underscores that we really have to do so in order to protect platform stability.

I want to emphasize that this outage was not the result of any kind of malicious intent nor was it the fault of this customer in any way. Our system needs to ensure that such infinite loops cannot occur. We did not have a mechanism to prevent that, and the outage is therefore our fault.

I sincerely apologize for the disruption caused. It is our job to ensure that an outage like today doesn’t recur. Our detailed work in the next few days and weeks will focus on mechanisms to detect and prevent events that are similar to this one. At the very least, we will learn from this mistake and avoid repeating it.

Mike_A

Thanks for the transparency @praveen. It’s always difficult when “stuff” happens, but I really appreciate how hard the Appsheet team works to quickly resolve issues, and more importantly, to make improvements based on the learning. Certainly not a fun set of events, but the way you and your team takes accountability is very refreshing. Thanks for all the efforts to quickly solve tough problems like this and keep the less “sexy” side of the operation running.

Rod

This is exactly what I like about you guys, the transparency and the effort behind it. Great job !