AppSheet outage update

(Gil Littman) #1

Seems like the issue is now solved. We’re still investigating what happened exactly.
Praveen will post his conclusions soon.

We apologize for any inconvenience cause by this issue.

1 Like
(Praveen Seshadri (AppSheet)) #2

Here is an update on today’s outage.

We have one customer who gets a daily load of 2000 users, all of whom sync around 4PM PST. Normally, these syncs take just a few seconds each, but about 10 days ago, due to a change in their data size and logic, the syncs started taking about 60 seconds. This caused a problem for our servers, naturally, because these users were swamping the system.

To counter this, we requested the customer to tune their app and data set, which they did, reducing the sync time to 2 seconds. We also put in a throttle mechanism to ensure that no more than N (roughly 10) syncs from the same app can be served at the same time. The goal of the throttle mechanism was to ensure fairness, so that our server resources would be available to all other customers at the same time. AppSheet has many actual servers that any individual browser or device connects to. To provide a single throttle across all of them, we recorded information in our central cache server which is accessible by all the individual AppSheet servers. So then, even if 2000 sync requests come in, they would check with the cache, 1990 of them would be stalled, while 10 go thru. Each of them would rapidly finish and another 10 would go thru, etc. This worked fine all of last week.
However, it now appears that the sync times for the app are up above 1 minute again. We have contacted the customer to find out why. However, this had a really negative effect on the system. Unable to get access, the 1990 requests start repeating their request to the central cache, this puts load on the cache and it gets slow responding, this causes the requests to timeout and retry, and this exacerbates a bad situation. At about 4PM PST today, our cache server suddenly went from 5% load to 100% load and effectively stopped responding to all requests. And this effectively stopped all AppSheet services — the central cache is used to improve the efficiency of many things, so it is always checked first before going to a slower database lookup.

It took a while and a lot of frenetic exploration of theories to hone in on the problem. Then we deployed a new cache server, we redeployed all our servers to utilize the new cache server, and we redeployed new code that increases the timeouts. We will put in some more mitigations overnight as well.

It has been a trying time for all of you these last few days, as you deal with some of these instabilities. Some of these have been Google’s doing but some of these, like today’s episode, are the fault of our platform. People are now running large scale, compute and data intensive workloads on AppSheet, and we are experiencing growing pains as we deal with the scale challenges. I want to apologize to all of you — you deserve uninterrupted service. We will continue to work very hard at anticipating and avoiding these issues going forward.

5 Likes
(Pratik Parmar) #3

Thank You @praveen and AppSheet Team.

Should I be concerned at end-user’s level with the above message negatively

(AppSheet’s Sync Waits) or just wait for new Cache System go in Place.

I trust AppSheet completely.
Thank you.

(Tammi Canelli) #4

We are in the process of developing a public app that will utilize the same SQL database as a private app (for employees). Amount of users on the public side is unknown at this time - it will be promoted for pre-registration for evacuation centers. What do we need to do to avoid crashing the system? Also, I saw another post about public apps being $50/mo - does this apply to business/corporate plans?

(Praveen Seshadri (AppSheet)) #5

Hi Tammi, regarding pricing, please channel that discussion via Drew. The public app per-app pricing you reference is meant for self-serve customers rather than business plans.

Wrt crashing the system, our job is to ensure that nobody can crash the system, whatever load they put on it. This happens by throttling iindividual accounts/apps to make sure no one app can overwhelm the system. What happened last week is a failure in that throttling mechanism — i.e. the throttling mechanism itself got into a bad state.

That said, if you are using the same SQL database for your private and your public apps, it is possible that your SQL database might be overloaded. It depends on the app design. So that is something we can talk through.

(Tammi Canelli) #6

Thank you. You’ll see another post re the public app and security concerns. Definitely could use some guidance on how best to set this up. I’m meeting with our IT security next week. Perhaps I can coordinate through Drew to schedule a meeting based on their concerns?

(Praveen Seshadri (AppSheet)) #7

Tammi, yes best to discuss via Drew.

For the broader audience, “public apps” have no security mechanisms. So it is rarely the case that public apps and private/secure apps should share the same data sets/tables.

That said, I’m assuming Tammi that you were just going to use the same database server as a backend for these two different data sets.