October 14th availability incident analysis

rwdaigle · October 14, 2024, 3:57pm

Here’s a link to our official incident status and postmortem, where we will be following up with more details over the coming days. However, just to keep everybody on this thread up to date, I want to summarize a few things.

Technical

First of all, regarding the technical nature of this particular incident, it was the result of a database-level upgrade that we had planned for some time and had successfully executed in staging but hit an error when doing so in production.

The upgrade involved moving databases from one set of compute resources to another, more dedicated, set of resources so we could continue to scale and provide you all with a great level of service. Obviously, in attempting to do so, we ended up providing you with a worse service.

More technical details will follow on the incident link I posted, as we have a lot to sift through internally, and that will take some time.

Communication

However, in terms of our general communication, Glide was dark for way too long regarding the impact of the issue, and that is unacceptable.

At the very least, when we are having incidents, we should be able to communicate to you and to a broader community of customers what the nature of the incident is and our progress as we attempt to resolve it.

That was a very clear miss in this incident, as there was about 10 hours of application impact and very little customer-facing communication.

This reveals several large gaps in our process, and that will also be one thing that we look to fix as part of our analysis and postmortem. Stay tuned for details on that as well.

Topic		Replies	Views
Addressing Glide’s recent Airtable and Excel outage General	1	441	November 3, 2023
🚨 February 17, 2021 Outage \| Tracking thread Report a Bug	97	2088	February 22, 2021
🚨 Glide is down Ask for Help	21	1707	January 8, 2021
Poor Glide support SOP Report a Bug	13	195	October 24, 2024
Google is having an outage Ask for Help	18	379	September 25, 2020

October 14th availability incident analysis

Technical

Communication

Next

Related topics