October 14th availability incident analysis

Good morning Glide Community. For many of you, I know that your business or client businesses you serve are off to a chaotic start due to the availability incident that occurred overnight (Pacific Time).

Our oncall engineers responded to the incident and were able to help our systems recover after an extended period of time lasting more than 4 hours.

We are currently aware that the following functionality experienced interruptions:

  • Glide Big Tables, SQL Data sources, Big Query
  • Syncing with external data sources (Google Sheets, Airtable, etc.)
  • Integrations (Glide AI, PDFMonkey, DocsAutomator, etc.)
  • Call API

This extended outage that affected so much of the platform is very unfortunate, and we always seek to avoid this type of negative impact.

Please accept our apologies for these interruptions! We know that you count on Glide to be able to run your business, and we will do everything we can to ensure strong platform stability and reliability.

Today I am working with our platform engineering team to conduct an incident postmortem review and root cause analysis in order to document the facts around what led to this incident.

Our goal is to learn why the changes that we made to our production infrastructure resulted in negative impact, while testing those same changes on our (pre-production) staging environment did not encounter the same failures.

Once we are able to fully understand what unexpected condition was encountered in our production systems, we will update this post with additional context. I expect that to be today or tomorrow.

15 Likes

I just want to note that - at least with respect to Big Tables - they first started failing at 11:00 UTC+8, and were restored shortly before 21:00 UTC+8, so the outage was closer to 10 hours.

2 Likes

Thank you for acknowledging this and working to remedy!

My one point of feedback is that the Glide status page did not reflect the severity of what was going on, even though an existing issue was escalated. The incident was labeled as “minor” with “performance issues” with certain integrations and external sources. That did not really reflect the severity of what was going on, and there was not an email from Glide about the issue outside of those generated by the status page, so the status page was really our only point of reference.

I only knew of the severity of the issue as the owner of one of the integrations reached out to his user base that there were major issues at Glide.

Downtimes happen, but I think it is is important to make sure messaging is as proactive as possible.

5 Likes

Our outage started at 10:00 PM ET and resolved around 8:45 AM ET, so also around 10 or so hours.

Thank you for the data points! I primarily referenced our status page to begin as a point of reference, but we’ll get more detail into the writeup.

1 Like

Thank you for the feedback. We will make sure to review and understand whether we were as transparent as possible with reflecting severity, and where/why we might have been missing information.

1 Like

Here’s a link to our official incident status and postmortem, where we will be following up with more details over the coming days. However, just to keep everybody on this thread up to date, I want to summarize a few things.

Technical

First of all, regarding the technical nature of this particular incident, it was the result of a database-level upgrade that we had planned for some time and had successfully executed in staging but hit an error when doing so in production.

The upgrade involved moving databases from one set of compute resources to another, more dedicated, set of resources so we could continue to scale and provide you all with a great level of service. Obviously, in attempting to do so, we ended up providing you with a worse service.

More technical details will follow on the incident link I posted, as we have a lot to sift through internally, and that will take some time.

Communication

However, in terms of our general communication, Glide was dark for way too long regarding the impact of the issue, and that is unacceptable.

At the very least, when we are having incidents, we should be able to communicate to you and to a broader community of customers what the nature of the incident is and our progress as we attempt to resolve it.

That was a very clear miss in this incident, as there was about 10 hours of application impact and very little customer-facing communication.

This reveals several large gaps in our process, and that will also be one thing that we look to fix as part of our analysis and postmortem. Stay tuned for details on that as well.

Next

By mid this week we intend to have a full technical post-mortem as well as specific changes to our incident response process.

We will post those updates here and on the official incident status linked above.

Thank you all for your understanding and your feedback.

14 Likes

Thanks for the patience everybody.

FYI we just posted our post-mortem review summary here which goes into more technical detail on the sequence of events that caused the outage as well as the corrective measures we intend to take.

Happy to answer any questions you have here as well.

Our apologies again for the impact to your business and we thank you for your understanding.

2 Likes

Not sure if I understand everything but it appears a checkpoint failback was corrupt do to a WAL write inconsistency. Is Glide investing in tools to ensure that HA data migration (which I expect to occur again and again due to continued rapid growth) is seamless even during unexpected but known Postgres (and other DB) issues (like the WAL write inconsistency).

Thank you for the write-up.

Hey Matt,

Thanks for the question.

So the checkpoint failure was due to a WAL write failure… which was due to an Postgres OOM error on the new cluster. So quite a series of dependencies that resulted in this situation. However, at our scale, such complexities should not be considered edge cases. Which leads us to the bigger picture solution here…

We mentioned that we have been undertaking significant improvements to our data infrastructure for several months now. The major effort we are planning is a migration away from self-provisioned, self-managed, Postgres clusters to more robust offerings from our platform provider (where backups, migrations, expansion, and scaling are “just” features exposed on a dashboard somewhere).

While we know there is no silver bullet when it comes to managing infrastructure, this direction was chosen with the intent of eliminating a whole class of infrastructure concerns, like the ones we experienced during this incident.

Hope that gives you some confidence in our future direction and data stewardship overall.

4 Likes

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.