Here’s a link to our official incident status and postmortem, where we will be following up with more details over the coming days. However, just to keep everybody on this thread up to date, I want to summarize a few things.
Technical
First of all, regarding the technical nature of this particular incident, it was the result of a database-level upgrade that we had planned for some time and had successfully executed in staging but hit an error when doing so in production.
The upgrade involved moving databases from one set of compute resources to another, more dedicated, set of resources so we could continue to scale and provide you all with a great level of service. Obviously, in attempting to do so, we ended up providing you with a worse service.
More technical details will follow on the incident link I posted, as we have a lot to sift through internally, and that will take some time.
Communication
However, in terms of our general communication, Glide was dark for way too long regarding the impact of the issue, and that is unacceptable.
At the very least, when we are having incidents, we should be able to communicate to you and to a broader community of customers what the nature of the incident is and our progress as we attempt to resolve it.
That was a very clear miss in this incident, as there was about 10 hours of application impact and very little customer-facing communication.
This reveals several large gaps in our process, and that will also be one thing that we look to fix as part of our analysis and postmortem. Stay tuned for details on that as well.
Next
By mid this week we intend to have a full technical post-mortem as well as specific changes to our incident response process.
We will post those updates here and on the official incident status linked above.
Thank you all for your understanding and your feedback.