Addressing Glide’s recent Airtable and Excel outage

On October 12th, Glide experienced an increased rate of failure to sync data with Airtable for a very small portion of our user base. With Airtable’s recently announced pricing changes, there have been major changes to how products like Glide can use Airtable’s API. New quotas and rate limits introduced by Airtable have made syncing data to Airtable less reliable.

In working to minimize service disruption, given new Airtable API limits, our team shipped changes to our data sources sync engine, which led to data consistency issues and data privacy issues for some of these customers.

While these instances are rare, we take ensuring the reliability and security of Glide for our customers very seriously. Thus, we engaged the full resources of our engineering team to take immediate corrective action to restore affected data sets from backup to prevent data loss. We also engaged our support team to reach out to all affected customers to confirm a full recovery of their business operations.

In the interest of transparency, we wanted to share additional details about the outage and convey our learnings and plans to improve.

How to identify if your account was affected

We have reached out to all of the customers we identified as being affected by this outage.

To determine if your account was impacted, please review the following context:

The issue only affected users with data sources in Airtable or Excel. If you are using Glide Tables, Google Sheets, or BigQuery, your account was not impacted.

If you have connected your Glide app to Airtable or Excel, check the Data Editor:

  • If you see your Airtable or Excel tables with the Glide tables icon, it means that your app was affected. Changes to your data are being written to these tables but are not being synced back to your external data source.
  • If you see your Airtable or Excel tables with the appropriate Airtable or Excel icons, your app was not affected. Your data should be syncing to your external data sources as normal.

In general, you can check our Glide Status page if you are ever concerned that your Glide account may be experiencing an issue.

The data source issues we encountered

On August 24, Airtable announced pricing changes and new Airtable API rate limits.

Upon receiving this context, our engineering team kicked off a project on August 31 to evolve our data source synchronization logic to mitigate any potential service disruption. Our technical approach for the project involved creating a checkpointing system for data synchronization. This approach allows partially completed syncs to be picked up where they left off when API rate limiting or other slowing factors cause them to run for too long. Our goal was to ensure Glide apps with Airtable data sources were able to sync in a timely fashion with Airtable.

On October 12, Airtable began enforcing its new API rate limits, causing a partial outage for Glide apps with Airtable data sources. At this time, we were in the final stages of testing out the checkpointing improvements we had been working on for data synchronization. We fixed the last known bug that had been identified during testing and proceeded to roll out the improvements.

That night, we began receiving reports on the community forum and via customer support indicating that “Airtable tables are disconnected and appear to be Glide tables”.

On October 13, engineers investigated these reports and discovered a new bug that would cause Airtable and Excel data sources to eject under certain rare conditions. “Ejection” is when a table is disconnected from its data source (in this case, Airtable) and is turned into a standard Glide Table. This halts the syncing of that table with its primary data source.

The process of ejecting a table in Glide was not designed to be a two-way process. Reversing it requires significant manual intervention. This leaves user applications in a state where data is being written to a Glide Table and safely saved but no longer syncing to its primary data source. This can be both confusing to the user and cause data consistency issues for their app.

Once the issue was discovered, we were able to quickly track down the issue. We had feature flagging in place, which let us identify and turn off the part of the code that was causing the problem.

The impact of those issues on user data

During an 18-hour time window, 84 users who were using Airtable and/or Excel were affected by the data source ejection issue. Those users may have experienced some amount of data inconsistency between the time their tables were ejected and when we were able to manually reconnect them.

Though we developed an ability to repair affected table connections, we are unable to guarantee 100% recovery of all affected data. However, we have been unable to find any instances of data loss.

Out of an abundance of caution, we decided to leave in place both tables - the original Airtable and the inadvertently created “duplicate” Glide Table. This will allow users to manually assess which data may not have been appropriately synced back to Airtable.

In most cases, we have automatically resolved the data consistency issues for the user.

What we learned and how we plan to improve

Our team here at Glide takes platform reliability and security for users extremely seriously. After every outage, we conduct a review to understand what went wrong and to decide how to improve to avoid related issues in the future. The following is a summary of our findings from this outage:

  • Features or improvements to the Glide platform that we intend to release normally go through a rigorous progressive rollout process in order to catch bugs early. This typically means that we first release changes to a small percentage of customers while monitoring telemetry and error logs.

    During this incident, we initially enabled the data sync checkpointing improvements for 100% of customers. This led to a larger number of customers being affected, amplifying the intensity and communication volume around the incident. We learned that we must incorporate progressive rollout requirements into our launch checklists so that we do not forget to do this, even when under time pressure during an outage. Since the outage, we have developed the ability to progressively roll out changes to our data source sync engine.

  • During this incident, an unrelated problem with database replication in one of our database clusters blocked our ability to deploy Airtable sync fixes in a timely manner. We learned that multiple issues unfolding in parallel can increase confusion and slow resolution times. We have since investigated and discovered the root cause of this database replication issue and plan to further improve our automated deployment processes to handle this scenario.

  • While restoring connections between Airtable and Glide, we introduced an unintentional data authorization issue that affected Row Owners settings for a very small number of Glide apps (less than 14). This issue was responsibly and privately disclosed to us, and we patched the vulnerability immediately. Customers affected by this niche issue have already been notified directly. We learned that our default sharing settings for Row Owners need to improve, and have committed to modifying this sensitive area of our codebase to require a more explicit declaration to modify these settings.

  • Gliders in our amazing community first spotted and reported the issue, which was later correlated to unintentional table ejection. Though we requested that our support team begin monitoring for similar reports, it was not immediately apparent to us that these problems were correlated with the changes to the data source sync engine. Engineering should have more swiftly resourced a deeper investigation into these reports. We learned that decision fatigue from working through the incident slowed our decision-making timeline. We are currently investigating how automated processes powered by AI can help us aggregate signals from various sources (the community forum, support channels, telemetry) in order to elevate the severity of reported issues so that they can be prioritized by engineering.

We are committed to improving the performance of Airtable sync and plan to integrate with the Airtable Webhooks API by the end of the year to sync your Glide app data with Airtable even faster.

We apologize for the disruptions from this outage and appreciate your reports to date and your patience as we work to improve.

Thank you for trusting Glide with your business.

Follow our status page for real-time updates on status changes.

11 Likes

I thought that despite the severity of the issue at the time the entire situation was handled very well.

I think comms via the status page but also the feedback from support was great.

Looking forward to seeing more updates!

1 Like