Service Outage

Incident Report for Roam

Postmortem

Summary of Impact

From 11:02 ET on May 19, 2023 until 11:24 ET Roam was totally unavailable, and Chat and Calendar functionality weren't restored until 11:44 ET.

Cause

A change meant to improve our ability to debug system issues caused performance problems in our backend systems during some usage patterns. Those problems then cascaded to other parts of our backend leading to a complete outage. This was part of the cause of the outage on May 16th, and a code fix had been made the night of May 18th but failed to be deployed.

Remediation Plan

  1. The root cause fix was deployed during the incident.
  2. We have instituted a more formal SRE process and dedicated senior staff to consistent production monitoring and early issue identification that we believe would have caught the signs of this before it became an incident at all.
  3. We are improving our deployment process to ensure it is more clear which changes are deployed and ensure important fixes are deployed in a timely manner.
Posted May 24, 2023 - 17:19 EDT

Resolved

This incident is resolved. We will post a public postmortem and post it by end of day on Monday, May 22nd, 2023.
Posted May 19, 2023 - 12:52 EDT

Monitoring

A fix has been implemented and we are monitoring the results.
Posted May 19, 2023 - 11:44 EDT

Identified

Meetings are working again, chat and calendar functionality should be resolved shortly.
Posted May 19, 2023 - 11:33 EDT

Update

We are continuing to investigate this issue.
Posted May 19, 2023 - 11:10 EDT

Investigating

We're currently investigating a service outage.
Posted May 19, 2023 - 11:10 EDT