MongoDB Atlas Serverless: Connection errors in GCP us-central1

Incident Report for MongoDB Cloud

Postmortem

Summary

For approximately 12 hours between 2023-03-28 02:08 UTC and 2023-03-28 14:19, some MongoDB Atlas Serverless customers in GCP region us-central1 experienced a complete outage. All connections to their serverless instance would have failed during this time period.

Root Cause and Remediation

Around 2023-03-28 02:08 UTC, a single hardware cluster in our GCP us-central1 serverless deployment experienced a spike in load. This triggered a series of load balancing steps within our infrastructure to migrate load to other healthy hardware in the region to maintain quality of service (QoS). The load balancing process eventually failed which triggered a rare memory allocation bug in our connection handling proxies. Because of this, all serverless instances allocated to this specific hardware cluster experienced downtime.

The outage was detected by our internal alert system but these alerts were not configured to page the appropriate engineering team so the issue was not noticed immediately. At the beginning of the following business day, our engineers began immediate investigation and remediation. About 12 hours after the issue first began, the issue was resolved and service was restored to impacted serverless instances.

Prevention and Follow-Up

We have identified the memory allocation bugs that caused the outage and have prepared patches to address the root cause. We have also scheduled further work to address the alerting gap that caused the outage to go unnoticed much longer than intended.

Posted Apr 28, 2023 - 20:20 UTC

Resolved

From approximately 2:12 UTC until 14:20 UTC today, Atlas Serverless instances in GCP region us-central1 may have experienced connection errors.

Posted Mar 28, 2023 - 02:00 UTC