MongoDB Atlas: Cluster operation delays and UI timeouts

Incident Report for MongoDB Cloud

Postmortem

Summary of Atlas Outage on June 4th 2025

This document describes an Atlas control plane service disruption that occurred on June 4th, 2025.

Issue Summary

Between 17:05 GMT and 22:54 GMT on June 4th, the MongoDB Atlas control plane experienced a service disruption due to a DNS misconfiguration. Customers were unable to make configuration changes across a range of MongoDB services, including Atlas databases, App Services and Device Sync, Atlas Data Federation, Stream Processing, Backup/Restore, and Atlas Search. The core data plane remained operational, and customer workloads continued uninterrupted. However, during the time of the disruption, customer clusters could not be managed via the UI, Admin API, or the auto-scaling system. Similarly, customers were unable to change network configuration, modify projects, or add/remove database users during this time. The Atlas Web UI was unavailable during a portion of this outage. 

This incident was the result of a DNS configuration change that impacted communication within Atlas’s internal metadata servers. These servers employ recursive DNS resolution that relies on name servers on the public Internet. An authorized operator executed a planned update to a DNS nameserver record that was believed to be unused. However, this belief was based on an incorrect internal configuration source. The disruption of communication between our metadata servers in turn disrupted most operations against the Atlas control plane.

The operator detected the misconfiguration within minutes. We immediately rolled back the offending change. However, our recovery process was delayed for several reasons. The top-level DNS records have a Time To Live (TTL) of 2 days. As such, rolling back the misconfiguration did not resolve the problem. We attempted multiple mitigations, including flushing local DNS caches and redirecting to an alternate resolver. After these mitigations proved unsuccessful, we requested that our upstream DNS provider flush the offending DNS records from the long-term cache. This fixed the immediate connectivity problem. Partial recovery was immediate. It took roughly another 60 minutes for all services to resume normal operations after working through queued work.

We are making a set of corrective actions based on this event. First, we will modify our operational tooling to enforce additional safety checks, especially for changes that modify a DNS top-level domain. Second, we are enhancing our existing internal review process for DNS configuration changes. These reviews will include additional testing of such changes in a controlled environment and gating mechanisms to reduce blast radius. 

Conclusion

We apologize for the impact of this event on our customers. We are aware that this outage had an impact on our customer’s operations. MongoDB’s highest priorities are security, durability, availability, and performance. We are committed to learning from this event and to update our internal processes to prevent similar scenarios in the future.

Posted Jun 13, 2025 - 21:16 UTC

Resolved

This incident has been resolved. We have run all remediations and systems have returned to normal.
Posted Jun 04, 2025 - 21:28 UTC

Monitoring

A fix has been implemented and we are monitoring the results.
Posted Jun 04, 2025 - 20:28 UTC

Update

We have identified that cloud.mongodb.com and all sub pages, including API endpoints, is down. Affected users may see timeouts, blank pages, or error pages from Cloud. This affects Cloud Manager as well.

Additionally Atlas Triggers, App Services, and Stream Processing are inactive at this time.

Search index management and changes to Search Nodes are not being processed at this time. Existing search indexes are unaffected.

Existing cluster health is unaffected.
Posted Jun 04, 2025 - 20:09 UTC

Update

We have identified that cloud.mongodb.com and all sub pages, including API endpoints, is down. Affected users may see timeouts, blank pages, or error pages from Cloud. This affects Cloud Manager as well.

Additionally Atlas Triggers, App Services, and Stream Processing are inactive at this time.

Existing cluster health is unaffected.
Posted Jun 04, 2025 - 20:06 UTC

Update

We have identified that cloud.mongodb.com and all sub pages, including API endpoints, is down. Affected users may see timeouts, blank pages, or error pages from Cloud. This affects Cloud Manager as well.

Existing cluster health is unaffected.
Posted Jun 04, 2025 - 19:44 UTC

Update

MongoDB Atlas cluster operations are delayed and the web UI is still experiencing delays and failures. Cloud Manager and the admin API for both Atlas and Cloud Manager may also be delayed.

Existing cluster health is unaffected.
Posted Jun 04, 2025 - 19:02 UTC

Update

MongoDB Atlas cluster operations are delayed and the web UI is still experiencing delays and failures.

Existing cluster health is unaffected.
Posted Jun 04, 2025 - 18:16 UTC

Identified

We have identified an issue with MongoDB Atlas cluster operations and the web UI. Affected users may see timeouts when accessing the control panel or delays in creating or modifying clusters.

Existing cluster health is unaffected.
Posted Jun 04, 2025 - 17:30 UTC
This incident affected: MongoDB Cloud, MongoDB Atlas App Services and Device Sync, MongoDB Atlas Data Federation and Online Archive, and MongoDB Atlas Stream Processing.