Atlas operations and Cloud Manager metrics collection delayed

Incident Report for MongoDB Cloud

Postmortem

On 1/9/24, a new version of the "Atlas Proxy" was released. One of the Atlas Proxies's responsibilities is syncing cluster configurations to the Atlas Shared and Atlas Serverless tiers. The new version updated this syncing protocol, but contained a bug where under certain circumstances it would sync cluster configuration much too frequently.

On 1/17/24 at 08:00 UTC, this bug was triggered across the fleet, immediately overloading both the Atlas Control Plane application servers that handle requests from the Atlas Data Plane and the database clusters for those application servers.

At 13:30 UTC, the Atlas Proxies were identified as the root cause of the increased load, and they were throttled. This stopped Shared/Serverless tier cluster changes from propagating with the intention of taking load off the system so Atlas Dedicated operations could proceed as normal. Unfortunately, the backlog of Atlas Dedicated changes overloaded numerous other internal services.

An additional consequence of the Atlas Proxy throttling was that ongoing OS security patching caused some servers to reboot during the incident, and after the reboot, the Proxies were not able to initialize correctly due to being throttled. This left a set of Shared and Serverless Tier customers unable to connect to their clusters.

Over the next few hours, internal systems continued to be repaired. The specific bug in the Atlas Proxy was identified, and a plan to roll it back was established. By 17:00 UTC, Atlas Dedicated user driven changes were being processed with only small delays. By 23:30 UTC, all internal systems were operating normally, the Atlas Proxy was rolled back/unthrottled, and there were no longer any delays in processing Atlas cluster changes.

Posted Feb 08, 2024 - 18:54 UTC

Resolved

This incident has been resolved.

Posted Jan 18, 2024 - 14:08 UTC

Update

We are continuing to monitor for any further issues.

Posted Jan 18, 2024 - 00:05 UTC

Monitoring

The system has recovered and we are continuing to monitor.

Posted Jan 18, 2024 - 00:04 UTC

Update

The system is largely recovered, but we are monitoring closely. Issues experienced by some customers connecting to Atlas Shared Tier or Atlas Serverless clusters were resolved around 23:00 UTC.

Connections to Atlas Dedicated clusters and performance of Atlas Dedicated clusters are not and have not been impacted.

Posted Jan 17, 2024 - 23:56 UTC

Update

The system is starting to recover.
- Issues connecting to Atlas Shared Tier and Atlas Serverless Clusters should be resolved.
- Metrics for some Atlas Serverless Clusters may still be delayed
- Creation of new Atlas clusters is experiencing a delay due to a delay in DNS propagation
- User-initiated requests for changes to Atlas clusters are processing in a timely manner with the exception of changes that require new DNS entries (e.g. adding a new node)

Connections to Atlas Dedicated clusters and performance of Atlas Dedicated clusters are not and have not been impacted.

Posted Jan 17, 2024 - 23:20 UTC

Update

We are continuing to investigate this issue.

Posted Jan 17, 2024 - 23:13 UTC

Update

We are continuing to see delays in metrics for Atlas Shared Tier and Atlas Serverless clusters. Additionally all modifications for the Atlas Clusters are delayed including creation of new clusters/instances, ip access list, and user changes. We have also identified an issue with connections to some Atlas Shared Tier and Atlas Serverless clusters.

Posted Jan 17, 2024 - 20:46 UTC

Update

We are continuing to delays in metrics for Atlas Shared Tier and Atlas Serverless clusters. Additionally all modifications for the Atlas Shared Tier clusters and Atlas Serverless instances are delayed including creation of new clusters/instances, ip access list, and user changes. For Atlas Dedicated clusters, user requested changes to Atlas clusters (modifying the Network Access list, adding an additional node in a replica set, adding a new Database User, etc.) are being processed in a timely manner. However, scheduled actions such as snapshots are experiencing delays. The delay for scheduled actions is currently decreasing.

Posted Jan 17, 2024 - 17:20 UTC

Update

Metrics for Atlas Shared Tier and Atlas Serverless clusters are still experiencing delays. User requested changes to Atlas clusters (modifying the Network Access list, adding an additional node in a replica set, adding a new Database User, etc.) are being processed in a timely manner. However, scheduled actions such as snapshots are experiencing delays.

Posted Jan 17, 2024 - 15:02 UTC

Update

We have partially remediated the issue. However, metrics for Atlas Shared Tier and Atlas Serverless clusters are still experiencing delays. Changes to Atlas clusters (including IP Access List modifications, topology and tier changes, or the creation of new clusters) may still be experiencing some delays.

Posted Jan 17, 2024 - 13:33 UTC

Update

We are continuing to investigate this issue. No root cause or remediation has been reached at this time.

Posted Jan 17, 2024 - 12:07 UTC

Update

We are continuing to investigate this issue. No root cause or remediation has been reached at this time.

Posted Jan 17, 2024 - 11:03 UTC

Update

Atlas clusters may show as Degraded due to the delayed metrics data - however, there is no impact to the clusters themselves. (This is a false positive.) There may be a delay processing requested changes to Atlas clusters.

Posted Jan 17, 2024 - 09:07 UTC

Investigating

We are investigating an issue delaying the processing of metrics points for Atlas and Cloud Manager.

Posted Jan 17, 2024 - 08:33 UTC

This incident affected: MongoDB Cloud.