On 1/9/24, a new version of the "Atlas Proxy" was released. One of the Atlas Proxies's responsibilities is syncing cluster configurations to the Atlas Shared and Atlas Serverless tiers. The new version updated this syncing protocol, but contained a bug where under certain circumstances it would sync cluster configuration much too frequently.
On 1/17/24 at 08:00 UTC, this bug was triggered across the fleet, immediately overloading both the Atlas Control Plane application servers that handle requests from the Atlas Data Plane and the database clusters for those application servers.
At 13:30 UTC, the Atlas Proxies were identified as the root cause of the increased load, and they were throttled. This stopped Shared/Serverless tier cluster changes from propagating with the intention of taking load off the system so Atlas Dedicated operations could proceed as normal. Unfortunately, the backlog of Atlas Dedicated changes overloaded numerous other internal services.
An additional consequence of the Atlas Proxy throttling was that ongoing OS security patching caused some servers to reboot during the incident, and after the reboot, the Proxies were not able to initialize correctly due to being throttled. This left a set of Shared and Serverless Tier customers unable to connect to their clusters.
Over the next few hours, internal systems continued to be repaired. The specific bug in the Atlas Proxy was identified, and a plan to roll it back was established. By 17:00 UTC, Atlas Dedicated user driven changes were being processed with only small delays. By 23:30 UTC, all internal systems were operating normally, the Atlas Proxy was rolled back/unthrottled, and there were no longer any delays in processing Atlas cluster changes.