Issue with Azure clusters with KeyVault enabled

Incident Report for MongoDB Cloud

Postmortem

Executive Summary

  • Incident Date/Time: August 11, 2025, 14:05 - 16:19 UTC
  • Duration: 2 hours 14 minutes
  • Impact: Multiple Atlas services experienced disruptions due to IP access restrictions following a control plane infrastructure change
  • Root Cause: The addition of new NAT gateway IP addresses triggered network access failures for services with IP allowlisting
  • Status: Resolved

What Happened

On August 11, 2025, at 14:05 UTC, we implemented a planned infrastructure change to add additional IP addresses to our control plane NAT gateways. Although we communicated this change in advance on June 30, 2025, we acknowledge that our communication and the resulting customer preparations were not sufficient to prevent service disruptions for customers with IP access restrictions. Upon detecting the customer impact we initiated a rollback of the infrastructure change to restore service and prevent further disruptions. The rollback was completed by 16:19 UTC, returning all affected services to their previous operational state.

Impact Assessment

  • Affected Services: Atlas Clusters with BYOK encryption, Atlas login/signup, App Services, MongoDB Charts
  • Geographic Scope: Global
  • Customer Impact:

    • Customers with Bring Your Own Key (BYOK) encryption and IP restrictions may have experienced cluster shutdowns of varying durations
    • Customers experienced intermittent login/signup failures for 12 minutes (14:09-14:21 UTC)
    • App Services and Charts users experienced partial service failures
  • Peak Impact Period: 14:05-16:19 UTC

Root Cause Analysis

The addition of new IP addresses to our control plane caused network access failures for customers who had configured IP allowlists that didn't include the new addresses.  (Please note that this is not the same as the cluster IP allowlist which you would use to control how to connect to your cluster)

 The primary issues were:

  1. BYOK Encryption Validation: Our key validation process (running every 15 minutes) failed on operations from the new IPs. Due to a flaw in our error handling logic, the system incorrectly interpreted these network failures as an intentional revocation of access to the encryption key and automatically shut down affected clusters.
  2. Identity Provider: The new IP addresses weren't allow-listed in our identity provider, resulting in degraded registrations and logins until those IPs were allowed.
  3. Service Authentication:  App Services experienced partial service failures because requests to the Atlas API originating from Triggers new IP addresses were blocked by an outdated internal IP allowlist.

Prevention

Immediate fixes (Already implemented):

  • Rolled back the NAT gateway change
  • Patched KeyVault validation logic to prevent erroneous cluster shutdowns

Next steps

  • We will be sharing a rollout plan via our status page as a planned maintenance with 2 weeks notice 

Conclusion

We acknowledge the significant disruption this incident caused and its impact on your applications and business operations. We are committed to preventing similar issues in the future. Although we communicated the upcoming IP changes in advance, we take full responsibility for the conditions that led to these failures. We are implementing the improvements outlined in this postmortem and will continue to invest in more resilient infrastructure change processes to ensure the reliability and stability of our services.

Posted Aug 15, 2025 - 17:50 UTC

Resolved

This incident has been resolved.
Posted Aug 11, 2025 - 17:22 UTC

Monitoring

A fix has been implemented and we are monitoring the results.
Posted Aug 11, 2025 - 17:22 UTC

Update

Infrastructure rollback has completed and affected Azure clusters are recovering. The previously suggested firewall update is no longer necessary.
Posted Aug 11, 2025 - 16:26 UTC

Update

We are rolling back an infrastructure change that should allow Azure clusters with KeyVault to recover but affected customers can restore access more quickly by updating their KeyVault firewall with the latest Atlas control plane IP addresses: https://www.mongodb.com/docs/atlas/setup-cluster-security/#allow-access-to-or-from-the-service-control-plane
Posted Aug 11, 2025 - 16:09 UTC

Identified

The issue has been identified and a fix is being implemented.
Posted Aug 11, 2025 - 15:58 UTC

Investigating

We are currently investigating an issue affecting Azure clusters with KeyVault enabled.
Posted Aug 11, 2025 - 15:53 UTC
This incident affected: MongoDB Cloud.