Microsoft has revealed that Thursday’s global outage was caused by a code flaw that made the Azure DNS service overwhelmed and unable to respond to DNS queries.
At around 5:21 pm Eastern Time on Thursday, Microsoft experienced a global outage that prevented users from accessing or logging in to many services, including Xbox Live, Microsoft Office, SharePoint Online, Microsoft Intune, Dynamics 365, Microsoft Teams, Skype, Exchange Online , OneDrive, Yammer, Power BI, Power Apps, OneNote, Microsoft Hosted Desktop and Microsoft Streams.
The service is so widespread in Microsoft̵
Microsoft finally resolved the downtime at approximately 6:30 pm Eastern Time, and some of these services took longer to run normally again.
At the time, Microsoft stated that the outage was caused by DNS issues, but did not provide further information.
Azure DNS service becomes overloaded
Last night, Microsoft released the Root Cause Analysis (RCA) of this week’s downtime and explained that it was caused by the overload of their Azure DNS service.
Microsoft’s Azure DNS is a global network of redundant name servers that provides high availability and fast DNS services.
According to Microsoft, the Azure DNS service is beginning to receive an “extraordinary surge” of DNS queries for certain domains hosted on Azure from all over the world. Although Microsoft did not explain the reason for this abnormal surge, it may be a DDoS attack on certain domains.
Microsoft pointed out that their DNS service can usually handle a large number of requests through DNS caching and traffic shaping. However, a code flaw caused its DNS Edge cache to not work properly.
“Azure DNS server has experienced an abnormal surge in DNS queries from around the world, these queries are for a set of domains hosted by Azure. Generally, Azure’s caching layer and traffic shaping can alleviate this surge. In this event, a specific incident The sequence exposed the flaws in the code DNS service and reduced the efficiency of our DNS edge cache.”
“As our DNS service became overloaded, DNS clients began to retry their requests frequently, which increased the workload of the DNS service. Since client retries were considered legitimate DNS traffic, our peak capacity was relieved The system did not reduce the traffic. The increase in traffic has reduced the availability of our DNS service.”
Since almost all Microsoft domains are resolved through Azure DNS, after the DNS service is overloaded, it is no longer possible to resolve the host names on these domains and access the associated services.
For example, the xboxlive.com domain uses the following Azure DNS name servers to resolve host names on that domain.
NS1-205.AZURE-DNS.COM NS2-205.AZURE-DNS.NET NS3-205.AZURE-DNS.ORG NS4-205.AZURE-DNS.INFO
Because xboxlive.com is hosted on Azure DNS and the service is not available, users can no longer log in to Xbox Live.
To prevent such interruptions in the future, Microsoft stated that they are fixing a code flaw in Azure DNS so that the DNS cache can adequately handle a large number of requests. They also plan to improve the monitoring and mitigation of abnormal traffic.
BleepingComputer has contacted Microsoft to learn more about this abnormal surge, but has not yet received a response.