Wednesday , September 28 2022

Microsoft explains one Azure authentication outage as another one happens


Microsoft explains one Azure authentication outage as another one happens

In a stroke of bad timing that would be comical if it was not so annoying, Microsoft's multifactor authentication (MFA) system, used for Azure, Office 365, and Dynamics, has gone down for the second time this month, just hours after company published its findings in a 14-hour outage on November 19.

The Azure Active Directory Multifactor Authentication Services went offline just before 05:00 UTC and remained nonfunctional until just before 19:00 UTC. The servers initially affected were those serving the Europe and the Middle East region and the Asia-Pacific region; as these regions woke up and tried to authenticate, the servers overloaded and went down. Microsoft has tried to redirect some authentication attempts to US servers, but this has also had the effect of overloading those too.

The company's subsequent analysis has shown that three individual bugs came together to cause the problems. On November 19, and a code change that had been progressively deployed over the previous six days, a cascade of failures provoked. Above a certain traffic level, the new code has caused a significant increase in latency between front-end servers and cache servers. This in turn revealed race condition in back-end servers, causing them to reset front-end servers over and over. That then revealed a third issue: back-end servers would create more and more processes, eventually starving themselves and leaving them unresponsive.

Today's problems are still under investigation. The MFA servers have been timing out since 14:25 UTC, causing login attempts to fail when MFA is in use. Currently, the company believes that the resolution of an earlier DNS error has produced a barrage of authentication attempts, essentially flooding the MFA system with more requests than it can handle.

Source link