A power outage in the data centre led to downtime at mailbox.org

2022-04-20

The reconstruction of an outage

By Peer Heinlein, mailbox.org CEO

A power outage caused by a short circuit in a 10kV line in the Tiergarten area of Berlin on Tuesday afternoon also damaged the emergency power system of a large data center we use there, resulting in a prolonged power outage in the data center as well and tens of thousands of servers down. There were widespread Internet disruptions in Berlin and also a prolonged outage at mailbox.org.

Actually, mailbox.org specifically uses two data centers in parallel to be able to compensate for outages of this kind. Nevertheless, there were operational disruptions. It is not easy to explain the cause - because there was not "one big classic problem". Rather, the concatenation of several small problems and unfortunate circumstances led to the outage in both data centers.

The non-technical short version

No data was lost.

A power outage in Berlin and a failure of an emergency power system led to a widespread power outage at a data center we use. Internet in Berlin was disrupted over a large area in the afternoon and evening hours, and tens of thousands of servers lost power. Due to various complications that had been analyzed and clarified in the meantime, our second data center was not able to take over operations without disruptions as expected. After the power supply was restored, it took our team another two and a half hours until all our services were available again. Considering the severity and scope of the outage, this may not be an unusual amount of time. However, we have technically analyzed why our replacement data center could not take over the service without interruption and have remedied the causes - the details are prepared for interested technicians :-) in this article below.

The long and technical version

For the sake of transparency, we will try to reconstruct the sequence of events here. It cannot be avoided that this will be a technical description, as the technical interrelationships cannot be explained in any other way. In the following, some facts are shortened and simplified and some technical details of our infrastructure cannot be published for security reasons.

mailbox.org and Heinlein Hosting operate two physically completely separate server locations in Berlin. For this purpose we have rented our own technology and infrastructure from two different data center providers. Like a "shopping mall", these providers operate the building and the air conditioning and power technology, while we are responsible for the use of our proportionate area and premises with servers, data traffic & Co.

At both locations, we operate virtualization systems that are physically and also largely logically independent of each other, as well as large hard disk storage on which our services run. Both virtualization clusters are connected via a common control unit. Should this fail, the clusters can continue to run autonomously, but restarts or other work may then not be possible.

If one site or virtualization cluster fails, the second site should ideally continue to function without interruption, or at least be able to take over operation within a reasonable time after minor adaptation or conversion work.

1. The power outage in Berlin

A power failure in the power grid happens every now and then. Data centers are equipped against this with battery buffers and large emergency power systems and usually even have two separate power circuits for parallel internal supply.

After all, what must not happen in the event of a public power outage is that a power outage will penetrate the data center with its tens of thousands of servers.

What happened: At one of our two sites, the provider's emergency power system failed and tens of thousands of servers from a wide range of providers and companies went offline. Our hardware, too. This "shouldn't" happen, but "can" happen and "has" happened. We, too, are eagerly awaiting the root cause analysis and report from the data center provider. According to initial feedback, the short-circuit of a 10,000-volt line in the Berlin city grid also blew through into the data center in such a way that the battery system of the emergency power supply also suffered damage.

The consequences were felt by numerous Internet services: Depending on the situation, there were large-scale disruptions lasting for hours until late at night, the Berlin data exchange node BCIX was partially affected in the meantime, and in the end even a daily newspaper could no longer produce a print edition.

It is rare for not just individual devices but the entire IT infrastructure at a site to go offline, and because of the sheer volume involved, it takes a lot of effort to restore everything.

But to minimize this, we operate our servers at several sites - and our second site had of course not been affected by the power outage and failure of the emergency power systems. Nevertheless, there was a noticeable disruption, even for users. Why?

2. Unexpected disruptions in two data centres

After a major alarm and an "all hands on deck," our team found a very unclear and contradictory picture with numerous malfunctions at the second site as well: Numerous systems were running, yet there were impairments and failures that could not be explained at first.

It quickly became apparent that even in the second location, which was not actually affected, the virtualization cluster with hundreds of our systems was "up and running", but could no longer be properly controlled and addressed, and some servers were no longer working reliably. To speed up troubleshooting, we called in external experts around 4 p.m. who specialize in the VMware virtualization software we use there, but even with their combined efforts it was difficult to explain the symptoms and problems.

In the end, the cause was a still ongoing malfunction in our Domain Name System, i.e. the system for resolving server hostnames to IP addresses. For this purpose, we operate several so-called DNS resolvers, which are divided between both data centers in order to intercept a total failure of a site here as well.

Two of our three DNS resolvers were affected by the outage - but this was not a problem for our servers at first and did not show up as a failure, because the third DNS resolver continued to work without any problems and only one working system was needed. Hundreds of systems therefore continued to function as planned for the time being.

3. Undetected malfunctions of a DNS resolver and VMware specifics

Unlike other (Linux) server systems, however, only two instead of the usual three different DNS servers can be used simultaneously in a VMware virtualisation. Care is taken to use systems from each location. Nevertheless, DNS resolvers at each location were disturbed and VMware happened to use exactly and exclusively the two failed systems.

As a result, the virtualisation servers involved could no longer "see" each other properly at all locations and could no longer communicate properly internally, so that the loss of control of the virtualisation cluster occurred at both locations at the same time.

For us, this in itself very banal circumstance was difficult to recognise, as for all other systems everything could still be operated and queried without any problems on the basis of the 3rd running DNS resolver. In addition, the two DNS resolvers that had failed also appeared externally intact and we assumed for a long time that they were functioning normally, so that they were not the focus of the analysis.

4. Looking for the cause in the wrong place

Our team, together with the external experts, therefore spent (too) long looking for a problem within the virtualisation solution and only belatedly realised that the symptoms were caused by the missing, externally faultless DNS resolvers.

After we restarted one of the two compromised DNS resolvers, all problems and causes in the cluster abruptly resolved, we regained control and were able to start restarting failed and disturbed servers around 5:30 p.m., so that relatively quickly around 6 p.m. all services were available again at both locations. Our team was then busy until after midnight restoring numerous other systems, while mailbox.org was already up and running again.

Our status page "https://status.mailbox.org" is automatically controlled by our monitoring system so that our admin team can concentrate fully on error analysis in the event of disruptions. Due to the widespread outage, one monitoring system was also affected, so the status display continued to show "green" for the first hour before we manually set the fault. This caused confusion for some users - but we ask for your understanding that in this situation we first took care of the root cause analysis.

The conclusion

A power outage in the data centre must not happen and we are looking forward to the final report from our provider. However, we are prepared for this circumstance ourselves, so that we could have continued operation with the second location even in the event of a prolonged outage, or actually the second location should have continued operation without interruption.
The fact that both virtualisation clusters at both locations used the two failed DNS resolvers and were not physically but logically affected was very unlucky, but must not happen either. We will therefore carry out technical modifications in several places so that this problem can no longer occur and certainly no longer occur undetected. Moreover, we now know how to interpret the resulting (very crazy) symptoms.
But apart from that, none of this would have been a problem if there hadn't also been a small network-related peculiarity due to maintenance work. This is in itself known, trivial, and harmless, but in the end, it contributed to the failure of the DNS resolver at the uninvolved site in the cascade of very small and foreseeable and planned faults, and the sum of the in itself harmless faults to escalation.
The major malfunction was the combined result of about half a dozen, individually harmless small things. Only through the combined occurrence of all these problems together could the total disturbance occur; any omission of a partial problem would not have allowed the total problem to occur.
The cooperation of our incident team - about 20 team members worked together on the outage - worked well and within minutes; our own backup by external specialists also worked.
Data and e-mails were not lost at any time.
After the power outage at around 3 pm, our services were available again around 6 pm.
The causes are known, analysed, and understood and can be ruled out in the future by taking them into account and rebuilding them