The reconstruction of an outage
By Peer Heinlein, mailbox.org CEO
A power outage caused by a short circuit in a 10kV line in the Tiergarten area of Berlin on Tuesday afternoon also damaged the emergency power system of a large data center we use there, resulting in a prolonged power outage in the data center as well and tens of thousands of servers down. There were widespread Internet disruptions in Berlin and also a prolonged outage at mailbox.org.
Actually, mailbox.org specifically uses two data centers in parallel to be able to compensate for outages of this kind. Nevertheless, there were operational disruptions. It is not easy to explain the cause – because there was not “one big classic problem”. Rather, the concatenation of several small problems and unfortunate circumstances led to the outage in both data centers.
The non-technical short version
Data was not lost at any time.
A power outage in Berlin and a failure of an emergency power system led to a widespread power outage at a data center we use. Internet in Berlin was disrupted over a large area in the afternoon and evening hours, and tens of thousands of servers lost power. Due to various complications that had been analyzed and clarified in the meantime, our second data center was not able to take over operations without disruptions as expected. After the power supply was restored, it took our team another two and a half hours until all our services were available again. Considering the severity and scope of the outage, this may not be an unusual amount of time. However, we have technically analyzed why our replacement data center could not take over the service without interruption and have remedied the causes – the details are prepared for interested technicians :-) in this article below.
The long and technical version
For the sake of transparency, we will try to reconstruct the sequence of events here. It cannot be avoided that this will be a technical description, as the technical interrelationships cannot be explained in any other way. In the following, some facts are shortened and simplified and some technical details of our infrastructure cannot be published for security reasons.
mailbox.org and Heinlein Hosting operate two physically completely separate server locations in Berlin. For this purpose we have rented our own technology and infrastructure from two different data center providers. Like a “shopping mall”, these providers operate the building and the air conditioning and power technology, while we are responsible for the use of our proportionate area and premises with servers, data traffic & Co.
At both locations, we operate virtualization systems that are physically and also largely logically independent of each other, as well as large hard disk storage o