Server offline
Incident Report for DentalRules
Postmortem

Reflectie van webhostingsbedrijf:

During regular hardware maintenance on 17 January on a server in the Equinix data center by a technician from one of our suppliers, a situation arose at 12:10 in which both power feeds of equipment, including two storage switches, in that rack were interrupted.

The cause appears to be a combination of circumstances in which both power feeds on the ATS PDU were interrupted. An ATS is an Automatic Transfer Switch that can switch between the A and B feed in the event of a failure on a power feed. As a result of the loss of both feeds, connected storage switches in the relevant rack both failed. As soon as we suspected that a power outage was the basis for the situation that had arisen, we immediately sent engineers to the data center to assess and repair the situation on site.

Due to the failure of a redundant part of the storage network, the SAN storage became unavailable for a large part of our customers in the Equinix data center, which resulted in virtual servers crashing.

After power was restored by an on-site Cyso engineer at 12:59, we began the process of rebooting the affected servers and checking for file system errors as needed due to the unexpected SAN storage failure. By 14:35, 80% of the systems experiencing problems were back online. At 16:15 this had risen to 95% and at 19:30 the very last error messages had been resolved.

Since we have our infrastructure connected to two different power feeds in the data center, we were caught off guard by this equipment failure. We are going to check the cabling in the data center to find out why this went wrong and take measures to prevent this in the future. In addition, until further notice, we will only allow external suppliers access to our data centers under the supervision of one of our own engineers. Finally, we will investigate whether it is possible to increase the robustness of servers in the event of an unexpected loss of SAN storage.

Posted Jan 21, 2022 - 22:03 CET

Resolved
This incident has been resolved.
Posted Jan 17, 2022 - 14:56 CET
Identified
By Cyso is vanochtend een storing geweest. Die zijn ze vanmiddag overal aan het doorlopen. Ze hadden geen idee dat daarmee de data center onbereikbaar is geworden. Dit gaan ze snel oplossen.
Posted Jan 17, 2022 - 14:40 CET
This incident affected: Portal.