-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
== What happened ==
On Wednesday February 24th, at 6pm UTC time, the DC hosting some of the CentOS equipments used for various roles had suffered from multiple electricity power outages. The facility was completely dark for just under 2 hrs, and we were able to start recovering services by 8pm UTC. By midnight we had most services restored, by 2:00AM UTC Feb 25th we had all services restored.
That meant that the machines in those racks were running on batteries (ups in the racks) but finally went down in an uncontrolled way due to lack ot communication with that UPS.
Subsequent on Monday March 14th, we suffered another power outage in the racks, this time due to a overload on the rack power circuits.
== Services that were impacted == - severity critical : mirrorlist.centos.org node (IPv6) went down (while multiple mirrorlist.centos.org nodes for IPv4 nodes were still online). That means that machines with only IPV6 connectivity couldn't get yum to work to retrieve the list of nearest mirrors. - severity medium : Our main buildservices queue management services were down; note: this did not impact our ability to build, test and deliver updates. - severity medium : www.centos.org and www.centos.org/forums weren't reachable through IPv6 : at the moment, those services are natively reachable through IPv4, but proxied through nodes in that DC for IPv6 users. Most tested browsers were falling back to IPv4 during that period - severity medium : CentOS DevCloud (https://wiki.centos.org/DevCloud) : that means that CentOS Developers weren't able to instantiate new CentOS test VMs for their work, but also weren't able to reach the existing ones. - severity low : several publicly facing small services like http://planet.centos.org , http://seven.centos.org (not critical and could be restored quickly to other VMs elsewhere) - severity low : the server leading the armv7hl builds for the Plague build farm was also offline, meaning no armhfp build during that timeframe (but not updates were to be built, so mitigated issue)
= Followup actions and notes Over the years, the baseline recovery model we've used and tried to enforce is one of 'restore in place', take a downtime hit if needed - and ensure we have service continuity for the user facing components ( the mirrorlist service, the centos update and content distribution services). For other resources, like the main website etc, we ensure there are good backups available in multiple places, usable to restore services should there be a need. This model has worked well for us over the years, and we've had very little, if any, service outages that had a user impact. The restore in place/restore outside HA also meant we were able to better utilise the exclusively sponsored machines we rely on.
However, as the project grows, with a lot more infrastructure being consolidated into a few locations for non CDN services, our exposure to service downtime has dramatically increased. Its clear that we need to expand the scope of where we backup to, how we backup, how we anticipate failure and our ability to restore services in a timely manner should there be facilities outages. In the coming weeks, we are going to undertake a deep dive into our Infrastructure design and delivery and try to first come up with a consolidated set of risks we need to manage against, and then work towards reducing the risk, spreading the availability as needed.
Our backend storage platform for the DevCloud and persistent storage for other nodes in the facility is run from a distributed, replicated Gluster setup. Inspite of the sudden loss of power, in a production environment with hundreds of running VMs and dozens of running data jobs, we were able to trivially recover our entire data set with minimum data loss. Some of the running VMs inside the DevCloud did see local filesystem issues, but we dont think that was a backing storage issue. This event has dramatically increased out confidence in the gluster technology stack and we will certainly be looking at extending deployments for it internally.
== Comments about hosting facility ==
Their Status post about this http://status.uk2.net/2016/02/24/london-power-outage/
We have multiple racks at this facility, and have a long standing relationship with them going back to late Summer 2012. Over this period we have had a near perfect uptime record for our equipment there. And above all we have been consistently impressed with the speed of and the knowledgeable support we've recieved at the DC. In many cases, how the facility reacts to outage defines the real service value - and in this case, we can only commend the fantastic support we had through the outage hours. We do however feel there could be better monitoring and reporting of some of the facilities information and will be working with them to improve in those regards.
Fabian Arrotin and Karanbir Singh The CentOS Project