[CentOS-announce] Notice of Service Outage and followup LON1/UK Facility

Wed Mar 30 10:25:19 UTC 2016
Karanbir Singh <kbsingh at centos.org>

Hash: SHA1

== What happened ==

On Wednesday February 24th, at  6pm UTC time, the DC hosting some of
the CentOS equipments used for various roles had suffered from
multiple electricity power outages. The facility was completely dark
for just under 2 hrs, and we were able to start recovering services by
8pm UTC. By midnight we had most services restored, by 2:00AM UTC Feb
25th we had all services restored.

That meant that the machines in those racks were running on batteries
(ups in the racks) but finally went down in an uncontrolled way due to
lack ot communication with that UPS.

Subsequent on Monday March 14th, we suffered another power outage in
the racks, this time due to a overload on the rack power circuits.

== Services that were impacted ==
 - severity critical : mirrorlist.centos.org node (IPv6) went down
(while multiple mirrorlist.centos.org nodes for IPv4 nodes were still
online). That means that machines with only IPV6 connectivity couldn't
get yum to work to retrieve the list of nearest mirrors.
 - severity medium : Our main buildservices queue management services
were down; note: this did not impact our ability to build, test and
deliver updates.
 - severity medium : www.centos.org and www.centos.org/forums weren't
reachable through IPv6 : at the moment, those services are natively
reachable through IPv4, but proxied through nodes in that DC for IPv6
users. Most tested browsers were falling back to IPv4 during that period
 - severity medium : CentOS DevCloud
(https://wiki.centos.org/DevCloud) : that means that CentOS Developers
weren't able to instantiate new CentOS test VMs for their work, but
also weren't able to reach the existing ones.
 - severity low : several publicly facing small services like
http://planet.centos.org , http://seven.centos.org (not critical and
could be restored quickly to other VMs elsewhere)
 - severity low : the server leading the armv7hl builds for the Plague
build farm was also offline, meaning no armhfp build during that
timeframe (but not updates were to be built, so mitigated issue)

= Followup actions and notes
   Over the years, the baseline recovery model we've used and tried to
enforce is one of 'restore in place', take a downtime hit if needed -
and ensure we have service continuity for the user facing components (
the mirrorlist service, the centos update and content distribution
services). For other resources, like the main website etc, we ensure
there are good backups available in multiple places, usable to restore
services should there be a need. This model has worked well for us
over the years, and we've had very little, if any, service outages
that had a user impact. The restore in place/restore outside HA also
meant we were able to better utilise the exclusively sponsored
machines we rely on.

   However, as the project grows, with a lot more infrastructure being
consolidated into a few locations for non CDN services, our exposure
to service downtime has dramatically increased. Its clear that we need
to expand the scope of where  we backup to, how we backup, how we
anticipate failure and our ability to restore services in a timely
manner should there be facilities outages. In the coming weeks, we are
going to undertake a deep dive into our Infrastructure design and
delivery and try to first come up with a consolidated set of risks we
need to manage against, and then work towards reducing the risk,
spreading the availability as needed.

   Our backend storage platform for the DevCloud and persistent
storage for other nodes in the facility is run from a distributed,
replicated Gluster setup. Inspite of the sudden loss of power, in a
production environment with hundreds of running VMs and dozens of
running data jobs, we were able to trivially recover our entire data
set with minimum data loss. Some of the running VMs inside the
DevCloud did see local filesystem issues, but we dont think that was a
backing storage issue. This event has dramatically increased out
confidence in the gluster technology stack and we will certainly be
looking at extending deployments for it internally.

== Comments about hosting facility ==

   Their Status post about this

   We have multiple racks at this facility, and have a long standing
relationship with them going back to late Summer 2012. Over this
period we have had a near perfect uptime record for our equipment
there. And above all we have been consistently impressed with the
speed of and the knowledgeable support we've recieved at the DC. In
many cases, how the facility reacts to outage defines the real service
value - and in this case, we can only commend the fantastic support we
had through the outage hours. We do however feel there could be better
monitoring and reporting of some of the facilities information and
will be working with them to improve in those regards.

Fabian Arrotin and Karanbir Singh
The CentOS Project
Version: GnuPG v2.0.22 (GNU/Linux)