[Ci-users] [apps.ci.centos.org] Service degradation 27-Sept-2018 to 28-Sept-2018

Summary:

A large subset of the application nodes in apps.ci.centos.org were
placed in an unschedulable state around 13h00 UTC on September 27th.
Nodes were rebooted and service was partially restored, but new behavior
was exhibited overnight. Pods were able to schedule on the nodes but DNS
was not functional. DNS service was restored at around 15h00 UTC on
September 28th.

Timeline:

27-Sept-2018 13h00 UTC - 28-Sept-2018 15h00 UTC

Root Cause:

A previously applied update (applied around 17-August) to selinux-policy
caused some files to be relabeled. We did not, at the time, schedule a
reboot, but routine restarts of the docker service caused the nodes to
enter a degraded state.

Further file relabels caused the node boot process to complete, but also
in a degraded state.

Recovery:

Completed the rest of the pending updates, and rebooted the nodes to
clear the node-schedulable degradation. (27-Sept)

Triggered a full autorelabel and rebooted the nodes to clear the
node-boot degradation.  (28-Sept)

Preventative Measures:

- Consider rebooting the nodes more often, perhaps on a regular
  schedule to catch OS upgrade problems

- Complete the openshift-monitoring EPIC in the CI backlog, which will
  add better checks for DNS.

Thank you very much for your patience during this outage.

--
Brian Stinson
CentOS CI Infrastructure Team