Summary:
A large subset of the application nodes in apps.ci.centos.org were placed in an unschedulable state around 13h00 UTC on September 27th. Nodes were rebooted and service was partially restored, but new behavior was exhibited overnight. Pods were able to schedule on the nodes but DNS was not functional. DNS service was restored at around 15h00 UTC on September 28th.
Timeline:
27-Sept-2018 13h00 UTC - 28-Sept-2018 15h00 UTC
Root Cause:
A previously applied update (applied around 17-August) to selinux-policy caused some files to be relabeled. We did not, at the time, schedule a reboot, but routine restarts of the docker service caused the nodes to enter a degraded state.
Further file relabels caused the node boot process to complete, but also in a degraded state.
Recovery:
Completed the rest of the pending updates, and rebooted the nodes to clear the node-schedulable degradation. (27-Sept)
Triggered a full autorelabel and rebooted the nodes to clear the node-boot degradation. (28-Sept)
Preventative Measures:
- Consider rebooting the nodes more often, perhaps on a regular schedule to catch OS upgrade problems
- Complete the openshift-monitoring EPIC in the CI backlog, which will add better checks for DNS.
Thank you very much for your patience during this outage.
-- Brian Stinson CentOS CI Infrastructure Team