[Ci-users] [apps.ci.centos.org] Service degradation 27-Sept-2018 to 28-Sept-2018
bstinson at redhat.com
Fri Sep 28 16:24:33 UTC 2018
A large subset of the application nodes in apps.ci.centos.org were
placed in an unschedulable state around 13h00 UTC on September 27th.
Nodes were rebooted and service was partially restored, but new behavior
was exhibited overnight. Pods were able to schedule on the nodes but DNS
was not functional. DNS service was restored at around 15h00 UTC on
27-Sept-2018 13h00 UTC - 28-Sept-2018 15h00 UTC
A previously applied update (applied around 17-August) to selinux-policy
caused some files to be relabeled. We did not, at the time, schedule a
reboot, but routine restarts of the docker service caused the nodes to
enter a degraded state.
Further file relabels caused the node boot process to complete, but also
in a degraded state.
Completed the rest of the pending updates, and rebooted the nodes to
clear the node-schedulable degradation. (27-Sept)
Triggered a full autorelabel and rebooted the nodes to clear the
node-boot degradation. (28-Sept)
- Consider rebooting the nodes more often, perhaps on a regular
schedule to catch OS upgrade problems
- Complete the openshift-monitoring EPIC in the CI backlog, which will
add better checks for DNS.
Thank you very much for your patience during this outage.
CentOS CI Infrastructure Team
More information about the Ci-users