[Ci-users] Unplanned outage incident 06h37 - 08h12 UTC

Thu Jul 13 08:51:54 UTC 2017
Brian Stinson <brian at bstinson.com>

Issue Summary
=============

A misconfigured ulimit on jenkins.ci.centos.org caused Jenkins to fail
with too many open files. This also caused the root volume to fill up
because of noisy messages in the logs. Access to the Jenkins HTTP
interface was affected during this period.

Root Cause
==========

During the reconfigure and move to a new host, the 'nofile'ulimits to
jenkins were reset to the default (4096). Jenkins reached this limit
before the next scheduled garbage collection.

Recovery
========

At 08h12 we cleared out the jenkins log to free up disk space and set
the ulimits for the jenkins user to the appropriate value, and jenkins
was restarted.

Corrective Measures
===================

The ulimit change was reflected in our ansible scripts and deployed.

Impact
======

Jobs running during the window should have completed, but may not have
reported back status to Jenkins. SCM/Github jobs that would have been
triggered during that period were picked up when the Jenkins service
came live again.

Jobs with triggers through other means (messaging, HTTP POST, etc.) may
not have been launched.





We appreciate your patience during this outage, and apologize for any
inconvenience.

--
Brian Stinson
CentOS CI Infrastructure Team