[Ci-users] Unplanned outage incident 06h37 - 08h12 UTC
Brian Stinson
brian at bstinson.comThu Jul 13 08:51:54 UTC 2017
- Previous message: [Ci-users] data hub in centos ci
- Next message: [Ci-users] AltArch support in CI (status update)
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Issue Summary ============= A misconfigured ulimit on jenkins.ci.centos.org caused Jenkins to fail with too many open files. This also caused the root volume to fill up because of noisy messages in the logs. Access to the Jenkins HTTP interface was affected during this period. Root Cause ========== During the reconfigure and move to a new host, the 'nofile'ulimits to jenkins were reset to the default (4096). Jenkins reached this limit before the next scheduled garbage collection. Recovery ======== At 08h12 we cleared out the jenkins log to free up disk space and set the ulimits for the jenkins user to the appropriate value, and jenkins was restarted. Corrective Measures =================== The ulimit change was reflected in our ansible scripts and deployed. Impact ====== Jobs running during the window should have completed, but may not have reported back status to Jenkins. SCM/Github jobs that would have been triggered during that period were picked up when the Jenkins service came live again. Jobs with triggers through other means (messaging, HTTP POST, etc.) may not have been launched. We appreciate your patience during this outage, and apologize for any inconvenience. -- Brian Stinson CentOS CI Infrastructure Team
- Previous message: [Ci-users] data hub in centos ci
- Next message: [Ci-users] AltArch support in CI (status update)
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the CI-users mailing list