Re: [Ci-users] Unexpected outage 17:00 UTC Today - Service Restored

14 Jun 2017


      On 14/06/17 11:51, Karanbir Singh wrote:
...
On 14/06/17 08:18, Daniel Horák wrote:
...
Hi Brian,
I see lots of slaves offline, is it connected to the yesterday's outage
or is it different issue?
Thanks,
Daniel
On 06/13/17 19:57, Brian Stinson wrote:
...
Hi Folks,
Jenkins was leaking file descriptors and hit a limit today at 17:00 UTC,
service was degraded for about 10 minutes, and service was fully
restored at around 17:24.
I've increased the open-files limit for jenkins and am working on tuning
the garbage collector to mitigate this in the future.
Thanks for your patience, and apologies for any inconvenience.
I noticed a lot of slaves were down, and was pointed to this by a few
people - on chat.openshift.io and irc.freenode : on investigation it
looked like jenkins master had exhausted ram and other jobs on the
machine were killing the cpu with loads upto 50.x; I had to restart the
jenkins master to bring services back.
once Brian is online, he will likely do a more through investigation and
get back with details.
regards
I spoke with Brian last week about a plan to move Jenkins to another
node : actually jenkins master is running on a small VM (2 vcpus and 4Gb
of RAM), and load average is indeed always high (actually above 20, to
give an example).
Let me sync with him (as we already have the node that will be used as
replacement) to schedule a maintenance window for this
-- 
Fabian Arrotin
The CentOS Project | http://www.centos.org
gpg key: 56BEC54E | twitter: @arrfab

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

Re: [Ci-users] Unexpected outage 17:00 UTC Today - Service Restored