[Ci-users] Unexpected outage 17:00 UTC Today - Service Restored
alivigni at redhat.com
Wed Jun 14 17:35:40 UTC 2017
On Wed, Jun 14, 2017 at 1:32 PM, Karanbir Singh <kbsingh at centos.org> wrote:
> On 14/06/17 15:40, Fabian Arrotin wrote:
> > On 14/06/17 11:51, Karanbir Singh wrote:
> >> On 14/06/17 08:18, Daniel Horák wrote:
> >>> Hi Brian,
> >>> I see lots of slaves offline, is it connected to the yesterday's outage
> >>> or is it different issue?
> >>> Thanks,
> >>> Daniel
> >>> On 06/13/17 19:57, Brian Stinson wrote:
> >>>> Hi Folks,
> >>>> Jenkins was leaking file descriptors and hit a limit today at 17:00
> >>>> service was degraded for about 10 minutes, and service was fully
> >>>> restored at around 17:24.
> >>>> I've increased the open-files limit for jenkins and am working on
> >>>> the garbage collector to mitigate this in the future.
> >>>> Thanks for your patience, and apologies for any inconvenience.
> >> I noticed a lot of slaves were down, and was pointed to this by a few
> >> people - on chat.openshift.io and irc.freenode : on investigation it
> >> looked like jenkins master had exhausted ram and other jobs on the
> >> machine were killing the cpu with loads upto 50.x; I had to restart the
> >> jenkins master to bring services back.
> >> once Brian is online, he will likely do a more through investigation and
> >> get back with details.
> >> regards
> > I spoke with Brian last week about a plan to move Jenkins to another
> > node : actually jenkins master is running on a small VM (2 vcpus and 4Gb
> > of RAM), and load average is indeed always high (actually above 20, to
> > give an example).
> > Let me sync with him (as we already have the node that will be used as
> > replacement) to schedule a maintenance window for this
> with 20 you might have caught it just before things went south, again.
> lets get Jenkins moved to a new host, more ram and compute etc, but I
> think we might need to look at whats going south here.
> I've disabled the JMS Plugin for now, that seems to have had a huge
> impact on the system stability. Am going to leave that off till we can
> workout what the underlaying issue here is.
Scott wrote that plugin and can look at what is happening. We need that
for our pipeline triggering it has been working fine for a while so it
would be good to understand
what the root cause issue is before just disabling it.
> Karanbir Singh, Project Lead, The CentOS Project
> +44-207-0999389 | http://www.centos.org/ | twitter.com/CentOS
> GnuPG Key : http://www.karan.org/publickey.asc
> Ci-users mailing list
> Ci-users at centos.org
-== @ri ==-
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Ci-users