[Ci-users] Unexpected outage 17:00 UTC Today - Service Restored

On Wed, Jun 14, 2017 at 1:32 PM, Karanbir Singh <kbsingh at centos.org> wrote:

> On 14/06/17 15:40, Fabian Arrotin wrote:
> > On 14/06/17 11:51, Karanbir Singh wrote:
> >>
> >>
> >> On 14/06/17 08:18, Daniel Horák wrote:
> >>> Hi Brian,
> >>> I see lots of slaves offline, is it connected to the yesterday's outage
> >>> or is it different issue?
> >>>
> >>> Thanks,
> >>> Daniel
> >>>
> >>> On 06/13/17 19:57, Brian Stinson wrote:
> >>>> Hi Folks,
> >>>>
> >>>> Jenkins was leaking file descriptors and hit a limit today at 17:00
> UTC,
> >>>> service was degraded for about 10 minutes, and service was fully
> >>>> restored at around 17:24.
> >>>>
> >>>> I've increased the open-files limit for jenkins and am working on
> tuning
> >>>> the garbage collector to mitigate this in the future.
> >>>>
> >>>> Thanks for your patience, and apologies for any inconvenience.
> >>>>
> >>
> >> I noticed a lot of slaves were down, and was pointed to this by a few
> >> people - on chat.openshift.io and irc.freenode : on investigation it
> >> looked like jenkins master had exhausted ram and other jobs on the
> >> machine were killing the cpu with loads upto 50.x; I had to restart the
> >> jenkins master to bring services back.
> >>
> >> once Brian is online, he will likely do a more through investigation and
> >> get back with details.
> >>
> >> regards
> >>
> >
> > I spoke with Brian last week about a plan to move Jenkins to another
> > node : actually jenkins master is running on a small VM (2 vcpus and 4Gb
> > of RAM), and load average is indeed always high (actually above 20, to
> > give an example).
> > Let me sync with him (as we already have the node that will be used as
> > replacement) to schedule a maintenance window for this
> >
>
> with 20 you might have caught it just before things went south, again.
> lets get Jenkins moved to a new host, more ram and compute etc, but I
> think we might need to look at whats going south here.
>
> I've disabled the JMS Plugin for now, that seems to have had a huge
> impact on the system stability. Am going to leave that off till we can
> workout what the underlaying issue here is.
>
> Regards,
>

Scott wrote that plugin and can look at what is happening.  We need that
for our pipeline triggering it has been working fine for a while so it
would be good to understand
what the root cause issue is before just disabling it.

>
>
>
> --
> Karanbir Singh, Project Lead, The CentOS Project
> +44-207-0999389 | http://www.centos.org/ | twitter.com/CentOS
> GnuPG Key : http://www.karan.org/publickey.asc
>
>
> _______________________________________________
> Ci-users mailing list
> Ci-users at centos.org
> https://lists.centos.org/mailman/listinfo/ci-users
>
>

-- 
-== @ri ==-
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.centos.org/pipermail/ci-users/attachments/20170614/e1803193/attachment-0005.html>