[Ci-users] Unexpected outage 17:00 UTC Today - Service Restored

Wed Jun 14 17:32:00 UTC 2017
Karanbir Singh <kbsingh at centos.org>

On 14/06/17 15:40, Fabian Arrotin wrote:
> On 14/06/17 11:51, Karanbir Singh wrote:
>>
>>
>> On 14/06/17 08:18, Daniel Horák wrote:
>>> Hi Brian,
>>> I see lots of slaves offline, is it connected to the yesterday's outage
>>> or is it different issue?
>>>
>>> Thanks,
>>> Daniel
>>>
>>> On 06/13/17 19:57, Brian Stinson wrote:
>>>> Hi Folks,
>>>>
>>>> Jenkins was leaking file descriptors and hit a limit today at 17:00 UTC,
>>>> service was degraded for about 10 minutes, and service was fully
>>>> restored at around 17:24.
>>>>
>>>> I've increased the open-files limit for jenkins and am working on tuning
>>>> the garbage collector to mitigate this in the future.
>>>>
>>>> Thanks for your patience, and apologies for any inconvenience.
>>>>
>>
>> I noticed a lot of slaves were down, and was pointed to this by a few
>> people - on chat.openshift.io and irc.freenode : on investigation it
>> looked like jenkins master had exhausted ram and other jobs on the
>> machine were killing the cpu with loads upto 50.x; I had to restart the
>> jenkins master to bring services back.
>>
>> once Brian is online, he will likely do a more through investigation and
>> get back with details.
>>
>> regards
>>
> 
> I spoke with Brian last week about a plan to move Jenkins to another
> node : actually jenkins master is running on a small VM (2 vcpus and 4Gb
> of RAM), and load average is indeed always high (actually above 20, to
> give an example).
> Let me sync with him (as we already have the node that will be used as
> replacement) to schedule a maintenance window for this
> 

with 20 you might have caught it just before things went south, again.
lets get Jenkins moved to a new host, more ram and compute etc, but I
think we might need to look at whats going south here.

I've disabled the JMS Plugin for now, that seems to have had a huge
impact on the system stability. Am going to leave that off till we can
workout what the underlaying issue here is.

Regards,



-- 
Karanbir Singh, Project Lead, The CentOS Project
+44-207-0999389 | http://www.centos.org/ | twitter.com/CentOS
GnuPG Key : http://www.karan.org/publickey.asc

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 490 bytes
Desc: OpenPGP digital signature
URL: <http://lists.centos.org/pipermail/ci-users/attachments/20170614/bee87b05/attachment-0002.sig>