Unexpected outage 17:00 UTC Today - Service Restored - CI-users

List overview All Threads
Download

newer

Unexpected outage 17:00 UTC Today - Service Restored

older

safe restart coming

Devicemapper error with Docker

Brian Stinson

13 Jun 2017 13 Jun '17

5:57 p.m.

Hi Folks,

Jenkins was leaking file descriptors and hit a limit today at 17:00 UTC, service was degraded for about 10 minutes, and service was fully restored at around 17:24.

I've increased the open-files limit for jenkins and am working on tuning the garbage collector to mitigate this in the future.

Thanks for your patience, and apologies for any inconvenience.

-- Brian Stinson

Show replies by date

Daniel Horák

14 Jun 14 Jun

7:18 a.m.

Hi Brian, I see lots of slaves offline, is it connected to the yesterday's outage or is it different issue?

Thanks, Daniel

On 06/13/17 19:57, Brian Stinson wrote:

...

Hi Folks,

Jenkins was leaking file descriptors and hit a limit today at 17:00 UTC, service was degraded for about 10 minutes, and service was fully restored at around 17:24.

I've increased the open-files limit for jenkins and am working on tuning the garbage collector to mitigate this in the future.

Thanks for your patience, and apologies for any inconvenience.

-- Brian Stinson _______________________________________________ Ci-users mailing list Ci-users@centos.org https://lists.centos.org/mailman/listinfo/ci-users

Karanbir Singh

9:51 a.m.

On 14/06/17 08:18, Daniel Horák wrote:

...

Hi Brian, I see lots of slaves offline, is it connected to the yesterday's outage or is it different issue?

Thanks, Daniel

On 06/13/17 19:57, Brian Stinson wrote:

...
Hi Folks,

Jenkins was leaking file descriptors and hit a limit today at 17:00 UTC, service was degraded for about 10 minutes, and service was fully restored at around 17:24.

I've increased the open-files limit for jenkins and am working on tuning the garbage collector to mitigate this in the future.

Thanks for your patience, and apologies for any inconvenience.

I noticed a lot of slaves were down, and was pointed to this by a few people - on chat.openshift.io and irc.freenode : on investigation it looked like jenkins master had exhausted ram and other jobs on the machine were killing the cpu with loads upto 50.x; I had to restart the jenkins master to bring services back.

once Brian is online, he will likely do a more through investigation and get back with details.

regards

-- Karanbir Singh, Project Lead, The CentOS Project +44-207-0999389 | http://www.centos.org/ | twitter.com/CentOS GnuPG Key : http://www.karan.org/publickey.asc

Fabian Arrotin

2:40 p.m.

On 14/06/17 11:51, Karanbir Singh wrote:

...

On 14/06/17 08:18, Daniel Horák wrote:

...
Hi Brian, I see lots of slaves offline, is it connected to the yesterday's outage or is it different issue?

Thanks, Daniel

On 06/13/17 19:57, Brian Stinson wrote:

...
Hi Folks,

Jenkins was leaking file descriptors and hit a limit today at 17:00 UTC, service was degraded for about 10 minutes, and service was fully restored at around 17:24.

I've increased the open-files limit for jenkins and am working on tuning the garbage collector to mitigate this in the future.

Thanks for your patience, and apologies for any inconvenience.

I noticed a lot of slaves were down, and was pointed to this by a few people - on chat.openshift.io and irc.freenode : on investigation it looked like jenkins master had exhausted ram and other jobs on the machine were killing the cpu with loads upto 50.x; I had to restart the jenkins master to bring services back.

once Brian is online, he will likely do a more through investigation and get back with details.

regards

I spoke with Brian last week about a plan to move Jenkins to another node : actually jenkins master is running on a small VM (2 vcpus and 4Gb of RAM), and load average is indeed always high (actually above 20, to give an example). Let me sync with him (as we already have the node that will be used as replacement) to schedule a maintenance window for this

-- Fabian Arrotin The CentOS Project | http://www.centos.org gpg key: 56BEC54E | twitter: @arrfab

Karanbir Singh

5:32 p.m.

On 14/06/17 15:40, Fabian Arrotin wrote:

...

On 14/06/17 11:51, Karanbir Singh wrote:

...
On 14/06/17 08:18, Daniel Horák wrote:

...
Hi Brian, I see lots of slaves offline, is it connected to the yesterday's outage or is it different issue?

Thanks, Daniel

On 06/13/17 19:57, Brian Stinson wrote:

...
Hi Folks,

Jenkins was leaking file descriptors and hit a limit today at 17:00 UTC, service was degraded for about 10 minutes, and service was fully restored at around 17:24.

I've increased the open-files limit for jenkins and am working on tuning the garbage collector to mitigate this in the future.

Thanks for your patience, and apologies for any inconvenience.

I noticed a lot of slaves were down, and was pointed to this by a few people - on chat.openshift.io and irc.freenode : on investigation it looked like jenkins master had exhausted ram and other jobs on the machine were killing the cpu with loads upto 50.x; I had to restart the jenkins master to bring services back.

once Brian is online, he will likely do a more through investigation and get back with details.

regards

I spoke with Brian last week about a plan to move Jenkins to another node : actually jenkins master is running on a small VM (2 vcpus and 4Gb of RAM), and load average is indeed always high (actually above 20, to give an example). Let me sync with him (as we already have the node that will be used as replacement) to schedule a maintenance window for this

with 20 you might have caught it just before things went south, again. lets get Jenkins moved to a new host, more ram and compute etc, but I think we might need to look at whats going south here.

I've disabled the JMS Plugin for now, that seems to have had a huge impact on the system stability. Am going to leave that off till we can workout what the underlaying issue here is.

Regards,

-- Karanbir Singh, Project Lead, The CentOS Project +44-207-0999389 | http://www.centos.org/ | twitter.com/CentOS GnuPG Key : http://www.karan.org/publickey.asc

Ari LiVigni

5:35 p.m.

On Wed, Jun 14, 2017 at 1:32 PM, Karanbir Singh kbsingh@centos.org wrote:

...

On 14/06/17 15:40, Fabian Arrotin wrote:

...
On 14/06/17 11:51, Karanbir Singh wrote:

...
On 14/06/17 08:18, Daniel Horák wrote:

...
Hi Brian, I see lots of slaves offline, is it connected to the yesterday's outage or is it different issue?

Thanks, Daniel

On 06/13/17 19:57, Brian Stinson wrote:

...
Hi Folks,

Jenkins was leaking file descriptors and hit a limit today at 17:00

UTC,

...
...
...
...
service was degraded for about 10 minutes, and service was fully restored at around 17:24.

I've increased the open-files limit for jenkins and am working on

tuning

...
...
...
...
the garbage collector to mitigate this in the future.

Thanks for your patience, and apologies for any inconvenience.

I noticed a lot of slaves were down, and was pointed to this by a few people - on chat.openshift.io and irc.freenode : on investigation it looked like jenkins master had exhausted ram and other jobs on the machine were killing the cpu with loads upto 50.x; I had to restart the jenkins master to bring services back.

once Brian is online, he will likely do a more through investigation and get back with details.

regards

I spoke with Brian last week about a plan to move Jenkins to another node : actually jenkins master is running on a small VM (2 vcpus and 4Gb of RAM), and load average is indeed always high (actually above 20, to give an example). Let me sync with him (as we already have the node that will be used as replacement) to schedule a maintenance window for this

with 20 you might have caught it just before things went south, again. lets get Jenkins moved to a new host, more ram and compute etc, but I think we might need to look at whats going south here.

I've disabled the JMS Plugin for now, that seems to have had a huge impact on the system stability. Am going to leave that off till we can workout what the underlaying issue here is.

Regards,

Scott wrote that plugin and can look at what is happening. We need that for our pipeline triggering it has been working fine for a while so it would be good to understand what the root cause issue is before just disabling it.

...

-- Karanbir Singh, Project Lead, The CentOS Project +44-207-0999389 | http://www.centos.org/ | twitter.com/CentOS GnuPG Key : http://www.karan.org/publickey.asc

Ci-users mailing list Ci-users@centos.org https://lists.centos.org/mailman/listinfo/ci-users

-- -== @ri ==-

Karanbir Singh

5:43 p.m.

On 14/06/17 18:35, Ari LiVigni wrote:

...

I've disabled the JMS Plugin for now, that seems to have had a huge
impact on the system stability. Am going to leave that off till we can
workout what the underlaying issue here is.

Regards,
Scott wrote that plugin and can look at what is happening. We need that for our pipeline triggering it has been working fine for a while so it would be good to understand what the root cause issue is before just disabling it.

the guys are looking at a new bringup, lets get that up with the JMS stuff and diagnose that before moving the rest of the projects over/. Would that work ?

Its important we keep jenkins up for now,

-- Karanbir Singh, Project Lead, The CentOS Project +44-207-0999389 | http://www.centos.org/ | twitter.com/CentOS GnuPG Key : http://www.karan.org/publickey.asc

Ari LiVigni

5:46 p.m.

On Wed, Jun 14, 2017 at 1:43 PM, Karanbir Singh kbsingh@centos.org wrote:

...

On 14/06/17 18:35, Ari LiVigni wrote:

...
I've disabled the JMS Plugin for now, that seems to have had a huge
impact on the system stability. Am going to leave that off till we
can

...
workout what the underlaying issue here is.

Regards,
Scott wrote that plugin and can look at what is happening. We need that for our pipeline triggering it has been working fine for a while so it would be good to understand what the root cause issue is before just disabling it.
the guys are looking at a new bringup, lets get that up with the JMS stuff and diagnose that before moving the rest of the projects over/. Would that work ?

Its important we keep jenkins up for now,

I am fine with that but what shows that plugin is the culprit? Is there some logs or something that can be sent over to Scott?

Thanks,

...

-- Karanbir Singh, Project Lead, The CentOS Project +44-207-0999389 | http://www.centos.org/ | twitter.com/CentOS GnuPG Key : http://www.karan.org/publickey.asc

-- -== @ri ==-

Ari LiVigni

5:55 p.m.

fyi Scott is looking into issues on his test instance to see if he can identify the issue while we wait.

On Wed, Jun 14, 2017 at 1:46 PM, Ari LiVigni alivigni@redhat.com wrote:

...

On Wed, Jun 14, 2017 at 1:43 PM, Karanbir Singh kbsingh@centos.org wrote:

...
On 14/06/17 18:35, Ari LiVigni wrote:

...
I've disabled the JMS Plugin for now, that seems to have had a huge
impact on the system stability. Am going to leave that off till we
can

...
workout what the underlaying issue here is.

Regards,
Scott wrote that plugin and can look at what is happening. We need that for our pipeline triggering it has been working fine for a while so it would be good to understand what the root cause issue is before just disabling it.
the guys are looking at a new bringup, lets get that up with the JMS stuff and diagnose that before moving the rest of the projects over/. Would that work ?

Its important we keep jenkins up for now,
I am fine with that but what shows that plugin is the culprit? Is there some logs or something that can be sent over to Scott?

Thanks,

...
-- Karanbir Singh, Project Lead, The CentOS Project +44-207-0999389 | http://www.centos.org/ | twitter.com/CentOS GnuPG Key : http://www.karan.org/publickey.asc

-- -== @ri ==-

-- -== @ri ==-

Ari LiVigni

6:06 p.m.

KB,

If it is possible to turn the JMS plugin back on it seems to be the JMS Messaging plugin with pipelines. I removed the multibranch ci-pipeline job. I believe this is why we haven't seen an issue until now since we just turned this on yesterday.

Is there any chance we can turn that plugin on and just use it for freestyle jobs?

Thanks

On Wed, Jun 14, 2017 at 1:55 PM, Ari LiVigni alivigni@redhat.com wrote:

...

fyi Scott is looking into issues on his test instance to see if he can identify the issue while we wait.

On Wed, Jun 14, 2017 at 1:46 PM, Ari LiVigni alivigni@redhat.com wrote:

...
On Wed, Jun 14, 2017 at 1:43 PM, Karanbir Singh kbsingh@centos.org wrote:

...
On 14/06/17 18:35, Ari LiVigni wrote:

...
I've disabled the JMS Plugin for now, that seems to have had a huge
impact on the system stability. Am going to leave that off till we
can

...
workout what the underlaying issue here is.

Regards,
Scott wrote that plugin and can look at what is happening. We need
that

...
for our pipeline triggering it has been working fine for a while so it would be good to understand what the root cause issue is before just disabling it.

the guys are looking at a new bringup, lets get that up with the JMS stuff and diagnose that before moving the rest of the projects over/. Would that work ?

Its important we keep jenkins up for now,
I am fine with that but what shows that plugin is the culprit? Is there some logs or something that can be sent over to Scott?

Thanks,

...
-- Karanbir Singh, Project Lead, The CentOS Project +44-207-0999389 | http://www.centos.org/ | twitter.com/CentOS GnuPG Key : http://www.karan.org/publickey.asc

-- -== @ri ==-
-- -== @ri ==-

-- -== @ri ==-

Karanbir Singh

4:20 p.m.

On 14/06/17 10:51, Karanbir Singh wrote:

...

On 14/06/17 08:18, Daniel Horák wrote:

...
Hi Brian, I see lots of slaves offline, is it connected to the yesterday's outage or is it different issue?

Thanks, Daniel

On 06/13/17 19:57, Brian Stinson wrote:

...
Hi Folks,

Jenkins was leaking file descriptors and hit a limit today at 17:00 UTC, service was degraded for about 10 minutes, and service was fully restored at around 17:24.

I've increased the open-files limit for jenkins and am working on tuning the garbage collector to mitigate this in the future.

Thanks for your patience, and apologies for any inconvenience.

I noticed a lot of slaves were down, and was pointed to this by a few people - on chat.openshift.io and irc.freenode : on investigation it looked like jenkins master had exhausted ram and other jobs on the machine were killing the cpu with loads upto 50.x; I had to restart the jenkins master to bring services back.

once Brian is online, he will likely do a more through investigation and get back with details.

service went down again a few minutes back, I have restarted jenkins and its up again.

Brian is on a long haul flight out of the US at the moment, I will try and keep an eye on things, but were going to need him to look when he can

-- Karanbir Singh, Project Lead, The CentOS Project +44-207-0999389 | http://www.centos.org/ | twitter.com/CentOS GnuPG Key : http://www.karan.org/publickey.asc

Ari LiVigni

5:05 p.m.

Hi KB,

In the future our team would like to help with Jenkins maintenance and issues. This is something I have spoken about with Brian. Let me know if this is an option you would like to pursue in the near term.

On Wed, Jun 14, 2017 at 12:20 PM, Karanbir Singh kbsingh@centos.org wrote:

...

On 14/06/17 10:51, Karanbir Singh wrote:

...
On 14/06/17 08:18, Daniel Horák wrote:

...
Hi Brian, I see lots of slaves offline, is it connected to the yesterday's outage or is it different issue?

Thanks, Daniel

On 06/13/17 19:57, Brian Stinson wrote:

...
Hi Folks,

Jenkins was leaking file descriptors and hit a limit today at 17:00

UTC,

...
...
...
service was degraded for about 10 minutes, and service was fully restored at around 17:24.

I've increased the open-files limit for jenkins and am working on

tuning

...
...
...
the garbage collector to mitigate this in the future.

Thanks for your patience, and apologies for any inconvenience.

I noticed a lot of slaves were down, and was pointed to this by a few people - on chat.openshift.io and irc.freenode : on investigation it looked like jenkins master had exhausted ram and other jobs on the machine were killing the cpu with loads upto 50.x; I had to restart the jenkins master to bring services back.

once Brian is online, he will likely do a more through investigation and get back with details.

service went down again a few minutes back, I have restarted jenkins and its up again.

Brian is on a long haul flight out of the US at the moment, I will try and keep an eye on things, but were going to need him to look when he can

-- Karanbir Singh, Project Lead, The CentOS Project +44-207-0999389 | http://www.centos.org/ | twitter.com/CentOS GnuPG Key : http://www.karan.org/publickey.asc

Ci-users mailing list Ci-users@centos.org https://lists.centos.org/mailman/listinfo/ci-users

-- -== @ri ==-

Karanbir Singh

5:28 p.m.

hi Ari,

Absolutely! Lets see if we can get Brian for sometime later this week, or early next week, and thrash through some options.

Regards,

On 14/06/17 18:05, Ari LiVigni wrote:

...

Hi KB,

On Wed, Jun 14, 2017 at 12:20 PM, Karanbir Singh <kbsingh@centos.org mailto:kbsingh@centos.org> wrote:

On 14/06/17 10:51, Karanbir Singh wrote:
>
>
> On 14/06/17 08:18, Daniel Horák wrote:
>> Hi Brian,
>> I see lots of slaves offline, is it connected to the yesterday's outage
>> or is it different issue?
>>
>> Thanks,
>> Daniel
>>
>> On 06/13/17 19:57, Brian Stinson wrote:
>>> Hi Folks,
>>>
>>> Jenkins was leaking file descriptors and hit a limit today at 17:00 UTC,
>>> service was degraded for about 10 minutes, and service was fully
>>> restored at around 17:24.
>>>
>>> I've increased the open-files limit for jenkins and am working on tuning
>>> the garbage collector to mitigate this in the future.
>>>
>>> Thanks for your patience, and apologies for any inconvenience.
>>>
>
> I noticed a lot of slaves were down, and was pointed to this by a few
> people - on chat.openshift.io <http://chat.openshift.io> and irc.freenode : on
investigation it
> looked like jenkins master had exhausted ram and other jobs on the
> machine were killing the cpu with loads upto 50.x; I had to restart the
> jenkins master to bring services back.
>
> once Brian is online, he will likely do a more through investigation and
> get back with details.
>

service went down again a few minutes back, I have restarted jenkins and
its up again.

Brian is on a long haul flight out of the US at the moment, I will try
and keep an eye on things, but were going to need him to look when
he can

-- Karanbir Singh, Project Lead, The CentOS Project +44-207-0999389 | http://www.centos.org/ | twitter.com/CentOS GnuPG Key : http://www.karan.org/publickey.asc

Ari LiVigni

5:31 p.m.

On Wed, Jun 14, 2017 at 1:28 PM, Karanbir Singh kbsingh@centos.org wrote:

...

hi Ari,

Absolutely! Lets see if we can get Brian for sometime later this week, or early next week, and thrash through some options.

Regards,

+1 Scott Hebert on our team has a lot of Jenkins knowledge and has written plugins as well. I added him to the thread

...

On 14/06/17 18:05, Ari LiVigni wrote:

...
Hi KB,

In the future our team would like to help with Jenkins maintenance and issues. This is something I have spoken about with Brian. Let me know if this is an option you would like to pursue in the near

term.

...
On Wed, Jun 14, 2017 at 12:20 PM, Karanbir Singh <kbsingh@centos.org mailto:kbsingh@centos.org> wrote:
On 14/06/17 10:51, Karanbir Singh wrote:
>
>
> On 14/06/17 08:18, Daniel Horák wrote:
>> Hi Brian,
>> I see lots of slaves offline, is it connected to the yesterday's
outage

...
>> or is it different issue?
>>
>> Thanks,
>> Daniel
>>
>> On 06/13/17 19:57, Brian Stinson wrote:
>>> Hi Folks,
>>>
>>> Jenkins was leaking file descriptors and hit a limit today at
17:00 UTC,

...
>>> service was degraded for about 10 minutes, and service was fully
>>> restored at around 17:24.
>>>
>>> I've increased the open-files limit for jenkins and am working
on tuning

...
>>> the garbage collector to mitigate this in the future.
>>>
>>> Thanks for your patience, and apologies for any inconvenience.
>>>
>
> I noticed a lot of slaves were down, and was pointed to this by a
few

...
> people - on chat.openshift.io <http://chat.openshift.io> and
irc.freenode : on

...
investigation it
> looked like jenkins master had exhausted ram and other jobs on the
> machine were killing the cpu with loads upto 50.x; I had to
restart the

...
> jenkins master to bring services back.
>
> once Brian is online, he will likely do a more through
investigation and

...
> get back with details.
>

service went down again a few minutes back, I have restarted jenkins
and

...
its up again.

Brian is on a long haul flight out of the US at the moment, I will
try

...
and keep an eye on things, but were going to need him to look when
he can
-- Karanbir Singh, Project Lead, The CentOS Project +44-207-0999389 | http://www.centos.org/ | twitter.com/CentOS GnuPG Key : http://www.karan.org/publickey.asc

Ci-users mailing list Ci-users@centos.org https://lists.centos.org/mailman/listinfo/ci-users

-- -== @ri ==-

2968

Age (days ago)

2969

Last active (days ago)

ci-users@lists.centos.org

13 comments

5 participants

tags (0)

participants (5)

Ari LiVigni
Brian Stinson
Daniel Horák
Fabian Arrotin
Karanbir Singh