duffy/networking hangups

List overview All Threads
Download

newer

older

Access to ci.centos.org/artifacts...

adding content to the output of...

Colin Walters

25 Jul 2016 25 Jul '16

10:13 p.m.

I'm seeing rather frequent issues where jobs that allocate a Duffy machine and ssh in end up losing network between my slave and the duffy machine - I can't ping even.

https://ci.centos.org/job/atomic-fedora-ws-treecompose/49/console

is an example run, but there are a number of others. Are there any available server side logs that might help debug this?

Show replies by date

Karanbir Singh

25 Jul 25 Jul

10:40 p.m.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1

On 25/07/16 23:13, Colin Walters wrote:

...

I'm seeing rather frequent issues where jobs that allocate a Duffy machine and ssh in end up losing network between my slave and the duffy machine - I can't ping even.

https://ci.centos.org/job/atomic-fedora-ws-treecompose/49/console

is an example run, but there are a number of others. Are there any available server side logs that might help debug this?

how long was this machine deployed for ? The machine reaper will silently kill power to the node.

- -- Karanbir Singh, Project Lead, The CentOS Project +44-207-0999389 | http://www.centos.org/ | twitter.com/CentOS GnuPG Key : http://www.karan.org/publickey.asc

...PGP SIGNATURE...

-----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.22 (GNU/Linux) iQEcBAEBAgAGBQJXlpVSAAoJEI3Oi2Mx7xbtQ2oH/2gk29QdrMau/iRmiI6TkT10 jVyCqY3iInzaXk6wzP75z8PFx5iMH65RQFntmaQiq41Y6SUvNECQ3DTp4DQS2tlT zD1tOsREDXWRIeX6lMXTAkJVLVCA0aDaDNvbclZ9N3uoMi4me5gGk/iue7K/zecs MePyYEL+xDNJ0tVH+wD2SuEnhysDd/be+OJLgQWjIZP1xbyMRHp6+BWwYKGppIY5 po5n9t8X/1Pj+/KOgiCQ0kTKtIlBnG/eIMaXQpegRQszow6LSM3qf8So4CtJTbBR mBbdwNtuIUHahLLIxtZ8reIWxgIk8hC32c3gnXplfQoLQoOLJvHfJk2UOfN/MHo= =8wgJ -----END PGP SIGNATURE-----

Colin Walters

11:57 p.m.

On Mon, Jul 25, 2016, at 06:40 PM, Karanbir Singh wrote:

...

how long was this machine deployed for ? The machine reaper will silently kill power to the node.

Right, but that's 8 hours, correct? I do have a "transparent duffy reuse" tool in https://github.com/cgwalters/centos-ci-skeleton which is used a lot in my jobs, but it generally avoids retaining machines for greater than an hour.

Take https://ci.centos.org/job/atomic-fedora-ws-treecompose/50/console if you click through to https://ci.centos.org/job/atomic-fedora-ws-duffy-allocate/6699/console

At: 22:09:11 Assigning host: n8.pufty.ci.centos.org (SSID=jenkins-atomic-fedora-ws-treecompose-50) Then at this point it hung and I finally aborted it: 22:32:45 Installing packages: 75% 23:53:14 Build was aborted

So that host had only been assigned for less than 30 minutes. Is there anything relevant in the Duffy logs about this?

Brian Stinson

26 Jul 26 Jul

2:30 p.m.

On Jul 25 19:57, Colin Walters wrote:

...

On Mon, Jul 25, 2016, at 06:40 PM, Karanbir Singh wrote:

...
how long was this machine deployed for ? The machine reaper will silently kill power to the node.

Right, but that's 8 hours, correct? I do have a "transparent duffy reuse" tool in https://github.com/cgwalters/centos-ci-skeleton which is used a lot in my jobs, but it generally avoids retaining machines for greater than an hour.

Take https://ci.centos.org/job/atomic-fedora-ws-treecompose/50/console if you click through to https://ci.centos.org/job/atomic-fedora-ws-duffy-allocate/6699/console

At: 22:09:11 Assigning host: n8.pufty.ci.centos.org (SSID=jenkins-atomic-fedora-ws-treecompose-50) Then at this point it hung and I finally aborted it: 22:32:45 Installing packages: 75% 23:53:14 Build was aborted

So that host had only been assigned for less than 30 minutes. Is there anything relevant in the Duffy logs about this?

Ci-users mailing list Ci-users@centos.org https://lists.centos.org/mailman/listinfo/ci-users

Can we work together to start one of those jobs and see if it hangs? If so, before aborting the jenkins job we can fail out the node and jump in on a serial console to grab some logs from the machine itself.

--Brian

Karanbir Singh

2:43 p.m.

On 26/07/16 15:30, Brian Stinson wrote:

...

...
...
how long was this machine deployed for ? The machine reaper will silently kill power to the node.

Right, but that's 8 hours, correct? I do have a "transparent duffy reuse" tool in https://github.com/cgwalters/centos-ci-skeleton which is used a lot in my jobs, but it generally avoids retaining machines for greater than an hour.

Take https://ci.centos.org/job/atomic-fedora-ws-treecompose/50/console if you click through to https://ci.centos.org/job/atomic-fedora-ws-duffy-allocate/6699/console

At: 22:09:11 Assigning host: n8.pufty.ci.centos.org (SSID=jenkins-atomic-fedora-ws-treecompose-50) Then at this point it hung and I finally aborted it: 22:32:45 Installing packages: 75% 23:53:14 Build was aborted

So that host had only been assigned for less than 30 minutes. Is there anything relevant in the Duffy logs about this?

nope, duffy only powers up machines and powers them down using ipmi calls. Nothing else in there, specially if the machine was still powered up.

-- Karanbir Singh +44-207-0999389 | http://www.karan.org/ | twitter.com/kbsingh GnuPG Key : http://www.karan.org/publickey.asc

Colin Walters

28 Jul 28 Jul

4:18 p.m.

On Mon, Jul 25, 2016, at 06:13 PM, Colin Walters wrote:

...

I'm seeing rather frequent issues where jobs that allocate a Duffy machine and ssh in end up losing network between my slave and the duffy machine - I can't ping even.

Sorry for the noise, carefully scanning the logs made it pretty obvious and I just pushed:

https://github.com/cgwalters/centos-ci-skeleton/commit/d1868c509f35d34520ce9...

3519

Age (days ago)

3522

Last active (days ago)

ci-users@lists.centos.org

5 comments

4 participants

tags (0)

participants (4)

Brian Stinson
Colin Walters
Karanbir Singh
Karanbir Singh