I'm seeing rather frequent issues where jobs that allocate a Duffy machine and ssh in end up losing network between my slave and the duffy machine - I can't ping even.
https://ci.centos.org/job/atomic-fedora-ws-treecompose/49/console
is an example run, but there are a number of others. Are there any available server side logs that might help debug this?
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On 25/07/16 23:13, Colin Walters wrote:
I'm seeing rather frequent issues where jobs that allocate a Duffy machine and ssh in end up losing network between my slave and the duffy machine - I can't ping even.
https://ci.centos.org/job/atomic-fedora-ws-treecompose/49/console
is an example run, but there are a number of others. Are there any available server side logs that might help debug this?
how long was this machine deployed for ? The machine reaper will silently kill power to the node.
- -- Karanbir Singh, Project Lead, The CentOS Project +44-207-0999389 | http://www.centos.org/ | twitter.com/CentOS GnuPG Key : http://www.karan.org/publickey.asc
On Mon, Jul 25, 2016, at 06:40 PM, Karanbir Singh wrote:
how long was this machine deployed for ? The machine reaper will silently kill power to the node.
Right, but that's 8 hours, correct? I do have a "transparent duffy reuse" tool in https://github.com/cgwalters/centos-ci-skeleton which is used a lot in my jobs, but it generally avoids retaining machines for greater than an hour.
Take https://ci.centos.org/job/atomic-fedora-ws-treecompose/50/console if you click through to https://ci.centos.org/job/atomic-fedora-ws-duffy-allocate/6699/console
At: 22:09:11 Assigning host: n8.pufty.ci.centos.org (SSID=jenkins-atomic-fedora-ws-treecompose-50) Then at this point it hung and I finally aborted it: 22:32:45 Installing packages: 75% 23:53:14 Build was aborted
So that host had only been assigned for less than 30 minutes. Is there anything relevant in the Duffy logs about this?
On Jul 25 19:57, Colin Walters wrote:
On Mon, Jul 25, 2016, at 06:40 PM, Karanbir Singh wrote:
how long was this machine deployed for ? The machine reaper will silently kill power to the node.
Right, but that's 8 hours, correct? I do have a "transparent duffy reuse" tool in https://github.com/cgwalters/centos-ci-skeleton which is used a lot in my jobs, but it generally avoids retaining machines for greater than an hour.
Take https://ci.centos.org/job/atomic-fedora-ws-treecompose/50/console if you click through to https://ci.centos.org/job/atomic-fedora-ws-duffy-allocate/6699/console
At: 22:09:11 Assigning host: n8.pufty.ci.centos.org (SSID=jenkins-atomic-fedora-ws-treecompose-50) Then at this point it hung and I finally aborted it: 22:32:45 Installing packages: 75% 23:53:14 Build was aborted
So that host had only been assigned for less than 30 minutes. Is there anything relevant in the Duffy logs about this?
Ci-users mailing list Ci-users@centos.org https://lists.centos.org/mailman/listinfo/ci-users
Can we work together to start one of those jobs and see if it hangs? If so, before aborting the jenkins job we can fail out the node and jump in on a serial console to grab some logs from the machine itself.
--Brian
On 26/07/16 15:30, Brian Stinson wrote:
how long was this machine deployed for ? The machine reaper will silently kill power to the node.
Right, but that's 8 hours, correct? I do have a "transparent duffy reuse" tool in https://github.com/cgwalters/centos-ci-skeleton which is used a lot in my jobs, but it generally avoids retaining machines for greater than an hour.
Take https://ci.centos.org/job/atomic-fedora-ws-treecompose/50/console if you click through to https://ci.centos.org/job/atomic-fedora-ws-duffy-allocate/6699/console
At: 22:09:11 Assigning host: n8.pufty.ci.centos.org (SSID=jenkins-atomic-fedora-ws-treecompose-50) Then at this point it hung and I finally aborted it: 22:32:45 Installing packages: 75% 23:53:14 Build was aborted
So that host had only been assigned for less than 30 minutes. Is there anything relevant in the Duffy logs about this?
nope, duffy only powers up machines and powers them down using ipmi calls. Nothing else in there, specially if the machine was still powered up.
On Mon, Jul 25, 2016, at 06:13 PM, Colin Walters wrote:
I'm seeing rather frequent issues where jobs that allocate a Duffy machine and ssh in end up losing network between my slave and the duffy machine - I can't ping even.
Sorry for the noise, carefully scanning the logs made it pretty obvious and I just pushed:
https://github.com/cgwalters/centos-ci-skeleton/commit/d1868c509f35d34520ce9...