Hello guys,
our jobs on ci.centos.org are pending because the *devtools-ci-slave04* is offline. Can someone take a look, please? One of the affected jobs is here https://ci.centos.org/view/Devtools/job/devtools-rh-che-rh-che-prcheck-dev.rdu2c.fabric8.io/ . Thank you!
Have a great day, Katka
On Tue, Oct 6, 2020 at 11:40 AM Katerina Foniok kkanova@redhat.com wrote:
Hello guys,
our jobs on ci.centos.org are pending because the devtools-ci-slave04 is offline. Can someone take a look, please?
fixed
One of the affected jobs is here. Thank you!
Have a great day, Katka _______________________________________________ CI-users mailing list CI-users@centos.org https://lists.centos.org/mailman/listinfo/ci-users
Thank you, the `devtools-ci-slave04` is running again but it seems that our jobs can not get credentials from the vault now. Can it be related to the outage?
On Tue, Oct 6, 2020 at 8:43 AM Vipul Siddharth vipul@redhat.com wrote:
On Tue, Oct 6, 2020 at 11:40 AM Katerina Foniok kkanova@redhat.com wrote:
Hello guys,
our jobs on ci.centos.org are pending because the devtools-ci-slave04
is offline. Can someone take a look, please? fixed
One of the affected jobs is here. Thank you!
Have a great day, Katka _______________________________________________ CI-users mailing list CI-users@centos.org https://lists.centos.org/mailman/listinfo/ci-users
-- Vipul Siddharth He/His/Him Fedora | CentOS CI Infrastructure Team
CI-users mailing list CI-users@centos.org https://lists.centos.org/mailman/listinfo/ci-users
So, I can see that access to Vault was disabled on purpose, so it probably doesn't relate to the outage. Sorry for the hoax.
We also can see this error message in our jobs:
"msg": "Exceeded maximum allowed fail nodes limit, please release other machines to continue"
Example of the job is here https://ci.centos.org/view/Devtools/job/devtools-rh-che-rh-che-prcheck-dev.rdu2c.fabric8.io/2931/console.
Thank you for taking a look, Katka
On Tue, Oct 6, 2020 at 9:04 AM Katerina Foniok kkanova@redhat.com wrote:
Thank you, the `devtools-ci-slave04` is running again but it seems that our jobs can not get credentials from the vault now. Can it be related to the outage?
On Tue, Oct 6, 2020 at 8:43 AM Vipul Siddharth vipul@redhat.com wrote:
On Tue, Oct 6, 2020 at 11:40 AM Katerina Foniok kkanova@redhat.com wrote:
Hello guys,
our jobs on ci.centos.org are pending because the devtools-ci-slave04
is offline. Can someone take a look, please? fixed
One of the affected jobs is here. Thank you!
Have a great day, Katka _______________________________________________ CI-users mailing list CI-users@centos.org https://lists.centos.org/mailman/listinfo/ci-users
-- Vipul Siddharth He/His/Him Fedora | CentOS CI Infrastructure Team
CI-users mailing list CI-users@centos.org https://lists.centos.org/mailman/listinfo/ci-users
On Tue, Oct 6, 2020 at 12:50 PM Katerina Foniok kkanova@redhat.com wrote:
So, I can see that access to Vault was disabled on purpose, so it probably doesn't relate to the outage. Sorry for the hoax.
We also can see this error message in our jobs:
"msg": "Exceeded maximum allowed fail nodes limit, please release other machines to continue"
Example of the job is here.
So when you mark a node fail (usually when the job fails), the node stays around for 12 hours in case someone wants to check manually on what went wrong. Keeping too many nodes in fail state becomes a bottleneck for duffy pool as it means those nodes can't be reprovisioned for the next round of jobs (for 12 hours). We have a limit on how many can be in the fail state. This is expected and you would have seen it on calling node/fail API which should ideally be called when the job failed. So error could be something else
Thank you for taking a look, Katka
On Tue, Oct 6, 2020 at 9:04 AM Katerina Foniok kkanova@redhat.com wrote:
Thank you, the `devtools-ci-slave04` is running again but it seems that our jobs can not get credentials from the vault now. Can it be related to the outage?
On Tue, Oct 6, 2020 at 8:43 AM Vipul Siddharth vipul@redhat.com wrote:
On Tue, Oct 6, 2020 at 11:40 AM Katerina Foniok kkanova@redhat.com wrote:
Hello guys,
our jobs on ci.centos.org are pending because the devtools-ci-slave04 is offline. Can someone take a look, please?
fixed
One of the affected jobs is here. Thank you!
Have a great day, Katka _______________________________________________ CI-users mailing list CI-users@centos.org https://lists.centos.org/mailman/listinfo/ci-users
-- Vipul Siddharth He/His/Him Fedora | CentOS CI Infrastructure Team
CI-users mailing list CI-users@centos.org https://lists.centos.org/mailman/listinfo/ci-users
CI-users mailing list CI-users@centos.org https://lists.centos.org/mailman/listinfo/ci-users
Ah, ok, thank you very much for clarifying!
On Tue, Oct 6, 2020 at 9:42 AM Vipul Siddharth vipul@redhat.com wrote:
On Tue, Oct 6, 2020 at 12:50 PM Katerina Foniok kkanova@redhat.com wrote:
So, I can see that access to Vault was disabled on purpose, so it
probably doesn't relate to the outage. Sorry for the hoax.
We also can see this error message in our jobs:
"msg": "Exceeded maximum allowed fail nodes limit, please release other
machines to continue"
Example of the job is here.
So when you mark a node fail (usually when the job fails), the node stays around for 12 hours in case someone wants to check manually on what went wrong. Keeping too many nodes in fail state becomes a bottleneck for duffy pool as it means those nodes can't be reprovisioned for the next round of jobs (for 12 hours). We have a limit on how many can be in the fail state. This is expected and you would have seen it on calling node/fail API which should ideally be called when the job failed. So error could be something else
Thank you for taking a look, Katka
On Tue, Oct 6, 2020 at 9:04 AM Katerina Foniok kkanova@redhat.com
wrote:
Thank you, the `devtools-ci-slave04` is running again but it seems that
our jobs can not get credentials from the vault now. Can it be related to the outage?
On Tue, Oct 6, 2020 at 8:43 AM Vipul Siddharth vipul@redhat.com
wrote:
On Tue, Oct 6, 2020 at 11:40 AM Katerina Foniok kkanova@redhat.com
wrote:
Hello guys,
our jobs on ci.centos.org are pending because the
devtools-ci-slave04 is offline. Can someone take a look, please?
fixed
One of the affected jobs is here. Thank you!
Have a great day, Katka _______________________________________________ CI-users mailing list CI-users@centos.org https://lists.centos.org/mailman/listinfo/ci-users
-- Vipul Siddharth He/His/Him Fedora | CentOS CI Infrastructure Team
CI-users mailing list CI-users@centos.org https://lists.centos.org/mailman/listinfo/ci-users
CI-users mailing list CI-users@centos.org https://lists.centos.org/mailman/listinfo/ci-users
-- Vipul Siddharth He/His/Him Fedora | CentOS CI Infrastructure Team
CI-users mailing list CI-users@centos.org https://lists.centos.org/mailman/listinfo/ci-users
Hello, it seems that the `devtools-ci-slave04` is down again. Thank you, have a nice day Katka
On Tue, Oct 6, 2020 at 9:52 AM Katerina Foniok kkanova@redhat.com wrote:
Ah, ok, thank you very much for clarifying!
On Tue, Oct 6, 2020 at 9:42 AM Vipul Siddharth vipul@redhat.com wrote:
On Tue, Oct 6, 2020 at 12:50 PM Katerina Foniok kkanova@redhat.com wrote:
So, I can see that access to Vault was disabled on purpose, so it
probably doesn't relate to the outage. Sorry for the hoax.
We also can see this error message in our jobs:
"msg": "Exceeded maximum allowed fail nodes limit, please release
other machines to continue"
Example of the job is here.
So when you mark a node fail (usually when the job fails), the node stays around for 12 hours in case someone wants to check manually on what went wrong. Keeping too many nodes in fail state becomes a bottleneck for duffy pool as it means those nodes can't be reprovisioned for the next round of jobs (for 12 hours). We have a limit on how many can be in the fail state. This is expected and you would have seen it on calling node/fail API which should ideally be called when the job failed. So error could be something else
Thank you for taking a look, Katka
On Tue, Oct 6, 2020 at 9:04 AM Katerina Foniok kkanova@redhat.com
wrote:
Thank you, the `devtools-ci-slave04` is running again but it seems
that our jobs can not get credentials from the vault now. Can it be related to the outage?
On Tue, Oct 6, 2020 at 8:43 AM Vipul Siddharth vipul@redhat.com
wrote:
On Tue, Oct 6, 2020 at 11:40 AM Katerina Foniok kkanova@redhat.com
wrote:
Hello guys,
our jobs on ci.centos.org are pending because the
devtools-ci-slave04 is offline. Can someone take a look, please?
fixed
One of the affected jobs is here. Thank you!
Have a great day, Katka _______________________________________________ CI-users mailing list CI-users@centos.org https://lists.centos.org/mailman/listinfo/ci-users
-- Vipul Siddharth He/His/Him Fedora | CentOS CI Infrastructure Team
CI-users mailing list CI-users@centos.org https://lists.centos.org/mailman/listinfo/ci-users
CI-users mailing list CI-users@centos.org https://lists.centos.org/mailman/listinfo/ci-users
-- Vipul Siddharth He/His/Him Fedora | CentOS CI Infrastructure Team
CI-users mailing list CI-users@centos.org https://lists.centos.org/mailman/listinfo/ci-users
On 09/10/2020 13:28, Katerina Foniok wrote:
Hello, it seems that the `devtools-ci-slave04` is down again. Thank you, have a nice day Katka
Agent was restarted
Thank you :)
On Fri, Oct 9, 2020 at 1:34 PM Fabian Arrotin arrfab@centos.org wrote:
On 09/10/2020 13:28, Katerina Foniok wrote:
Hello, it seems that the `devtools-ci-slave04` is down again. Thank you, have a nice day Katka
Agent was restarted
-- Fabian Arrotin The CentOS Project | https://www.centos.org gpg key: 17F3B7A1 | twitter: @arrfab
CI-users mailing list CI-users@centos.org https://lists.centos.org/mailman/listinfo/ci-users