Hello guys,
our jobs on ci.centos.org are pending because the *devtools-ci-slave04* is offline. Can someone take a look, please? One of the affected jobs is here https://ci.centos.org/view/Devtools/job/devtools-rh-che-rh-che-prcheck-dev.rdu2c.fabric8.io/ . Thank you!
Have a great day, Katka
On Tue, Oct 6, 2020 at 11:40 AM Katerina Foniok kkanova@redhat.com wrote:
Hello guys,
our jobs on ci.centos.org are pending because the devtools-ci-slave04 is offline. Can someone take a look, please?
fixed
Thank you, the `devtools-ci-slave04` is running again but it seems that our jobs can not get credentials from the vault now. Can it be related to the outage?
On Tue, Oct 6, 2020 at 8:43 AM Vipul Siddharth vipul@redhat.com wrote:
So, I can see that access to Vault was disabled on purpose, so it probably doesn't relate to the outage. Sorry for the hoax.
We also can see this error message in our jobs:
"msg": "Exceeded maximum allowed fail nodes limit, please release other machines to continue"
Example of the job is here https://ci.centos.org/view/Devtools/job/devtools-rh-che-rh-che-prcheck-dev.rdu2c.fabric8.io/2931/console.
Thank you for taking a look, Katka
On Tue, Oct 6, 2020 at 9:04 AM Katerina Foniok kkanova@redhat.com wrote:
On Tue, Oct 6, 2020 at 12:50 PM Katerina Foniok kkanova@redhat.com wrote:
So when you mark a node fail (usually when the job fails), the node stays around for 12 hours in case someone wants to check manually on what went wrong. Keeping too many nodes in fail state becomes a bottleneck for duffy pool as it means those nodes can't be reprovisioned for the next round of jobs (for 12 hours). We have a limit on how many can be in the fail state. This is expected and you would have seen it on calling node/fail API which should ideally be called when the job failed. So error could be something else
Ah, ok, thank you very much for clarifying!
On Tue, Oct 6, 2020 at 9:42 AM Vipul Siddharth vipul@redhat.com wrote:
Hello, it seems that the `devtools-ci-slave04` is down again. Thank you, have a nice day Katka
On Tue, Oct 6, 2020 at 9:52 AM Katerina Foniok kkanova@redhat.com wrote:
On 09/10/2020 13:28, Katerina Foniok wrote:
Agent was restarted