Yesterday (Saturday) evening we got zabbix notifications that some nodes in CI environment were unreachable. After a quick look, I discovered that it was an embedded network switch in a chassis hosting multiple nodes (including but not limited to jenkins node behind ci.centos.org) that went nuts.
I tried a remote "hardware reset" and nodes were back online after ~10min.
But this morning (sunday), I see through zabbix that same issue happened again, and in the hour after I already did the "hardware reset", but this time, even that doesn't work anymore.
So that means that we have a network switch not working anymore.
As that chassis (like almost *all* equipment in CI) *isn't* under warranty, we'll see on monday what can be done and how we give priority to try to dispatch services elsewhere (and that probably means then powering down other services , depending on priority that will be given), but it's easy to understand that we can't even give any ETA at this point.
Thanks for your comprehending,
On 04/10/2020 09:00, Fabian Arrotin wrote:
Yesterday (Saturday) evening we got zabbix notifications that some nodes in CI environment were unreachable. After a quick look, I discovered that it was an embedded network switch in a chassis hosting multiple nodes (including but not limited to jenkins node behind ci.centos.org) that went nuts.
I tried a remote "hardware reset" and nodes were back online after ~10min.
But this morning (sunday), I see through zabbix that same issue happened again, and in the hour after I already did the "hardware reset", but this time, even that doesn't work anymore.
So that means that we have a network switch not working anymore.
As that chassis (like almost *all* equipment in CI) *isn't* under warranty, we'll see on monday what can be done and how we give priority to try to dispatch services elsewhere (and that probably means then powering down other services , depending on priority that will be given), but it's easy to understand that we can't even give any ETA at this point.
Thanks for your comprehending,
I had a quick workaround and jenkins (aka ci.centos.org) is now back in action normally.
We'll see tomorrow about other impacted services ..