New subject: [Ci-users] [unscheduled outage] hardware issue impacting CI services (including https://ci.centos.org)

4 Oct 2020


      Yesterday (Saturday) evening we got zabbix notifications that some nodes
in CI environment were unreachable. After a quick look, I discovered
that it was an embedded network switch in a chassis hosting multiple
nodes (including but not limited to jenkins node behind ci.centos.org)
that went nuts.
I tried a remote "hardware reset" and nodes were back online after ~10min.
But this morning (sunday), I see through zabbix that same issue happened
again, and in the hour after I already did the "hardware reset", but
this time, even that doesn't work anymore.
So that means that we have a network switch not working anymore.
As that chassis (like almost *all* equipment in CI) *isn't* under
warranty, we'll see on monday what can be done and how we give priority
to try to dispatch services elsewhere (and that probably means then
powering down other services , depending on priority that will be
given), but it's easy to understand that we can't even give any ETA at
this point.
Thanks for your comprehending,
-- 
Fabian Arrotin
The CentOS Project | https://www.centos.org
gpg key: 17F3B7A1 | twitter: @arrfab