[CentOS-devel] [unscheduled outage] hardware issue impacting CI services (including https://ci.centos.org)

Yesterday (Saturday) evening we got zabbix notifications that some nodes
in CI environment were unreachable. After a quick look, I discovered
that it was an embedded network switch in a chassis hosting multiple
nodes (including but not limited to jenkins node behind ci.centos.org)
that went nuts.

I tried a remote "hardware reset" and nodes were back online after ~10min.

But this morning (sunday), I see through zabbix that same issue happened
again, and in the hour after I already did the "hardware reset", but
this time, even that doesn't work anymore.

So that means that we have a network switch not working anymore.

As that chassis (like almost *all* equipment in CI) *isn't* under
warranty, we'll see on monday what can be done and how we give priority
to try to dispatch services elsewhere (and that probably means then
powering down other services , depending on priority that will be
given), but it's easy to understand that we can't even give any ETA at
this point.

Thanks for your comprehending,
-- 
Fabian Arrotin
The CentOS Project | https://www.centos.org
gpg key: 17F3B7A1 | twitter: @arrfab

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: OpenPGP digital signature
URL: <http://lists.centos.org/pipermail/centos-devel/attachments/20201004/9385d7f2/attachment-0005.sig>