[Ci-users] [unscheduled outage] hardware issue impacting CI services (including https://ci.centos.org)

Sun Oct 4 07:00:39 UTC 2020
Fabian Arrotin <arrfab at centos.org>

Yesterday (Saturday) evening we got zabbix notifications that some nodes
in CI environment were unreachable. After a quick look, I discovered
that it was an embedded network switch in a chassis hosting multiple
nodes (including but not limited to jenkins node behind ci.centos.org)
that went nuts.

I tried a remote "hardware reset" and nodes were back online after ~10min.

But this morning (sunday), I see through zabbix that same issue happened
again, and in the hour after I already did the "hardware reset", but
this time, even that doesn't work anymore.

So that means that we have a network switch not working anymore.

As that chassis (like almost *all* equipment in CI) *isn't* under
warranty, we'll see on monday what can be done and how we give priority
to try to dispatch services elsewhere (and that probably means then
powering down other services , depending on priority that will be
given), but it's easy to understand that we can't even give any ETA at
this point.

Thanks for your comprehending,
Fabian Arrotin
The CentOS Project | https://www.centos.org
gpg key: 17F3B7A1 | twitter: @arrfab

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: OpenPGP digital signature
URL: <http://lists.centos.org/pipermail/ci-users/attachments/20201004/9385d7f2/attachment-0002.sig>