[Ci-users] [unscheduled outage] hardware issue impacting CI services (including https://ci.centos.org)

Sun Oct 4 11:18:24 UTC 2020
Fabian Arrotin <arrfab at centos.org>

On 04/10/2020 09:00, Fabian Arrotin wrote:
> Yesterday (Saturday) evening we got zabbix notifications that some nodes
> in CI environment were unreachable. After a quick look, I discovered
> that it was an embedded network switch in a chassis hosting multiple
> nodes (including but not limited to jenkins node behind ci.centos.org)
> that went nuts.
> 
> I tried a remote "hardware reset" and nodes were back online after ~10min.
> 
> But this morning (sunday), I see through zabbix that same issue happened
> again, and in the hour after I already did the "hardware reset", but
> this time, even that doesn't work anymore.
> 
> So that means that we have a network switch not working anymore.
> 
> As that chassis (like almost *all* equipment in CI) *isn't* under
> warranty, we'll see on monday what can be done and how we give priority
> to try to dispatch services elsewhere (and that probably means then
> powering down other services , depending on priority that will be
> given), but it's easy to understand that we can't even give any ETA at
> this point.
> 
> Thanks for your comprehending,
> 

I had a quick workaround and jenkins (aka ci.centos.org) is now back in
action normally.

We'll see tomorrow about other impacted services ..

-- 
Fabian Arrotin
The CentOS Project | https://www.centos.org
gpg key: 17F3B7A1 | twitter: @arrfab

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: OpenPGP digital signature
URL: <http://lists.centos.org/pipermail/ci-users/attachments/20201004/977ccdb4/attachment-0005.sig>