CentOS CI OCP 4 cluster down for emergency maintenance

List overview All Threads
Download

newer

older

Planned Outage: CentOS CI OCP4...

Jobs failing with JNLP error

David Kirwan

8 Jun 2021 8 Jun '21

1:04 p.m.

Hi ci-users,

We're currently suffering an issue with our storage on the CentOS CI OCP 4 cluster, we'll be taking the cluster down for emergency maintenance immediately.

Apologies for the inconvenience, we'll keep you updated once we know more.

-- David Kirwan Software Engineer Community Platform Engineering @ Red Hat T: +(353) 86-8624108 IM: @dkirwan

Attachments:

attachment.html (text/html — 1.3 KB)

Show replies by date

Vipul Siddharth

9 Jun 9 Jun

4:41 p.m.

New subject: CentOS CI OCP 4 cluster down for emergency maintenance

On Tue, Jun 8, 2021 at 6:33 PM David Kirwan dkirwan@redhat.com wrote:

...

Hi ci-users,

We're currently suffering an issue with our storage on the CentOS CI OCP 4 cluster, we'll be taking the cluster down for emergency maintenance immediately.

This problem seems bigger than we had anticipated. So far from our investigation it seems this is a low level hardware issue that will need an onsite visit. We may have to go with a server replacement (from logs this is a symptom of backplane issue or of the controller's) but hoping onsite visit reveals something like "power cable not connected properly or low voltage".

There is no estimated resolution time for this but we will keep the ticket [0] up to date as we find out.

[0] https://pagure.io/centos-infra/issue/353

-- Vipul Siddharth He/His/Him Fedora and CentOS Infrastructure

David Kirwan

10 Jun 10 Jun

2:30 p.m.

New subject: CentOS CI OCP 4 cluster down for emergency maintenance

The outage for the CentOS CI OCP 4 cluster is now over, service has been fully restored with a temporary workaround.

We had a hardware failure on the storinator node `storage02`, which provides storage services to our cluster. Logs show some issues with the backplane.

As a temporary workaround, we have migrated this storage to an older node (which is out of warranty). We'll have an on-site engineer visit the data center early next week to diagnose the problem affecting the main storinator node. At a future date, once this storinator node is repaired/replaced, we will schedule an outage to migrate our storage back to that device.

Tracking ticket [0] has been updated [0]

- [0] https://pagure.io/centos-infra/issue/353

On Wed, 9 Jun 2021 at 17:42, Vipul Siddharth vipul@redhat.com wrote:

...

On Tue, Jun 8, 2021 at 6:33 PM David Kirwan dkirwan@redhat.com wrote:

...
Hi ci-users,

We're currently suffering an issue with our storage on the CentOS CI OCP

4 cluster, we'll be taking the cluster down for emergency maintenance immediately.

This problem seems bigger than we had anticipated. So far from our investigation it seems this is a low level hardware issue that will need an onsite visit. We may have to go with a server replacement (from logs this is a symptom of backplane issue or of the controller's) but hoping onsite visit reveals something like "power cable not connected properly or low voltage".

There is no estimated resolution time for this but we will keep the ticket [0] up to date as we find out.

[0] https://pagure.io/centos-infra/issue/353

Vipul Siddharth He/His/Him Fedora and CentOS Infrastructure

CI-users mailing list CI-users@centos.org https://lists.centos.org/mailman/listinfo/ci-users

-- David Kirwan Software Engineer Community Platform Engineering @ Red Hat T: +(353) 86-8624108 IM: @dkirwan

1749

Age (days ago)

1751

Last active (days ago)

ci-users@lists.centos.org

2 comments

2 participants

tags (0)

participants (2)

David Kirwan
Vipul Siddharth