Hi ci-users,
We're currently suffering an issue with our storage on the CentOS CI OCP 4 cluster, we'll be taking the cluster down for emergency maintenance immediately.
Apologies for the inconvenience, we'll keep you updated once we know more.
On Tue, Jun 8, 2021 at 6:33 PM David Kirwan dkirwan@redhat.com wrote:
Hi ci-users,
We're currently suffering an issue with our storage on the CentOS CI OCP 4 cluster, we'll be taking the cluster down for emergency maintenance immediately.
This problem seems bigger than we had anticipated. So far from our investigation it seems this is a low level hardware issue that will need an onsite visit. We may have to go with a server replacement (from logs this is a symptom of backplane issue or of the controller's) but hoping onsite visit reveals something like "power cable not connected properly or low voltage".
There is no estimated resolution time for this but we will keep the ticket [0] up to date as we find out.
[0] https://pagure.io/centos-infra/issue/353
The outage for the CentOS CI OCP 4 cluster is now over, service has been fully restored with a temporary workaround.
We had a hardware failure on the storinator node `storage02`, which provides storage services to our cluster. Logs show some issues with the backplane.
As a temporary workaround, we have migrated this storage to an older node (which is out of warranty). We'll have an on-site engineer visit the data center early next week to diagnose the problem affecting the main storinator node. At a future date, once this storinator node is repaired/replaced, we will schedule an outage to migrate our storage back to that device.
Tracking ticket [0] has been updated [0]
- [0] https://pagure.io/centos-infra/issue/353
On Wed, 9 Jun 2021 at 17:42, Vipul Siddharth vipul@redhat.com wrote:
On Tue, Jun 8, 2021 at 6:33 PM David Kirwan dkirwan@redhat.com wrote:
Hi ci-users,
We're currently suffering an issue with our storage on the CentOS CI OCP
4 cluster, we'll be taking the cluster down for emergency maintenance immediately.
This problem seems bigger than we had anticipated. So far from our investigation it seems this is a low level hardware issue that will need an onsite visit. We may have to go with a server replacement (from logs this is a symptom of backplane issue or of the controller's) but hoping onsite visit reveals something like "power cable not connected properly or low voltage".
There is no estimated resolution time for this but we will keep the ticket [0] up to date as we find out.
[0] https://pagure.io/centos-infra/issue/353
Vipul Siddharth He/His/Him Fedora and CentOS Infrastructure
CI-users mailing list CI-users@centos.org https://lists.centos.org/mailman/listinfo/ci-users