Infra : scheduled hardware maintenance (Openshift/NFS)

List overview All Threads
Download

newer

older

[infra outage] : nfs storage...

Infra Pre-Announce : moving CI ssh...

Fabian Arrotin

17 Sep 2020 17 Sep '20

2:30 p.m.

Due to a hardware maintenance that needs to take place on the NFS storage node used by openshift ("legacy" and current one - ocp ), we'll have to shutdown the openshift cluster, and then proceed with hardware maintenance on the NFS server (that itself needs to be powered down, no way to actually do that "online")

Migration is scheduled for """"Wednesday September 30th, 12:00 pm UTC time"""". You can convert to local time with $(date -d '2020-09-30 12:00 UTC')

The expected "downtime" is estimated to ~60 minutes , time needed to shutdown the machine, install new disks, restart the machine and also do some updates and tuning on the setup.

For more informations about this, here are some relevant tickets that were created for the perf issue in openshift and nfs :

https://pagure.io/centos-infra/issue/53 https://pagure.io/centos-infra/issue/105 https://pagure.io/centos-infra/issue/85 https://pagure.io/centos-infra/issue/26

<subliminal message> PS : worth noting that while we'll investigate reports on new ocp cluster, we'll probably not spend time investigating in the old/legacy one, that projects are supposed to migrate away from soon, as the legacy openshift setup will disappear soon (see https://pagure.io/centos-infra/issue/16) </subliminal message>

Thanks for your comprehending and patience.

on behalf of the CI Infra team,

-- Fabian Arrotin The CentOS Project | https://www.centos.org gpg key: 17F3B7A1 | twitter: @arrfab

Attachments:

signature.asc (application/pgp-signature — 833 bytes)

Show replies by date

Fabian Arrotin

29 Sep 29 Sep

8:20 p.m.

New subject: Infra : scheduled hardware maintenance (Openshift/NFS)

On 17/09/2020 16:30, Fabian Arrotin wrote:

...

Due to a hardware maintenance that needs to take place on the NFS storage node used by openshift ("legacy" and current one - ocp ), we'll have to shutdown the openshift cluster, and then proceed with hardware maintenance on the NFS server (that itself needs to be powered down, no way to actually do that "online")

Migration is scheduled for """"Wednesday September 30th, 12:00 pm UTC time"""". You can convert to local time with $(date -d '2020-09-30 12:00 UTC')

The expected "downtime" is estimated to ~60 minutes , time needed to shutdown the machine, install new disks, restart the machine and also do some updates and tuning on the setup.

For more informations about this, here are some relevant tickets that were created for the perf issue in openshift and nfs :

https://pagure.io/centos-infra/issue/53 https://pagure.io/centos-infra/issue/105 https://pagure.io/centos-infra/issue/85 https://pagure.io/centos-infra/issue/26

<subliminal message> PS : worth noting that while we'll investigate reports on new ocp cluster, we'll probably not spend time investigating in the old/legacy one, that projects are supposed to migrate away from soon, as the legacy openshift setup will disappear soon (see https://pagure.io/centos-infra/issue/16) </subliminal message>

Reminder ! :-)

Also, due to the needed time to also properly/cleanly power down all nodes, we decided to start at 11:00 am UTC, to be ready when on-site engineer will start un-racking storage server for hardware maintenance and put it back online after (we have a fixed appointment for when to do it)

I'd like to remind all projects still on the old openshift cluster that despite our calls to have projects migrated, only a very few did. So we'll have discussion (centos ci infra team) about how to deal with this but at first sight, we'll just announce a date/deadline for decommissioning the old infra

-- Fabian Arrotin The CentOS Project | https://www.centos.org gpg key: 17F3B7A1 | twitter: @arrfab

Vipul Siddharth

30 Sep 30 Sep

1:02 p.m.

New subject: Infra : scheduled hardware maintenance (Openshift/NFS)

On Wed, Sep 30, 2020 at 1:50 AM Fabian Arrotin arrfab@centos.org wrote:

...

On 17/09/2020 16:30, Fabian Arrotin wrote:

...
Due to a hardware maintenance that needs to take place on the NFS storage node used by openshift ("legacy" and current one - ocp ), we'll have to shutdown the openshift cluster, and then proceed with hardware maintenance on the NFS server (that itself needs to be powered down, no way to actually do that "online")

Migration is scheduled for """"Wednesday September 30th, 12:00 pm UTC time"""". You can convert to local time with $(date -d '2020-09-30 12:00 UTC')

The expected "downtime" is estimated to ~60 minutes , time needed to shutdown the machine, install new disks, restart the machine and also do some updates and tuning on the setup.

Due to some issues with legacy cluster volume, this is taking longer than expected. We are working on it. Apologies for the inconveniences.

...

...
For more informations about this, here are some relevant tickets that were created for the perf issue in openshift and nfs :

https://pagure.io/centos-infra/issue/53 https://pagure.io/centos-infra/issue/105 https://pagure.io/centos-infra/issue/85 https://pagure.io/centos-infra/issue/26

<subliminal message> PS : worth noting that while we'll investigate reports on new ocp cluster, we'll probably not spend time investigating in the old/legacy one, that projects are supposed to migrate away from soon, as the legacy openshift setup will disappear soon (see https://pagure.io/centos-infra/issue/16) </subliminal message>

Reminder ! :-)

Also, due to the needed time to also properly/cleanly power down all nodes, we decided to start at 11:00 am UTC, to be ready when on-site engineer will start un-racking storage server for hardware maintenance and put it back online after (we have a fixed appointment for when to do it)

I'd like to remind all projects still on the old openshift cluster that despite our calls to have projects migrated, only a very few did. So we'll have discussion (centos ci infra team) about how to deal with this but at first sight, we'll just announce a date/deadline for decommissioning the old infra

-- Fabian Arrotin The CentOS Project | https://www.centos.org gpg key: 17F3B7A1 | twitter: @arrfab

CI-users mailing list CI-users@centos.org https://lists.centos.org/mailman/listinfo/ci-users

-- Vipul Siddharth He/His/Him Fedora | CentOS CI Infrastructure Team

Brian Stinson

10:24 p.m.

New subject: Infra : scheduled hardware maintenance (Openshift/NFS)

Hi Folks,

Here's an update on where we are with the legacy (OKD 3.6) cluster. There were some integrity issues on the filesystem, so we decided to make sure we got a full xfs_repair done, and will be syncing the data to a new volume before we bring up the old cluster again.

The good news is: we have a good xfs_repair The bad news is: the sync is going to take a number of hours yet. I don't have a good ETA for when the legacy cluster will come up again, but it may be into the morning US-time on Thursday

This is another call for folks who would like to start a migration to the fancy OCP cluster, please fill out a ticket at https://pagure.io/centos-infra

-- Brian Stinson brian@bstinson.com On Wed, Sep 30, 2020, at 08:02, Vipul Siddharth wrote: > On Wed, Sep 30, 2020 at 1:50 AM Fabian Arrotin arrfab@centos.org wrote: > > > > On 17/09/2020 16:30, Fabian Arrotin wrote: > > > Due to a hardware maintenance that needs to take place on the NFS > > > storage node used by openshift ("legacy" and current one - ocp ), we'll > > > have to shutdown the openshift cluster, and then proceed with hardware > > > maintenance on the NFS server (that itself needs to be powered down, no > > > way to actually do that "online") > > > > > > Migration is scheduled for """"Wednesday September 30th, 12:00 pm UTC > > > time"""". > > > You can convert to local time with $(date -d '2020-09-30 12:00 UTC') > > > > > > The expected "downtime" is estimated to ~60 minutes , time needed to > > > shutdown the machine, install new disks, restart the machine and also do > > > some updates and tuning on the setup. > > Due to some issues with legacy cluster volume, this is taking longer > than expected. > We are working on it. > Apologies for the inconveniences. > > > > > > > For more informations about this, here are some relevant tickets that > > > were created for the perf issue in openshift and nfs : > > > > > > https://pagure.io/centos-infra/issue/53 > > > https://pagure.io/centos-infra/issue/105 > > > https://pagure.io/centos-infra/issue/85 > > > https://pagure.io/centos-infra/issue/26 > > > > > > <subliminal message> > > > PS : worth noting that while we'll investigate reports on new ocp > > > cluster, we'll probably not spend time investigating in the old/legacy > > > one, that projects are supposed to migrate away from soon, as the legacy > > > openshift setup will disappear soon (see > > > https://pagure.io/centos-infra/issue/16) > > > </subliminal message> > > > > > > > > > Reminder ! :-) > > > > Also, due to the needed time to also properly/cleanly power down all > > nodes, we decided to start at 11:00 am UTC, to be ready when on-site > > engineer will start un-racking storage server for hardware maintenance > > and put it back online after (we have a fixed appointment for when to do it) > > > > I'd like to remind all projects still on the old openshift cluster that > > despite our calls to have projects migrated, only a very few did. > > So we'll have discussion (centos ci infra team) about how to deal with > > this but at first sight, we'll just announce a date/deadline for > > decommissioning the old infra > > > > -- > > Fabian Arrotin > > The CentOS Project | https://www.centos.org > > gpg key: 17F3B7A1 | twitter: @arrfab > > > > _______________________________________________ > > CI-users mailing list > > CI-users@centos.org > > https://lists.centos.org/mailman/listinfo/ci-users > > > > -- > Vipul Siddharth > He/His/Him > Fedora | CentOS CI Infrastructure Team > > _______________________________________________ > CI-users mailing list > CI-users@centos.org > https://lists.centos.org/mailman/listinfo/ci-users >

Brian Stinson

1 Oct 1 Oct

4:19 a.m.

New subject: Infra : scheduled hardware maintenance (Openshift/NFS)

The legacy cluster should be back online now.

Thank you all for your patience while we worked this through.

--Brian

On Wed, Sep 30, 2020, at 17:24, Brian Stinson wrote:

...

Hi Folks,

Here's an update on where we are with the legacy (OKD 3.6) cluster. There were some integrity issues on the filesystem, so we decided to make sure we got a full xfs_repair done, and will be syncing the data to a new volume before we bring up the old cluster again.

The good news is: we have a good xfs_repair The bad news is: the sync is going to take a number of hours yet. I don't have a good ETA for when the legacy cluster will come up again, but it may be into the morning US-time on Thursday

This is another call for folks who would like to start a migration to the fancy OCP cluster, please fill out a ticket at https://pagure.io/centos-infra

-- Brian Stinson brian@bstinson.com

On Wed, Sep 30, 2020, at 08:02, Vipul Siddharth wrote:

...
On Wed, Sep 30, 2020 at 1:50 AM Fabian Arrotin arrfab@centos.org wrote:

...
On 17/09/2020 16:30, Fabian Arrotin wrote:

...
Due to a hardware maintenance that needs to take place on the NFS storage node used by openshift ("legacy" and current one - ocp ), we'll have to shutdown the openshift cluster, and then proceed with hardware maintenance on the NFS server (that itself needs to be powered down, no way to actually do that "online")

Migration is scheduled for """"Wednesday September 30th, 12:00 pm UTC time"""". You can convert to local time with $(date -d '2020-09-30 12:00 UTC')

The expected "downtime" is estimated to ~60 minutes , time needed to shutdown the machine, install new disks, restart the machine and also do some updates and tuning on the setup.

Due to some issues with legacy cluster volume, this is taking longer than expected. We are working on it. Apologies for the inconveniences.

...
...
For more informations about this, here are some relevant tickets that were created for the perf issue in openshift and nfs :

https://pagure.io/centos-infra/issue/53 https://pagure.io/centos-infra/issue/105 https://pagure.io/centos-infra/issue/85 https://pagure.io/centos-infra/issue/26

<subliminal message> PS : worth noting that while we'll investigate reports on new ocp cluster, we'll probably not spend time investigating in the old/legacy one, that projects are supposed to migrate away from soon, as the legacy openshift setup will disappear soon (see https://pagure.io/centos-infra/issue/16) </subliminal message>

Reminder ! :-)

Also, due to the needed time to also properly/cleanly power down all nodes, we decided to start at 11:00 am UTC, to be ready when on-site engineer will start un-racking storage server for hardware maintenance and put it back online after (we have a fixed appointment for when to do it)

I'd like to remind all projects still on the old openshift cluster that despite our calls to have projects migrated, only a very few did. So we'll have discussion (centos ci infra team) about how to deal with this but at first sight, we'll just announce a date/deadline for decommissioning the old infra

-- Fabian Arrotin The CentOS Project | https://www.centos.org gpg key: 17F3B7A1 | twitter: @arrfab

CI-users mailing list CI-users@centos.org https://lists.centos.org/mailman/listinfo/ci-users

-- Vipul Siddharth He/His/Him Fedora | CentOS CI Infrastructure Team

CI-users mailing list CI-users@centos.org https://lists.centos.org/mailman/listinfo/ci-users

CI-users mailing list CI-users@centos.org https://lists.centos.org/mailman/listinfo/ci-users

1995

Age (days ago)

2009

Last active (days ago)

ci-users@lists.centos.org

4 comments

3 participants

tags (0)

participants (3)

Brian Stinson
Fabian Arrotin
Vipul Siddharth