CI infra outage (duffy.ci.centos.org) : Wednesday April 23rd

List overview All Threads
Download

newer

older

CI Infra outage (openshift) :...

Scheduled Upgrade Notification:...

Fabian Arrotin

22 Apr 2025 22 Apr '25

12:06 p.m.

We recently discovered that number of duffy ec2 instances (aws side) is much higher than what Duffy thinks is provisioned so at some point, it lost track of really deployed ec2 and we need to reconcile DB and reality)

We just need to :

stop duffy delete all duffy ec2 instances clean-up duffy DB restart duffy (and it will reprovision from scratch/zero)

Once restarted, you'll be able to resume your ci jobs and requesting duffy nodes

Maintenance is scheduled for """"Wednesday April 23rd, 11:30 am UTC time"""". You can convert to local time with $(date -d '2025-04-23 11:30 UTC')

So just sending this in advance so that you can either pause your builds, or just inject logic into your provisioning scripts for duffy nodes and just retry until service is available again.

-- Fabian Arrotin The CentOS Project | https://www.centos.org gpg key: 17F3B7A1 | @arrfab[@fosstodon.org]

Attachments:

OpenPGP_signature.asc (application/pgp-signature — 840 bytes)

Show replies by date

Fabian Arrotin

22 Apr 22 Apr

12:16 p.m.

On 22/04/2025 14:06, Fabian Arrotin wrote:

...

We recently discovered that number of duffy ec2 instances (aws side) is much higher than what Duffy thinks is provisioned so at some point, it lost track of really deployed ec2 and we need to reconcile DB and reality)

We just need to :

stop duffy delete all duffy ec2 instances clean-up duffy DB restart duffy (and it will reprovision from scratch/zero)

Once restarted, you'll be able to resume your ci jobs and requesting duffy nodes

Maintenance is scheduled for """"Wednesday April 23rd, 11:30 am UTC time"""". You can convert to local time with $(date -d '2025-04-23 11:30 UTC')

So just sending this in advance so that you can either pause your builds, or just inject logic into your provisioning scripts for duffy nodes and just retry until service is available again.

Just to add that we'll also decommission ppc64le architecture in Duffy pool, due to upcoming DC move (and Duffy will entirely run from the cloud, where it's not possible to request ppc64le arch anyway - see https://pagure.io/centos-infra/issue/1590 )

-- Fabian Arrotin The CentOS Project | https://www.centos.org gpg key: 17F3B7A1 | @arrfab[@fosstodon.org]

Peter Georg

8:55 p.m.

New subject: [EXT] Re: CI infra outage (duffy.ci.centos.org) : Wednesday April 23rd

On 22/04/2025 14.16, Fabian Arrotin wrote:

...

On 22/04/2025 14:06, Fabian Arrotin wrote:

...
We recently discovered that number of duffy ec2 instances (aws side) is much higher than what Duffy thinks is provisioned so at some point, it lost track of really deployed ec2 and we need to reconcile DB and reality)

We just need to :

stop duffy delete all duffy ec2 instances clean-up duffy DB restart duffy (and it will reprovision from scratch/zero)

Once restarted, you'll be able to resume your ci jobs and requesting duffy nodes

Maintenance is scheduled for """"Wednesday April 23rd, 11:30 am UTC time"""". You can convert to local time with $(date -d '2025-04-23 11:30 UTC')

So just sending this in advance so that you can either pause your builds, or just inject logic into your provisioning scripts for duffy nodes and just retry until service is available again.

Just to add that we'll also decommission ppc64le architecture in Duffy pool, due to upcoming DC move (and Duffy will entirely run from the cloud, where it's not possible to request ppc64le arch anyway - see https://pagure.io/centos-infra/issue/1590 )

Just to clarify: This means that in the future there will no longer be any possibility for SIGs to run tests for ppc64le within the CentOS project or is this only temporary? If permanent, in the case of the Kmods SIG this would mean that we would either stop providing kernels for ppc64le or ship them without performing any tests, i.e. the kernel might not even boot. Or am I missing another option?

Fabian Arrotin

23 Apr 23 Apr

4:54 a.m.

New subject: [EXT] Re: CI infra outage (duffy.ci.centos.org) : Wednesday April 23rd

On 22/04/2025 22:55, Peter Georg wrote:

...

On 22/04/2025 14.16, Fabian Arrotin wrote:

...
On 22/04/2025 14:06, Fabian Arrotin wrote:

...
We recently discovered that number of duffy ec2 instances (aws side) is much higher than what Duffy thinks is provisioned so at some point, it lost track of really deployed ec2 and we need to reconcile DB and reality)

We just need to :

stop duffy delete all duffy ec2 instances clean-up duffy DB restart duffy (and it will reprovision from scratch/zero)

Once restarted, you'll be able to resume your ci jobs and requesting duffy nodes

Maintenance is scheduled for """"Wednesday April 23rd, 11:30 am UTC time"""". You can convert to local time with $(date -d '2025-04-23 11:30 UTC')

So just sending this in advance so that you can either pause your builds, or just inject logic into your provisioning scripts for duffy nodes and just retry until service is available again.

Just to add that we'll also decommission ppc64le architecture in Duffy pool, due to upcoming DC move (and Duffy will entirely run from the cloud, where it's not possible to request ppc64le arch anyway - see https://pagure.io/centos-infra/issue/1590 )

Just to clarify: This means that in the future there will no longer be any possibility for SIGs to run tests for ppc64le within the CentOS project or is this only temporary? If permanent, in the case of the Kmods SIG this would mean that we would either stop providing kernels for ppc64le or ship them without performing any tests, i.e. the kernel might not even boot. Or am I missing another option?

That's a very good question .. Based on actual consumption I was under the impression that it wasn't used at all but you're right, in the last 30 days the average is 0.002109 deployed c9s ppc64le (with a peak up to 3) Problem is that we're not sure about keeping access to Power9 and keeping opennebula (itself using an opennebula controller on a dedicated host now used only for that single ppc64le hypervisor) doesn't make sense.

Let's so make that "temporary" and I'll just try to plumb ppc64le back but outside of opennebula, and so just using cloud images (contextualized through cloud-init local iso) or just plain virt-install guests .. I'll create another ticket for that but for sure opennebula has to go away and before DC move.

In the meantime, can you try to just boot ppc64le kernel (as it's not a fully functional test I guess) through qemu on a metal ec2 host ? I'll try to add ppc64le guests support soon but multiple moving targets on my plate right now :/

-- Fabian Arrotin The CentOS Project | https://www.centos.org gpg key: 17F3B7A1 | @arrfab[@fosstodon.org]

Fabian Arrotin

5:20 a.m.

New subject: [EXT] Re: CI infra outage (duffy.ci.centos.org) : Wednesday April 23rd

On 23/04/2025 06:54, Fabian Arrotin wrote:

...

On 22/04/2025 22:55, Peter Georg wrote:

...
On 22/04/2025 14.16, Fabian Arrotin wrote:

...
On 22/04/2025 14:06, Fabian Arrotin wrote:

...
We recently discovered that number of duffy ec2 instances (aws side) is much higher than what Duffy thinks is provisioned so at some point, it lost track of really deployed ec2 and we need to reconcile DB and reality)

We just need to :

stop duffy delete all duffy ec2 instances clean-up duffy DB restart duffy (and it will reprovision from scratch/zero)

Once restarted, you'll be able to resume your ci jobs and requesting duffy nodes

Maintenance is scheduled for """"Wednesday April 23rd, 11:30 am UTC time"""". You can convert to local time with $(date -d '2025-04-23 11:30 UTC')

So just sending this in advance so that you can either pause your builds, or just inject logic into your provisioning scripts for duffy nodes and just retry until service is available again.

Just to add that we'll also decommission ppc64le architecture in Duffy pool, due to upcoming DC move (and Duffy will entirely run from the cloud, where it's not possible to request ppc64le arch anyway - see https://pagure.io/centos-infra/issue/1590 )

Just to clarify: This means that in the future there will no longer be any possibility for SIGs to run tests for ppc64le within the CentOS project or is this only temporary? If permanent, in the case of the Kmods SIG this would mean that we would either stop providing kernels for ppc64le or ship them without performing any tests, i.e. the kernel might not even boot. Or am I missing another option?

That's a very good question .. Based on actual consumption I was under the impression that it wasn't used at all but you're right, in the last 30 days the average is 0.002109 deployed c9s ppc64le (with a peak up to 3) Problem is that we're not sure about keeping access to Power9 and keeping opennebula (itself using an opennebula controller on a dedicated host now used only for that single ppc64le hypervisor) doesn't make sense.

Let's so make that "temporary" and I'll just try to plumb ppc64le back but outside of opennebula, and so just using cloud images (contextualized through cloud-init local iso) or just plain virt-install guests .. I'll create another ticket for that but for sure opennebula has to go away and before DC move.

In the meantime, can you try to just boot ppc64le kernel (as it's not a fully functional test I guess) through qemu on a metal ec2 host ? I'll try to add ppc64le guests support soon but multiple moving targets on my plate right now :/

... and while I think I'd be able to add ppc64le arch support for duffy pool, I'm not sure that it would be possible after the DC move, as still waiting on confirmation that we'd be able to get a tunnel between the AWS VPC and new DC vlan but no confirmation yet (I'd say even more and more that it will be probably not possible)

-- Fabian Arrotin The CentOS Project | https://www.centos.org gpg key: 17F3B7A1 | @arrfab[@fosstodon.org]

Peter Georg

10:10 a.m.

On 23/04/2025 07.20, Fabian Arrotin wrote:

...

On 23/04/2025 06:54, Fabian Arrotin wrote:

...
On 22/04/2025 22:55, Peter Georg wrote:

...
On 22/04/2025 14.16, Fabian Arrotin wrote:

...
On 22/04/2025 14:06, Fabian Arrotin wrote:

...
We recently discovered that number of duffy ec2 instances (aws side) is much higher than what Duffy thinks is provisioned so at some point, it lost track of really deployed ec2 and we need to reconcile DB and reality)

We just need to :

stop duffy delete all duffy ec2 instances clean-up duffy DB restart duffy (and it will reprovision from scratch/zero)

Once restarted, you'll be able to resume your ci jobs and requesting duffy nodes

Maintenance is scheduled for """"Wednesday April 23rd, 11:30 am UTC time"""". You can convert to local time with $(date -d '2025-04-23 11:30 UTC')

So just sending this in advance so that you can either pause your builds, or just inject logic into your provisioning scripts for duffy nodes and just retry until service is available again.

Just to add that we'll also decommission ppc64le architecture in Duffy pool, due to upcoming DC move (and Duffy will entirely run from the cloud, where it's not possible to request ppc64le arch anyway - see https://pagure.io/centos-infra/issue/1590 )

Just to clarify: This means that in the future there will no longer be any possibility for SIGs to run tests for ppc64le within the CentOS project or is this only temporary? If permanent, in the case of the Kmods SIG this would mean that we would either stop providing kernels for ppc64le or ship them without performing any tests, i.e. the kernel might not even boot. Or am I missing another option?

That's a very good question .. Based on actual consumption I was under the impression that it wasn't used at all but you're right, in the last 30 days the average is 0.002109 deployed c9s ppc64le (with a peak up to 3) Problem is that we're not sure about keeping access to Power9 and keeping opennebula (itself using an opennebula controller on a dedicated host now used only for that single ppc64le hypervisor) doesn't make sense.

Let's so make that "temporary" and I'll just try to plumb ppc64le back but outside of opennebula, and so just using cloud images (contextualized through cloud-init local iso) or just plain virt- install guests .. I'll create another ticket for that but for sure opennebula has to go away and before DC move.

In the meantime, can you try to just boot ppc64le kernel (as it's not a fully functional test I guess) through qemu on a metal ec2 host ? I'll try to add ppc64le guests support soon but multiple moving targets on my plate right now :/

It is just a simple "does it boot" test, i.e., - Boot CentOS Stream - Install new kernel to be tested - Reboot using new kernel - Verify new kernel booted

So using qemu might indeed be an option.

...

... and while I think I'd be able to add ppc64le arch support for duffy pool, I'm not sure that it would be possible after the DC move, as still waiting on confirmation that we'd be able to get a tunnel between the AWS VPC and new DC vlan but no confirmation yet (I'd say even more and more that it will be probably not possible)

Ok. Probably easiest is to "temporarily" disable tests for ppc64le. Once it is clear what is possible and what is not, a solution can be implemented by the Kmods SIG. Implementing a temporary solution is not worth the effort.

Fabian Arrotin

9:13 p.m.

On 23/04/2025 12:10, Peter Georg wrote:

...

On 23/04/2025 07.20, Fabian Arrotin wrote:

...
On 23/04/2025 06:54, Fabian Arrotin wrote:

...
On 22/04/2025 22:55, Peter Georg wrote:

...
On 22/04/2025 14.16, Fabian Arrotin wrote:

...
On 22/04/2025 14:06, Fabian Arrotin wrote:

...
We recently discovered that number of duffy ec2 instances (aws side) is much higher than what Duffy thinks is provisioned so at some point, it lost track of really deployed ec2 and we need to reconcile DB and reality)

We just need to :

stop duffy delete all duffy ec2 instances clean-up duffy DB restart duffy (and it will reprovision from scratch/zero)

Once restarted, you'll be able to resume your ci jobs and requesting duffy nodes

Maintenance is scheduled for """"Wednesday April 23rd, 11:30 am UTC time"""". You can convert to local time with $(date -d '2025-04-23 11:30 UTC')

So just sending this in advance so that you can either pause your builds, or just inject logic into your provisioning scripts for duffy nodes and just retry until service is available again.

Just to add that we'll also decommission ppc64le architecture in Duffy pool, due to upcoming DC move (and Duffy will entirely run from the cloud, where it's not possible to request ppc64le arch anyway - see https://pagure.io/centos-infra/issue/1590 )

Just to clarify: This means that in the future there will no longer be any possibility for SIGs to run tests for ppc64le within the CentOS project or is this only temporary? If permanent, in the case of the Kmods SIG this would mean that we would either stop providing kernels for ppc64le or ship them without performing any tests, i.e. the kernel might not even boot. Or am I missing another option?

That's a very good question .. Based on actual consumption I was under the impression that it wasn't used at all but you're right, in the last 30 days the average is 0.002109 deployed c9s ppc64le (with a peak up to 3) Problem is that we're not sure about keeping access to Power9 and keeping opennebula (itself using an opennebula controller on a dedicated host now used only for that single ppc64le hypervisor) doesn't make sense.

Let's so make that "temporary" and I'll just try to plumb ppc64le back but outside of opennebula, and so just using cloud images (contextualized through cloud-init local iso) or just plain virt- install guests .. I'll create another ticket for that but for sure opennebula has to go away and before DC move.

In the meantime, can you try to just boot ppc64le kernel (as it's not a fully functional test I guess) through qemu on a metal ec2 host ? I'll try to add ppc64le guests support soon but multiple moving targets on my plate right now :/

It is just a simple "does it boot" test, i.e.,

Boot CentOS Stream

Install new kernel to be tested

Reboot using new kernel

Verify new kernel booted

So using qemu might indeed be an option.

Just to confirm that it was possible (even if slower than on "native" ppc64le arch), I just grabbed an ephemeral duffy ec2 instance (so not bare-metal one, just a classic ec2 VM) and then installed centos-release-hyperscale (as they provide qemu-* packages, including qemu-system-ppc64)

Than ran a classic qcow2 cloud image from centos stream 9 (ppc64le) on a x86_64 ec2 host and it boots fine. (but slowly, as emulated of course)

From that qemu ppc64le guest, I then tried "dnf install -y centos-release-kmods-kernel-6.6 && dnf update && systemctl reboot"

On reboot :

``` uname -a ; lscpu Linux qemu-ppc64le.dev.centos.org 6.6.87-1.el9.ppc64le #1 SMP Fri Apr 11 07:14:53 UTC 2025 ppc64le ppc64le ppc64le GNU/Linux Architecture: ppc64le Byte Order: Little Endian CPU(s): 4 On-line CPU(s) list: 0-3 Model name: POWER9 (architected), altivec supported Model: 2.2 (pvr 004e 1202) Thread(s) per core: 1 Core(s) per socket: 4 Socket(s): 1 Virtualization features: Hypervisor vendor: KVM Virtualization type: para Caches (sum of all): L1d: 128 KiB (4 instances) L1i: 128 KiB (4 instances) NUMA: NUMA node(s): 1 NUMA node0 CPU(s): 0-3 Vulnerabilities: Gather data sampling: Not affected Itlb multihit: Not affected L1tf: Mitigation; RFI Flush, L1D private per thread Mds: Not affected Meltdown: Mitigation; RFI Flush, L1D private per thread Mmio stale data: Not affected Reg file data sampling: Not affected Retbleed: Not affected Spec rstack overflow: Not affected Spec store bypass: Mitigation; Kernel entry/exit barrier (eieio) Spectre v1: Mitigation; __user pointer sanitization Spectre v2: Mitigation; Software count cache flush (hardware accelerated), Software link sta ck flush Srbds: Not affected Tsx async abort: Not affected

```

I can document the whole thing on sig-guide (testing) if that might interest people (as if we want to use cloud image in qcow2 format, we can just "inject" cloud-init metadata through local .iso that is found and applied automatically when the qemu emulated guest is starting

-- Fabian Arrotin The CentOS Project | https://www.centos.org gpg key: 17F3B7A1 | @arrfab[@fosstodon.org]

Peter Georg

24 Apr 24 Apr

8:57 a.m.

On 23/04/2025 23.13, Fabian Arrotin wrote:

...

On 23/04/2025 12:10, Peter Georg wrote:

...
On 23/04/2025 07.20, Fabian Arrotin wrote:

...
On 23/04/2025 06:54, Fabian Arrotin wrote:

...
On 22/04/2025 22:55, Peter Georg wrote:

...
On 22/04/2025 14.16, Fabian Arrotin wrote:

...
On 22/04/2025 14:06, Fabian Arrotin wrote: > We recently discovered that number of duffy ec2 instances (aws > side) is much higher than what Duffy thinks is provisioned so at > some point, it lost track of really deployed ec2 and we need to > reconcile DB and reality) > > We just need to : > > stop duffy > delete all duffy ec2 instances > clean-up duffy DB > restart duffy (and it will reprovision from scratch/zero) > > Once restarted, you'll be able to resume your ci jobs and > requesting duffy nodes > > Maintenance is scheduled for """"Wednesday April 23rd, 11:30 am > UTC time"""". > You can convert to local time with $(date -d '2025-04-23 11:30 UTC') > > So just sending this in advance so that you can either pause your > builds, or just inject logic into your provisioning scripts for > duffy nodes and just retry until service is available again. >

Just to add that we'll also decommission ppc64le architecture in Duffy pool, due to upcoming DC move (and Duffy will entirely run from the cloud, where it's not possible to request ppc64le arch anyway - see https://pagure.io/centos-infra/issue/1590 )

Just to clarify: This means that in the future there will no longer be any possibility for SIGs to run tests for ppc64le within the CentOS project or is this only temporary? If permanent, in the case of the Kmods SIG this would mean that we would either stop providing kernels for ppc64le or ship them without performing any tests, i.e. the kernel might not even boot. Or am I missing another option?

That's a very good question .. Based on actual consumption I was under the impression that it wasn't used at all but you're right, in the last 30 days the average is 0.002109 deployed c9s ppc64le (with a peak up to 3) Problem is that we're not sure about keeping access to Power9 and keeping opennebula (itself using an opennebula controller on a dedicated host now used only for that single ppc64le hypervisor) doesn't make sense.

Let's so make that "temporary" and I'll just try to plumb ppc64le back but outside of opennebula, and so just using cloud images (contextualized through cloud-init local iso) or just plain virt- install guests .. I'll create another ticket for that but for sure opennebula has to go away and before DC move.

In the meantime, can you try to just boot ppc64le kernel (as it's not a fully functional test I guess) through qemu on a metal ec2 host ? I'll try to add ppc64le guests support soon but multiple moving targets on my plate right now :/

It is just a simple "does it boot" test, i.e.,

Boot CentOS Stream

Install new kernel to be tested

Reboot using new kernel

Verify new kernel booted

So using qemu might indeed be an option.

Just to confirm that it was possible (even if slower than on "native" ppc64le arch), I just grabbed an ephemeral duffy ec2 instance (so not bare-metal one, just a classic ec2 VM) and then installed centos-release-hyperscale (as they provide qemu-* packages, including qemu-system-ppc64)

Than ran a classic qcow2 cloud image from centos stream 9 (ppc64le) on a x86_64 ec2 host and it boots fine. (but slowly, as emulated of course)

From that qemu ppc64le guest, I then tried "dnf install -y centos- release-kmods-kernel-6.6 && dnf update && systemctl reboot"

On reboot :
uname -a ; lscpu
Linux qemu-ppc64le.dev.centos.org 6.6.87-1.el9.ppc64le #1 SMP Fri Apr 11 
07:14:53 UTC 2025 ppc64le ppc64le ppc64le GNU/Linux
Architecture:             ppc64le
   Byte Order:             Little Endian
CPU(s):                   4
   On-line CPU(s) list:    0-3
Model name:               POWER9 (architected), altivec supported
   Model:                  2.2 (pvr 004e 1202)
   Thread(s) per core:     1
   Core(s) per socket:     4
   Socket(s):              1
Virtualization features:
   Hypervisor vendor:      KVM
   Virtualization type:    para
Caches (sum of all):
   L1d:                    128 KiB (4 instances)
   L1i:                    128 KiB (4 instances)
NUMA:
   NUMA node(s):           1
   NUMA node0 CPU(s):      0-3
Vulnerabilities:
   Gather data sampling:   Not affected
   Itlb multihit:          Not affected
   L1tf:                   Mitigation; RFI Flush, L1D private per thread
   Mds:                    Not affected
   Meltdown:               Mitigation; RFI Flush, L1D private per thread
   Mmio stale data:        Not affected
   Reg file data sampling: Not affected
   Retbleed:               Not affected
   Spec rstack overflow:   Not affected
   Spec store bypass:      Mitigation; Kernel entry/exit barrier (eieio)
   Spectre v1:             Mitigation; __user pointer sanitization
   Spectre v2:             Mitigation; Software count cache flush 
(hardware accelerated), Software link sta
                           ck flush
   Srbds:                  Not affected
   Tsx async abort:        Not affected
I can document the whole thing on sig-guide (testing) if that might interest people (as if we want to use cloud image in qcow2 format, we can just "inject" cloud-init metadata through local .iso that is found and applied automatically when the qemu emulated guest is starting

It is indeed interesting for me as this is exactly the workflow the Kmods SIG would then implement to test kernels on ppc64le. In case no one else is interested and documenting on sig-guide is too much work, I'd be happy if you could just send me the list of issued commands as a starting point. Thanks!

Fabian Arrotin

23 Apr 23 Apr

11:54 a.m.

On 22/04/2025 14:06, Fabian Arrotin wrote:

...

We recently discovered that number of duffy ec2 instances (aws side) is much higher than what Duffy thinks is provisioned so at some point, it lost track of really deployed ec2 and we need to reconcile DB and reality)

We just need to :

stop duffy delete all duffy ec2 instances clean-up duffy DB restart duffy (and it will reprovision from scratch/zero)

Once restarted, you'll be able to resume your ci jobs and requesting duffy nodes

Maintenance is scheduled for """"Wednesday April 23rd, 11:30 am UTC time"""". You can convert to local time with $(date -d '2025-04-23 11:30 UTC')

So just sending this in advance so that you can either pause your builds, or just inject logic into your provisioning scripts for duffy nodes and just retry until service is available again.

Just to let you know that we're fully back in action, and with a clean state (both Duffy DB and AWS side) now.

-- Fabian Arrotin The CentOS Project | https://www.centos.org gpg key: 17F3B7A1 | @arrfab[@fosstodon.org]

175

Age (days ago)

177

Last active (days ago)

ci-users@lists.centos.org

8 comments

2 participants

tags (0)

participants (2)

Fabian Arrotin
Peter Georg