From arrfab at centos.org  Mon Oct  3 11:37:55 2022
From: arrfab at centos.org (Fabian Arrotin)
Date: Mon, 3 Oct 2022 13:37:55 +0200
Subject: [Ci-users] CentOS CI infra changes: retiring ci.centos.org
 legacy jenkins
In-Reply-To: <01d98834-444f-a8b8-ddba-36b8b2020ec4@centos.org>
References: <01d98834-444f-a8b8-ddba-36b8b2020ec4@centos.org>
Message-ID: <1cbd0172-e192-5acd-016e-24244371006d@centos.org>

On 29/09/2022 13:36, Fabian Arrotin wrote:
> as a FYI message : the legacy Jenkins server (initially shared instance 
> between CI tenants) and available behind https://ci.centos.org will be 
> decommissioned and powered off next week.
> 
> It's not used anymore and was just put online with a notification about 
> shutting it down and pointing to relevant ticket template to ask to be 
> migrated to openshift (2 years ago)
> 
> As we're also decommissioning other part of the CI infra, we'll also 
> just delete that one from and also remove A record from public DNS 
> (actually pointing to haproxy in front)
> 
> related ticket: https://pagure.io/centos-infra/issue/931
> 
> Kind Regards
> 

ci.centos.org is now "gone" and public A record also removed.

-- 
Fabian Arrotin
The CentOS Project | https://www.centos.org
gpg key: 17F3B7A1 | twitter: @arrfab
-------------- next part --------------
A non-text attachment was scrubbed...
Name: OpenPGP_0xA25DBAFB17F3B7A1.asc
Type: application/pgp-keys
Size: 12767 bytes
Desc: OpenPGP public key
URL: <http://lists.centos.org/pipermail/ci-users/attachments/20221003/99cdf1a6/attachment.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: OpenPGP_signature
Type: application/pgp-signature
Size: 840 bytes
Desc: OpenPGP digital signature
URL: <http://lists.centos.org/pipermail/ci-users/attachments/20221003/99cdf1a6/attachment.sig>

From frantisek at sumsal.cz  Thu Oct 20 18:30:23 2022
From: frantisek at sumsal.cz (=?UTF-8?B?RnJhbnRpxaFlayDFoHVtxaFhbA==?=)
Date: Thu, 20 Oct 2022 20:30:23 +0200
Subject: [Ci-users] Performance issues on the EC2 T2 nodes
Message-ID: <3e4e05d5-9568-ba80-1d47-06b998a45ba0@sumsal.cz>

Hey!

As the follow-up discussion after the migration to AWS suggested, I took one of our systemd CI jobs (which relies heavily on spawning QEMU VMs) from a metal machine to a T2 VM. This took quite a while and required tuning down the "depth" of some tests (which also means less coverage), but in the end it works quite reliably, and with parallelization the job takes a reasonable amount of time (~95 minutes).

However, recently (~three weeks ago) I noticed that the jobs using T2 machines (from the virt-ec2-t2-centos-8s-x86_64 pool) started having issues - the runtime was over 4 hours (instead of the usual ~95 minutes) and several tests kept timing out. In this case the issue fixed itself before I had a chance to debug it further, though.

Fast forward to this week, the issue appeared again on Monday, but, again, it fixed itself - this time while I was debugging it. I put some possibly helpful debug checks in place and waited for the next occurrence, which happened to be this Wednesday.

Taking a couple of affected machines aside, I checked various things, including all possible logs, I/O measurements, etc. and in the end I noticed that more than 50% of the vCPU time is actually being "stolen" by the hypervisor:

# mpstat -P ALL 5 7
Linux 4.18.0-408.el8.x86_64 (n27-29-92.pool.ci.centos.org)  19/10/22    _x86_64_    (8 CPU)
...
Average:     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
Average:     all   11.65    0.00    3.17    0.32    1.72    0.30   52.92    0.00    0.00   29.92
Average:       0   12.99    0.00    2.68    0.37    1.69    0.32   59.64    0.00    0.00   22.31
Average:       1   11.00    0.00    3.14    0.34    1.95    0.34   49.16    0.00    0.00   34.06
Average:       2   11.31    0.00    2.72    0.52    1.84    0.41   47.02    0.00    0.00   36.18
Average:       3   11.75    0.00    2.97    0.45    1.96    0.29   52.45    0.00    0.00   30.11
Average:       4   10.88    0.00    3.06    0.19    1.99    0.21   46.12    0.00    0.00   37.55
Average:       5   13.19    0.00    3.30    0.24    1.11    0.22   64.23    0.00    0.00   17.71
Average:       6   11.92    0.00    3.71    0.08    1.60    0.24   58.52    0.00    0.00   23.92
Average:       7   10.22    0.00    3.82    0.32    1.63    0.32   46.71    0.00    0.00   36.99

After some tinkering I managed to reproduce it on "clean" machine as well, with just `stress --cpu 8`, where the results were even worse:

# mpstat -P ALL 5 7
Linux 4.18.0-408.el8.x86_64 (n27-34-82.pool.ci.centos.org)  19/10/22    _x86_64_    (8 CPU)

Average:     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
Average:     all   19.08    0.00    0.00    0.00    0.27    0.00   80.64    0.00    0.00    0.00
Average:       0   12.75    0.00    0.00    0.00    0.22    0.00   87.03    0.00    0.00    0.00
Average:       1   19.08    0.00    0.00    0.00    0.15    0.00   80.77    0.00    0.00    0.00
Average:       2   22.55    0.00    0.00    0.00    0.24    0.00   77.21    0.00    0.00    0.00
Average:       3   25.94    0.00    0.00    0.00    0.38    0.00   73.68    0.00    0.00    0.00
Average:       4   19.17    0.00    0.00    0.00    0.30    0.00   80.53    0.00    0.00    0.00
Average:       5   21.31    0.00    0.00    0.00    0.39    0.00   78.30    0.00    0.00    0.00
Average:       6   17.37    0.00    0.00    0.00    0.30    0.00   82.33    0.00    0.00    0.00
Average:       7   20.48    0.00    0.00    0.00    0.27    0.00   79.25    0.00    0.00    0.00

This is really unfortunate. As the EC2 VMs don't support nested virtualization, we're "at mercy" of TCG, which is quite CPU intensive. When this rate-limiting kicks in, the CI jobs take more than 4 hours, and that's with timeout protections in place (meaning the CI results are unusable) - without them they would take significantly more and would be most likely killed by the watchdog, which is currently 6 hours (iirc). And since the current project queue limit is 10 parallel jobs, this make the CI part almost infeasible.

I originally reported it to the CentOS Infra tracker [0] but was advised to post it here, since this behavior is, for better or worse, expected, or at least that's how the T2 machines are advertised. Is there anything that can be done to mitigate this? The only available solution would be to move the job back to the metal nodes, but that's going against the original issue (and the metal pool is quite limited anyway).

Cheers,
Frantisek

[0] https://pagure.io/centos-infra/issue/950

-- 
PGP Key ID: 0xFB738CE27B634E4B
-------------- next part --------------
A non-text attachment was scrubbed...
Name: OpenPGP_signature
Type: application/pgp-signature
Size: 840 bytes
Desc: OpenPGP digital signature
URL: <http://lists.centos.org/pipermail/ci-users/attachments/20221020/59098491/attachment.sig>

From mhofmann at redhat.com  Thu Oct 20 19:46:29 2022
From: mhofmann at redhat.com (Michael Hofmann)
Date: Thu, 20 Oct 2022 21:46:29 +0200
Subject: [Ci-users] Performance issues on the EC2 T2 nodes
In-Reply-To: <3e4e05d5-9568-ba80-1d47-06b998a45ba0@sumsal.cz>
References: <3e4e05d5-9568-ba80-1d47-06b998a45ba0@sumsal.cz>
Message-ID: <20221020194613.GA596414@black>

Hi Franti?ek,

On Thu, Oct 20, 2022 at 08:30:23PM +0200, Franti?ek ?um?al wrote:
> I originally reported it to the CentOS Infra tracker [0] but was
> advised to post it here, since this behavior is, for better or worse,
> expected, or at least that's how the T2 machines are advertised. Is
> there anything that can be done to mitigate this? The only available
> solution would be to move the job back to the metal nodes, but that's
> going against the original issue (and the metal pool is quite limited
> anyway).

T2 instances don't have to be rate limited if you are ready to pay for
it: iiuc the T2 unlimited mode [1] allows you to trade higher CPU
utilization for money.

Cheers

Michael

[1] https://aws.amazon.com/blogs/aws/new-t2-unlimited-going-beyond-the-burst-with-high-performance/

> [0] https://pagure.io/centos-infra/issue/950

-- 
Michael Hofmann (he/him) | CKI | IRC mh21 #kernelci | GPG 0xE8E1F78D86F24DA1

Red Hat GmbH, Werner von Siemens Ring 12, D-85630 Grasbrunn
Amtsgericht Muenchen/Munich, HRB 153243
Managing Directors: Ryan Barnhart, Charles Cachera, Michael O'Neill, Amy Ross
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: not available
URL: <http://lists.centos.org/pipermail/ci-users/attachments/20221020/18a0abb5/attachment.sig>

From arrfab at centos.org  Sat Oct 22 07:52:21 2022
From: arrfab at centos.org (Fabian Arrotin)
Date: Sat, 22 Oct 2022 09:52:21 +0200
Subject: [Ci-users] Performance issues on the EC2 T2 nodes
In-Reply-To: <3e4e05d5-9568-ba80-1d47-06b998a45ba0@sumsal.cz>
References: <3e4e05d5-9568-ba80-1d47-06b998a45ba0@sumsal.cz>
Message-ID: <5ae8f3c0-0596-f4dc-c00b-2bf5cfab7c52@centos.org>

On 20/10/2022 20:30, Franti?ek ?um?al wrote:
> Hey!
<snip>
> 
> I originally reported it to the CentOS Infra tracker [0] but was advised 
> to post it here, since this behavior is, for better or worse, expected, 
> or at least that's how the T2 machines are advertised. Is there anything 
> that can be done to mitigate this? The only available solution would be 
> to move the job back to the metal nodes, but that's going against the 
> original issue (and the metal pool is quite limited anyway).
> 

Well, yes, and it was a known fact : the "Cloud [TM]" is about virtual 
machines and (normally) not about bare metal options. I was even just 
happy that we can (ab)use a little bit the fact that AWS support "metal" 
options, but clearly (as you discovered it) in very limited quantity and 
availability.

Unfortunately that's the only thing we (or I should say "AWS", which is 
sponsoring that infra) can offer.

Isn't there a possibility to switch your workflow to avoid trying QEMU 
binary emulation ?
IIRC you wanted to have VM yourself, to be able to troubleshoot through 
console access, in case something wouldn't come back online.
What about you try directly on t2 instance and only trigger another job 
that would do that, only if it was failing ? (so just looking at that 
option *when* there is something to debug).
I hope that systemd code is sane and so doesn't need someone to 
troubleshoot issues for each commit/build/test :)

-- 
Fabian Arrotin
The CentOS Project | https://www.centos.org
gpg key: 17F3B7A1 | twitter: @arrfab
-------------- next part --------------
A non-text attachment was scrubbed...
Name: OpenPGP_0xA25DBAFB17F3B7A1.asc
Type: application/pgp-keys
Size: 12767 bytes
Desc: OpenPGP public key
URL: <http://lists.centos.org/pipermail/ci-users/attachments/20221022/0b93cc26/attachment.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: OpenPGP_signature
Type: application/pgp-signature
Size: 840 bytes
Desc: OpenPGP digital signature
URL: <http://lists.centos.org/pipermail/ci-users/attachments/20221022/0b93cc26/attachment.sig>

From frantisek at sumsal.cz  Sun Oct 23 17:52:01 2022
From: frantisek at sumsal.cz (=?UTF-8?B?RnJhbnRpxaFlayDFoHVtxaFhbA==?=)
Date: Sun, 23 Oct 2022 19:52:01 +0200
Subject: [Ci-users] Performance issues on the EC2 T2 nodes
In-Reply-To: <5ae8f3c0-0596-f4dc-c00b-2bf5cfab7c52@centos.org>
References: <3e4e05d5-9568-ba80-1d47-06b998a45ba0@sumsal.cz>
 <5ae8f3c0-0596-f4dc-c00b-2bf5cfab7c52@centos.org>
Message-ID: <0fe205e1-afb6-9840-13f0-a5cd22fe2cd9@sumsal.cz>


On 10/22/22 09:52, Fabian Arrotin wrote:
> On 20/10/2022 20:30, Franti?ek ?um?al wrote:
>> Hey!
> <snip>
>>
>> I originally reported it to the CentOS Infra tracker [0] but was advised to post it here, since this behavior is, for better or worse, expected, or at least that's how the T2 machines are advertised. Is there anything that can be done to mitigate this? The only available solution would be to move the job back to the metal nodes, but that's going against the original issue (and the metal pool is quite limited anyway).
>>
> 
> Well, yes, and it was a known fact : the "Cloud [TM]" is about virtual machines and (normally) not about bare metal options. I was even just happy that we can (ab)use a little bit the fact that AWS support "metal" options, but clearly (as you discovered it) in very limited quantity and availability.
> 
> Unfortunately that's the only thing we (or I should say "AWS", which is sponsoring that infra) can offer.
> 
> Isn't there a possibility to switch your workflow to avoid trying QEMU binary emulation ?
> IIRC you wanted to have VM yourself, to be able to troubleshoot through console access, in case something wouldn't come back online.

Not really, troubleshooting can be eventually done locally, that's just a "bonus". Many tests require QEMU since they use to emulate different device topologies - the udev test suite tests various multipath/(i)scsi/nvme/raid/etc. topologies with various number of storage-related devices, then there are some tests for NUMA stuff, etc. Another reason for QEMU is that we need to test the early boot stuff, transitions/handover from initrd to real the root, TPM2 stuff, RTC behavior, cryptsetup stuff, etc. many of which would be neigh impossible to do on the host machine.

So, yeah, there are multiple factors why we can't just ditch QEMU (not mentioning cases like sanitizers, where even QEMU is not enough and you need KVM), but I guess that's just the nature of the component in question (systemd). I suspect this won't affect many other users, and I have no idea if there's a clear solution. Michael in the other thread mentioned an "unlimited" mode for the T2 machines, but that's, of course, an additional cost which we already hike up quite "a bit" by using the metal machines (and I'm not even sure if that's feasible, not being familiar with AWS tiers).

But I guess this needs a feedback from other projects, I don't want to raise RFEs just to make one of the many CentOS CI projects happy.

Cheers,
Frantisek

> 
> 
> _______________________________________________
> CI-users mailing list
> CI-users at centos.org
> https://lists.centos.org/mailman/listinfo/ci-users

-- 
PGP Key ID: 0xFB738CE27B634E4B
-------------- next part --------------
A non-text attachment was scrubbed...
Name: OpenPGP_signature
Type: application/pgp-signature
Size: 840 bytes
Desc: OpenPGP digital signature
URL: <http://lists.centos.org/pipermail/ci-users/attachments/20221023/06930ac0/attachment.sig>

From arrfab at centos.org  Mon Oct  3 11:37:55 2022
From: arrfab at centos.org (Fabian Arrotin)
Date: Mon, 3 Oct 2022 13:37:55 +0200
Subject: [Ci-users] CentOS CI infra changes: retiring ci.centos.org
 legacy jenkins
In-Reply-To: <01d98834-444f-a8b8-ddba-36b8b2020ec4@centos.org>
References: <01d98834-444f-a8b8-ddba-36b8b2020ec4@centos.org>
Message-ID: <1cbd0172-e192-5acd-016e-24244371006d@centos.org>

On 29/09/2022 13:36, Fabian Arrotin wrote:
> as a FYI message : the legacy Jenkins server (initially shared instance 
> between CI tenants) and available behind https://ci.centos.org will be 
> decommissioned and powered off next week.
> 
> It's not used anymore and was just put online with a notification about 
> shutting it down and pointing to relevant ticket template to ask to be 
> migrated to openshift (2 years ago)
> 
> As we're also decommissioning other part of the CI infra, we'll also 
> just delete that one from and also remove A record from public DNS 
> (actually pointing to haproxy in front)
> 
> related ticket: https://pagure.io/centos-infra/issue/931
> 
> Kind Regards
> 

ci.centos.org is now "gone" and public A record also removed.

-- 
Fabian Arrotin
The CentOS Project | https://www.centos.org
gpg key: 17F3B7A1 | twitter: @arrfab
-------------- next part --------------
A non-text attachment was scrubbed...
Name: OpenPGP_0xA25DBAFB17F3B7A1.asc
Type: application/pgp-keys
Size: 12767 bytes
Desc: OpenPGP public key
URL: <http://lists.centos.org/pipermail/ci-users/attachments/20221003/99cdf1a6/attachment-0002.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: OpenPGP_signature
Type: application/pgp-signature
Size: 840 bytes
Desc: OpenPGP digital signature
URL: <http://lists.centos.org/pipermail/ci-users/attachments/20221003/99cdf1a6/attachment-0002.sig>

From frantisek at sumsal.cz  Thu Oct 20 18:30:23 2022
From: frantisek at sumsal.cz (=?UTF-8?B?RnJhbnRpxaFlayDFoHVtxaFhbA==?=)
Date: Thu, 20 Oct 2022 20:30:23 +0200
Subject: [Ci-users] Performance issues on the EC2 T2 nodes
Message-ID: <3e4e05d5-9568-ba80-1d47-06b998a45ba0@sumsal.cz>

Hey!

As the follow-up discussion after the migration to AWS suggested, I took one of our systemd CI jobs (which relies heavily on spawning QEMU VMs) from a metal machine to a T2 VM. This took quite a while and required tuning down the "depth" of some tests (which also means less coverage), but in the end it works quite reliably, and with parallelization the job takes a reasonable amount of time (~95 minutes).

However, recently (~three weeks ago) I noticed that the jobs using T2 machines (from the virt-ec2-t2-centos-8s-x86_64 pool) started having issues - the runtime was over 4 hours (instead of the usual ~95 minutes) and several tests kept timing out. In this case the issue fixed itself before I had a chance to debug it further, though.

Fast forward to this week, the issue appeared again on Monday, but, again, it fixed itself - this time while I was debugging it. I put some possibly helpful debug checks in place and waited for the next occurrence, which happened to be this Wednesday.

Taking a couple of affected machines aside, I checked various things, including all possible logs, I/O measurements, etc. and in the end I noticed that more than 50% of the vCPU time is actually being "stolen" by the hypervisor:

# mpstat -P ALL 5 7
Linux 4.18.0-408.el8.x86_64 (n27-29-92.pool.ci.centos.org)  19/10/22    _x86_64_    (8 CPU)
...
Average:     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
Average:     all   11.65    0.00    3.17    0.32    1.72    0.30   52.92    0.00    0.00   29.92
Average:       0   12.99    0.00    2.68    0.37    1.69    0.32   59.64    0.00    0.00   22.31
Average:       1   11.00    0.00    3.14    0.34    1.95    0.34   49.16    0.00    0.00   34.06
Average:       2   11.31    0.00    2.72    0.52    1.84    0.41   47.02    0.00    0.00   36.18
Average:       3   11.75    0.00    2.97    0.45    1.96    0.29   52.45    0.00    0.00   30.11
Average:       4   10.88    0.00    3.06    0.19    1.99    0.21   46.12    0.00    0.00   37.55
Average:       5   13.19    0.00    3.30    0.24    1.11    0.22   64.23    0.00    0.00   17.71
Average:       6   11.92    0.00    3.71    0.08    1.60    0.24   58.52    0.00    0.00   23.92
Average:       7   10.22    0.00    3.82    0.32    1.63    0.32   46.71    0.00    0.00   36.99

After some tinkering I managed to reproduce it on "clean" machine as well, with just `stress --cpu 8`, where the results were even worse:

# mpstat -P ALL 5 7
Linux 4.18.0-408.el8.x86_64 (n27-34-82.pool.ci.centos.org)  19/10/22    _x86_64_    (8 CPU)

Average:     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
Average:     all   19.08    0.00    0.00    0.00    0.27    0.00   80.64    0.00    0.00    0.00
Average:       0   12.75    0.00    0.00    0.00    0.22    0.00   87.03    0.00    0.00    0.00
Average:       1   19.08    0.00    0.00    0.00    0.15    0.00   80.77    0.00    0.00    0.00
Average:       2   22.55    0.00    0.00    0.00    0.24    0.00   77.21    0.00    0.00    0.00
Average:       3   25.94    0.00    0.00    0.00    0.38    0.00   73.68    0.00    0.00    0.00
Average:       4   19.17    0.00    0.00    0.00    0.30    0.00   80.53    0.00    0.00    0.00
Average:       5   21.31    0.00    0.00    0.00    0.39    0.00   78.30    0.00    0.00    0.00
Average:       6   17.37    0.00    0.00    0.00    0.30    0.00   82.33    0.00    0.00    0.00
Average:       7   20.48    0.00    0.00    0.00    0.27    0.00   79.25    0.00    0.00    0.00

This is really unfortunate. As the EC2 VMs don't support nested virtualization, we're "at mercy" of TCG, which is quite CPU intensive. When this rate-limiting kicks in, the CI jobs take more than 4 hours, and that's with timeout protections in place (meaning the CI results are unusable) - without them they would take significantly more and would be most likely killed by the watchdog, which is currently 6 hours (iirc). And since the current project queue limit is 10 parallel jobs, this make the CI part almost infeasible.

I originally reported it to the CentOS Infra tracker [0] but was advised to post it here, since this behavior is, for better or worse, expected, or at least that's how the T2 machines are advertised. Is there anything that can be done to mitigate this? The only available solution would be to move the job back to the metal nodes, but that's going against the original issue (and the metal pool is quite limited anyway).

Cheers,
Frantisek

[0] https://pagure.io/centos-infra/issue/950

-- 
PGP Key ID: 0xFB738CE27B634E4B
-------------- next part --------------
A non-text attachment was scrubbed...
Name: OpenPGP_signature
Type: application/pgp-signature
Size: 840 bytes
Desc: OpenPGP digital signature
URL: <http://lists.centos.org/pipermail/ci-users/attachments/20221020/59098491/attachment-0002.sig>

From mhofmann at redhat.com  Thu Oct 20 19:46:29 2022
From: mhofmann at redhat.com (Michael Hofmann)
Date: Thu, 20 Oct 2022 21:46:29 +0200
Subject: [Ci-users] Performance issues on the EC2 T2 nodes
In-Reply-To: <3e4e05d5-9568-ba80-1d47-06b998a45ba0@sumsal.cz>
References: <3e4e05d5-9568-ba80-1d47-06b998a45ba0@sumsal.cz>
Message-ID: <20221020194613.GA596414@black>

Hi Franti?ek,

On Thu, Oct 20, 2022 at 08:30:23PM +0200, Franti?ek ?um?al wrote:
> I originally reported it to the CentOS Infra tracker [0] but was
> advised to post it here, since this behavior is, for better or worse,
> expected, or at least that's how the T2 machines are advertised. Is
> there anything that can be done to mitigate this? The only available
> solution would be to move the job back to the metal nodes, but that's
> going against the original issue (and the metal pool is quite limited
> anyway).

T2 instances don't have to be rate limited if you are ready to pay for
it: iiuc the T2 unlimited mode [1] allows you to trade higher CPU
utilization for money.

Cheers

Michael

[1] https://aws.amazon.com/blogs/aws/new-t2-unlimited-going-beyond-the-burst-with-high-performance/

> [0] https://pagure.io/centos-infra/issue/950

-- 
Michael Hofmann (he/him) | CKI | IRC mh21 #kernelci | GPG 0xE8E1F78D86F24DA1

Red Hat GmbH, Werner von Siemens Ring 12, D-85630 Grasbrunn
Amtsgericht Muenchen/Munich, HRB 153243
Managing Directors: Ryan Barnhart, Charles Cachera, Michael O'Neill, Amy Ross
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: not available
URL: <http://lists.centos.org/pipermail/ci-users/attachments/20221020/18a0abb5/attachment-0002.sig>

From arrfab at centos.org  Sat Oct 22 07:52:21 2022
From: arrfab at centos.org (Fabian Arrotin)
Date: Sat, 22 Oct 2022 09:52:21 +0200
Subject: [Ci-users] Performance issues on the EC2 T2 nodes
In-Reply-To: <3e4e05d5-9568-ba80-1d47-06b998a45ba0@sumsal.cz>
References: <3e4e05d5-9568-ba80-1d47-06b998a45ba0@sumsal.cz>
Message-ID: <5ae8f3c0-0596-f4dc-c00b-2bf5cfab7c52@centos.org>

On 20/10/2022 20:30, Franti?ek ?um?al wrote:
> Hey!
<snip>
> 
> I originally reported it to the CentOS Infra tracker [0] but was advised 
> to post it here, since this behavior is, for better or worse, expected, 
> or at least that's how the T2 machines are advertised. Is there anything 
> that can be done to mitigate this? The only available solution would be 
> to move the job back to the metal nodes, but that's going against the 
> original issue (and the metal pool is quite limited anyway).
> 

Well, yes, and it was a known fact : the "Cloud [TM]" is about virtual 
machines and (normally) not about bare metal options. I was even just 
happy that we can (ab)use a little bit the fact that AWS support "metal" 
options, but clearly (as you discovered it) in very limited quantity and 
availability.

Unfortunately that's the only thing we (or I should say "AWS", which is 
sponsoring that infra) can offer.

Isn't there a possibility to switch your workflow to avoid trying QEMU 
binary emulation ?
IIRC you wanted to have VM yourself, to be able to troubleshoot through 
console access, in case something wouldn't come back online.
What about you try directly on t2 instance and only trigger another job 
that would do that, only if it was failing ? (so just looking at that 
option *when* there is something to debug).
I hope that systemd code is sane and so doesn't need someone to 
troubleshoot issues for each commit/build/test :)

-- 
Fabian Arrotin
The CentOS Project | https://www.centos.org
gpg key: 17F3B7A1 | twitter: @arrfab
-------------- next part --------------
A non-text attachment was scrubbed...
Name: OpenPGP_0xA25DBAFB17F3B7A1.asc
Type: application/pgp-keys
Size: 12767 bytes
Desc: OpenPGP public key
URL: <http://lists.centos.org/pipermail/ci-users/attachments/20221022/0b93cc26/attachment-0002.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: OpenPGP_signature
Type: application/pgp-signature
Size: 840 bytes
Desc: OpenPGP digital signature
URL: <http://lists.centos.org/pipermail/ci-users/attachments/20221022/0b93cc26/attachment-0002.sig>

From frantisek at sumsal.cz  Sun Oct 23 17:52:01 2022
From: frantisek at sumsal.cz (=?UTF-8?B?RnJhbnRpxaFlayDFoHVtxaFhbA==?=)
Date: Sun, 23 Oct 2022 19:52:01 +0200
Subject: [Ci-users] Performance issues on the EC2 T2 nodes
In-Reply-To: <5ae8f3c0-0596-f4dc-c00b-2bf5cfab7c52@centos.org>
References: <3e4e05d5-9568-ba80-1d47-06b998a45ba0@sumsal.cz>
 <5ae8f3c0-0596-f4dc-c00b-2bf5cfab7c52@centos.org>
Message-ID: <0fe205e1-afb6-9840-13f0-a5cd22fe2cd9@sumsal.cz>


On 10/22/22 09:52, Fabian Arrotin wrote:
> On 20/10/2022 20:30, Franti?ek ?um?al wrote:
>> Hey!
> <snip>
>>
>> I originally reported it to the CentOS Infra tracker [0] but was advised to post it here, since this behavior is, for better or worse, expected, or at least that's how the T2 machines are advertised. Is there anything that can be done to mitigate this? The only available solution would be to move the job back to the metal nodes, but that's going against the original issue (and the metal pool is quite limited anyway).
>>
> 
> Well, yes, and it was a known fact : the "Cloud [TM]" is about virtual machines and (normally) not about bare metal options. I was even just happy that we can (ab)use a little bit the fact that AWS support "metal" options, but clearly (as you discovered it) in very limited quantity and availability.
> 
> Unfortunately that's the only thing we (or I should say "AWS", which is sponsoring that infra) can offer.
> 
> Isn't there a possibility to switch your workflow to avoid trying QEMU binary emulation ?
> IIRC you wanted to have VM yourself, to be able to troubleshoot through console access, in case something wouldn't come back online.

Not really, troubleshooting can be eventually done locally, that's just a "bonus". Many tests require QEMU since they use to emulate different device topologies - the udev test suite tests various multipath/(i)scsi/nvme/raid/etc. topologies with various number of storage-related devices, then there are some tests for NUMA stuff, etc. Another reason for QEMU is that we need to test the early boot stuff, transitions/handover from initrd to real the root, TPM2 stuff, RTC behavior, cryptsetup stuff, etc. many of which would be neigh impossible to do on the host machine.

So, yeah, there are multiple factors why we can't just ditch QEMU (not mentioning cases like sanitizers, where even QEMU is not enough and you need KVM), but I guess that's just the nature of the component in question (systemd). I suspect this won't affect many other users, and I have no idea if there's a clear solution. Michael in the other thread mentioned an "unlimited" mode for the T2 machines, but that's, of course, an additional cost which we already hike up quite "a bit" by using the metal machines (and I'm not even sure if that's feasible, not being familiar with AWS tiers).

But I guess this needs a feedback from other projects, I don't want to raise RFEs just to make one of the many CentOS CI projects happy.

Cheers,
Frantisek

> 
> 
> _______________________________________________
> CI-users mailing list
> CI-users at centos.org
> https://lists.centos.org/mailman/listinfo/ci-users

-- 
PGP Key ID: 0xFB738CE27B634E4B
-------------- next part --------------
A non-text attachment was scrubbed...
Name: OpenPGP_signature
Type: application/pgp-signature
Size: 840 bytes
Desc: OpenPGP digital signature
URL: <http://lists.centos.org/pipermail/ci-users/attachments/20221023/06930ac0/attachment-0002.sig>

From arrfab at centos.org  Mon Oct  3 11:37:55 2022
From: arrfab at centos.org (Fabian Arrotin)
Date: Mon, 3 Oct 2022 13:37:55 +0200
Subject: [Ci-users] CentOS CI infra changes: retiring ci.centos.org
 legacy jenkins
In-Reply-To: <01d98834-444f-a8b8-ddba-36b8b2020ec4@centos.org>
References: <01d98834-444f-a8b8-ddba-36b8b2020ec4@centos.org>
Message-ID: <1cbd0172-e192-5acd-016e-24244371006d@centos.org>

On 29/09/2022 13:36, Fabian Arrotin wrote:
> as a FYI message : the legacy Jenkins server (initially shared instance 
> between CI tenants) and available behind https://ci.centos.org will be 
> decommissioned and powered off next week.
> 
> It's not used anymore and was just put online with a notification about 
> shutting it down and pointing to relevant ticket template to ask to be 
> migrated to openshift (2 years ago)
> 
> As we're also decommissioning other part of the CI infra, we'll also 
> just delete that one from and also remove A record from public DNS 
> (actually pointing to haproxy in front)
> 
> related ticket: https://pagure.io/centos-infra/issue/931
> 
> Kind Regards
> 

ci.centos.org is now "gone" and public A record also removed.

-- 
Fabian Arrotin
The CentOS Project | https://www.centos.org
gpg key: 17F3B7A1 | twitter: @arrfab
-------------- next part --------------
A non-text attachment was scrubbed...
Name: OpenPGP_0xA25DBAFB17F3B7A1.asc
Type: application/pgp-keys
Size: 12767 bytes
Desc: OpenPGP public key
URL: <http://lists.centos.org/pipermail/ci-users/attachments/20221003/99cdf1a6/attachment-0003.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: OpenPGP_signature
Type: application/pgp-signature
Size: 840 bytes
Desc: OpenPGP digital signature
URL: <http://lists.centos.org/pipermail/ci-users/attachments/20221003/99cdf1a6/attachment-0003.sig>

From frantisek at sumsal.cz  Thu Oct 20 18:30:23 2022
From: frantisek at sumsal.cz (=?UTF-8?B?RnJhbnRpxaFlayDFoHVtxaFhbA==?=)
Date: Thu, 20 Oct 2022 20:30:23 +0200
Subject: [Ci-users] Performance issues on the EC2 T2 nodes
Message-ID: <3e4e05d5-9568-ba80-1d47-06b998a45ba0@sumsal.cz>

Hey!

As the follow-up discussion after the migration to AWS suggested, I took one of our systemd CI jobs (which relies heavily on spawning QEMU VMs) from a metal machine to a T2 VM. This took quite a while and required tuning down the "depth" of some tests (which also means less coverage), but in the end it works quite reliably, and with parallelization the job takes a reasonable amount of time (~95 minutes).

However, recently (~three weeks ago) I noticed that the jobs using T2 machines (from the virt-ec2-t2-centos-8s-x86_64 pool) started having issues - the runtime was over 4 hours (instead of the usual ~95 minutes) and several tests kept timing out. In this case the issue fixed itself before I had a chance to debug it further, though.

Fast forward to this week, the issue appeared again on Monday, but, again, it fixed itself - this time while I was debugging it. I put some possibly helpful debug checks in place and waited for the next occurrence, which happened to be this Wednesday.

Taking a couple of affected machines aside, I checked various things, including all possible logs, I/O measurements, etc. and in the end I noticed that more than 50% of the vCPU time is actually being "stolen" by the hypervisor:

# mpstat -P ALL 5 7
Linux 4.18.0-408.el8.x86_64 (n27-29-92.pool.ci.centos.org)  19/10/22    _x86_64_    (8 CPU)
...
Average:     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
Average:     all   11.65    0.00    3.17    0.32    1.72    0.30   52.92    0.00    0.00   29.92
Average:       0   12.99    0.00    2.68    0.37    1.69    0.32   59.64    0.00    0.00   22.31
Average:       1   11.00    0.00    3.14    0.34    1.95    0.34   49.16    0.00    0.00   34.06
Average:       2   11.31    0.00    2.72    0.52    1.84    0.41   47.02    0.00    0.00   36.18
Average:       3   11.75    0.00    2.97    0.45    1.96    0.29   52.45    0.00    0.00   30.11
Average:       4   10.88    0.00    3.06    0.19    1.99    0.21   46.12    0.00    0.00   37.55
Average:       5   13.19    0.00    3.30    0.24    1.11    0.22   64.23    0.00    0.00   17.71
Average:       6   11.92    0.00    3.71    0.08    1.60    0.24   58.52    0.00    0.00   23.92
Average:       7   10.22    0.00    3.82    0.32    1.63    0.32   46.71    0.00    0.00   36.99

After some tinkering I managed to reproduce it on "clean" machine as well, with just `stress --cpu 8`, where the results were even worse:

# mpstat -P ALL 5 7
Linux 4.18.0-408.el8.x86_64 (n27-34-82.pool.ci.centos.org)  19/10/22    _x86_64_    (8 CPU)

Average:     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
Average:     all   19.08    0.00    0.00    0.00    0.27    0.00   80.64    0.00    0.00    0.00
Average:       0   12.75    0.00    0.00    0.00    0.22    0.00   87.03    0.00    0.00    0.00
Average:       1   19.08    0.00    0.00    0.00    0.15    0.00   80.77    0.00    0.00    0.00
Average:       2   22.55    0.00    0.00    0.00    0.24    0.00   77.21    0.00    0.00    0.00
Average:       3   25.94    0.00    0.00    0.00    0.38    0.00   73.68    0.00    0.00    0.00
Average:       4   19.17    0.00    0.00    0.00    0.30    0.00   80.53    0.00    0.00    0.00
Average:       5   21.31    0.00    0.00    0.00    0.39    0.00   78.30    0.00    0.00    0.00
Average:       6   17.37    0.00    0.00    0.00    0.30    0.00   82.33    0.00    0.00    0.00
Average:       7   20.48    0.00    0.00    0.00    0.27    0.00   79.25    0.00    0.00    0.00

This is really unfortunate. As the EC2 VMs don't support nested virtualization, we're "at mercy" of TCG, which is quite CPU intensive. When this rate-limiting kicks in, the CI jobs take more than 4 hours, and that's with timeout protections in place (meaning the CI results are unusable) - without them they would take significantly more and would be most likely killed by the watchdog, which is currently 6 hours (iirc). And since the current project queue limit is 10 parallel jobs, this make the CI part almost infeasible.

I originally reported it to the CentOS Infra tracker [0] but was advised to post it here, since this behavior is, for better or worse, expected, or at least that's how the T2 machines are advertised. Is there anything that can be done to mitigate this? The only available solution would be to move the job back to the metal nodes, but that's going against the original issue (and the metal pool is quite limited anyway).

Cheers,
Frantisek

[0] https://pagure.io/centos-infra/issue/950

-- 
PGP Key ID: 0xFB738CE27B634E4B
-------------- next part --------------
A non-text attachment was scrubbed...
Name: OpenPGP_signature
Type: application/pgp-signature
Size: 840 bytes
Desc: OpenPGP digital signature
URL: <http://lists.centos.org/pipermail/ci-users/attachments/20221020/59098491/attachment-0003.sig>

From mhofmann at redhat.com  Thu Oct 20 19:46:29 2022
From: mhofmann at redhat.com (Michael Hofmann)
Date: Thu, 20 Oct 2022 21:46:29 +0200
Subject: [Ci-users] Performance issues on the EC2 T2 nodes
In-Reply-To: <3e4e05d5-9568-ba80-1d47-06b998a45ba0@sumsal.cz>
References: <3e4e05d5-9568-ba80-1d47-06b998a45ba0@sumsal.cz>
Message-ID: <20221020194613.GA596414@black>

Hi Franti?ek,

On Thu, Oct 20, 2022 at 08:30:23PM +0200, Franti?ek ?um?al wrote:
> I originally reported it to the CentOS Infra tracker [0] but was
> advised to post it here, since this behavior is, for better or worse,
> expected, or at least that's how the T2 machines are advertised. Is
> there anything that can be done to mitigate this? The only available
> solution would be to move the job back to the metal nodes, but that's
> going against the original issue (and the metal pool is quite limited
> anyway).

T2 instances don't have to be rate limited if you are ready to pay for
it: iiuc the T2 unlimited mode [1] allows you to trade higher CPU
utilization for money.

Cheers

Michael

[1] https://aws.amazon.com/blogs/aws/new-t2-unlimited-going-beyond-the-burst-with-high-performance/

> [0] https://pagure.io/centos-infra/issue/950

-- 
Michael Hofmann (he/him) | CKI | IRC mh21 #kernelci | GPG 0xE8E1F78D86F24DA1

Red Hat GmbH, Werner von Siemens Ring 12, D-85630 Grasbrunn
Amtsgericht Muenchen/Munich, HRB 153243
Managing Directors: Ryan Barnhart, Charles Cachera, Michael O'Neill, Amy Ross
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: not available
URL: <http://lists.centos.org/pipermail/ci-users/attachments/20221020/18a0abb5/attachment-0003.sig>

From arrfab at centos.org  Sat Oct 22 07:52:21 2022
From: arrfab at centos.org (Fabian Arrotin)
Date: Sat, 22 Oct 2022 09:52:21 +0200
Subject: [Ci-users] Performance issues on the EC2 T2 nodes
In-Reply-To: <3e4e05d5-9568-ba80-1d47-06b998a45ba0@sumsal.cz>
References: <3e4e05d5-9568-ba80-1d47-06b998a45ba0@sumsal.cz>
Message-ID: <5ae8f3c0-0596-f4dc-c00b-2bf5cfab7c52@centos.org>

On 20/10/2022 20:30, Franti?ek ?um?al wrote:
> Hey!
<snip>
> 
> I originally reported it to the CentOS Infra tracker [0] but was advised 
> to post it here, since this behavior is, for better or worse, expected, 
> or at least that's how the T2 machines are advertised. Is there anything 
> that can be done to mitigate this? The only available solution would be 
> to move the job back to the metal nodes, but that's going against the 
> original issue (and the metal pool is quite limited anyway).
> 

Well, yes, and it was a known fact : the "Cloud [TM]" is about virtual 
machines and (normally) not about bare metal options. I was even just 
happy that we can (ab)use a little bit the fact that AWS support "metal" 
options, but clearly (as you discovered it) in very limited quantity and 
availability.

Unfortunately that's the only thing we (or I should say "AWS", which is 
sponsoring that infra) can offer.

Isn't there a possibility to switch your workflow to avoid trying QEMU 
binary emulation ?
IIRC you wanted to have VM yourself, to be able to troubleshoot through 
console access, in case something wouldn't come back online.
What about you try directly on t2 instance and only trigger another job 
that would do that, only if it was failing ? (so just looking at that 
option *when* there is something to debug).
I hope that systemd code is sane and so doesn't need someone to 
troubleshoot issues for each commit/build/test :)

-- 
Fabian Arrotin
The CentOS Project | https://www.centos.org
gpg key: 17F3B7A1 | twitter: @arrfab
-------------- next part --------------
A non-text attachment was scrubbed...
Name: OpenPGP_0xA25DBAFB17F3B7A1.asc
Type: application/pgp-keys
Size: 12767 bytes
Desc: OpenPGP public key
URL: <http://lists.centos.org/pipermail/ci-users/attachments/20221022/0b93cc26/attachment-0003.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: OpenPGP_signature
Type: application/pgp-signature
Size: 840 bytes
Desc: OpenPGP digital signature
URL: <http://lists.centos.org/pipermail/ci-users/attachments/20221022/0b93cc26/attachment-0003.sig>

From frantisek at sumsal.cz  Sun Oct 23 17:52:01 2022
From: frantisek at sumsal.cz (=?UTF-8?B?RnJhbnRpxaFlayDFoHVtxaFhbA==?=)
Date: Sun, 23 Oct 2022 19:52:01 +0200
Subject: [Ci-users] Performance issues on the EC2 T2 nodes
In-Reply-To: <5ae8f3c0-0596-f4dc-c00b-2bf5cfab7c52@centos.org>
References: <3e4e05d5-9568-ba80-1d47-06b998a45ba0@sumsal.cz>
 <5ae8f3c0-0596-f4dc-c00b-2bf5cfab7c52@centos.org>
Message-ID: <0fe205e1-afb6-9840-13f0-a5cd22fe2cd9@sumsal.cz>


On 10/22/22 09:52, Fabian Arrotin wrote:
> On 20/10/2022 20:30, Franti?ek ?um?al wrote:
>> Hey!
> <snip>
>>
>> I originally reported it to the CentOS Infra tracker [0] but was advised to post it here, since this behavior is, for better or worse, expected, or at least that's how the T2 machines are advertised. Is there anything that can be done to mitigate this? The only available solution would be to move the job back to the metal nodes, but that's going against the original issue (and the metal pool is quite limited anyway).
>>
> 
> Well, yes, and it was a known fact : the "Cloud [TM]" is about virtual machines and (normally) not about bare metal options. I was even just happy that we can (ab)use a little bit the fact that AWS support "metal" options, but clearly (as you discovered it) in very limited quantity and availability.
> 
> Unfortunately that's the only thing we (or I should say "AWS", which is sponsoring that infra) can offer.
> 
> Isn't there a possibility to switch your workflow to avoid trying QEMU binary emulation ?
> IIRC you wanted to have VM yourself, to be able to troubleshoot through console access, in case something wouldn't come back online.

Not really, troubleshooting can be eventually done locally, that's just a "bonus". Many tests require QEMU since they use to emulate different device topologies - the udev test suite tests various multipath/(i)scsi/nvme/raid/etc. topologies with various number of storage-related devices, then there are some tests for NUMA stuff, etc. Another reason for QEMU is that we need to test the early boot stuff, transitions/handover from initrd to real the root, TPM2 stuff, RTC behavior, cryptsetup stuff, etc. many of which would be neigh impossible to do on the host machine.

So, yeah, there are multiple factors why we can't just ditch QEMU (not mentioning cases like sanitizers, where even QEMU is not enough and you need KVM), but I guess that's just the nature of the component in question (systemd). I suspect this won't affect many other users, and I have no idea if there's a clear solution. Michael in the other thread mentioned an "unlimited" mode for the T2 machines, but that's, of course, an additional cost which we already hike up quite "a bit" by using the metal machines (and I'm not even sure if that's feasible, not being familiar with AWS tiers).

But I guess this needs a feedback from other projects, I don't want to raise RFEs just to make one of the many CentOS CI projects happy.

Cheers,
Frantisek

> 
> 
> _______________________________________________
> CI-users mailing list
> CI-users at centos.org
> https://lists.centos.org/mailman/listinfo/ci-users

-- 
PGP Key ID: 0xFB738CE27B634E4B
-------------- next part --------------
A non-text attachment was scrubbed...
Name: OpenPGP_signature
Type: application/pgp-signature
Size: 840 bytes
Desc: OpenPGP digital signature
URL: <http://lists.centos.org/pipermail/ci-users/attachments/20221023/06930ac0/attachment-0003.sig>