Hey!
As the follow-up discussion after the migration to AWS suggested, I took one of our systemd CI jobs (which relies heavily on spawning QEMU VMs) from a metal machine to a T2 VM. This took quite a while and required tuning down the "depth" of some tests (which also means less coverage), but in the end it works quite reliably, and with parallelization the job takes a reasonable amount of time (~95 minutes).
However, recently (~three weeks ago) I noticed that the jobs using T2 machines (from the virt-ec2-t2-centos-8s-x86_64 pool) started having issues - the runtime was over 4 hours (instead of the usual ~95 minutes) and several tests kept timing out. In this case the issue fixed itself before I had a chance to debug it further, though.
Fast forward to this week, the issue appeared again on Monday, but, again, it fixed itself - this time while I was debugging it. I put some possibly helpful debug checks in place and waited for the next occurrence, which happened to be this Wednesday.
Taking a couple of affected machines aside, I checked various things, including all possible logs, I/O measurements, etc. and in the end I noticed that more than 50% of the vCPU time is actually being "stolen" by the hypervisor:
# mpstat -P ALL 5 7 Linux 4.18.0-408.el8.x86_64 (n27-29-92.pool.ci.centos.org) 19/10/22 _x86_64_ (8 CPU) ... Average: CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle Average: all 11.65 0.00 3.17 0.32 1.72 0.30 52.92 0.00 0.00 29.92 Average: 0 12.99 0.00 2.68 0.37 1.69 0.32 59.64 0.00 0.00 22.31 Average: 1 11.00 0.00 3.14 0.34 1.95 0.34 49.16 0.00 0.00 34.06 Average: 2 11.31 0.00 2.72 0.52 1.84 0.41 47.02 0.00 0.00 36.18 Average: 3 11.75 0.00 2.97 0.45 1.96 0.29 52.45 0.00 0.00 30.11 Average: 4 10.88 0.00 3.06 0.19 1.99 0.21 46.12 0.00 0.00 37.55 Average: 5 13.19 0.00 3.30 0.24 1.11 0.22 64.23 0.00 0.00 17.71 Average: 6 11.92 0.00 3.71 0.08 1.60 0.24 58.52 0.00 0.00 23.92 Average: 7 10.22 0.00 3.82 0.32 1.63 0.32 46.71 0.00 0.00 36.99
After some tinkering I managed to reproduce it on "clean" machine as well, with just `stress --cpu 8`, where the results were even worse:
# mpstat -P ALL 5 7 Linux 4.18.0-408.el8.x86_64 (n27-34-82.pool.ci.centos.org) 19/10/22 _x86_64_ (8 CPU)
Average: CPU %usr %nice %sys %iowait %irq %soft %steal %guest %gnice %idle Average: all 19.08 0.00 0.00 0.00 0.27 0.00 80.64 0.00 0.00 0.00 Average: 0 12.75 0.00 0.00 0.00 0.22 0.00 87.03 0.00 0.00 0.00 Average: 1 19.08 0.00 0.00 0.00 0.15 0.00 80.77 0.00 0.00 0.00 Average: 2 22.55 0.00 0.00 0.00 0.24 0.00 77.21 0.00 0.00 0.00 Average: 3 25.94 0.00 0.00 0.00 0.38 0.00 73.68 0.00 0.00 0.00 Average: 4 19.17 0.00 0.00 0.00 0.30 0.00 80.53 0.00 0.00 0.00 Average: 5 21.31 0.00 0.00 0.00 0.39 0.00 78.30 0.00 0.00 0.00 Average: 6 17.37 0.00 0.00 0.00 0.30 0.00 82.33 0.00 0.00 0.00 Average: 7 20.48 0.00 0.00 0.00 0.27 0.00 79.25 0.00 0.00 0.00
This is really unfortunate. As the EC2 VMs don't support nested virtualization, we're "at mercy" of TCG, which is quite CPU intensive. When this rate-limiting kicks in, the CI jobs take more than 4 hours, and that's with timeout protections in place (meaning the CI results are unusable) - without them they would take significantly more and would be most likely killed by the watchdog, which is currently 6 hours (iirc). And since the current project queue limit is 10 parallel jobs, this make the CI part almost infeasible.
I originally reported it to the CentOS Infra tracker [0] but was advised to post it here, since this behavior is, for better or worse, expected, or at least that's how the T2 machines are advertised. Is there anything that can be done to mitigate this? The only available solution would be to move the job back to the metal nodes, but that's going against the original issue (and the metal pool is quite limited anyway).
Cheers, Frantisek
[0] https://pagure.io/centos-infra/issue/950
Hi František,
On Thu, Oct 20, 2022 at 08:30:23PM +0200, František Šumšal wrote:
I originally reported it to the CentOS Infra tracker [0] but was advised to post it here, since this behavior is, for better or worse, expected, or at least that's how the T2 machines are advertised. Is there anything that can be done to mitigate this? The only available solution would be to move the job back to the metal nodes, but that's going against the original issue (and the metal pool is quite limited anyway).
T2 instances don't have to be rate limited if you are ready to pay for it: iiuc the T2 unlimited mode [1] allows you to trade higher CPU utilization for money.
Cheers
Michael
[1] https://aws.amazon.com/blogs/aws/new-t2-unlimited-going-beyond-the-burst-wit...
On 20/10/2022 20:30, František Šumšal wrote:
Hey!
<snip>
I originally reported it to the CentOS Infra tracker [0] but was advised to post it here, since this behavior is, for better or worse, expected, or at least that's how the T2 machines are advertised. Is there anything that can be done to mitigate this? The only available solution would be to move the job back to the metal nodes, but that's going against the original issue (and the metal pool is quite limited anyway).
Well, yes, and it was a known fact : the "Cloud [TM]" is about virtual machines and (normally) not about bare metal options. I was even just happy that we can (ab)use a little bit the fact that AWS support "metal" options, but clearly (as you discovered it) in very limited quantity and availability.
Unfortunately that's the only thing we (or I should say "AWS", which is sponsoring that infra) can offer.
Isn't there a possibility to switch your workflow to avoid trying QEMU binary emulation ? IIRC you wanted to have VM yourself, to be able to troubleshoot through console access, in case something wouldn't come back online. What about you try directly on t2 instance and only trigger another job that would do that, only if it was failing ? (so just looking at that option *when* there is something to debug). I hope that systemd code is sane and so doesn't need someone to troubleshoot issues for each commit/build/test :)
On 10/22/22 09:52, Fabian Arrotin wrote:
On 20/10/2022 20:30, František Šumšal wrote:
Hey!
<snip> > > I originally reported it to the CentOS Infra tracker [0] but was advised to post it here, since this behavior is, for better or worse, expected, or at least that's how the T2 machines are advertised. Is there anything that can be done to mitigate this? The only available solution would be to move the job back to the metal nodes, but that's going against the original issue (and the metal pool is quite limited anyway). >
Well, yes, and it was a known fact : the "Cloud [TM]" is about virtual machines and (normally) not about bare metal options. I was even just happy that we can (ab)use a little bit the fact that AWS support "metal" options, but clearly (as you discovered it) in very limited quantity and availability.
Unfortunately that's the only thing we (or I should say "AWS", which is sponsoring that infra) can offer.
Isn't there a possibility to switch your workflow to avoid trying QEMU binary emulation ? IIRC you wanted to have VM yourself, to be able to troubleshoot through console access, in case something wouldn't come back online.
Not really, troubleshooting can be eventually done locally, that's just a "bonus". Many tests require QEMU since they use to emulate different device topologies - the udev test suite tests various multipath/(i)scsi/nvme/raid/etc. topologies with various number of storage-related devices, then there are some tests for NUMA stuff, etc. Another reason for QEMU is that we need to test the early boot stuff, transitions/handover from initrd to real the root, TPM2 stuff, RTC behavior, cryptsetup stuff, etc. many of which would be neigh impossible to do on the host machine.
So, yeah, there are multiple factors why we can't just ditch QEMU (not mentioning cases like sanitizers, where even QEMU is not enough and you need KVM), but I guess that's just the nature of the component in question (systemd). I suspect this won't affect many other users, and I have no idea if there's a clear solution. Michael in the other thread mentioned an "unlimited" mode for the T2 machines, but that's, of course, an additional cost which we already hike up quite "a bit" by using the metal machines (and I'm not even sure if that's feasible, not being familiar with AWS tiers).
But I guess this needs a feedback from other projects, I don't want to raise RFEs just to make one of the many CentOS CI projects happy.
Cheers, Frantisek
CI-users mailing list CI-users@centos.org https://lists.centos.org/mailman/listinfo/ci-users