[Ci-users] Performance issues on the EC2 T2 nodes

Sun Oct 23 17:52:01 UTC 2022
František Šumšal <frantisek at sumsal.cz>

On 10/22/22 09:52, Fabian Arrotin wrote:
> On 20/10/2022 20:30, František Šumšal wrote:
>> Hey!
> <snip>
>>
>> I originally reported it to the CentOS Infra tracker [0] but was advised to post it here, since this behavior is, for better or worse, expected, or at least that's how the T2 machines are advertised. Is there anything that can be done to mitigate this? The only available solution would be to move the job back to the metal nodes, but that's going against the original issue (and the metal pool is quite limited anyway).
>>
> 
> Well, yes, and it was a known fact : the "Cloud [TM]" is about virtual machines and (normally) not about bare metal options. I was even just happy that we can (ab)use a little bit the fact that AWS support "metal" options, but clearly (as you discovered it) in very limited quantity and availability.
> 
> Unfortunately that's the only thing we (or I should say "AWS", which is sponsoring that infra) can offer.
> 
> Isn't there a possibility to switch your workflow to avoid trying QEMU binary emulation ?
> IIRC you wanted to have VM yourself, to be able to troubleshoot through console access, in case something wouldn't come back online.

Not really, troubleshooting can be eventually done locally, that's just a "bonus". Many tests require QEMU since they use to emulate different device topologies - the udev test suite tests various multipath/(i)scsi/nvme/raid/etc. topologies with various number of storage-related devices, then there are some tests for NUMA stuff, etc. Another reason for QEMU is that we need to test the early boot stuff, transitions/handover from initrd to real the root, TPM2 stuff, RTC behavior, cryptsetup stuff, etc. many of which would be neigh impossible to do on the host machine.

So, yeah, there are multiple factors why we can't just ditch QEMU (not mentioning cases like sanitizers, where even QEMU is not enough and you need KVM), but I guess that's just the nature of the component in question (systemd). I suspect this won't affect many other users, and I have no idea if there's a clear solution. Michael in the other thread mentioned an "unlimited" mode for the T2 machines, but that's, of course, an additional cost which we already hike up quite "a bit" by using the metal machines (and I'm not even sure if that's feasible, not being familiar with AWS tiers).

But I guess this needs a feedback from other projects, I don't want to raise RFEs just to make one of the many CentOS CI projects happy.

Cheers,
Frantisek

> 
> 
> _______________________________________________
> CI-users mailing list
> CI-users at centos.org
> https://lists.centos.org/mailman/listinfo/ci-users

-- 
PGP Key ID: 0xFB738CE27B634E4B
-------------- next part --------------
A non-text attachment was scrubbed...
Name: OpenPGP_signature
Type: application/pgp-signature
Size: 840 bytes
Desc: OpenPGP digital signature
URL: <http://lists.centos.org/pipermail/ci-users/attachments/20221023/06930ac0/attachment-0002.sig>