[Ci-users] Changes to CentOS CI: reminder of Phase 1 and 2

Fri Aug 19 13:31:09 UTC 2022
František Šumšal <frantisek at sumsal.cz>

Hey,

On 8/19/22 14:23, Camila Granella wrote:
> Hello!
> 
>     I understand that the metal machines are expensive, and I'm not sure how many other projects are eventually going to migrate over to them, but I guess in the future some balance will need to be found out between the cost and available metal nodes. Is this even up to a discussion, or the size of the metal pools is given and can't/won't be adjusted?
> 
> 
> We're looking to optimize resource usage with the recent changes to CentOS CI. From our side, the goal is to find a balance between adjusting to tenants' needs (there are adaptations we could do to have more nodes available with an increase in resource consumption) and adjusting projects workflows to use EC2.
> 
> I'd appreciate your suggestions on mitigating how to make workflows more adaptable to EC2.

The main blocker for many projects is that EC2 VMs don't support nested virtualization, which is really unfortunate, since using the EC2 metal machines is indeed a "bit" overkill in many scenarios (ours included). I spent a week playing with various approaches to avoid this requirement, but failed (in our case it would be running the VMs with TCG instead of KVM, but that makes the tests flaky/unreliable in many cases, and some of them run for several hours with this change).

Going through many online resources just confirms this - EC2 VMs don't support nested virt[0], which is sad, since, for example, Microsoft's Azure apparently supports it[1][2] (and Google's Compute Engine apparently supports it as well from a quick lookup).

I'm not really sure if there's an easy solution for this (if any). I'm at least trying to spread the workload on the machine "to the limits" to utilize as much of the metal resources as possible, which shortens the runtime of each job quite considerably, but even that's not ideal (resource-wise).

As I mentioned on IRC, maybe having Duffy changing the pool size dynamically based on the demand for the past hour or so would help with the overall balance (to avoid wasting resources in "quiet periods"), but that's just an idea from top of my head, I'm not sure how feasible it is or if it even makes sense.

[0] https://repost.aws/questions/QUZppFxHs6Q3uJFCYeliAxxg/nested-virtualization-with-hyper-v-on-ec-2-instance
[1] https://ignite.readthedocs.io/en/stable/cloudprovider/#microsoft-azure
[2] https://azure.microsoft.com/cs-cz/blog/nested-virtualization-in-azure/

> Also, how much is this impacting critical deliveries on your side at the moment? My goal here is to understand whether we need a more urgent solution for you before going for deeper discussions. As I understand, we still have some bandwidth to find the best solution we can, as it could become more critical in the future. Is that assumption correct?

Indeed, I do think we still have time to discuss possible solutions.

Thank you,
Frantisek

> Thank you for reaching out about this,
> 
> 
> On Fri, Aug 19, 2022 at 8:35 AM Evgeni Golov <evgeni at redhat.com <mailto:evgeni at redhat.com>> wrote:
> 
>     Moin,
> 
>     On Fri, Aug 19, 2022 at 1:21 PM František Šumšal <frantisek at sumsal.cz <mailto:frantisek at sumsal.cz>> wrote:
> 
>      > After a couple of weeks of back and forth with the always helpful infra team I was able to migrate most of our (systemd) jobs over to the EC2 machines. As we require at least an access to KVM (and the EC2 VMs, unfortunately, don't support nested virt), I had to resort to metal machines over the "plain" VMs.
> 
>     We (foreman) are in the same boat, our tests spawn multiple VMs, so we
>     require KVM access (metal or nested, with the latter sadly not
>     supported by EC2)
> 
>      > After monitoring the situation for a couple of days I noticed an issue[0] which might bite us in the future if/when other projects migrate over to the metal machines as well (since several of them require at least KVM too) - Duffy currently provisions only one metal machine at a time, and returns an API error for all other API requests for the same pool in the meantime:
>      >
>      > can't reserve nodes: quantity=1 pool='metal-ec2-c5n-centos-8s-x86_64'
>      >
>      > As the provisioning takes a bit, this delay might stack up quite noticeably. For example, after firing up 10 jobs (current project quota) at once, all for the metal pool, the last one got the machine after ~30 minutes - and that's only one project. If/when other projects migrate over to the metal machines as well, this might get quickly out of hand.
> 
>     Our tests run up to 8 parallel jobs, so yeah, I can totally see this
>     being a problem in the longer term for everybody.
> 
>     We're currently investigating whether we can change our scheduling and
>     run multiple jobs on one metal host (it's big enough to host more than
>     the 3 VMs one job needs), but it doesn't seem too trivial right now.
> 
>     Evgeni
> 
>     -- 
>     Beste Grüße/Kind regards,
> 
>     Evgeni Golov
>     Senior Software Engineer
>     ________________________________________________________________________
>     Red Hat GmbH, https://de.redhat.com/ <https://de.redhat.com/>, Registered seat: Werner von
>     Siemens Ring 14, D-85630 Grasbrunn, Germany
>     Commercial register: Amtsgericht Muenchen/Munich, HRB 153243,
>     Managing Directors: Ryan Barnhart, Charles Cachera, Michael O'Neill, Amy Ross
> 
>     _______________________________________________
>     CI-users mailing list
>     CI-users at centos.org <mailto:CI-users at centos.org>
>     https://lists.centos.org/mailman/listinfo/ci-users <https://lists.centos.org/mailman/listinfo/ci-users>
> 
> 
> 
> -- 
> 
> Camila Granella
> 
> Associate Manager, Software Engineering
> 
> Red Hat<https://www.redhat.com/>
> 
> @Red Hat <https://twitter.com/redhat> Red Hat <https://www.linkedin.com/company/red-hat> Red Hat <https://www.facebook.com/RedHatInc>
> <https://www.redhat.com/>
> 
> 
> _______________________________________________
> CI-users mailing list
> CI-users at centos.org
> https://lists.centos.org/mailman/listinfo/ci-users

-- 
PGP Key ID: 0xFB738CE27B634E4B
-------------- next part --------------
A non-text attachment was scrubbed...
Name: OpenPGP_signature
Type: application/pgp-signature
Size: 840 bytes
Desc: OpenPGP digital signature
URL: <http://lists.centos.org/pipermail/ci-users/attachments/20220819/fe28685b/attachment-0002.sig>