Moin,
On Fri, Aug 19, 2022 at 1:21 PM František Šumšal frantisek@sumsal.cz wrote:
After a couple of weeks of back and forth with the always helpful infra team I was able to migrate most of our (systemd) jobs over to the EC2 machines. As we require at least an access to KVM (and the EC2 VMs, unfortunately, don't support nested virt), I had to resort to metal machines over the "plain" VMs.
We (foreman) are in the same boat, our tests spawn multiple VMs, so we require KVM access (metal or nested, with the latter sadly not supported by EC2)
After monitoring the situation for a couple of days I noticed an issue[0] which might bite us in the future if/when other projects migrate over to the metal machines as well (since several of them require at least KVM too) - Duffy currently provisions only one metal machine at a time, and returns an API error for all other API requests for the same pool in the meantime:
can't reserve nodes: quantity=1 pool='metal-ec2-c5n-centos-8s-x86_64'
As the provisioning takes a bit, this delay might stack up quite noticeably. For example, after firing up 10 jobs (current project quota) at once, all for the metal pool, the last one got the machine after ~30 minutes - and that's only one project. If/when other projects migrate over to the metal machines as well, this might get quickly out of hand.
Our tests run up to 8 parallel jobs, so yeah, I can totally see this being a problem in the longer term for everybody.
We're currently investigating whether we can change our scheduling and run multiple jobs on one metal host (it's big enough to host more than the 3 VMs one job needs), but it doesn't seem too trivial right now.
Evgeni