Hello everyone,
This is a friendly reminder of the current and upcoming status of CentOS CI changes (check [1]).
Projects that opted-in for continuing on CentOS CI have been migrated, and the new Duffy API is available. With that, *phase 0 has been completed*. Regarding *phase 1*, we are still working on a permanent fix for the DB Concurrency issues [2]. Also, as for our OpenShift new deployment, we have a staging environment up and running, and it should be available at the beginning of September 2022.
In October 2022 we begin *phase 2* when we will work through the following items (these were also previously communicated in [1]):
- legacy/compatibility API endpoint will handover EC2 instances instead of local seamicro nodes (VMs vs bare metal) - bare-metal options will be available through the new API only - legacy seamicro and aarch64/ThunderX hardware are decommissioned - only remaining "on-premises" option is ppc64le (local cloud)Feel free to reach out if you have any questions or concerns
The final deadline for decommissioning the old infrastructure (*phase 3*) is *December 2022*. We will be communicating further until then, and meanwhile, reach out to any of us in case you have any questions.
Regards,
[1] [ci-users] Changes on CentOS CI and next steps: https://lists.centos.org/pipermail/ci-users/2022-June/004547.html [2] DB Concurrency issues: https://github.com/CentOS/duffy/issues/523
On 16/08/2022 15:58, Camila Granella wrote:
Hello everyone,
This is a friendly reminder of the current and upcoming status of CentOS CI changes (check [1]).
Projects that opted-in for continuing on CentOS CI have been migrated, and the new Duffy API is available. With that, /*phase 0* has been completed/. Regarding */phase 1/*, we are still working on a permanent fix for the DB Concurrency issues [2]. Also, as for our OpenShift new deployment, we have a staging environment up and running, and it should be available at the beginning of September 2022.
In October 2022 we begin /phase 2/ when we will work through the following items (these were also previously communicated in [1]):
- legacy/compatibility API endpoint will handover EC2 instances instead of local seamicro nodes (VMs vs bare metal)
- bare-metal options will be available through the new API only
- legacy seamicro and aarch64/ThunderX hardware are decommissioned
- only remaining "on-premises" option is ppc64le (local cloud)Feel free to reach out if you have any questions or concerns
The final deadline for decommissioning the old infrastructure (/phase 3/) is *December 2022*. We will be communicating further until then, and meanwhile, reach out to any of us in case you have any questions.
Regards,
[1] [ci-users] Changes on CentOS CI and next steps: https://lists.centos.org/pipermail/ci-users/2022-June/004547.html https://lists.centos.org/pipermail/ci-users/2022-June/004547.html [2] DB Concurrency issues: https://github.com/CentOS/duffy/issues/523
https://github.com/CentOS/duffy/issues/523
Camila Granella
Just to add that the storage box know as https://artifacts.ci.centos.org will also move to AWS. Infra is ready but we wanted to do the service migration somewhere in September. Reason is that we have some people on PTO and switching it means some (small) code changes : The old internal box (private vlan) was accepting plain rsync with a rsync password as authentication. The replacement box will be (re)using ssh pki, as you already have a dedicated ssh keypair, and so push/pull/rsync will happen over ssh (to encrypt traffic to/from duffy nodes to that box)
Stay tuned for when we'll announce the artifacts service migration and we'll reflect new setup on dedicated guide (https://sigs.centos.org/guide/ci/)
Hello!
After a couple of weeks of back and forth with the always helpful infra team I was able to migrate most of our (systemd) jobs over to the EC2 machines. As we require at least an access to KVM (and the EC2 VMs, unfortunately, don't support nested virt), I had to resort to metal machines over the "plain" VMs.
After monitoring the situation for a couple of days I noticed an issue[0] which might bite us in the future if/when other projects migrate over to the metal machines as well (since several of them require at least KVM too) - Duffy currently provisions only one metal machine at a time, and returns an API error for all other API requests for the same pool in the meantime:
can't reserve nodes: quantity=1 pool='metal-ec2-c5n-centos-8s-x86_64'
As the provisioning takes a bit, this delay might stack up quite noticeably. For example, after firing up 10 jobs (current project quota) at once, all for the metal pool, the last one got the machine after ~30 minutes - and that's only one project. If/when other projects migrate over to the metal machines as well, this might get quickly out of hand.
I understand that the metal machines are expensive, and I'm not sure how many other projects are eventually going to migrate over to them, but I guess in the future some balance will need to be found out between the cost and available metal nodes. Is this even up to a discussion, or the size of the metal pools is given and can't/won't be adjusted?
Thank you.
Cheers, Frantisek
[0] https://pagure.io/centos-infra/issue/865#comment-811365
On 8/16/22 15:58, Camila Granella wrote:
Hello everyone,
This is a friendly reminder of the current and upcoming status of CentOS CI changes (check [1]).
Projects that opted-in for continuing on CentOS CI have been migrated, and the new Duffy API is available. With that, /*phase 0* has been completed/. Regarding */phase 1/*, we are still working on a permanent fix for the DB Concurrency issues [2]. Also, as for our OpenShift new deployment, we have a staging environment up and running, and it should be available at the beginning of September 2022.
In October 2022 we begin /phase 2/ when we will work through the following items (these were also previously communicated in [1]):
- legacy/compatibility API endpoint will handover EC2 instances instead of local seamicro nodes (VMs vs bare metal)
- bare-metal options will be available through the new API only
- legacy seamicro and aarch64/ThunderX hardware are decommissioned
- only remaining "on-premises" option is ppc64le (local cloud)Feel free to reach out if you have any questions or concerns
The final deadline for decommissioning the old infrastructure (/phase 3/) is *December 2022*. We will be communicating further until then, and meanwhile, reach out to any of us in case you have any questions.
Regards,
[1] [ci-users] Changes on CentOS CI and next steps: https://lists.centos.org/pipermail/ci-users/2022-June/004547.html https://lists.centos.org/pipermail/ci-users/2022-June/004547.html [2] DB Concurrency issues: https://github.com/CentOS/duffy/issues/523 https://github.com/CentOS/duffy/issues/523 --
Camila Granella
Associate Manager, Software Engineering
Red Hathttps://www.redhat.com/
@Red Hat https://twitter.com/redhat Red Hat https://www.linkedin.com/company/red-hat Red Hat https://www.facebook.com/RedHatInc https://www.redhat.com/
CI-users mailing list CI-users@centos.org https://lists.centos.org/mailman/listinfo/ci-users
Moin,
On Fri, Aug 19, 2022 at 1:21 PM František Šumšal frantisek@sumsal.cz wrote:
After a couple of weeks of back and forth with the always helpful infra team I was able to migrate most of our (systemd) jobs over to the EC2 machines. As we require at least an access to KVM (and the EC2 VMs, unfortunately, don't support nested virt), I had to resort to metal machines over the "plain" VMs.
We (foreman) are in the same boat, our tests spawn multiple VMs, so we require KVM access (metal or nested, with the latter sadly not supported by EC2)
After monitoring the situation for a couple of days I noticed an issue[0] which might bite us in the future if/when other projects migrate over to the metal machines as well (since several of them require at least KVM too) - Duffy currently provisions only one metal machine at a time, and returns an API error for all other API requests for the same pool in the meantime:
can't reserve nodes: quantity=1 pool='metal-ec2-c5n-centos-8s-x86_64'
As the provisioning takes a bit, this delay might stack up quite noticeably. For example, after firing up 10 jobs (current project quota) at once, all for the metal pool, the last one got the machine after ~30 minutes - and that's only one project. If/when other projects migrate over to the metal machines as well, this might get quickly out of hand.
Our tests run up to 8 parallel jobs, so yeah, I can totally see this being a problem in the longer term for everybody.
We're currently investigating whether we can change our scheduling and run multiple jobs on one metal host (it's big enough to host more than the 3 VMs one job needs), but it doesn't seem too trivial right now.
Evgeni
Hello!
I understand that the metal machines are expensive, and I'm not sure how
many other projects are eventually going to migrate over to them, but I guess in the future some balance will need to be found out between the cost and available metal nodes. Is this even up to a discussion, or the size of the metal pools is given and can't/won't be adjusted?
We're looking to optimize resource usage with the recent changes to CentOS CI. From our side, the goal is to find a balance between adjusting to tenants' needs (there are adaptations we could do to have more nodes available with an increase in resource consumption) and adjusting projects workflows to use EC2.
I'd appreciate your suggestions on mitigating how to make workflows more adaptable to EC2.
Also, how much is this impacting critical deliveries on your side at the moment? My goal here is to understand whether we need a more urgent solution for you before going for deeper discussions. As I understand, we still have some bandwidth to find the best solution we can, as it could become more critical in the future. Is that assumption correct?
Thank you for reaching out about this,
On Fri, Aug 19, 2022 at 8:35 AM Evgeni Golov evgeni@redhat.com wrote:
Moin,
On Fri, Aug 19, 2022 at 1:21 PM František Šumšal frantisek@sumsal.cz wrote:
After a couple of weeks of back and forth with the always helpful infra
team I was able to migrate most of our (systemd) jobs over to the EC2 machines. As we require at least an access to KVM (and the EC2 VMs, unfortunately, don't support nested virt), I had to resort to metal machines over the "plain" VMs.
We (foreman) are in the same boat, our tests spawn multiple VMs, so we require KVM access (metal or nested, with the latter sadly not supported by EC2)
After monitoring the situation for a couple of days I noticed an
issue[0] which might bite us in the future if/when other projects migrate over to the metal machines as well (since several of them require at least KVM too) - Duffy currently provisions only one metal machine at a time, and returns an API error for all other API requests for the same pool in the meantime:
can't reserve nodes: quantity=1 pool='metal-ec2-c5n-centos-8s-x86_64'
As the provisioning takes a bit, this delay might stack up quite
noticeably. For example, after firing up 10 jobs (current project quota) at once, all for the metal pool, the last one got the machine after ~30 minutes - and that's only one project. If/when other projects migrate over to the metal machines as well, this might get quickly out of hand.
Our tests run up to 8 parallel jobs, so yeah, I can totally see this being a problem in the longer term for everybody.
We're currently investigating whether we can change our scheduling and run multiple jobs on one metal host (it's big enough to host more than the 3 VMs one job needs), but it doesn't seem too trivial right now.
Evgeni
-- Beste Grüße/Kind regards,
Evgeni Golov Senior Software Engineer ________________________________________________________________________ Red Hat GmbH, https://de.redhat.com/, Registered seat: Werner von Siemens Ring 14, D-85630 Grasbrunn, Germany Commercial register: Amtsgericht Muenchen/Munich, HRB 153243, Managing Directors: Ryan Barnhart, Charles Cachera, Michael O'Neill, Amy Ross
CI-users mailing list CI-users@centos.org https://lists.centos.org/mailman/listinfo/ci-users
Hey,
On 8/19/22 14:23, Camila Granella wrote:
Hello!
I understand that the metal machines are expensive, and I'm not sure how many other projects are eventually going to migrate over to them, but I guess in the future some balance will need to be found out between the cost and available metal nodes. Is this even up to a discussion, or the size of the metal pools is given and can't/won't be adjusted?
We're looking to optimize resource usage with the recent changes to CentOS CI. From our side, the goal is to find a balance between adjusting to tenants' needs (there are adaptations we could do to have more nodes available with an increase in resource consumption) and adjusting projects workflows to use EC2.
I'd appreciate your suggestions on mitigating how to make workflows more adaptable to EC2.
The main blocker for many projects is that EC2 VMs don't support nested virtualization, which is really unfortunate, since using the EC2 metal machines is indeed a "bit" overkill in many scenarios (ours included). I spent a week playing with various approaches to avoid this requirement, but failed (in our case it would be running the VMs with TCG instead of KVM, but that makes the tests flaky/unreliable in many cases, and some of them run for several hours with this change).
Going through many online resources just confirms this - EC2 VMs don't support nested virt[0], which is sad, since, for example, Microsoft's Azure apparently supports it[1][2] (and Google's Compute Engine apparently supports it as well from a quick lookup).
I'm not really sure if there's an easy solution for this (if any). I'm at least trying to spread the workload on the machine "to the limits" to utilize as much of the metal resources as possible, which shortens the runtime of each job quite considerably, but even that's not ideal (resource-wise).
As I mentioned on IRC, maybe having Duffy changing the pool size dynamically based on the demand for the past hour or so would help with the overall balance (to avoid wasting resources in "quiet periods"), but that's just an idea from top of my head, I'm not sure how feasible it is or if it even makes sense.
[0] https://repost.aws/questions/QUZppFxHs6Q3uJFCYeliAxxg/nested-virtualization-... [1] https://ignite.readthedocs.io/en/stable/cloudprovider/#microsoft-azure [2] https://azure.microsoft.com/cs-cz/blog/nested-virtualization-in-azure/
Also, how much is this impacting critical deliveries on your side at the moment? My goal here is to understand whether we need a more urgent solution for you before going for deeper discussions. As I understand, we still have some bandwidth to find the best solution we can, as it could become more critical in the future. Is that assumption correct?
Indeed, I do think we still have time to discuss possible solutions.
Thank you, Frantisek
Thank you for reaching out about this,
On Fri, Aug 19, 2022 at 8:35 AM Evgeni Golov <evgeni@redhat.com mailto:evgeni@redhat.com> wrote:
Moin, On Fri, Aug 19, 2022 at 1:21 PM František Šumšal <frantisek@sumsal.cz <mailto:frantisek@sumsal.cz>> wrote: > After a couple of weeks of back and forth with the always helpful infra team I was able to migrate most of our (systemd) jobs over to the EC2 machines. As we require at least an access to KVM (and the EC2 VMs, unfortunately, don't support nested virt), I had to resort to metal machines over the "plain" VMs. We (foreman) are in the same boat, our tests spawn multiple VMs, so we require KVM access (metal or nested, with the latter sadly not supported by EC2) > After monitoring the situation for a couple of days I noticed an issue[0] which might bite us in the future if/when other projects migrate over to the metal machines as well (since several of them require at least KVM too) - Duffy currently provisions only one metal machine at a time, and returns an API error for all other API requests for the same pool in the meantime: > > can't reserve nodes: quantity=1 pool='metal-ec2-c5n-centos-8s-x86_64' > > As the provisioning takes a bit, this delay might stack up quite noticeably. For example, after firing up 10 jobs (current project quota) at once, all for the metal pool, the last one got the machine after ~30 minutes - and that's only one project. If/when other projects migrate over to the metal machines as well, this might get quickly out of hand. Our tests run up to 8 parallel jobs, so yeah, I can totally see this being a problem in the longer term for everybody. We're currently investigating whether we can change our scheduling and run multiple jobs on one metal host (it's big enough to host more than the 3 VMs one job needs), but it doesn't seem too trivial right now. Evgeni -- Beste Grüße/Kind regards, Evgeni Golov Senior Software Engineer ________________________________________________________________________ Red Hat GmbH, https://de.redhat.com/ <https://de.redhat.com/>, Registered seat: Werner von Siemens Ring 14, D-85630 Grasbrunn, Germany Commercial register: Amtsgericht Muenchen/Munich, HRB 153243, Managing Directors: Ryan Barnhart, Charles Cachera, Michael O'Neill, Amy Ross _______________________________________________ CI-users mailing list CI-users@centos.org <mailto:CI-users@centos.org> https://lists.centos.org/mailman/listinfo/ci-users <https://lists.centos.org/mailman/listinfo/ci-users>
--
Camila Granella
Associate Manager, Software Engineering
Red Hathttps://www.redhat.com/
@Red Hat https://twitter.com/redhat Red Hat https://www.linkedin.com/company/red-hat Red Hat https://www.facebook.com/RedHatInc https://www.redhat.com/
CI-users mailing list CI-users@centos.org https://lists.centos.org/mailman/listinfo/ci-users
On 19/08/2022 15:31, František Šumšal wrote:
Hey,
On 8/19/22 14:23, Camila Granella wrote:
Hello!
I understand that the metal machines are expensive, and I'm not sure how many other projects are eventually going to migrate over to them, but I guess in the future some balance will need to be found out between the cost and available metal nodes. Is this even up to a discussion, or the size of the metal pools is given and can't/won't be adjusted?
We're looking to optimize resource usage with the recent changes to CentOS CI. From our side, the goal is to find a balance between adjusting to tenants' needs (there are adaptations we could do to have more nodes available with an increase in resource consumption) and adjusting projects workflows to use EC2.
I'd appreciate your suggestions on mitigating how to make workflows more adaptable to EC2.
The main blocker for many projects is that EC2 VMs don't support nested virtualization, which is really unfortunate, since using the EC2 metal machines is indeed a "bit" overkill in many scenarios (ours included). I spent a week playing with various approaches to avoid this requirement, but failed (in our case it would be running the VMs with TCG instead of KVM, but that makes the tests flaky/unreliable in many cases, and some of them run for several hours with this change).
Going through many online resources just confirms this - EC2 VMs don't support nested virt[0], which is sad, since, for example, Microsoft's Azure apparently supports it[1][2] (and Google's Compute Engine apparently supports it as well from a quick lookup).
I'm not really sure if there's an easy solution for this (if any). I'm at least trying to spread the workload on the machine "to the limits" to utilize as much of the metal resources as possible, which shortens the runtime of each job quite considerably, but even that's not ideal (resource-wise).
As I mentioned on IRC, maybe having Duffy changing the pool size dynamically based on the demand for the past hour or so would help with the overall balance (to avoid wasting resources in "quiet periods"), but that's just an idea from top of my head, I'm not sure how feasible it is or if it even makes sense.
Yes, that was always communicated that default EC2 instances don't support nested virt, as one request a cloud vm, so not an hypervisor :) It's just before migrating to ec2 that we saw it was possible to deploy bare-metal options at AWS side, but with a higher cost (obviousy) than traditional EC2 instances (VMs)
Can you explain why you'd need to have an hypervisor instead of VMs ? I guess that troubleshooting comes to mind (`virsh console` to the rescue while it's not even possible with the ec2 instance as VM) ?
On 8/22/22 13:28, Fabian Arrotin wrote:
On 19/08/2022 15:31, František Šumšal wrote:
Hey,
On 8/19/22 14:23, Camila Granella wrote:
Hello!
I understand that the metal machines are expensive, and I'm not sure how many other projects are eventually going to migrate over to them, but I guess in the future some balance will need to be found out between the cost and available metal nodes. Is this even up to a discussion, or the size of the metal pools is given and can't/won't be adjusted?
We're looking to optimize resource usage with the recent changes to CentOS CI. From our side, the goal is to find a balance between adjusting to tenants' needs (there are adaptations we could do to have more nodes available with an increase in resource consumption) and adjusting projects workflows to use EC2.
I'd appreciate your suggestions on mitigating how to make workflows more adaptable to EC2.
The main blocker for many projects is that EC2 VMs don't support nested virtualization, which is really unfortunate, since using the EC2 metal machines is indeed a "bit" overkill in many scenarios (ours included). I spent a week playing with various approaches to avoid this requirement, but failed (in our case it would be running the VMs with TCG instead of KVM, but that makes the tests flaky/unreliable in many cases, and some of them run for several hours with this change).
Going through many online resources just confirms this - EC2 VMs don't support nested virt[0], which is sad, since, for example, Microsoft's Azure apparently supports it[1][2] (and Google's Compute Engine apparently supports it as well from a quick lookup).
I'm not really sure if there's an easy solution for this (if any). I'm at least trying to spread the workload on the machine "to the limits" to utilize as much of the metal resources as possible, which shortens the runtime of each job quite considerably, but even that's not ideal (resource-wise).
As I mentioned on IRC, maybe having Duffy changing the pool size dynamically based on the demand for the past hour or so would help with the overall balance (to avoid wasting resources in "quiet periods"), but that's just an idea from top of my head, I'm not sure how feasible it is or if it even makes sense.
Yes, that was always communicated that default EC2 instances don't support nested virt, as one request a cloud vm, so not an hypervisor :) It's just before migrating to ec2 that we saw it was possible to deploy bare-metal options at AWS side, but with a higher cost (obviousy) than traditional EC2 instances (VMs)
Can you explain why you'd need to have an hypervisor instead of VMs ? I guess that troubleshooting comes to mind (`virsh console` to the rescue while it's not even possible with the ec2 instance as VM) ?
The systemd integration test suite builds an image for each test and then runs it with both systemd-nspawn and directly with qemu/qemu-kvm, since running systemd tests straight on the host is in many cases dangerous (and in some cases it wouldn't be feasible at all, since we need to test stuff that happens during (early) boot). Running only the systemd-nspawn part would be an option, but this way we'd lose a significant part of coverage (as with nspawn you can't test the full boot process, and some tests don't run in nspawn at all, like the systemd-udevd tests and other storage-related stuff).
CI-users mailing list CI-users@centos.org https://lists.centos.org/mailman/listinfo/ci-users
Hello,
Thank you for clarifying it. Soon we will bump the amount of metal machines available for provisioning on Duffy, adding a kind reminder here to use them wisely (as you all already do) as they represent a significant cost increase. Also, please, whenever possible default to the use of VMs.
Have a nice week,
On Mon, Aug 22, 2022 at 8:59 AM František Šumšal frantisek@sumsal.cz wrote:
On 8/22/22 13:28, Fabian Arrotin wrote:
On 19/08/2022 15:31, František Šumšal wrote:
Hey,
On 8/19/22 14:23, Camila Granella wrote:
Hello!
I understand that the metal machines are expensive, and I'm not
sure how many other projects are eventually going to migrate over to them, but I guess in the future some balance will need to be found out between the cost and available metal nodes. Is this even up to a discussion, or the size of the metal pools is given and can't/won't be adjusted?
We're looking to optimize resource usage with the recent changes to
CentOS CI. From our side, the goal is to find a balance between adjusting to tenants' needs (there are adaptations we could do to have more nodes available with an increase in resource consumption) and adjusting projects workflows to use EC2.
I'd appreciate your suggestions on mitigating how to make workflows
more adaptable to EC2.
The main blocker for many projects is that EC2 VMs don't support nested
virtualization, which is really unfortunate, since using the EC2 metal machines is indeed a "bit" overkill in many scenarios (ours included). I spent a week playing with various approaches to avoid this requirement, but failed (in our case it would be running the VMs with TCG instead of KVM, but that makes the tests flaky/unreliable in many cases, and some of them run for several hours with this change).
Going through many online resources just confirms this - EC2 VMs don't
support nested virt[0], which is sad, since, for example, Microsoft's Azure apparently supports it[1][2] (and Google's Compute Engine apparently supports it as well from a quick lookup).
I'm not really sure if there's an easy solution for this (if any). I'm
at least trying to spread the workload on the machine "to the limits" to utilize as much of the metal resources as possible, which shortens the runtime of each job quite considerably, but even that's not ideal (resource-wise).
As I mentioned on IRC, maybe having Duffy changing the pool size
dynamically based on the demand for the past hour or so would help with the overall balance (to avoid wasting resources in "quiet periods"), but that's just an idea from top of my head, I'm not sure how feasible it is or if it even makes sense.
Yes, that was always communicated that default EC2 instances don't
support nested virt, as one request a cloud vm, so not an hypervisor :)
It's just before migrating to ec2 that we saw it was possible to deploy
bare-metal options at AWS side, but with a higher cost (obviousy) than traditional EC2 instances (VMs)
Can you explain why you'd need to have an hypervisor instead of VMs ? I
guess that troubleshooting comes to mind (`virsh console` to the rescue while it's not even possible with the ec2 instance as VM) ?
The systemd integration test suite builds an image for each test and then runs it with both systemd-nspawn and directly with qemu/qemu-kvm, since running systemd tests straight on the host is in many cases dangerous (and in some cases it wouldn't be feasible at all, since we need to test stuff that happens during (early) boot). Running only the systemd-nspawn part would be an option, but this way we'd lose a significant part of coverage (as with nspawn you can't test the full boot process, and some tests don't run in nspawn at all, like the systemd-udevd tests and other storage-related stuff).
CI-users mailing list CI-users@centos.org https://lists.centos.org/mailman/listinfo/ci-users
-- PGP Key ID: 0xFB738CE27B634E4B _______________________________________________ CI-users mailing list CI-users@centos.org https://lists.centos.org/mailman/listinfo/ci-users
On Mon, 2022-08-22 at 13:59 +0200, František Šumšal wrote:
On 8/22/22 13:28, Fabian Arrotin wrote:
On 19/08/2022 15:31, František Šumšal wrote:
Hey,
On 8/19/22 14:23, Camila Granella wrote:
Hello!
I understand that the metal machines are expensive, and I'm not sure how many other projects are eventually going to migrate over to them, but I guess in the future some balance will need to be found out between the cost and available metal nodes. Is this even up to a discussion, or the size of the metal pools is given and can't/won't be adjusted?
We're looking to optimize resource usage with the recent changes to CentOS CI. From our side, the goal is to find a balance between adjusting to tenants' needs (there are adaptations we could do to have more nodes available with an increase in resource consumption) and adjusting projects workflows to use EC2.
I'd appreciate your suggestions on mitigating how to make workflows more adaptable to EC2.
The main blocker for many projects is that EC2 VMs don't support nested virtualization, which is really unfortunate, since using the EC2 metal machines is indeed a "bit" overkill in many scenarios (ours included). I spent a week playing with various approaches to avoid this requirement, but failed (in our case it would be running the VMs with TCG instead of KVM, but that makes the tests flaky/unreliable in many cases, and some of them run for several hours with this change).
Going through many online resources just confirms this - EC2 VMs don't support nested virt[0], which is sad, since, for example, Microsoft's Azure apparently supports it[1][2] (and Google's Compute Engine apparently supports it as well from a quick lookup).
I'm not really sure if there's an easy solution for this (if any). I'm at least trying to spread the workload on the machine "to the limits" to utilize as much of the metal resources as possible, which shortens the runtime of each job quite considerably, but even that's not ideal (resource-wise).
As I mentioned on IRC, maybe having Duffy changing the pool size dynamically based on the demand for the past hour or so would help with the overall balance (to avoid wasting resources in "quiet periods"), but that's just an idea from top of my head, I'm not sure how feasible it is or if it even makes sense.
Yes, that was always communicated that default EC2 instances don't support nested virt, as one request a cloud vm, so not an hypervisor :) It's just before migrating to ec2 that we saw it was possible to deploy bare-metal options at AWS side, but with a higher cost (obviousy) than traditional EC2 instances (VMs)
Can you explain why you'd need to have an hypervisor instead of VMs ? I guess that troubleshooting comes to mind (`virsh console` to the rescue while it's not even possible with the ec2 instance as VM) ?
The systemd integration test suite builds an image for each test and then runs it with both systemd-nspawn and directly with qemu/qemu- kvm, since running systemd tests straight on the host is in many cases dangerous (and in some cases it wouldn't be feasible at all, since we need to test stuff that happens during (early) boot). Running only the systemd-nspawn part would be an option, but this way we'd lose a significant part of coverage (as with nspawn you can't test the full boot process, and some tests don't run in nspawn at all, like the systemd-udevd tests and other storage-related stuff).
NetworkManager needs some more power to start qemu machine as we have tests trying all possible remote root mounts via nfs/iscsi (over, bond, bridge, vlans, etc, etc) so we have similar requirements as dracut/systemd for at least a part of our tests. We don't need something fancy but we at least need to be able to execute a vm inside the testing machine to simulate the early boot (remote filesystems are hosted directly from the machine we run tests on). Maybe we can live with paravirt, we have to experiment a bit.
Thank you, Vladimir
CI-users mailing list CI-users@centos.org https://lists.centos.org/mailman/listinfo/ci-users
-- PGP Key ID: 0xFB738CE27B634E4B _______________________________________________ CI-users mailing list CI-users@centos.org https://lists.centos.org/mailman/listinfo/ci-users
Hi all,
Earlier today the infra team attempted to bump the amount of metal machines available for provisioning on Duffy. However, the AWS API returned that currently there is no capacity to provision metal machines in the Availability Zone we are currently in (us-east-1a). For this reason, we will need to default to the use of EC2.
Let us know if you need anything from our end to support you adapting your workflows to it.
Regards,
On Mon, Aug 22, 2022 at 3:56 PM Vladimir Benes benesv@email.cz wrote:
On Mon, 2022-08-22 at 13:59 +0200, František Šumšal wrote:
On 8/22/22 13:28, Fabian Arrotin wrote:
On 19/08/2022 15:31, František Šumšal wrote:
Hey,
On 8/19/22 14:23, Camila Granella wrote:
Hello!
I understand that the metal machines are expensive, and I'm
not sure how many other projects are eventually going to migrate over to them, but I guess in the future some balance will need to be found out between the cost and available metal nodes. Is this even up to a discussion, or the size of the metal pools is given and can't/won't be adjusted?
We're looking to optimize resource usage with the recent changes to CentOS CI. From our side, the goal is to find a balance between adjusting to tenants' needs (there are adaptations we could do to have more nodes available with an increase in resource consumption) and adjusting projects workflows to use EC2.
I'd appreciate your suggestions on mitigating how to make workflows more adaptable to EC2.
The main blocker for many projects is that EC2 VMs don't support nested virtualization, which is really unfortunate, since using the EC2 metal machines is indeed a "bit" overkill in many scenarios (ours included). I spent a week playing with various approaches to avoid this requirement, but failed (in our case it would be running the VMs with TCG instead of KVM, but that makes the tests flaky/unreliable in many cases, and some of them run for several hours with this change).
Going through many online resources just confirms this - EC2 VMs don't support nested virt[0], which is sad, since, for example, Microsoft's Azure apparently supports it[1][2] (and Google's Compute Engine apparently supports it as well from a quick lookup).
I'm not really sure if there's an easy solution for this (if any). I'm at least trying to spread the workload on the machine "to the limits" to utilize as much of the metal resources as possible, which shortens the runtime of each job quite considerably, but even that's not ideal (resource-wise).
As I mentioned on IRC, maybe having Duffy changing the pool size dynamically based on the demand for the past hour or so would help with the overall balance (to avoid wasting resources in "quiet periods"), but that's just an idea from top of my head, I'm not sure how feasible it is or if it even makes sense.
Yes, that was always communicated that default EC2 instances don't support nested virt, as one request a cloud vm, so not an hypervisor :) It's just before migrating to ec2 that we saw it was possible to deploy bare-metal options at AWS side, but with a higher cost (obviousy) than traditional EC2 instances (VMs)
Can you explain why you'd need to have an hypervisor instead of VMs ? I guess that troubleshooting comes to mind (`virsh console` to the rescue while it's not even possible with the ec2 instance as VM) ?
The systemd integration test suite builds an image for each test and then runs it with both systemd-nspawn and directly with qemu/qemu- kvm, since running systemd tests straight on the host is in many cases dangerous (and in some cases it wouldn't be feasible at all, since we need to test stuff that happens during (early) boot). Running only the systemd-nspawn part would be an option, but this way we'd lose a significant part of coverage (as with nspawn you can't test the full boot process, and some tests don't run in nspawn at all, like the systemd-udevd tests and other storage-related stuff).
NetworkManager needs some more power to start qemu machine as we have tests trying all possible remote root mounts via nfs/iscsi (over, bond, bridge, vlans, etc, etc) so we have similar requirements as dracut/systemd for at least a part of our tests. We don't need something fancy but we at least need to be able to execute a vm inside the testing machine to simulate the early boot (remote filesystems are hosted directly from the machine we run tests on). Maybe we can live with paravirt, we have to experiment a bit.
Thank you, Vladimir
CI-users mailing list CI-users@centos.org https://lists.centos.org/mailman/listinfo/ci-users
-- PGP Key ID: 0xFB738CE27B634E4B _______________________________________________ CI-users mailing list CI-users@centos.org https://lists.centos.org/mailman/listinfo/ci-users
CI-users mailing list CI-users@centos.org https://lists.centos.org/mailman/listinfo/ci-users
Camila,
Do you know where (roughly) the limit for metal is? Right now (in this thread), I see three projects wanting/needing virtualization capabilities and thus metal (systemd, networkmanager, foreman). That means that, even if projects adjust their tests to run multiple things on one metal host, there might still be at least three parallel machines needed (or the projects will block each other, which would be very unfortunate).
Evgeni
On Tue, Aug 23, 2022 at 4:41 PM Camila Granella cgranell@redhat.com wrote:
Hi all,
Earlier today the infra team attempted to bump the amount of metal machines available for provisioning on Duffy. However, the AWS API returned that currently there is no capacity to provision metal machines in the Availability Zone we are currently in (us-east-1a). For this reason, we will need to default to the use of EC2.
Let us know if you need anything from our end to support you adapting your workflows to it.
Regards,
On Mon, Aug 22, 2022 at 3:56 PM Vladimir Benes benesv@email.cz wrote:
On Mon, 2022-08-22 at 13:59 +0200, František Šumšal wrote:
On 8/22/22 13:28, Fabian Arrotin wrote:
On 19/08/2022 15:31, František Šumšal wrote:
Hey,
On 8/19/22 14:23, Camila Granella wrote:
Hello!
I understand that the metal machines are expensive, and I'm
not sure how many other projects are eventually going to migrate over to them, but I guess in the future some balance will need to be found out between the cost and available metal nodes. Is this even up to a discussion, or the size of the metal pools is given and can't/won't be adjusted?
We're looking to optimize resource usage with the recent changes to CentOS CI. From our side, the goal is to find a balance between adjusting to tenants' needs (there are adaptations we could do to have more nodes available with an increase in resource consumption) and adjusting projects workflows to use EC2.
I'd appreciate your suggestions on mitigating how to make workflows more adaptable to EC2.
The main blocker for many projects is that EC2 VMs don't support nested virtualization, which is really unfortunate, since using the EC2 metal machines is indeed a "bit" overkill in many scenarios (ours included). I spent a week playing with various approaches to avoid this requirement, but failed (in our case it would be running the VMs with TCG instead of KVM, but that makes the tests flaky/unreliable in many cases, and some of them run for several hours with this change).
Going through many online resources just confirms this - EC2 VMs don't support nested virt[0], which is sad, since, for example, Microsoft's Azure apparently supports it[1][2] (and Google's Compute Engine apparently supports it as well from a quick lookup).
I'm not really sure if there's an easy solution for this (if any). I'm at least trying to spread the workload on the machine "to the limits" to utilize as much of the metal resources as possible, which shortens the runtime of each job quite considerably, but even that's not ideal (resource-wise).
As I mentioned on IRC, maybe having Duffy changing the pool size dynamically based on the demand for the past hour or so would help with the overall balance (to avoid wasting resources in "quiet periods"), but that's just an idea from top of my head, I'm not sure how feasible it is or if it even makes sense.
Yes, that was always communicated that default EC2 instances don't support nested virt, as one request a cloud vm, so not an hypervisor :) It's just before migrating to ec2 that we saw it was possible to deploy bare-metal options at AWS side, but with a higher cost (obviousy) than traditional EC2 instances (VMs)
Can you explain why you'd need to have an hypervisor instead of VMs ? I guess that troubleshooting comes to mind (`virsh console` to the rescue while it's not even possible with the ec2 instance as VM) ?
The systemd integration test suite builds an image for each test and then runs it with both systemd-nspawn and directly with qemu/qemu- kvm, since running systemd tests straight on the host is in many cases dangerous (and in some cases it wouldn't be feasible at all, since we need to test stuff that happens during (early) boot). Running only the systemd-nspawn part would be an option, but this way we'd lose a significant part of coverage (as with nspawn you can't test the full boot process, and some tests don't run in nspawn at all, like the systemd-udevd tests and other storage-related stuff).
NetworkManager needs some more power to start qemu machine as we have tests trying all possible remote root mounts via nfs/iscsi (over, bond, bridge, vlans, etc, etc) so we have similar requirements as dracut/systemd for at least a part of our tests. We don't need something fancy but we at least need to be able to execute a vm inside the testing machine to simulate the early boot (remote filesystems are hosted directly from the machine we run tests on). Maybe we can live with paravirt, we have to experiment a bit.
Thank you, Vladimir
CI-users mailing list CI-users@centos.org https://lists.centos.org/mailman/listinfo/ci-users
-- PGP Key ID: 0xFB738CE27B634E4B _______________________________________________ CI-users mailing list CI-users@centos.org https://lists.centos.org/mailman/listinfo/ci-users
CI-users mailing list CI-users@centos.org https://lists.centos.org/mailman/listinfo/ci-users
--
Camila Granella
Associate Manager, Software Engineering
Red Hat https://www.redhat.com/ @Red Hat https://twitter.com/redhat Red Hat https://www.linkedin.com/company/red-hat Red Hat https://www.facebook.com/RedHatInc https://www.redhat.com/ _______________________________________________ CI-users mailing list CI-users@centos.org https://lists.centos.org/mailman/listinfo/ci-users
On 23/08/2022 16:52, Evgeni Golov wrote:
Camila,
Do you know where (roughly) the limit for metal is? Right now (in this thread), I see three projects wanting/needing virtualization capabilities and thus metal (systemd, networkmanager, foreman). That means that, even if projects adjust their tests to run multiple things on one metal host, there might still be at least three parallel machines needed (or the projects will block each other, which would be very unfortunate).
Evgeni
If you look at actual status, we already have 10 metal nodes in parallel (and we can't provision more due to AWS lack of nodes in these zones). So yes, if each tenant can just ask for *one* metal node and adapt workflow so that they can run their ci jobs on top of that single metal node, that would (in theory) work ... It just needs some orchestration at the tenant jobs side
On 23/08/2022 16:41, Camila Granella wrote:
Hi all,
Earlier today the infra team attempted to bump the amount of metal machines available for provisioning on Duffy. However, the AWS API returned that currently there is no capacity to provision metal machines in the Availability Zone we are currently in (us-east-1a). For this reason, we will need to default to the use of EC2.
I had a look at the number of deployed c5n.metal instances for c8s and it reached 11 nodes ... so that also means that now Duffy is trying to have 5 nodes in Ready state (it was bumped from 1 to 5 through git commit/push earlier today)
It seems we're reaching a limit of c5n.metal available physical machines in us-east-1 (we use 3 availability zones there, through three subnets in dedicated duffy VPC)
Worth knowing that Duffy is catching ansible error and so knows that it was failing, so just retries every 60 seconds to provision such instance type machines , but by looking at the logs, we clearly ask much more than what AWS can offer. And that's also normal : AWS is about EC2 Virtual Machines, not (costly) bare-metal options. Also worth knowing that we added that option to let people transition their workflow but clearly metal option will be limited (by AWS availability, not even by us in this case) ....
For the time being, you can just put all your jobs in a queue, and retry to get one node through duffy api, if duffy itself was able to have some in ready state . At each point, one can see the pool status :
duffy client show-pool metal-ec2-c5n-centos-8s-x86_64 { "action": "get", "pool": { "name": "metal-ec2-c5n-centos-8s-x86_64", "fill_level": 5, "levels": { "provisioning": 0, "ready": 0, "contextualizing": 0, "deployed": 10, "deprovisioning": 0 } } }
In this case, it's showing 10 metal nodes deployed to tenants, and duffy not able to provision more (provisioning will show number and back to zero if it fails)
Hey!
On 8/23/22 16:41, Camila Granella wrote:
Hi all,
Earlier today the infra team attempted to bump the amount of metal machines available for provisioning on Duffy. However, the AWS API returned that currently there is no capacity to provision metal machines in the Availability Zone we are currently in (us-east-1a). For this reason, we will need to default to the use of EC2.
Let us know if you need anything from our end to support you adapting your workflows to it.
After thinking about this a bit more, I'd have one (quite naïve idea) - would it be possible to get in touch with the "other side" (i.e. people responsible for AWS) and ask them about a possibility of enabling nested virt for the CentOS CI pool? I have no idea about what's the reason behind no-nested virt anywhere (I suspect it's due to security-related reasons) and if it's even possible to enable in in the current EC2 infra, but having it enabled would, in the end, benefit all involved parties (especially given the infra is apparently sponsored by AWS/Amazon). As voiced by me several project currently utilizing CentOS CI - there are certain workflows which can't be run on the current EC2 machines, and as much as I'd like to use them (to avoid wasting resources unnecessarily), I simply can't.
Again, this is just my late-night spitballing in hopes to find some suitable middle-ground, so if it doesn't make sense, please let me know.
Cheers, Frantisek
Regards,
On Mon, Aug 22, 2022 at 3:56 PM Vladimir Benes <benesv@email.cz mailto:benesv@email.cz> wrote:
On Mon, 2022-08-22 at 13:59 +0200, František Šumšal wrote: > > On 8/22/22 13:28, Fabian Arrotin wrote: > > On 19/08/2022 15:31, František Šumšal wrote: > > > Hey, > > > > > > On 8/19/22 14:23, Camila Granella wrote: > > > > Hello! > > > > > > > > I understand that the metal machines are expensive, and I'm > > > > not sure how many other projects are eventually going to > > > > migrate over to them, but I guess in the future some balance > > > > will need to be found out between the cost and available metal > > > > nodes. Is this even up to a discussion, or the size of the > > > > metal pools is given and can't/won't be adjusted? > > > > > > > > > > > > We're looking to optimize resource usage with the recent > > > > changes to CentOS CI. From our side, the goal is to find a > > > > balance between adjusting to tenants' needs (there are > > > > adaptations we could do to have more nodes available with an > > > > increase in resource consumption) and adjusting projects > > > > workflows to use EC2. > > > > > > > > I'd appreciate your suggestions on mitigating how to make > > > > workflows more adaptable to EC2. > > > > > > The main blocker for many projects is that EC2 VMs don't support > > > nested virtualization, which is really unfortunate, since using > > > the EC2 metal machines is indeed a "bit" overkill in many > > > scenarios (ours included). I spent a week playing with various > > > approaches to avoid this requirement, but failed (in our case it > > > would be running the VMs with TCG instead of KVM, but that makes > > > the tests flaky/unreliable in many cases, and some of them run > > > for several hours with this change). > > > > > > Going through many online resources just confirms this - EC2 VMs > > > don't support nested virt[0], which is sad, since, for example, > > > Microsoft's Azure apparently supports it[1][2] (and Google's > > > Compute Engine apparently supports it as well from a quick > > > lookup). > > > > > > I'm not really sure if there's an easy solution for this (if > > > any). I'm at least trying to spread the workload on the machine > > > "to the limits" to utilize as much of the metal resources as > > > possible, which shortens the runtime of each job quite > > > considerably, but even that's not ideal (resource-wise). > > > > > > As I mentioned on IRC, maybe having Duffy changing the pool size > > > dynamically based on the demand for the past hour or so would > > > help with the overall balance (to avoid wasting resources in > > > "quiet periods"), but that's just an idea from top of my head, > > > I'm not sure how feasible it is or if it even makes sense. > > > > > > > Yes, that was always communicated that default EC2 instances don't > > support nested virt, as one request a cloud vm, so not an > > hypervisor :) > > It's just before migrating to ec2 that we saw it was possible to > > deploy bare-metal options at AWS side, but with a higher cost > > (obviousy) than traditional EC2 instances (VMs) > > > > Can you explain why you'd need to have an hypervisor instead of VMs > > ? I guess that troubleshooting comes to mind (`virsh console` to > > the rescue while it's not even possible with the ec2 instance as > > VM) ? > > The systemd integration test suite builds an image for each test and > then runs it with both systemd-nspawn and directly with qemu/qemu- > kvm, since running systemd tests straight on the host is in many > cases dangerous (and in some cases it wouldn't be feasible at all, > since we need to test stuff that happens during (early) boot). > Running only the systemd-nspawn part would be an option, but this way > we'd lose a significant part of coverage (as with nspawn you can't > test the full boot process, and some tests don't run in nspawn at > all, like the systemd-udevd tests and other storage-related stuff). > NetworkManager needs some more power to start qemu machine as we have tests trying all possible remote root mounts via nfs/iscsi (over, bond, bridge, vlans, etc, etc) so we have similar requirements as dracut/systemd for at least a part of our tests. We don't need something fancy but we at least need to be able to execute a vm inside the testing machine to simulate the early boot (remote filesystems are hosted directly from the machine we run tests on). Maybe we can live with paravirt, we have to experiment a bit. Thank you, Vladimir > > > > > > > > _______________________________________________ > > CI-users mailing list > > CI-users@centos.org <mailto:CI-users@centos.org> > > https://lists.centos.org/mailman/listinfo/ci-users <https://lists.centos.org/mailman/listinfo/ci-users> > > -- > PGP Key ID: 0xFB738CE27B634E4B > _______________________________________________ > CI-users mailing list > CI-users@centos.org <mailto:CI-users@centos.org> > https://lists.centos.org/mailman/listinfo/ci-users <https://lists.centos.org/mailman/listinfo/ci-users> _______________________________________________ CI-users mailing list CI-users@centos.org <mailto:CI-users@centos.org> https://lists.centos.org/mailman/listinfo/ci-users <https://lists.centos.org/mailman/listinfo/ci-users>
--
Camila Granella
Associate Manager, Software Engineering
Red Hathttps://www.redhat.com/
@Red Hat https://twitter.com/redhat Red Hat https://www.linkedin.com/company/red-hat Red Hat https://www.facebook.com/RedHatInc https://www.redhat.com/
CI-users mailing list CI-users@centos.org https://lists.centos.org/mailman/listinfo/ci-users
On 24/08/2022 00:48, František Šumšal wrote:
Hey!
On 8/23/22 16:41, Camila Granella wrote:
Hi all,
Earlier today the infra team attempted to bump the amount of metal machines available for provisioning on Duffy. However, the AWS API returned that currently there is no capacity to provision metal machines in the Availability Zone we are currently in (us-east-1a). For this reason, we will need to default to the use of EC2.
Let us know if you need anything from our end to support you adapting your workflows to it.
After thinking about this a bit more, I'd have one (quite naïve idea) - would it be possible to get in touch with the "other side" (i.e. people responsible for AWS) and ask them about a possibility of enabling nested virt for the CentOS CI pool? I have no idea about what's the reason behind no-nested virt anywhere (I suspect it's due to security-related reasons) and if it's even possible to enable in in the current EC2 infra, but having it enabled would, in the end, benefit all involved parties (especially given the infra is apparently sponsored by AWS/Amazon). As voiced by me several project currently utilizing CentOS CI - there are certain workflows which can't be run on the current EC2 machines, and as much as I'd like to use them (to avoid wasting resources unnecessarily), I simply can't.
Again, this is just my late-night spitballing in hopes to find some suitable middle-ground, so if it doesn't make sense, please let me know.
Well, if AWS, world-wide, never enabled it for all their paying customers, I doubt they'd do it for a *whole* region just for a sponsored account .. ;-) So basically I guess it's safe to answer that it's not possible (I guess that's also the explanation about why they still let you request a metal node, but in limited quantity vs the "classic" and default EC2 instances)