Performance issues on the EC2 T2 nodes - CI-users

20 Oct 2022


      Hey!
As the follow-up discussion after the migration to AWS suggested, I took one of our systemd CI jobs (which relies heavily on spawning QEMU VMs) from a metal machine to a T2 VM. This took quite a while and required tuning down the "depth" of some tests (which also means less coverage), but in the end it works quite reliably, and with parallelization the job takes a reasonable amount of time (~95 minutes).
However, recently (~three weeks ago) I noticed that the jobs using T2 machines (from the virt-ec2-t2-centos-8s-x86_64 pool) started having issues - the runtime was over 4 hours (instead of the usual ~95 minutes) and several tests kept timing out. In this case the issue fixed itself before I had a chance to debug it further, though.
Fast forward to this week, the issue appeared again on Monday, but, again, it fixed itself - this time while I was debugging it. I put some possibly helpful debug checks in place and waited for the next occurrence, which happened to be this Wednesday.
Taking a couple of affected machines aside, I checked various things, including all possible logs, I/O measurements, etc. and in the end I noticed that more than 50% of the vCPU time is actually being "stolen" by the hypervisor:
# mpstat -P ALL 5 7
Linux 4.18.0-408.el8.x86_64 (n27-29-92.pool.ci.centos.org)  19/10/22    _x86_64_    (8 CPU)
...
Average:     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
Average:     all   11.65    0.00    3.17    0.32    1.72    0.30   52.92    0.00    0.00   29.92
Average:       0   12.99    0.00    2.68    0.37    1.69    0.32   59.64    0.00    0.00   22.31
Average:       1   11.00    0.00    3.14    0.34    1.95    0.34   49.16    0.00    0.00   34.06
Average:       2   11.31    0.00    2.72    0.52    1.84    0.41   47.02    0.00    0.00   36.18
Average:       3   11.75    0.00    2.97    0.45    1.96    0.29   52.45    0.00    0.00   30.11
Average:       4   10.88    0.00    3.06    0.19    1.99    0.21   46.12    0.00    0.00   37.55
Average:       5   13.19    0.00    3.30    0.24    1.11    0.22   64.23    0.00    0.00   17.71
Average:       6   11.92    0.00    3.71    0.08    1.60    0.24   58.52    0.00    0.00   23.92
Average:       7   10.22    0.00    3.82    0.32    1.63    0.32   46.71    0.00    0.00   36.99
After some tinkering I managed to reproduce it on "clean" machine as well, with just `stress --cpu 8`, where the results were even worse:
# mpstat -P ALL 5 7
Linux 4.18.0-408.el8.x86_64 (n27-34-82.pool.ci.centos.org)  19/10/22    _x86_64_    (8 CPU)
Average:     CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
Average:     all   19.08    0.00    0.00    0.00    0.27    0.00   80.64    0.00    0.00    0.00
Average:       0   12.75    0.00    0.00    0.00    0.22    0.00   87.03    0.00    0.00    0.00
Average:       1   19.08    0.00    0.00    0.00    0.15    0.00   80.77    0.00    0.00    0.00
Average:       2   22.55    0.00    0.00    0.00    0.24    0.00   77.21    0.00    0.00    0.00
Average:       3   25.94    0.00    0.00    0.00    0.38    0.00   73.68    0.00    0.00    0.00
Average:       4   19.17    0.00    0.00    0.00    0.30    0.00   80.53    0.00    0.00    0.00
Average:       5   21.31    0.00    0.00    0.00    0.39    0.00   78.30    0.00    0.00    0.00
Average:       6   17.37    0.00    0.00    0.00    0.30    0.00   82.33    0.00    0.00    0.00
Average:       7   20.48    0.00    0.00    0.00    0.27    0.00   79.25    0.00    0.00    0.00
This is really unfortunate. As the EC2 VMs don't support nested virtualization, we're "at mercy" of TCG, which is quite CPU intensive. When this rate-limiting kicks in, the CI jobs take more than 4 hours, and that's with timeout protections in place (meaning the CI results are unusable) - without them they would take significantly more and would be most likely killed by the watchdog, which is currently 6 hours (iirc). And since the current project queue limit is 10 parallel jobs, this make the CI part almost infeasible.
I originally reported it to the CentOS Infra tracker [0] but was advised to post it here, since this behavior is, for better or worse, expected, or at least that's how the T2 machines are advertised. Is there anything that can be done to mitigate this? The only available solution would be to move the job back to the metal nodes, but that's going against the original issue (and the metal pool is quite limited anyway).
Cheers,
Frantisek
[0] https://pagure.io/centos-infra/issue/950
-- 
PGP Key ID: 0xFB738CE27B634E4B