[CentOS-devel] rseq in glibc coming to CentOS Stream 9

The only way to implement sched_getcpu efficiently on AArch64 is to use
rseq and the per-thread CPU field in the rseq area.  Unlike other
architectures, there is no vDSO acceleration for sched_getcpu or getcpu
on AArch64, and there does not seem to be a special register available
that could be used for this.  (E.g., on x86-64, the kernel uses
otherwise-unused segment limits; access is not particularly fast, but at
least there is no system call involved.)  No application changes are
needed to benefit from this optimization, and even architectures with
existing vDSO acceleration benefit (slightly).

Upstream glibc will enable rseq in the upcoming 2.35 release, and while
we are not putting that glibc version into CentOS anytime soon, we have
backported the rseq acceleration for sched_getcpu to CentOS Stream 9,
glibc-2.34-19.el9 to be precise:

  glibc: Optional sched_getcpu acceleration using rseq
  <https://bugzilla.redhat.com/show_bug.cgi?id=2024347>

However, Fedora rawhide testing revealed an integration issue with CRIU:

  Implement rseq support, as required by glibc 2.35
  <https://github.com/checkpoint-restore/criu/issues/1696>

  criu: Implement rseq support
  <https://bugzilla.redhat.com/show_bug.cgi?id=2033397>

Therefore, we currently do not enable rseq acceleration by default, but
we plan to do so in the future, once CRIU fixes are available in
relevant places.  Therefore, applications need to be launched with the
GLIBC_TUNABLES=glibc.pthread.rseq=1 environment variable for the time
being to benefit from rseq-based sched_getcpu acceleration.

This time, no bad seccomp interactions are expected because we leaned on
browsers a while back to fix their sandboxes.  Container engines should
not be impacted because the glibc integration treats an initial EPERM
failure of the rseq system call as an indicator that rseq is not
available.  (Only success for the first rseq system call followed by
failure for rseq system calls on newly created threads is problematic;
this part needed fixing in browser sandboxes.)

A few applications which currently use rseq will need integration with
the new glibc way of doing things if they want to keep benefiting from
rseq.  Mathieu Desnoyers is enhancing librseq; direct consumers such as
tcmalloc will need to be updated separately.  I do not know yet how the
interaction between CentOS Stream 9 glibc and these applications will
look like because we have not backported the public rseq symbols that
glibc is adding in glibc 2.35 (__rseq_offset, __rseq_size,
__rseq_flags).  The current downstream glibc default (without
GLIBC_TUNABLES=glibc.pthread.rseq=1) is not to use rseq at all, leaving
it available for application use, so we have some time to figure out the
details.

Thanks,
Florian