systemd and 'Stale file handle' errors?

List overview All Threads
Download

newer

older

getssl was working stopped

cups, stalling the transfer of...

James Pearson

13 May 2021 13 May '21

2:15 p.m.

I have a CentOS 7 system where I needed to restart chronyd - but the systemctl restart failed with the error:

systemd[1]: Starting NTP client/server... systemd[43578]: Failed at step NAMESPACE spawning /usr/sbin/chronyd: Stale file handle systemd[1]: chronyd.service: control process exited, code=exited status=226

Turns out there are a couple of Stale NFS file handles from fuse mounts (related to gvfsd) of sub directories under an NFS mounted home directory server - but the home directory for the user in this case, no longer exist (user has left)

However, I have no idea why these 'Stale file handles' prevent a service being started by systemd ?

In this case, chronyd has nothing to do with NFS mounted user home directories - so shouldn't really care ?

I have tried everything I can think of to clear these stale mounts, but with no luck

Does anyone know why systemd complains about unconnected 'Stale file handles' - and is there any way I can tell systemctl to start a service regardless of these 'errors' ?

Rebooting the host will be a last resort (the system is used by many users) - but in the meantime, I've manually started the /usr/sbin/chronyd binary directly, which runs fine

Thanks

James Pearson

Show replies by date

Simon Matter

14 May 14 May

10:44 a.m.

...

I have a CentOS 7 system where I needed to restart chronyd - but the systemctl restart failed with the error:

systemd[1]: Starting NTP client/server... systemd[43578]: Failed at step NAMESPACE spawning /usr/sbin/chronyd: Stale file handle systemd[1]: chronyd.service: control process exited, code=exited status=226

Turns out there are a couple of Stale NFS file handles from fuse mounts (related to gvfsd) of sub directories under an NFS mounted home directory server - but the home directory for the user in this case, no longer exist (user has left)

However, I have no idea why these 'Stale file handles' prevent a service being started by systemd ?

In this case, chronyd has nothing to do with NFS mounted user home directories - so shouldn't really care ?

I have tried everything I can think of to clear these stale mounts, but with no luck

Does anyone know why systemd complains about unconnected 'Stale file handles' - and is there any way I can tell systemctl to start a service regardless of these 'errors' ?

Rebooting the host will be a last resort (the system is used by many users) - but in the meantime, I've manually started the /usr/sbin/chronyd binary directly, which runs fine

We're running large multi user systems with desktop sessions on Red Hat based systems for decades but it became increasingly painful after EL6 with the introduction of systemd in EL7. It may have improved the user experience on developers laptops but for our use case things are worse today...

Regards, Simon

Jonathan Billings

12:47 p.m.

On Thu, May 13, 2021 at 02:15:15PM +0000, James Pearson wrote:

...

I have a CentOS 7 system where I needed to restart chronyd - but the systemctl restart failed with the error:

systemd[1]: Starting NTP client/server... systemd[43578]: Failed at step NAMESPACE spawning /usr/sbin/chronyd: Stale file handle systemd[1]: chronyd.service: control process exited, code=exited status=226

Turns out there are a couple of Stale NFS file handles from fuse mounts (related to gvfsd) of sub directories under an NFS mounted home directory server - but the home directory for the user in this case, no longer exist (user has left)

However, I have no idea why these 'Stale file handles' prevent a service being started by systemd ?

In this case, chronyd has nothing to do with NFS mounted user home directories - so shouldn't really care ?

I have tried everything I can think of to clear these stale mounts, but with no luck

Does anyone know why systemd complains about unconnected 'Stale file handles' - and is there any way I can tell systemctl to start a service regardless of these 'errors' ?

Rebooting the host will be a last resort (the system is used by many users) - but in the meantime, I've manually started the /usr/sbin/chronyd binary directly, which runs fine

So, the chronyd systemd unit looks like this:

# /usr/lib/systemd/system/chronyd.service [Unit] Description=NTP client/server Documentation=man:chronyd(8) man:chrony.conf(5) After=ntpdate.service sntp.service ntpd.service Conflicts=ntpd.service systemd-timesyncd.service ConditionCapability=CAP_SYS_TIME

[Service] Type=forking PIDFile=/var/run/chrony/chronyd.pid EnvironmentFile=-/etc/sysconfig/chronyd ExecStart=/usr/sbin/chronyd $OPTIONS ExecStartPost=/usr/libexec/chrony-helper update-daemon PrivateTmp=yes ProtectHome=yes ProtectSystem=full

[Install] WantedBy=multi-user.target

So, you'll notice there are "ProtectHome=yes" and "ProtectSystem=yes" settings in the Service section. This sets up a private namespace for the systemd unit so /home, /root and /run/user are made inaccessible and empty (ProtectHome), and /usr, /boot and /etc are read-only (ProtectSystem). It does this to reduce the ability of a malicious NTP server attacking the system through bogus NTP traffic (which is a real thing that can happen). Many systemd services limit their processes this way.

I suspect that is why you're seeing stale file handle errors, the kernel can't set up the namespace for directories that are now stale on the system.

You can probably just do a lazy unmount (umount -l) to make them go away until you reboot. You can also disable the namespaced directories by doing a 'systemctl edit chronyd.service' and setting the options to 'off', but you'll be reducing the security of your system.

We've seen some weird stuff in the past related to this feature. For example, I couldn't unmount /home because a service with ProtectHome=read-only was running (cups), and 'fuser' and 'lsof' didn't show anything was using it. It's because the kernel namespace stuff operates as a mountpoint, so it's all kernel. Another fun issue I discovered is that we had some locally-developed services that used files in /tmp as a communication channel, and with PrivateTmp=yes set, they no longer could communicate. So it forced us to actually do the right thing and use more appropriate methods.

It is kinda confusing but I do appreciate that I now have a lot of ways I can now lock down services beyond simple UNIX permissions. systemd is a rather neat init system. My complaints with it usually are with the parts that reach outside of being an init system (I'm looking at you, systemd-logind and systemd-resolved).

-- Jonathan Billings billings@negate.org

James Pearson

5:46 p.m.

Jonathan Billings wrote:

...

So, the chronyd systemd unit looks like this:
# /usr/lib/systemd/system/chronyd.service
[Unit]
Description=NTP client/server
Documentation=man:chronyd(8) man:chrony.conf(5)
After=ntpdate.service sntp.service ntpd.service
Conflicts=ntpd.service systemd-timesyncd.service
ConditionCapability=CAP_SYS_TIME

[Service]
Type=forking
PIDFile=/var/run/chrony/chronyd.pid
EnvironmentFile=-/etc/sysconfig/chronyd
ExecStart=/usr/sbin/chronyd $OPTIONS
ExecStartPost=/usr/libexec/chrony-helper update-daemon
PrivateTmp=yes
ProtectHome=yes
ProtectSystem=full

[Install]
WantedBy=multi-user.target
So, you'll notice there are "ProtectHome=yes" and "ProtectSystem=yes" settings in the Service section. This sets up a private namespace for the systemd unit so /home, /root and /run/user are made inaccessible and empty (ProtectHome), and /usr, /boot and /etc are read-only (ProtectSystem). It does this to reduce the ability of a malicious NTP server attacking the system through bogus NTP traffic (which is a real thing that can happen). Many systemd services limit their processes this way.

I suspect that is why you're seeing stale file handle errors, the kernel can't set up the namespace for directories that are now stale on the system.

You can probably just do a lazy unmount (umount -l) to make them go away until you reboot. You can also disable the namespaced directories by doing a 'systemctl edit chronyd.service' and setting the options to 'off', but you'll be reducing the security of your system.

Thanks - that all makes sense - unfortunately 'umount -l' didn't work :-(

I've actually now rebooted the box - but if something like this happens again, maybe I could use a drop-in snippet in /run/systemd/system/ to turn the options off - which would then only last until the next reboot ?

Thanks

James Pearson

1564

Age (days ago)

1565

Last active (days ago)

discuss@lists.centos.org

3 comments

3 participants

tags (0)

participants (3)

James Pearson
Jonathan Billings
Simon Matter