Hi folks,
It seems that the latest vagrant box for centos/7 is breaking nfs mounts at boot. My vagrant project has an nfs server and client node. The client has two nfs mount in fstab. https://gitlab.rc.uab.edu/jpr/ohpc_vagrant
Using the image from the prior release (1804.02, kernel 3.10.0-862.2.3.el7.x86_64, CentOS Linux release 7.5.1804 (Core) ) the mounts complete successfully. The newest releases (1811.02 and 1809.01) fail to mount the drives at boot.
I believe the differences between the older VMs using XFS and volume manager vs direct device and ext4 in the newer images are changing the boot timings of services. The messages.log shows that the network is unavailable at the point which the nfs mounts are tried on the failing nodes. The wait service doesn't appear to work right or wait long enough to allow the network to come up first. It seems that the boot for the direct use of the sda1 and use of ext4 simply initializes too fast.
Here's the top of the systemd blame info for the box that succeeds in it's nfs mounts at boot (v1804.02). Note that the boot times are dominated by the network manager wait service:
[vagrant@ood ~]$ systemd-analyze blame 2.970s NetworkManager-wait-online.service 1.511s tuned.service 1.272s postfix.service 594ms httpd24-httpd.service 538ms lvm2-monitor.service 474ms opt-ohpc-pub.mount 471ms home.mount 450ms auditd.service 444ms dev-mapper-VolGroup00\x2dLogVol00.device 380ms boot.mount 343ms network.service 216ms munge.service 206ms NetworkManager.service 177ms chronyd.service 157ms polkit.service 149ms sshd.service 148ms systemd-logind.service 134ms slurmd.service 114ms gssproxy.service 112ms lvm2-pvscan@8:3.service 112ms rpc-statd.service 112ms systemd-udev-trigger.service 111ms rsyslog.service 110ms rhel-readonly.service 105ms rhel-dmesg.service 91ms systemd-vconsole-setup.service 84ms systemd-tmpfiles-setup-dev.service 74ms systemd-tmpfiles-clean.service 72ms dev-mapper-VolGroup00\x2dLogVol01.swap 66ms kmod-static-nodes.service 65ms rhel-domainname.service 65ms rpc-statd-notify.service ...
Here's the top of systemd blame info for the box that is failing to mount nfs (v1811.02), network manager barely waits:
[vagrant@ood ~]$ systemd-analyze blame 1.811s dev-sda1.device 1.737s tuned.service 1.594s postfix.service 785ms httpd24-httpd.service 378ms systemd-vconsole-setup.service 295ms slurmd.service 291ms network.service 286ms home.mount 266ms auditd.service 253ms opt-ohpc-pub.mount 250ms NetworkManager-wait-online.service 208ms systemd-udev-trigger.service 200ms polkit.service 188ms systemd-tmpfiles-setup-dev.service 180ms sshd.service 177ms chronyd.service 175ms rhel-readonly.service 146ms rhel-dmesg.service 145ms munge.service 144ms gssproxy.service 141ms rpcbind.service 139ms swapfile.swap 112ms rhel-domainname.service 102ms systemd-udevd.service 101ms systemd-journald.service 91ms rpc-statd.service 78ms rsyslog.service 70ms var-lib-nfs-rpc_pipefs.mount 69ms systemd-tmpfiles-setup.service 61ms rpc-statd-notify.service 58ms systemd-journal-flush.service 58ms systemd-sysctl.service ...
I have the boot svd graphs that tell a similar story on the sequencing of service startup.
Is there a way to add delay to the network manager wait or otherwise influence the boot configuration to ensure the NFS drives mount correctly at boot?
Thanks,
John-Paul
Hi John-Paul,
I think you have a more generic problem, not something affecting just the Vagrant images.
On 2019-01-11 22:19, John-Paul Robinson wrote:
It seems that the latest vagrant box for centos/7 is breaking nfs mounts at boot. My vagrant project has an nfs server and client node. The client has two nfs mount in fstab. https://gitlab.rc.uab.edu/jpr/ohpc_vagrant
Using the image from the prior release (1804.02, kernel 3.10.0-862.2.3.el7.x86_64, CentOS Linux release 7.5.1804 (Core) ) the mounts complete successfully. The newest releases (1811.02 and 1809.01) fail to mount the drives at boot.
We'll go back to using XFS in starting with 1812, but I wouldn't rely on timing to see if NFS is going to work or not - that's too fragile. I wasn't able to find an fstab in the Ansible playbooks your repo points to, perhaps you could provide a direct link?
Are you already using the _netdev mount option in your fstab? That should make sure that these mounts are only attempted after the network is working. This should be mentioned by 'man systemd-mount', didn't use NFS myself though.
Best regards, Laurențiu
On Sat, Jan 12, 2019 at 4:24 PM Laurențiu Păncescu lpancescu@centosproject.org wrote:
Hi John-Paul,
I think you have a more generic problem, not something affecting just the Vagrant images.
On 2019-01-11 22:19, John-Paul Robinson wrote:
It seems that the latest vagrant box for centos/7 is breaking nfs mounts at boot. My vagrant project has an nfs server and client node. The client has two nfs mount in fstab. https://gitlab.rc.uab.edu/jpr/ohpc_vagrant
Using the image from the prior release (1804.02, kernel 3.10.0-862.2.3.el7.x86_64, CentOS Linux release 7.5.1804 (Core) ) the mounts complete successfully. The newest releases (1811.02 and 1809.01) fail to mount the drives at boot.
We'll go back to using XFS in starting with 1812, but I wouldn't rely on timing to see if NFS is going to work or not - that's too fragile. I wasn't able to find an fstab in the Ansible playbooks your repo points to, perhaps you could provide a direct link?
Are you already using the _netdev mount option in your fstab? That should make sure that these mounts are only attempted after the network is working. This should be mentioned by 'man systemd-mount', didn't use NFS myself though.
Best regards, Laurențiu
NFS CIFS, and external USB drives are one of the best places to ue automount, instead of /etc/fstab. The behavior in case of failures, and the attempts to restry the mount point, and the absence of the mountpoint except when the mount is successful, is more useful than the hardcoded /etc/fstab workarounds. Not having the mountpoint. exist helps prevent accidentally writing *under* the NFS mount, while it is not present, then mounting NFS shares on top of the files. That can be *nasty*: I ran into it recently with a MySQL share mounted, on top of a running MySQL.
Hi Laurențiu,
Thanks for the comments.
I tried the newest 1812.01 image, which again does have XFS (but not volumes). The behavior remains the same wrt NFS mounts in the fstab failing to mount at boot. I did try the _netdev flag and that doesn't fix it either.
When I go back to 1804.02 the NFS mounts work fine at boot, and don't require _netdev either (we do use nodev, not sure what impact if any that has).
Here's the fstab from the 1804.02 image. The NFS config is identical across working and non-working configs:
/dev/mapper/VolGroup00-LogVol00 / xfs defaults 0 0 UUID=570897ca-e759-4c81-90cf-389da6eee4cc /boot xfs defaults 0 0 /dev/mapper/VolGroup00-LogVol01 swap swap defaults 0 0 10.1.1.1:/home /home nfs nfsvers=3,nodev,nosuid,noatime 0 0 10.1.1.1:/opt/ohpc/pub /opt/ohpc/pub nfs nfsvers=3,nodev,noatime 0 0
For reference, the NFS config is added to fstab in the ansible role that preps the open ondemand node:
https://github.com/jprorama/CRI_XCBC/blob/uab-dev/roles/prep_ood/tasks/main....
(Note this is in a submodule of the ohpc_vagrant repo link in my original email.)
I'm not sure what's going on here. The only thing superficially different is the volume manager config. That may be perturbing the timing enough to to avoid the issue. I'm aware that relying on timing is not viable.
The main reason I see this as an issue is that I started my tests with 1804.02 where the NFS mounts do work correctly at boot. So I know it /can/ work. :) If I'd started with a later build (like some of my colleagues), I'd never have considered it a working feature. ;)
Hope this provides some insight. Happy to provide more info if it helps.
John-Paul
On 1/12/19 3:24 PM, Laurențiu Păncescu wrote:
Hi John-Paul,
I think you have a more generic problem, not something affecting just the Vagrant images.
On 2019-01-11 22:19, John-Paul Robinson wrote:
It seems that the latest vagrant box for centos/7 is breaking nfs mounts at boot. My vagrant project has an nfs server and client node. The client has two nfs mount in fstab. https://gitlab.rc.uab.edu/jpr/ohpc_vagrant
Using the image from the prior release (1804.02, kernel 3.10.0-862.2.3.el7.x86_64, CentOS Linux release 7.5.1804 (Core) ) the mounts complete successfully. The newest releases (1811.02 and 1809.01) fail to mount the drives at boot.
We'll go back to using XFS in starting with 1812, but I wouldn't rely on timing to see if NFS is going to work or not - that's too fragile. I wasn't able to find an fstab in the Ansible playbooks your repo points to, perhaps you could provide a direct link?
Are you already using the _netdev mount option in your fstab? That should make sure that these mounts are only attempted after the network is working. This should be mentioned by 'man systemd-mount', didn't use NFS myself though.
Best regards, Laurențiu
NFS at boot time is very timing sensitive, and error prone. The network has to be fully enabled, especially DHCP based network configurations and RPC enabled, for NFS to work correctly. So stop relying it in /etc/fstab, seriously. Enable autofs and use that to enable the NFS mount as needed and only as needed.