But if I start the automount unit and ls the mount point, the shell hangs and eventually, a long time later (I haven't timed it, maybe an hour), I eventually get a prompt again. Control-C won't interrupt it. I can still ssh in and get another session so it's just the process that's accessing the mount point that hangs.
I don't have a solution, but I wanted to point out this same hang happened to me recently with a Myricom 10Gb card. Apparently Myricom drivers do not support CentOS 7 smb connections, although HTTP traffic works fine. I solved it by switching to a different NIC.
--On Friday, October 19, 2018 2:33 PM -0700 Elliott Balsley elliott@altsystems.com wrote:
I don't have a solution, but I wanted to point out this same hang happened to me recently with a Myricom 10Gb card. Apparently Myricom drivers do not support CentOS 7 smb connections, although HTTP traffic works fine. I solved it by switching to a different NIC.
The mount works fine for me. It's only the automount that hangs, and only since a few months ago.
I had it happen again today when my LetsEncrypt cert renewed and the dovecot (IMAP) server restarted. Dovecot checks all the mountpoints (in case any have mail folders on them) and hung on restart. I shelled in and ran df and it also hung. I logged in yet another session and tried to ls the mountpoint and that hung completing the directory name.
Here's what I see in /var/log/messages when dovecot hangs and I manually mount the shares from another shell session. SELinux is in permissive mode.
Oct 26 09:11:39 saruman systemd: Mounting NAS1 share 1... Oct 26 09:11:39 saruman systemd: Failed to expire automount, ignoring: No such device Oct 26 09:11:39 saruman systemd: Mounted NAS1 share 1. Oct 26 09:11:45 saruman kernel: INFO: task dovecot:831 blocked for more than 120 seconds. Oct 26 09:11:45 saruman kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Oct 26 09:11:45 saruman kernel: dovecot D ffff9994adfa3f40 0 831 1 0x00000084 Oct 26 09:11:45 saruman kernel: Call Trace: Oct 26 09:11:45 saruman kernel: [<ffffffff85f1890c>] ? __schedule+0x41c/0xa20 Oct 26 09:11:45 saruman kernel: [<ffffffff85f18f39>] schedule+0x29/0x70 Oct 26 09:11:45 saruman kernel: [<ffffffff85f168a9>] schedule_timeout+0x239/0x2c0 Oct 26 09:11:45 saruman kernel: [<ffffffff858beb96>] ? finish_wait+0x56/0x70 Oct 26 09:11:45 saruman kernel: [<ffffffff85f16ff2>] ? mutex_lock+0x12/0x2f Oct 26 09:11:45 saruman kernel: [<ffffffff85ab4e00>] ? autofs4_wait+0x420/0x910 Oct 26 09:11:45 saruman kernel: [<ffffffff859faf82>] ? kmem_cache_alloc+0x1c2/0x1f0 Oct 26 09:11:45 saruman kernel: [<ffffffff85f192ed>] wait_for_completion+0xfd/0x140 Oct 26 09:11:45 saruman kernel: [<ffffffff858d2010>] ? wake_up_state+0x20/0x20 Oct 26 09:11:45 saruman kernel: [<ffffffff85ab603b>] autofs4_expire_wait+0xab/0x160 Oct 26 09:11:45 saruman kernel: [<ffffffff85ab2fc0>] do_expire_wait+0x1e0/0x210 Oct 26 09:11:45 saruman kernel: [<ffffffff85ab31fe>] autofs4_d_manage+0x7e/0x1d0 Oct 26 09:11:45 saruman kernel: [<ffffffff85a2a37a>] follow_managed+0xba/0x310 Oct 26 09:11:45 saruman kernel: [<ffffffff85a2b32d>] lookup_fast+0x12d/0x230 Oct 26 09:11:45 saruman kernel: [<ffffffff85a2e0dd>] path_lookupat+0x16d/0x8b0 Oct 26 09:11:45 saruman kernel: [<ffffffff85f127ba>] ? avc_alloc_node+0x24/0x123 Oct 26 09:11:45 saruman kernel: [<ffffffff859fadf5>] ? kmem_cache_alloc+0x35/0x1f0 Oct 26 09:11:45 saruman kernel: [<ffffffff85a30aef>] ? getname_flags+0x4f/0x1a0 Oct 26 09:11:45 saruman kernel: [<ffffffff85a2e84b>] filename_lookup+0x2b/0xc0 Oct 26 09:11:45 saruman kernel: [<ffffffff85a31c87>] user_path_at_empty+0x67/0xc0 Oct 26 09:11:45 saruman kernel: [<ffffffff85927b72>] ? from_kgid_munged+0x12/0x20 Oct 26 09:11:45 saruman kernel: [<ffffffff85a251df>] ? cp_new_stat+0x14f/0x180 Oct 26 09:11:45 saruman kernel: [<ffffffff85a31cf1>] user_path_at+0x11/0x20 Oct 26 09:11:45 saruman kernel: [<ffffffff85a24cd3>] vfs_fstatat+0x63/0xc0 Oct 26 09:11:45 saruman kernel: [<ffffffff85a2523e>] SYSC_newstat+0x2e/0x60 Oct 26 09:11:45 saruman kernel: [<ffffffff859326b6>] ? __audit_syscall_exit+0x1e6/0x280 Oct 26 09:11:45 saruman kernel: [<ffffffff85a2551e>] SyS_newstat+0xe/0x10 Oct 26 09:11:45 saruman kernel: [<ffffffff85f2579b>] system_call_fastpath+0x22/0x27 Oct 26 09:11:50 saruman systemd: Unmounting NAS1 share 1... Oct 26 09:11:50 saruman systemd: Unmounted NAS1 share 1. Oct 26 09:12:41 saruman systemd: dovecot.service stop-final-sigterm timed out. Killing. Oct 26 09:13:19 saruman systemd: Mounting NAS1 share 2... Oct 26 09:13:19 saruman systemd: Failed to expire automount, ignoring: No such device Oct 26 09:13:19 saruman systemd: dovecot.service: main process exited, code=killed, status=9/KILL Oct 26 09:13:19 saruman systemd: Unit dovecot.service entered failed state. Oct 26 09:13:19 saruman systemd: dovecot.service failed. Oct 26 09:13:19 saruman systemd: Starting Dovecot IMAP/POP3 email server... Oct 26 09:13:19 saruman systemd: Mounted NAS1 share 2. Oct 26 09:13:19 saruman systemd: Started Dovecot IMAP/POP3 email server.
Kenneth Porter wrote:
--On Friday, October 19, 2018 2:33 PM -0700 Elliott Balsley elliott@altsystems.com wrote:
I don't have a solution, but I wanted to point out this same hang happened to me recently with a Myricom 10Gb card. Apparently Myricom drivers do not support CentOS 7 smb connections, although HTTP traffic works fine. I solved it by switching to a different NIC.
The mount works fine for me. It's only the automount that hangs, and only since a few months ago.
I had it happen again today when my LetsEncrypt cert renewed and the dovecot (IMAP) server restarted. Dovecot checks all the mountpoints (in case any have mail folders on them) and hung on restart. I shelled in and ran df and it also hung. I logged in yet another session and tried to ls the mountpoint and that hung completing the directory name.
Here's what I see in /var/log/messages when dovecot hangs and I manually mount the shares from another shell session. SELinux is in permissive mode.
Oct 26 09:11:39 saruman systemd: Mounting NAS1 share 1... Oct 26 09:11:39 saruman systemd: Failed to expire automount, ignoring: No such device Oct 26 09:11:39 saruman systemd: Mounted NAS1 share 1. Oct 26 09:11:45 saruman kernel: INFO: task dovecot:831 blocked for more than 120 seconds. Oct 26 09:11:45 saruman kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Oct 26 09:11:45 saruman kernel: dovecot D ffff9994adfa3f40 0 831 1 0x00000084 Oct 26 09:11:45 saruman kernel: Call Trace: Oct 26 09:11:45 saruman kernel: [<ffffffff85f1890c>] ? __schedule+0x41c/0xa20 Oct 26 09:11:45 saruman kernel: [<ffffffff85f18f39>] schedule+0x29/0x70 Oct 26 09:11:45 saruman kernel: [<ffffffff85f168a9>] schedule_timeout+0x239/0x2c0 Oct 26 09:11:45 saruman kernel: [<ffffffff858beb96>] ? finish_wait+0x56/0x70 Oct 26 09:11:45 saruman kernel: [<ffffffff85f16ff2>] ? mutex_lock+0x12/0x2f Oct 26 09:11:45 saruman kernel: [<ffffffff85ab4e00>] ?
<snip> Wait a minute: are you running IPv6? What we see is that if a system doesn't get its IPv6 address, NFSv4 goes preferentially for that, and if it has that, and looses it, it will *NOT* fall back to IPv4, but hangs.
mark
Wait a minute: are you running IPv6? What we see is that if a system doesn't get its IPv6 address, NFSv4 goes preferentially for that, and if it has that, and looses it, it will *NOT* fall back to IPv4, but hangs.
Nope. My router does not do IPv6. From what I've heard, the Myricom
driver was never updated for C7. Myricom was acquired by some company called CSPI, and apparently they're not interested in updating the driver.
On 10/26/2018 12:25 PM, mark wrote:
Wait a minute: are you running IPv6? What we see is that if a system doesn't get its IPv6 address, NFSv4 goes preferentially for that, and if it has that, and looses it, it will*NOT* fall back to IPv4, but hangs.
All my interfaces have a link local IPv6 address.
Note that only the automount hangs. The regular mount unit works fine. It also seems that once the mount is manually mounted and allowed to expire, it automounts again just fine. (I have the TimeoutIdleSec set to 10 for testing.) It's a production server so I can't easily reboot it to test the failure. I reboot it about once a month when a new kernel comes out.