 
            Hello,
We have a VM (under KVM - a VPS service by our ISP) running CentOS 7.
On it we have 2 NFS mounts, one for backup and one as a live file system (where there are two user homes as well):
----------------------------------------------------------------------------------------------------------------------- # cat /etc/fstab
/dev/mapper/centos-root / xfs defaults 0 0 UUID=7a3ae70a-8ef3-463b-8f5b-be4e2e7be894 /boot xfs defaults 0 0 /dev/mapper/centos-swap swap swap defaults 0 0 10.201.40.34:/data/col1/noc-bkups-1 /mnt/dd2500-1 nfs auto,noatime,nolock,bg,nfsvers=3,intr,tcp,actimeo=1800 0 0 10.201.40.34:/data/col1/hesperia-mount /hesperiamount nfs auto,noatime,nolock,bg,nfsvers=3,intr,tcp,actimeo=1800 0 0 -----------------------------------------------------------------------------------------------------------------------
This setup has been working fine for over a year, even under significant load, without issues.
However, yesterday, the "live" NFS mount (/hesperiamount) has started crashing. When bootingeverything is fine, but very soonafter boot we noticed that we lose communication to the mount, although the remote storage system is accessible(without reporting any errors) and no network issues have occurred. We found that dmesg reports failures with call traces (2 examples):
This happens repeatedly/consistently (after several reboots) so we have been forced to replace the NFS mount with a local mount (on a new local virtual hard disk), to restore normal system operation. So the fstab has now become:
------------------------------------------------------------------------------------------------------------------- # cat /etc/fstab
/dev/mapper/centos-root / xfs defaults 0 0 UUID=7a3ae70a-8ef3-463b-8f5b-be4e2e7be894 /boot xfs defaults 0 0 /dev/mapper/centos-swap swap swap defaults 0 0 /dev/mapper/vg2-lv1 /hesperiamount xfs defaults 0 0 10.201.40.34:/data/col1/noc-bkups-1 /mnt/dd2500-1 nfs auto,noatime,nolock,bg,nfsvers=3,intr,tcp,actimeo=1800 0 0 # 10.201.40.34:/data/col1/hesperia-mount /hesperiamount nfs auto,noatime,nolock,bg,nfsvers=3,intr,tcp,actimeo=1800 0 0 -------------------------------------------------------------------------------------------------------------------
Note that when I later mounted manually the same NFS share on the same box (in order to copy data from it using rsync), it did not crash (but it only had reads and no writes in this scenario). The share was manually mounted with the following command:
# mount -vv -o auto,noatime,nolock,bg,nfsvers=3,intr,tcp,actimeo=1800 -t nfs 10.201.40.34:/data/col1/hesperia-mount /hesperiamount2
Questions:
* Is this a known issue/bug? * Have we possibly made any NFS misconfigurations (which however have not caused any errors for about a year now)? * What could we do to prevent the error from occurring again?
Please advise.
Thanks, Nick
 
            Le 02/06/2017 à 08:41, Nikolaos Milas a écrit :
Questions:
- Is this a known issue/bug?
I have same problem since last rpcbind package update (rpcbind-0.2.0-38.el7_3)
- Have we possibly made any NFS misconfigurations (which however have not caused any errors for about a year now)?
- What could we do to prevent the error from occurring again?
Reverting to rpcbind-0.2.0-38.el7 solves the problem for me
 
            On 2/6/2017 10:40 πμ, Philippe BOURDEU d'AGUERRE wrote:
Reverting to rpcbind-0.2.0-38.el7 solves the problem for me
Thank you very much Philippe,
I notice that I have upgraded to rpcbind-0.2.0-38.el7_3.x86_64 on May 26.
Have you checked if this bug/behavior has been reported or should we file a bug report?
Nick
 
            On 2/6/2017 10:58 πμ, Nikolaos Milas wrote:
Have you checked if this bug/behavior has been reported or should we file a bug report?
After a bit of search, I found the associated reports:
https://bugs.centos.org/view.php?id=13351 https://bugzilla.redhat.com/show_bug.cgi?id=1454876
No solution yet, but -as a workaround- it seems that -at least- nfs problems are indeed solved with downgrading.
Cheers, Nick
 
            On 2/6/2017 1:46 μμ, Nikolaos Milas wrote:
After a bit of search, I found the associated reports:
https://bugs.centos.org/view.php?id=13351 https://bugzilla.redhat.com/show_bug.cgi?id=1454876
No solution yet, but -as a workaround- it seems that -at least- nfs problems are indeed solved with downgrading.
I have been working fine with CentOS 7.3, since I downgraded to rpcbind-0.2.0-38.el7.x86_64.
Today, I decided to upgrade to 7.4 (which, among several hundred updates, includes rpcbind-0.2.0-42.el7.x86_64); after that I have started having similar NFS issues again: NFS communication hungs. In /var/log/messages:
----------------------------------------------------------------------------------------- ... Sep 22 11:03:21 hesperia1 kernel: RPC: Registered named UNIX socket transport module. Sep 22 11:03:21 hesperia1 kernel: RPC: Registered udp transport module. Sep 22 11:03:21 hesperia1 kernel: RPC: Registered tcp transport module. Sep 22 11:03:21 hesperia1 kernel: RPC: Registered tcp NFSv4.1 backchannel transport module. Sep 22 11:03:21 hesperia1 systemd-udevd: starting version 219 Sep 22 11:03:21 hesperia1 systemd: Started Configure read-only root support. Sep 22 11:03:21 hesperia1 kernel: Installing knfsd (copyright (C) 1996 okir@monad.swb.de). Sep 22 11:03:21 hesperia1 systemd: Mounted NFSD configuration filesystem. ... Sep 22 11:03:27 hesperia1 systemd: Mounting /mnt/dd2500-1... Sep 22 11:03:27 hesperia1 systemd: Starting Notify NFS peers of a restart... Sep 22 11:03:27 hesperia1 sm-notify[948]: Version 1.3.0 starting Sep 22 11:03:27 hesperia1 systemd: Started Notify NFS peers of a restart. Sep 22 11:03:27 hesperia1 systemd: Started OpenSSH server daemon. Sep 22 11:03:27 hesperia1 kernel: FS-Cache: Loaded Sep 22 11:03:27 hesperia1 kernel: FS-Cache: Netfs 'nfs' registered for caching Sep 22 11:03:27 hesperia1 systemd: Mounted /mnt/dd2500-1. Sep 22 11:03:27 hesperia1 systemd: Reached target Remote File Systems. Sep 22 11:03:27 hesperia1 systemd: Starting Remote File Systems. ... Sep 22 11:11:16 hesperia1 kernel: nfs: server 10.201.40.34 not responding, still trying ... Sep 22 11:20:44 hesperia1 kernel: nfs: server 10.201.40.34 not responding, still trying ... -----------------------------------------------------------------------------------------
I tried downgrading to rpcbind-0.2.0-38.el7.x86_64 but this time it didn't help.
I mount either directly:
mount -vv -o auto,noatime,nolock,bg,nfsvers=3,intr,tcp,actimeo=1800 -t nfs 10.201.40.34:/data/col1/hesperia-mount /hesperiamount2
or through /etc/fstab:
10.201.40.34:/data/col1/hesperia-mount /hesperiamount2 nfs auto,noatime,nolock,bg,nfsvers=3,intr,tcp,actimeo=1800 0
The box may even hung during reboot, which has never happened in the past.
It needs a hard reboot (via VM admin console) to boot again.
I have confirmed the above behavior multiple times.
Please advise me on how to resolve this situation. We are very much dependent on NFS mounts.
Is it a known bug? (As far as I could search, I didn't came up with something.)
The earlier bug report appears resolved: https://bugzilla.redhat.com/show_bug.cgi?id=1454876
Can I safely/easily revert to 7.3?
Thanks in advance, Nick
 
            On 22/9/2017 2:58 μμ, Nikolaos Milas wrote:
... or through /etc/fstab:
10.201.40.34:/data/col1/hesperia-mount /hesperiamount2 nfs auto,noatime,nolock,bg,nfsvers=3,intr,tcp,actimeo=1800 0
Correction: the /etc/fstab nfs mount line has one more zero:
10.201.40.34:/data/col1/hesperia-mount /hesperiamount2 nfs auto,noatime,nolock,bg,nfsvers=3,intr,tcp,actimeo=1800 0 0
I am looking forward to your feedback.
Based on the facts and experience, it looks like a bug. After all, it occurred right after upgrade to 7.4, without any system configuration changes.
Please help! Nick
 
            On 22/9/2017 3:46 μμ, Nikolaos Milas wrote:
Based on the facts and experience, it looks like a bug. After all, it occurred right after upgrade to 7.4, without any system configuration changes.
I have created bug report: https://bugs.centos.org/view.php?id=13891 for this.
Isn't there anyone else having NFS mount issues after upgrade to 7.4?
(I have found this report: https://access.redhat.com/solutions/3146191 which I think is not directly related.)
Other possible error report which could be related: https://www.reddit.com/r/ansible/comments/6tu9c4/mounting_a_nfs_share_from_a...
Please let me know if there can be a workaround or something.
Thanks, Nick
 
            On 22/9/2017 8:15 μμ, Nikolaos Milas wrote:
I have created bug report: https://bugs.centos.org/view.php?id=13891 for this.
I have also created
https://bugzilla.redhat.com/show_bug.cgi?id=1494834
and I have uploaded a lot of (hopefully useful) information, but there doesn't seem to exist any activity on the issue nor have I had any feedback, although it's been a week since the report.
I continue to have this issue.
Nick
 
            This config is working fior me, with just using an older kernel.
[root@mnemosyne ~]# uname -r 3.10.0-514.21.2.el7.x86_64 [root@mnemosyne ~]# rpm -qa | grep rpcbind rpcbind-0.2.0-42.el7.x86_64 [root@mnemosyne ~]# rpm -qa | grep nfs libnfsidmap-0.25-17.el7.x86_64 nfs-utils-1.3.0-0.48.el7.x86_64
Patrick
 
            On 2/10/2017 11:19 πμ, Patrick Begou wrote:
This config is working fior me, with just using an older kernel.
Thanks Patrick,
Unfortunately, it doesn't work for me. I tried booting with an older kernel (3.10.0-514.21.1.el7.x86_64) and/or downgrading rpcbind to rpcbind-0.2.0-38.el7.x86_64 (all three combinations: each one separately and both at the same time), but it didn't work. I have not tried to downgrade nfs-utils as well...
Note that on the same VLAN, on the same cluster, I have another VM which I have not upgraded yet (thankfully) to 7.4 and this works normally (using 7.3 and rpcbind-0.2.0-38.el7).
My findings and logs are available at: https://bugzilla.redhat.com/show_bug.cgi?id=1494834
Nick
 
            On 2/10/2017 11:46 μμ, Nikolaos Milas wrote:
Unfortunately, it doesn't work for me.
Problem solved - at least in my case - by changing the NFS Export Options (of the NFS shared directory, at the data storage system) from secure to insecure. That is, I changed from:
rw,no_root_squash,no_all_squash,secure,nolog
to:
rw,no_root_squash,no_all_squash,insecure,nolog
I don't know if the behavior I had described can be explained by using the "secure" option, but after I changed to "insecure" everything works fine, using the latest packages - latest kernel and latest rpms.
If anyone can provide some insight on it, that would be appreciated (since I know little about NFS).
Cheers, Nick
 
            On 4/10/2017 3:09 μμ, Nikolaos Milas wrote:
Problem solved - at least in my case - by changing the NFS Export Options (of the NFS shared directory, at the data storage system) from secure to insecure.
In the end, it occurred that the issue re-appeared after a couple of days.
So, it seems that this change did not actually solve the problem.
I am still trying to find a solution.
Nick
 
            On 7/10/2017 7:35 μμ, Nikolaos Milas wrote:
I am still trying to find a solution.
The problem was finally traced down to a Cisco ASA bug (this firewall device lies between the connected networks); bug CSCuq80704 was resolved by an ASA software update.
NFS packets were incorrectly being dropped by ASA and were causing nfs traffic to stall. After ASA software upgrade the problem has not occurred again.
I can't tell why this was not happening for many months, but only lately.
Case closed.
Cheers, Nick


