Hello,
I'm back with these NFS problems.... Server and client have been updated but it still rise time to time.
server is: Linux robin.legi.grenoble-inp.fr 3.10.0-1127.18.2.el7.x86_64 #1 SMP Sun Jul 26 15:27:06 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux client is : Linux grivola.legi.grenoble-inp.fr 3.10.0-1127.18.2.el7.x86_64 #1 SMP Sun Jul 26 15:27:06 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
CentOS Linux release 7.8.2003 (Core) each.
It seams related to an scp session: the NFS client downloads a large data set from a remote server and store the files on it's NFS file system.
On the client I have such messages in /var/log/messages:
Aug 28 10:03:08 grivola kernel: INFO: task scp:78495 blocked for more than 120 seconds. Aug 28 10:03:08 grivola kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Aug 28 10:03:08 grivola kernel: scp D ffff97e37fa9acc0 0 78495 147369 0x00000084 Aug 28 10:03:08 grivola kernel: Call Trace: Aug 28 10:03:08 grivola kernel: [<ffffffff92783ef0>] ? bit_wait+0x50/0x50 Aug 28 10:03:08 grivola kernel: [<ffffffff92785da9>] schedule+0x29/0x70 Aug 28 10:03:08 grivola kernel: [<ffffffff927838b1>] schedule_timeout+0x221/0x2d0 Aug 28 10:03:08 grivola kernel: [<ffffffffc132e7e6>] ? rpc_run_task+0xf6/0x150 [sunrpc] Aug 28 10:03:08 grivola kernel: [<ffffffffc133d850>] ? rpc_put_task+0x10/0x20 [sunrpc] Aug 28 10:03:08 grivola kernel: [<ffffffff92783ef0>] ? bit_wait+0x50/0x50 Aug 28 10:03:08 grivola kernel: [<ffffffff9278549d>] io_schedule_timeout+0xad/0x130 Aug 28 10:03:08 grivola kernel: [<ffffffff92785538>] io_schedule+0x18/0x20 Aug 28 10:03:08 grivola kernel: [<ffffffff92783f01>] bit_wait_io+0x11/0x50 Aug 28 10:03:08 grivola kernel: [<ffffffff92783a27>] __wait_on_bit+0x67/0x90 Aug 28 10:03:08 grivola kernel: [<ffffffff921bd741>] wait_on_page_bit+0x81/0xa0 Aug 28 10:03:08 grivola kernel: [<ffffffff920c7840>] ? wake_bit_function+0x40/0x40 Aug 28 10:03:08 grivola kernel: [<ffffffff921bd871>] __filemap_fdatawait_range+0x111/0x190 Aug 28 10:03:08 grivola kernel: [<ffffffff921bd904>] filemap_fdatawait_range+0x14/0x30 Aug 28 10:03:08 grivola kernel: [<ffffffff921bd947>] filemap_fdatawait+0x27/0x30 Aug 28 10:03:08 grivola kernel: [<ffffffff921bfd1c>] filemap_write_and_wait+0x4c/0x80 Aug 28 10:03:08 grivola kernel: [<ffffffffc097ddd0>] nfs_wb_all+0x20/0x100 [nfs] Aug 28 10:03:08 grivola kernel: [<ffffffffc09700e0>] nfs_setattr+0x1f0/0x210 [nfs] Aug 28 10:03:08 grivola kernel: [<ffffffff9226cecc>] notify_change+0x30c/0x4d0 Aug 28 10:03:08 grivola kernel: [<ffffffff9224af05>] do_truncate+0x75/0xc0 Aug 28 10:03:08 grivola kernel: [<ffffffff92250118>] ? __sb_start_write+0x58/0x120 Aug 28 10:03:08 grivola kernel: [<ffffffff9224b329>] do_sys_ftruncate.constprop.14+0x139/0x1a0 Aug 28 10:03:08 grivola kernel: [<ffffffff9224b3ce>] SyS_ftruncate+0xe/0x10 Aug 28 10:03:08 grivola kernel: [<ffffffff92792ed2>] system_call_fastpath+0x25/0x2a
At this time the NFS server freeze. Even a ssh session or the local console (via IDRAC or screen/keyboard physically plugged on the server) do not work.
I have no special messages on the NFS server. The freeze period end with:
On the server:
Aug 28 10:20:26 robin kernel: NFSD: client 194.254.66.26 testing state ID with incorrect client ID Aug 28 10:20:26 robin kernel: NFSD: client 194.254.66.26 testing state ID with incorrect client ID Aug 28 10:20:26 robin kernel: NFSD: client 194.254.66.26 testing state ID with incorrect client ID Aug 28 10:20:26 robin kernel: NFSD: client 194.254.66.26 testing state ID with incorrect client ID Aug 28 10:20:26 robin kernel: NFSD: client 194.254.66.26 testing state ID with incorrect client ID Aug 28 10:20:26 robin kernel: NFSD: client 194.254.66.26 testing state ID with incorrect client ID Aug 28 10:20:26 robin kernel: NFSD: client 194.254.66.26 testing state ID with incorrect client ID Aug 28 10:20:26 robin kernel: NFSD: client 194.254.66.26 testing state ID with incorrect client ID Aug 28 10:20:26 robin kernel: NFSD: client 194.254.66.26 testing state ID with incorrect client ID Aug 28 10:20:26 robin kernel: NFSD: client 194.254.66.26 testing state ID with incorrect client ID
and on the client:
Aug 28 10:20:26 grivola kernel: nfs: server robin.legi.grenoble-inp.fr OK Aug 28 10:20:26 grivola kernel: nfs: server robin.legi.grenoble-inp.fr OK Aug 28 10:20:26 grivola kernel: nfs: server robin.legi.grenoble-inp.fr OK Aug 28 10:20:26 grivola kernel: nfs: server robin.legi.grenoble-inp.fr OK Aug 28 10:20:26 grivola kernel: nfs: server robin.legi.grenoble-inp.fr OK
I do not know how to investigate this....
Patrick
Le 09/07/2020 à 12:11, Patrick Bégou a écrit :
Hi Orion,
no, I still have this problem. I delay working on it as I the latest updates have not been installed on the server and on the client. I'll work again on this problem as soon as possible.
Thanks Charles for your detailed information on how to track this problem. I'll check all these metrics.
I have several clients for this nfs server and the problem seems only to occur from the client using nfs 4.1 in CentOS Linux release 7.7.1908 (Core). The default options used are: rw,relatime,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=194.254.xx.xx,local_lock=none,addr=194.254.yy.yy
On olders clients (Red Hat Enterprise Linux Server release 6.7 (Santiago)) default options are: rw,intr,hard,sloppy,vers=4,addr=194.254.xx.xx,clientaddr=194.254.yy.yy
The server in CentOS7.6.1810
Will see if the latest updates help to solve the problem.
Patrick
Le 03/07/2020 à 00:05, Orion Poplawski a écrit :
On 6/1/20 3:08 AM, Patrick Bégou wrote:
Le 13/05/2020 à 02:13, Orion Poplawski a écrit :
On 5/12/20 2:46 AM, Patrick Bégou wrote:
Hi,
I need some help with NFSv4 setup/tuning. I have a dedicated nfs server (2 x E5-2620 8cores/16 threads each, 64GB RAM, 1x10Gb ethernet and 16x 8TB HDD) used by two servers and a small cluster (400 cores). All the servers are running CentOS 7, the cluster is running CentOS6.
Time to time on the server I get:
kernel: NFSD: client xxx.xxx.xxx.xxx testing state ID with incorrect client ID
And the client xxx.xxx.xxx.xxx freeze whith:
kernel: nfs: server xxxxx.legi.grenoble-inp.fr not responding, still trying kernel: nfs: server xxxxx.legi.grenoble-inp.fr OK kernel: nfs: server xxxxx.legi.grenoble-inp.fr not responding, still trying kernel: nfs: server xxxxx.legi.grenoble-inp.fr OK
There is a discussion on RedHat7 support about this but only open to subscribers. Other searches with google do not provide useful information.
FYI - you can get access to such info with a free RHEL developers account.
Thanks for your suggestion. As the problem is back I've subscribed to reach the full content of this discussion.
The answer was "do not use antivirus" :-(. I do not use antivirus as I am CentOS only.
Patrick
Just curious to see if you have had any luck resolving these issues? I'm afraid that NFS on EL 7 has become much less stable for us recently as well with lots more client access hangs.
Orion
CentOS mailing list CentOS@centos.org https://lists.centos.org/mailman/listinfo/centos