[CentOS] CentOS7 and NFS

Fri Aug 28 09:24:19 UTC 2020
Patrick Bégou <Patrick.Begou at legi.grenoble-inp.fr>

Hello,

I'm back with these NFS problems....
Server and client have been updated but it still rise time to time.

server is: Linux robin.legi.grenoble-inp.fr 3.10.0-1127.18.2.el7.x86_64
#1 SMP Sun Jul 26 15:27:06 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
client is :  Linux grivola.legi.grenoble-inp.fr
3.10.0-1127.18.2.el7.x86_64 #1 SMP Sun Jul 26 15:27:06 UTC 2020 x86_64
x86_64 x86_64 GNU/Linux

CentOS Linux release 7.8.2003 (Core) each.

It seams related to an scp session: the NFS client downloads a large
data set from a remote server and store the files on it's NFS file system.

On the client I have such messages in /var/log/messages:

    Aug 28 10:03:08 grivola kernel: INFO: task scp:78495 blocked for
    more than 120 seconds.
    Aug 28 10:03:08 grivola kernel: "echo 0 >
    /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    Aug 28 10:03:08 grivola kernel: scp             D
    ffff97e37fa9acc0     0 78495 147369 0x00000084
    Aug 28 10:03:08 grivola kernel: Call Trace:
    Aug 28 10:03:08 grivola kernel: [<ffffffff92783ef0>] ?
    bit_wait+0x50/0x50
    Aug 28 10:03:08 grivola kernel: [<ffffffff92785da9>] schedule+0x29/0x70
    Aug 28 10:03:08 grivola kernel: [<ffffffff927838b1>]
    schedule_timeout+0x221/0x2d0
    Aug 28 10:03:08 grivola kernel: [<ffffffffc132e7e6>] ?
    rpc_run_task+0xf6/0x150 [sunrpc]
    Aug 28 10:03:08 grivola kernel: [<ffffffffc133d850>] ?
    rpc_put_task+0x10/0x20 [sunrpc]
    Aug 28 10:03:08 grivola kernel: [<ffffffff92783ef0>] ?
    bit_wait+0x50/0x50
    Aug 28 10:03:08 grivola kernel: [<ffffffff9278549d>]
    io_schedule_timeout+0xad/0x130
    Aug 28 10:03:08 grivola kernel: [<ffffffff92785538>]
    io_schedule+0x18/0x20
    Aug 28 10:03:08 grivola kernel: [<ffffffff92783f01>]
    bit_wait_io+0x11/0x50
    Aug 28 10:03:08 grivola kernel: [<ffffffff92783a27>]
    __wait_on_bit+0x67/0x90
    Aug 28 10:03:08 grivola kernel: [<ffffffff921bd741>]
    wait_on_page_bit+0x81/0xa0
    Aug 28 10:03:08 grivola kernel: [<ffffffff920c7840>] ?
    wake_bit_function+0x40/0x40
    Aug 28 10:03:08 grivola kernel: [<ffffffff921bd871>]
    __filemap_fdatawait_range+0x111/0x190
    Aug 28 10:03:08 grivola kernel: [<ffffffff921bd904>]
    filemap_fdatawait_range+0x14/0x30
    Aug 28 10:03:08 grivola kernel: [<ffffffff921bd947>]
    filemap_fdatawait+0x27/0x30
    Aug 28 10:03:08 grivola kernel: [<ffffffff921bfd1c>]
    filemap_write_and_wait+0x4c/0x80
    Aug 28 10:03:08 grivola kernel: [<ffffffffc097ddd0>]
    nfs_wb_all+0x20/0x100 [nfs]
    Aug 28 10:03:08 grivola kernel: [<ffffffffc09700e0>]
    nfs_setattr+0x1f0/0x210 [nfs]
    Aug 28 10:03:08 grivola kernel: [<ffffffff9226cecc>]
    notify_change+0x30c/0x4d0
    Aug 28 10:03:08 grivola kernel: [<ffffffff9224af05>]
    do_truncate+0x75/0xc0
    Aug 28 10:03:08 grivola kernel: [<ffffffff92250118>] ?
    __sb_start_write+0x58/0x120
    Aug 28 10:03:08 grivola kernel: [<ffffffff9224b329>]
    do_sys_ftruncate.constprop.14+0x139/0x1a0
    Aug 28 10:03:08 grivola kernel: [<ffffffff9224b3ce>]
    SyS_ftruncate+0xe/0x10
    Aug 28 10:03:08 grivola kernel: [<ffffffff92792ed2>]
    system_call_fastpath+0x25/0x2a

At this time the NFS server freeze. Even a ssh session or the local
console (via IDRAC or screen/keyboard physically plugged on the server)
do not work.

I have no special messages on the NFS server. The freeze period end with:

On the server:

    Aug 28 10:20:26 robin kernel: NFSD: client 194.254.66.26 testing
    state ID with incorrect client ID
    Aug 28 10:20:26 robin kernel: NFSD: client 194.254.66.26 testing
    state ID with incorrect client ID
    Aug 28 10:20:26 robin kernel: NFSD: client 194.254.66.26 testing
    state ID with incorrect client ID
    Aug 28 10:20:26 robin kernel: NFSD: client 194.254.66.26 testing
    state ID with incorrect client ID
    Aug 28 10:20:26 robin kernel: NFSD: client 194.254.66.26 testing
    state ID with incorrect client ID
    Aug 28 10:20:26 robin kernel: NFSD: client 194.254.66.26 testing
    state ID with incorrect client ID
    Aug 28 10:20:26 robin kernel: NFSD: client 194.254.66.26 testing
    state ID with incorrect client ID
    Aug 28 10:20:26 robin kernel: NFSD: client 194.254.66.26 testing
    state ID with incorrect client ID
    Aug 28 10:20:26 robin kernel: NFSD: client 194.254.66.26 testing
    state ID with incorrect client ID
    Aug 28 10:20:26 robin kernel: NFSD: client 194.254.66.26 testing
    state ID with incorrect client ID

and on the client:

    Aug 28 10:20:26 grivola kernel: nfs: server
    robin.legi.grenoble-inp.fr OK
    Aug 28 10:20:26 grivola kernel: nfs: server
    robin.legi.grenoble-inp.fr OK
    Aug 28 10:20:26 grivola kernel: nfs: server
    robin.legi.grenoble-inp.fr OK
    Aug 28 10:20:26 grivola kernel: nfs: server
    robin.legi.grenoble-inp.fr OK
    Aug 28 10:20:26 grivola kernel: nfs: server
    robin.legi.grenoble-inp.fr OK


I do not know how to investigate this....

Patrick

Le 09/07/2020 à 12:11, Patrick Bégou a écrit :
> Hi Orion,
>
> no, I still have this problem. I delay working on it as I the latest
> updates have not been installed on the server and on the client. I'll
> work again on this problem as soon as possible.
>
> Thanks Charles for your detailed information on how to track this
> problem. I'll check all these metrics.
>
> I have several clients for this nfs server and the problem seems only to
> occur from the client using nfs 4.1 in CentOS Linux release 7.7.1908 (Core).
> The default options used are:
> rw,relatime,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=194.254.xx.xx,local_lock=none,addr=194.254.yy.yy
>
> On olders clients (Red Hat Enterprise Linux Server release 6.7
> (Santiago)) default options are:
> rw,intr,hard,sloppy,vers=4,addr=194.254.xx.xx,clientaddr=194.254.yy.yy
>
> The server in CentOS7.6.1810
>
> Will see if the latest updates help to solve the problem.
>
> Patrick
>
> Le 03/07/2020 à 00:05, Orion Poplawski a écrit :
>> On 6/1/20 3:08 AM, Patrick Bégou wrote:
>>> Le 13/05/2020 à 02:13, Orion Poplawski a écrit :
>>>> On 5/12/20 2:46 AM, Patrick Bégou wrote:
>>>>> Hi,
>>>>>
>>>>> I need some help with NFSv4 setup/tuning. I have a dedicated nfs
>>>>> server
>>>>> (2 x E5-2620  8cores/16 threads each, 64GB RAM, 1x10Gb ethernet and
>>>>> 16x
>>>>> 8TB HDD) used by two servers and a small cluster (400 cores). All the
>>>>> servers are running CentOS 7, the cluster is running CentOS6.
>>>>>
>>>>> Time to time on the server I get:
>>>>>
>>>>>        kernel: NFSD: client xxx.xxx.xxx.xxx testing state ID with
>>>>>       incorrect client ID
>>>>>
>>>>> And the client xxx.xxx.xxx.xxx freeze whith:
>>>>>
>>>>>        kernel: nfs: server xxxxx.legi.grenoble-inp.fr not responding,
>>>>>       still trying
>>>>>        kernel: nfs: server xxxxx.legi.grenoble-inp.fr OK
>>>>>        kernel: nfs: server xxxxx.legi.grenoble-inp.fr not responding,
>>>>>       still trying
>>>>>        kernel: nfs: server xxxxx.legi.grenoble-inp.fr OK
>>>>>
>>>>> There is a discussion on RedHat7 support about this but only open to
>>>>> subscribers. Other searches with google do not provide  useful
>>>>> information.
>>>> FYI - you can get access to such info with a free RHEL developers
>>>> account.
>>>>
>>>>
>>> Thanks for your suggestion. As the problem is back I've subscribed to
>>> reach the full content of this discussion.
>>>
>>> The answer was "do not use antivirus" :-(. I do not use antivirus as I
>>> am CentOS only.
>>>
>>> Patrick
>>>
>> Just curious to see if you have had any luck resolving these issues?
>> I'm afraid that NFS on EL 7 has become much less stable for us
>> recently as well with lots more client access hangs.
>>
>> Orion
>>
> _______________________________________________
> CentOS mailing list
> CentOS at centos.org
> https://lists.centos.org/mailman/listinfo/centos