[CentOS] CentOS7 and NFS

Sat Jul 4 00:28:01 UTC 2020
cpolish at surewest.net <cpolish at surewest.net>

On 2020-06-01 11:08, Patrick Bégou wrote:
> >> I need some help with NFSv4 setup/tuning. I have a dedicated nfs server
> >> (2 x E5-2620  8cores/16 threads each, 64GB RAM, 1x10Gb ethernet and 16x
> >> 8TB HDD) used by two servers and a small cluster (400 cores). All the
> >> servers are running CentOS 7, the cluster is running CentOS6.
> >>
> >> Time to time on the server I get:
> >>
> >>       kernel: NFSD: client xxx.xxx.xxx.xxx testing state ID with
> >>      incorrect client ID

According to Red Hat Bugzilla [1], 2015-11-19:

    "testing state ID with incorrect client ID" means the server
    thinks a TEST_STATEID op was sent for a stateid associated with
    a client different from the client associated with the session
    over which the TEST_STATEID was sent. Perhaps this could be the
    result of some confusion in the server's data structures but the
    most straightforward explanation would be just that that's
    really what the client did (perhaps as a result of a bug in
    client recovery code?)

The above explanation is applicable but unless you're running 
a rather old kernel that /particular/ bug is not.

My understanding of your issue from the thread to date is you've
not yet narrowed the issue to the NFS server, the network, the 2
server clients, or the cluster clients. In other words, the
corrupt client ID could be tendered by either of the 2 servers,
or by the cluster clients, or could be corrupted in transit over
the network, or could originate on the NFS server. Correct?

According to my notes from a class given by Ted T'so on NFS,
always start with checking network health. That should be easy,
using interface statistics, eg:

    $ ifconfig eth3
    eth3 Link encap:Ethernet  HWaddr A0:36:9F:10:A9:06  
         inet addr:10.10.1.100  Bcast:10.10.1.255  Mask:255.255.255.0
         inet6 addr: fe80::a236:9fff:fe10:a906/64 Scope:Link
         UP BROADCAST RUNNING MULTICAST  MTU:9000  Metric:1
         RX packets:858295756  errors:0 dropped:0 overruns:0 frame:0
                               ^^^^^^^^
         TX packets:7090386023 errors:0 dropped:0 overruns:0 carrier:0
                               ^^^^^^^^
         collisions:0 txqueuelen:1000 
         RX bytes:495026510281 (461.0 GiB)  TX bytes:10475167734024 (9.5 TiB)

     $ ip --stats link show eth3
     eth3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UP qlen 1000
     link/ether a0:36:9f:10:a9:06 brd ff:ff:ff:ff:ff:ff
     RX: bytes      packets    errors  dropped overrun mcast   
     495027287320   858296399  0       0       0       112282  
                               ^^^^^^
     TX: bytes      packets    errors  dropped carrier collsns 
     10475167775376 7090386249 0       0       0       0       
                               ^^^^^^
Layer 2 stats can also be checked using ethtool:

     $ sudo ethtool --statistics eth3 | egrep 'dropped|errors'
     rx_errors: 0
     tx_errors: 0
     ...

If you've got a clean, healthy network, that leaves the clients
or the server. Maybe the clients are asking for the wrong ID.
To analyze the client ID given, you could capture traffic at 
the server using, perhaps:

    # tcpdump -W 10 -C 10 -w nfs_capture host <client-ipaddr>

Then using tshark or wireshark, see if the client is sending
consistent client ID's. If so, that would exonerate the clients,
leaving as suspect the NFS daemon code in the Linux kernel.

Another point that Mr. T'so made (which it sounds like you
have covered) is, don't combine an NFS server with another
application or service. I mention this only because I'm
pedantic and obsessive, or maybe obsessively pedantic.

Also worth mentioning: consider specifying no_subtree_check
in your NFS exports. And T'so suggested (ca. 2012) using
fs_mark (available from the epel repository) to exercize
your file systems.

Best luck,
-- 
Charles Polisher

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1233284