On 2020-06-01 11:08, Patrick Bégou wrote:
I need some help with NFSv4 setup/tuning. I have a dedicated nfs server (2 x E5-2620 8cores/16 threads each, 64GB RAM, 1x10Gb ethernet and 16x 8TB HDD) used by two servers and a small cluster (400 cores). All the servers are running CentOS 7, the cluster is running CentOS6.
Time to time on the server I get:
kernel: NFSD: client xxx.xxx.xxx.xxx testing state ID with incorrect client ID
According to Red Hat Bugzilla [1], 2015-11-19:
"testing state ID with incorrect client ID" means the server thinks a TEST_STATEID op was sent for a stateid associated with a client different from the client associated with the session over which the TEST_STATEID was sent. Perhaps this could be the result of some confusion in the server's data structures but the most straightforward explanation would be just that that's really what the client did (perhaps as a result of a bug in client recovery code?)
The above explanation is applicable but unless you're running a rather old kernel that /particular/ bug is not.
My understanding of your issue from the thread to date is you've not yet narrowed the issue to the NFS server, the network, the 2 server clients, or the cluster clients. In other words, the corrupt client ID could be tendered by either of the 2 servers, or by the cluster clients, or could be corrupted in transit over the network, or could originate on the NFS server. Correct?
According to my notes from a class given by Ted T'so on NFS, always start with checking network health. That should be easy, using interface statistics, eg:
$ ifconfig eth3 eth3 Link encap:Ethernet HWaddr A0:36:9F:10:A9:06 inet addr:10.10.1.100 Bcast:10.10.1.255 Mask:255.255.255.0 inet6 addr: fe80::a236:9fff:fe10:a906/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:9000 Metric:1 RX packets:858295756 errors:0 dropped:0 overruns:0 frame:0 ^^^^^^^^ TX packets:7090386023 errors:0 dropped:0 overruns:0 carrier:0 ^^^^^^^^ collisions:0 txqueuelen:1000 RX bytes:495026510281 (461.0 GiB) TX bytes:10475167734024 (9.5 TiB)
$ ip --stats link show eth3 eth3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UP qlen 1000 link/ether a0:36:9f:10:a9:06 brd ff:ff:ff:ff:ff:ff RX: bytes packets errors dropped overrun mcast 495027287320 858296399 0 0 0 112282 ^^^^^^ TX: bytes packets errors dropped carrier collsns 10475167775376 7090386249 0 0 0 0 ^^^^^^ Layer 2 stats can also be checked using ethtool:
$ sudo ethtool --statistics eth3 | egrep 'dropped|errors' rx_errors: 0 tx_errors: 0 ...
If you've got a clean, healthy network, that leaves the clients or the server. Maybe the clients are asking for the wrong ID. To analyze the client ID given, you could capture traffic at the server using, perhaps:
# tcpdump -W 10 -C 10 -w nfs_capture host <client-ipaddr>
Then using tshark or wireshark, see if the client is sending consistent client ID's. If so, that would exonerate the clients, leaving as suspect the NFS daemon code in the Linux kernel.
Another point that Mr. T'so made (which it sounds like you have covered) is, don't combine an NFS server with another application or service. I mention this only because I'm pedantic and obsessive, or maybe obsessively pedantic.
Also worth mentioning: consider specifying no_subtree_check in your NFS exports. And T'so suggested (ca. 2012) using fs_mark (available from the epel repository) to exercize your file systems.
Best luck,