On 2020-06-01 11:08, Patrick Bégou wrote:
> >> I need some help with NFSv4 setup/tuning. I have a dedicated nfs server
> >> (2 x E5-2620 8cores/16 threads each, 64GB RAM, 1x10Gb ethernet and 16x
> >> 8TB HDD) used by two servers and a small cluster (400 cores). All the
> >> servers are running CentOS 7, the cluster is running CentOS6.
> >>
> >> Time to time on the server I get:
> >>
> >> kernel: NFSD: client xxx.xxx.xxx.xxx testing state ID with
> >> incorrect client ID
According to Red Hat Bugzilla [1], 2015-11-19:
"testing state ID with incorrect client ID" means the server
thinks a TEST_STATEID op was sent for a stateid associated with
a client different from the client associated with the session
over which the TEST_STATEID was sent. Perhaps this could be the
result of some confusion in the server's data structures but the
most straightforward explanation would be just that that's
really what the client did (perhaps as a result of a bug in
client recovery code?)
The above explanation is applicable but unless you're running
a rather old kernel that /particular/ bug is not.
My understanding of your issue from the thread to date is you've
not yet narrowed the issue to the NFS server, the network, the 2
server clients, or the cluster clients. In other words, the
corrupt client ID could be tendered by either of the 2 servers,
or by the cluster clients, or could be corrupted in transit over
the network, or could originate on the NFS server. Correct?
According to my notes from a class given by Ted T'so on NFS,
always start with checking network health. That should be easy,
using interface statistics, eg:
$ ifconfig eth3
eth3 Link encap:Ethernet HWaddr A0:36:9F:10:A9:06
inet addr:10.10.1.100 Bcast:10.10.1.255 Mask:255.255.255.0
inet6 addr: fe80::a236:9fff:fe10:a906/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:9000 Metric:1
RX packets:858295756 errors:0 dropped:0 overruns:0 frame:0
^^^^^^^^
TX packets:7090386023 errors:0 dropped:0 overruns:0 carrier:0
^^^^^^^^
collisions:0 txqueuelen:1000
RX bytes:495026510281 (461.0 GiB) TX bytes:10475167734024 (9.5 TiB)
$ ip --stats link show eth3
eth3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UP qlen 1000
link/ether a0:36:9f:10:a9:06 brd ff:ff:ff:ff:ff:ff
RX: bytes packets errors dropped overrun mcast
495027287320 858296399 0 0 0 112282
^^^^^^
TX: bytes packets errors dropped carrier collsns
10475167775376 7090386249 0 0 0 0
^^^^^^
Layer 2 stats can also be checked using ethtool:
$ sudo ethtool --statistics eth3 | egrep 'dropped|errors'
rx_errors: 0
tx_errors: 0
...
If you've got a clean, healthy network, that leaves the clients
or the server. Maybe the clients are asking for the wrong ID.
To analyze the client ID given, you could capture traffic at
the server using, perhaps:
# tcpdump -W 10 -C 10 -w nfs_capture host <client-ipaddr>
Then using tshark or wireshark, see if the client is sending
consistent client ID's. If so, that would exonerate the clients,
leaving as suspect the NFS daemon code in the Linux kernel.
Another point that Mr. T'so made (which it sounds like you
have covered) is, don't combine an NFS server with another
application or service. I mention this only because I'm
pedantic and obsessive, or maybe obsessively pedantic.
Also worth mentioning: consider specifying no_subtree_check
in your NFS exports. And T'so suggested (ca. 2012) using
fs_mark (available from the epel repository) to exercize
your file systems.
Best luck,
--
Charles Polisher
[1] https://bugzilla.redhat.com/show_bug.cgi?id=1233284