CentOS7 and NFS

List overview All Threads
Download

newer

older

CentOS 8 installer bug

Testing

Patrick Bégou

12 May 2020 12 May '20

8:46 a.m.

Hi,

I need some help with NFSv4 setup/tuning. I have a dedicated nfs server (2 x E5-2620 8cores/16 threads each, 64GB RAM, 1x10Gb ethernet and 16x 8TB HDD) used by two servers and a small cluster (400 cores). All the servers are running CentOS 7, the cluster is running CentOS6.

Time to time on the server I get:

kernel: NFSD: client xxx.xxx.xxx.xxx testing state ID with incorrect client ID

And the client xxx.xxx.xxx.xxx freeze whith:

kernel: nfs: server xxxxx.legi.grenoble-inp.fr not responding, still trying kernel: nfs: server xxxxx.legi.grenoble-inp.fr OK kernel: nfs: server xxxxx.legi.grenoble-inp.fr not responding, still trying kernel: nfs: server xxxxx.legi.grenoble-inp.fr OK

There is a discussion on RedHat7 support about this but only open to subscribers. Other searches with google do not provide useful information.

Do you have an idea how to solve these freeze states ?

More generally I would be really interested with some advice/tutorials to improve NFS performances in this dedicated context. There are so many [different] things about tuning NFS available on the web that I'm a little bit lost (the opposite of the previous question). So if some one has "the tutorial"...;-)

Thanks

Patrick

Show replies by thread

James Pearson

12 May 12 May

2:10 p.m.

Patrick Bégou wrote:

...

Hi,

I need some help with NFSv4 setup/tuning. I have a dedicated nfs server (2 x E5-2620 8cores/16 threads each, 64GB RAM, 1x10Gb ethernet and 16x 8TB HDD) used by two servers and a small cluster (400 cores). All the servers are running CentOS 7, the cluster is running CentOS6.

Time to time on the server I get:
  kernel: NFSD: client xxx.xxx.xxx.xxx testing state ID with
 incorrect client ID
And the client xxx.xxx.xxx.xxx freeze whith:
  kernel: nfs: server xxxxx.legi.grenoble-inp.fr not responding,
 still trying
  kernel: nfs: server xxxxx.legi.grenoble-inp.fr OK
  kernel: nfs: server xxxxx.legi.grenoble-inp.fr not responding,
 still trying
  kernel: nfs: server xxxxx.legi.grenoble-inp.fr OK
There is a discussion on RedHat7 support about this but only open to subscribers. Other searches with google do not provide useful information.

Do you have an idea how to solve these freeze states ?

More generally I would be really interested with some advice/tutorials to improve NFS performances in this dedicated context. There are so many [different] things about tuning NFS available on the web that I'm a little bit lost (the opposite of the previous question). So if some one has "the tutorial"...;-)

How many nfsd threads are you running on the server? - current count will be in /proc/fs/nfsd/threads

James Pearson

Patrick Bégou

8:19 p.m.

Le 12/05/2020 à 16:10, James Pearson a écrit :

...

Patrick Bégou wrote:

...
Hi,

I need some help with NFSv4 setup/tuning. I have a dedicated nfs server (2 x E5-2620 8cores/16 threads each, 64GB RAM, 1x10Gb ethernet and 16x 8TB HDD) used by two servers and a small cluster (400 cores). All the servers are running CentOS 7, the cluster is running CentOS6.

Time to time on the server I get:

kernel: NFSD: client xxx.xxx.xxx.xxx testing state ID with incorrect client ID

And the client xxx.xxx.xxx.xxx freeze whith:

kernel: nfs: server xxxxx.legi.grenoble-inp.fr not responding, still trying kernel: nfs: server xxxxx.legi.grenoble-inp.fr OK kernel: nfs: server xxxxx.legi.grenoble-inp.fr not responding, still trying kernel: nfs: server xxxxx.legi.grenoble-inp.fr OK

There is a discussion on RedHat7 support about this but only open to subscribers. Other searches with google do not provide useful information.

Do you have an idea how to solve these freeze states ?

More generally I would be really interested with some advice/tutorials to improve NFS performances in this dedicated context. There are so many [different] things about tuning NFS available on the web that I'm a little bit lost (the opposite of the previous question). So if some one has "the tutorial"...;-)

How many nfsd threads are you running on the server? - current count will be in /proc/fs/nfsd/threads

James Pearson

Hi James,

Thanks for your answer. I've configured 24 threads (for 16 hardware cores/ 32Threads on the NFS server with this processors)

But it seams that there are buffer setup to modify too when increasing the threads number... It is not done.

Load average on the server is below 1....

Patrick

Orion Poplawski

13 May 13 May

12:13 a.m.

On 5/12/20 2:46 AM, Patrick Bégou wrote:

...

Hi,

I need some help with NFSv4 setup/tuning. I have a dedicated nfs server (2 x E5-2620 8cores/16 threads each, 64GB RAM, 1x10Gb ethernet and 16x 8TB HDD) used by two servers and a small cluster (400 cores). All the servers are running CentOS 7, the cluster is running CentOS6.

Time to time on the server I get:
  kernel: NFSD: client xxx.xxx.xxx.xxx testing state ID with
 incorrect client ID
And the client xxx.xxx.xxx.xxx freeze whith:
  kernel: nfs: server xxxxx.legi.grenoble-inp.fr not responding,
 still trying
  kernel: nfs: server xxxxx.legi.grenoble-inp.fr OK
  kernel: nfs: server xxxxx.legi.grenoble-inp.fr not responding,
 still trying
  kernel: nfs: server xxxxx.legi.grenoble-inp.fr OK
There is a discussion on RedHat7 support about this but only open to subscribers. Other searches with google do not provide useful information.

FYI - you can get access to such info with a free RHEL developers account.

-- Orion Poplawski Manager of NWRA Technical Systems 720-772-5637 NWRA, Boulder/CoRA Office FAX: 303-415-9702 3380 Mitchell Lane orion@nwra.com Boulder, CO 80301 https://www.nwra.com/

Simon Matter

5:32 a.m.

...

Le 12/05/2020 à 16:10, James Pearson a écrit :

...
Patrick Bégou wrote:

...
Hi,

I need some help with NFSv4 setup/tuning. I have a dedicated nfs server (2 x E5-2620 8cores/16 threads each, 64GB RAM, 1x10Gb ethernet and 16x 8TB HDD) used by two servers and a small cluster (400 cores). All the servers are running CentOS 7, the cluster is running CentOS6.

Time to time on the server I get:

kernel: NFSD: client xxx.xxx.xxx.xxx testing state ID with incorrect client ID

And the client xxx.xxx.xxx.xxx freeze whith:

kernel: nfs: server xxxxx.legi.grenoble-inp.fr not responding, still trying kernel: nfs: server xxxxx.legi.grenoble-inp.fr OK kernel: nfs: server xxxxx.legi.grenoble-inp.fr not responding, still trying kernel: nfs: server xxxxx.legi.grenoble-inp.fr OK

There is a discussion on RedHat7 support about this but only open to subscribers. Other searches with google do not provide useful information.

Do you have an idea how to solve these freeze states ?

More generally I would be really interested with some advice/tutorials to improve NFS performances in this dedicated context. There are so many [different] things about tuning NFS available on the web that I'm a little bit lost (the opposite of the previous question). So if some one has "the tutorial"...;-)

How many nfsd threads are you running on the server? - current count will be in /proc/fs/nfsd/threads

James Pearson

Hi James,

Thanks for your answer. I've configured 24 threads (for 16 hardware cores/ 32Threads on the NFS server with this processors)

But it seams that there are buffer setup to modify too when increasing the threads number... It is not done.

Load average on the server is below 1....

I'd be very careful with higher thread numbers than physical cores. NFS threads and so called CPU hyper/simultaneous threads are quite different things and it can hurt performance if not configured correctly.

Regards, Simon

Patrick Bégou

1:36 p.m.

Le 13/05/2020 à 07:32, Simon Matter via CentOS a écrit :

...

...
Le 12/05/2020 à 16:10, James Pearson a écrit :

...
Patrick Bégou wrote:

...
Hi,

I need some help with NFSv4 setup/tuning. I have a dedicated nfs server (2 x E5-2620 8cores/16 threads each, 64GB RAM, 1x10Gb ethernet and 16x 8TB HDD) used by two servers and a small cluster (400 cores). All the servers are running CentOS 7, the cluster is running CentOS6.

Time to time on the server I get:

kernel: NFSD: client xxx.xxx.xxx.xxx testing state ID with incorrect client ID

And the client xxx.xxx.xxx.xxx freeze whith:

kernel: nfs: server xxxxx.legi.grenoble-inp.fr not responding, still trying kernel: nfs: server xxxxx.legi.grenoble-inp.fr OK kernel: nfs: server xxxxx.legi.grenoble-inp.fr not responding, still trying kernel: nfs: server xxxxx.legi.grenoble-inp.fr OK

There is a discussion on RedHat7 support about this but only open to subscribers. Other searches with google do not provide useful information.

Do you have an idea how to solve these freeze states ?

More generally I would be really interested with some advice/tutorials to improve NFS performances in this dedicated context. There are so many [different] things about tuning NFS available on the web that I'm a little bit lost (the opposite of the previous question). So if some one has "the tutorial"...;-)

How many nfsd threads are you running on the server? - current count will be in /proc/fs/nfsd/threads

James Pearson

Hi James,

Thanks for your answer. I've configured 24 threads (for 16 hardware cores/ 32Threads on the NFS server with this processors)

But it seams that there are buffer setup to modify too when increasing the threads number... It is not done.

Load average on the server is below 1....

I'd be very careful with higher thread numbers than physical cores. NFS threads and so called CPU hyper/simultaneous threads are quite different things and it can hurt performance if not configured correctly.

So you suggest to limit the setup to 16 daemons ? I'll try this evening.

Patrick

Patrick Bégou

15 May 15 May

7:26 a.m.

Le 13/05/2020 à 15:36, Patrick Bégou a écrit :

...

Le 13/05/2020 à 07:32, Simon Matter via CentOS a écrit :

...
...
Le 12/05/2020 à 16:10, James Pearson a écrit :

...
Patrick Bégou wrote:

...
Hi,

I need some help with NFSv4 setup/tuning. I have a dedicated nfs server (2 x E5-2620 8cores/16 threads each, 64GB RAM, 1x10Gb ethernet and 16x 8TB HDD) used by two servers and a small cluster (400 cores). All the servers are running CentOS 7, the cluster is running CentOS6.

Time to time on the server I get:

kernel: NFSD: client xxx.xxx.xxx.xxx testing state ID with incorrect client ID

And the client xxx.xxx.xxx.xxx freeze whith:

kernel: nfs: server xxxxx.legi.grenoble-inp.fr not responding, still trying kernel: nfs: server xxxxx.legi.grenoble-inp.fr OK kernel: nfs: server xxxxx.legi.grenoble-inp.fr not responding, still trying kernel: nfs: server xxxxx.legi.grenoble-inp.fr OK

There is a discussion on RedHat7 support about this but only open to subscribers. Other searches with google do not provide useful information.

Do you have an idea how to solve these freeze states ?

More generally I would be really interested with some advice/tutorials to improve NFS performances in this dedicated context. There are so many [different] things about tuning NFS available on the web that I'm a little bit lost (the opposite of the previous question). So if some one has "the tutorial"...;-)

How many nfsd threads are you running on the server? - current count will be in /proc/fs/nfsd/threads

James Pearson

Hi James,

Thanks for your answer. I've configured 24 threads (for 16 hardware cores/ 32Threads on the NFS server with this processors)

But it seams that there are buffer setup to modify too when increasing the threads number... It is not done.

Load average on the server is below 1....

I'd be very careful with higher thread numbers than physical cores. NFS threads and so called CPU hyper/simultaneous threads are quite different things and it can hurt performance if not configured correctly.

So you suggest to limit the setup to 16 daemons ? I'll try this evening.

Setting 16 daemons (the number of physical cores) do not solve this problem. Moreover I saw a document (but old) provided by DELL to optimize NFS servers performances in HPC context and they suggest to use... 128 daemons on a dedicated poweredge server. :-\

I saw that it is always the same client showing the problem (a large fat node), may be I must investigate on the client side more than on the serveur side.

Patrick

Barbara Krašovec

1:32 p.m.

The number of threads has nothing to do with the number of cores on the machine. It depends on the I/O, network speed, type of workload etc. We usually start with 32 threads and increase if necessary.

You can check the statistics with: watch 'cat /proc/net/rpc/nfsd | grep th’

Or you can check on the client

nfsstat -rc Client rpc stats: calls retrans authrefrsh 1326777974 0 1326645701

If you see a large number of retransmissions, you should increase the number of threads.

However, your problem could also be related to the filesystem or network.

Do you have jumbo frames (if yes, you should have them on clients and server)? You might think about disabling flow control on the switch and on the network card. Are there a lot of dropped packets?

For network tuning, check http://fasterdata.es.net/host-tuning/linux/

Did you try to enable readahead (blockdev —setra) on the filesystem?

On the client side, changing the mount options helps. The default read/write block size is quite little, increase it (rsize, wsize), and use noatime.

Cheers, Barbara

...

On 15 May 2020, at 09:26, Patrick Bégou Patrick.Begou@legi.grenoble-inp.fr wrote:

Le 13/05/2020 à 15:36, Patrick Bégou a écrit :

...
Le 13/05/2020 à 07:32, Simon Matter via CentOS a écrit :

...
...
Le 12/05/2020 à 16:10, James Pearson a écrit :

...
Patrick Bégou wrote:

...
Hi,

I need some help with NFSv4 setup/tuning. I have a dedicated nfs server (2 x E5-2620 8cores/16 threads each, 64GB RAM, 1x10Gb ethernet and 16x 8TB HDD) used by two servers and a small cluster (400 cores). All the servers are running CentOS 7, the cluster is running CentOS6.

Time to time on the server I get:
  kernel: NFSD: client xxx.xxx.xxx.xxx testing state ID with
 incorrect client ID
And the client xxx.xxx.xxx.xxx freeze whith:
  kernel: nfs: server xxxxx.legi.grenoble-inp.fr not responding,
 still trying
  kernel: nfs: server xxxxx.legi.grenoble-inp.fr OK
  kernel: nfs: server xxxxx.legi.grenoble-inp.fr not responding,
 still trying
  kernel: nfs: server xxxxx.legi.grenoble-inp.fr OK
There is a discussion on RedHat7 support about this but only open to subscribers. Other searches with google do not provide useful information.

Do you have an idea how to solve these freeze states ?

More generally I would be really interested with some advice/tutorials to improve NFS performances in this dedicated context. There are so many [different] things about tuning NFS available on the web that I'm a little bit lost (the opposite of the previous question). So if some one has "the tutorial"...;-)
How many nfsd threads are you running on the server? - current count will be in /proc/fs/nfsd/threads

James Pearson
Hi James,

Thanks for your answer. I've configured 24 threads (for 16 hardware cores/ 32Threads on the NFS server with this processors)

But it seams that there are buffer setup to modify too when increasing the threads number... It is not done.

Load average on the server is below 1....
I'd be very careful with higher thread numbers than physical cores. NFS threads and so called CPU hyper/simultaneous threads are quite different things and it can hurt performance if not configured correctly.
So you suggest to limit the setup to 16 daemons ? I'll try this evening.
Setting 16 daemons (the number of physical cores) do not solve this problem. Moreover I saw a document (but old) provided by DELL to optimize NFS servers performances in HPC context and they suggest to use... 128 daemons on a dedicated poweredge server. :-\

I saw that it is always the same client showing the problem (a large fat node), may be I must investigate on the client side more than on the serveur side.

Patrick

CentOS mailing list CentOS@centos.org mailto:CentOS@centos.org https://lists.centos.org/mailman/listinfo/centos https://lists.centos.org/mailman/listinfo/centos

Patrick Bégou

16 May 16 May

9:41 a.m.

Hi Barbara,

Thanks for all these suggestions. Yes, jumbo frames are activated and I have only two 10Gb ethernet switch between the server and the client, connected with a monomode fiber. I saw yesterday that the client showing the problem had not the right MTU (1500 instead of 9000). I don't know why. I changed the MTU to 9000 yesterday and I'm looking at the logs now to see if the problems occur again.

I will try to increase the number of nfs daemon in a few day, to check each setup change one after the other. Because of covid19, I'm working from home so I should be really careful when changing the setup of the servers.

On a cluster node I try to set "rsize=1048576,wsize=1048576,vers=4,tcp" (I cannot have a larger value for rsize/wsize) but comparison with the mount using default setup do not show significant improvements. I sent 20GB to the server or 2x10GB (2 concurrent processes) with dd to be larger than the raid controller cache but lower than the server and client RAM. It was just a short test this morning.

Patrick

Le 15/05/2020 à 15:32, Barbara Krašovec a écrit :

...

The number of threads has nothing to do with the number of cores on the machine. It depends on the I/O, network speed, type of workload etc. We usually start with 32 threads and increase if necessary.

You can check the statistics with: watch 'cat /proc/net/rpc/nfsd | grep th’

Or you can check on the client bide5.bin nfsstat -rc Client rpc stats: calls retrans authrefrsh 1326777974 0 1326645701

If you see a large number of retransmissions, you should increase the number of threads.

However, your problem could also be related to the filesystem or network.

Do you have jumbo frames (if yes, you should have them on clients and server)? You might think about disabling flow control on the switch and on the network card. Are there a lot of dropped packets?

For network tuning, check http://fasterdata.es.net/host-tuning/linux/

Did you try to enable readahead (blockdev —setra) on the filesystem?

On the client side, changing the mount options helps. The default read/write block size is quite little, increase it (rsize, wsize), and use noatime.

Cheers, Barbara

...
On 15 May 2020, at 09:26, Patrick Bégou Patrick.Begou@legi.grenoble-inp.fr wrote:

Le 13/05/2020 à 15:36, Patrick Bégou a écrit :

...
Le 13/05/2020 à 07:32, Simon Matter via CentOS a écrit :

...
...
Le 12/05/2020 à 16:10, James Pearson a écrit :

...
Patrick Bégou wrote: > Hi, > > I need some help with NFSv4 setup/tuning. I have a dedicated nfs server > (2 x E5-2620 8cores/16 threads each, 64GB RAM, 1x10Gb ethernet and 16x > 8TB HDD) used by two servers and a small cluster (400 cores). All the > servers are running CentOS 7, the cluster is running CentOS6. > > Time to time on the server I get: > > kernel: NFSD: client xxx.xxx.xxx.xxx testing state ID with > incorrect client ID > > And the client xxx.xxx.xxx.xxx freeze whith: > > kernel: nfs: server xxxxx.legi.grenoble-inp.fr not responding, > still trying > kernel: nfs: server xxxxx.legi.grenoble-inp.fr OK > kernel: nfs: server xxxxx.legi.grenoble-inp.fr not responding, > still trying > kernel: nfs: server xxxxx.legi.grenoble-inp.fr OK > > There is a discussion on RedHat7 support about this but only open to > subscribers. Other searches with google do not provide useful > information. > > Do you have an idea how to solve these freeze states ? > > More generally I would be really interested with some advice/tutorials > to improve NFS performances in this dedicated context. There are so > many > [different] things about tuning NFS available on the web that I'm a > little bit lost (the opposite of the previous question). So if some one > has "the tutorial"...;-) How many nfsd threads are you running on the server? - current count will be in /proc/fs/nfsd/threads

James Pearson

Hi James,

Thanks for your answer. I've configured 24 threads (for 16 hardware cores/ 32Threads on the NFS server with this processors)

But it seams that there are buffer setup to modify too when increasing the threads number... It is not done.

Load average on the server is below 1....

I'd be very careful with higher thread numbers than physical cores. NFS threads and so called CPU hyper/simultaneous threads are quite different things and it can hurt performance if not configured correctly.

So you suggest to limit the setup to 16 daemons ? I'll try this evening.

Setting 16 daemons (the number of physical cores) do not solve this problem. Moreover I saw a document (but old) provided by DELL to optimize NFS servers performances in HPC context and they suggest to use... 128 daemons on a dedicated poweredge server. :-\

I saw that it is always the same client showing the problem (a large fat node), may be I must investigate on the client side more than on the serveur side.

Patrick

CentOS mailing list CentOS@centos.org mailto:CentOS@centos.org https://lists.centos.org/mailman/listinfo/centos https://lists.centos.org/mailman/listinfo/centos

CentOS mailing list CentOS@centos.org https://lists.centos.org/mailman/listinfo/centos

Strahil Nikolov

3:39 p.m.

On May 16, 2020 12:41:09 PM GMT+03:00, "Patrick Bégou" Patrick.Begou@legi.grenoble-inp.fr wrote:

...

Hi Barbara,

Thanks for all these suggestions. Yes, jumbo frames are activated and I have only two 10Gb ethernet switch between the server and the client, connected with a monomode fiber. I saw yesterday that the client showing the problem had not the right MTU (1500 instead of 9000). I don't know why. I changed the MTU to 9000 yesterday and I'm looking at the logs now to see if the problems occur again.

I will try to increase the number of nfs daemon in a few day, to check each setup change one after the other. Because of covid19, I'm working from home so I should be really careful when changing the setup of the servers.

On a cluster node I try to set "rsize=1048576,wsize=1048576,vers=4,tcp" (I cannot have a larger value for rsize/wsize) but comparison with the mount using default setup do not show significant improvements. I sent 20GB to the server or 2x10GB (2 concurrent processes) with dd to be larger than the raid controller cache but lower than the server and client RAM. It was just a short test this morning.

Patrick

Le 15/05/2020 à 15:32, Barbara Krašovec a écrit :

...
The number of threads has nothing to do with the number of cores on

the machine. It depends on the I/O, network speed, type of workload etc.

...
We usually start with 32 threads and increase if necessary.

You can check the statistics with: watch 'cat /proc/net/rpc/nfsd | grep th’

Or you can check on the client bide5.bin nfsstat -rc Client rpc stats: calls retrans authrefrsh 1326777974 0 1326645701

If you see a large number of retransmissions, you should increase the

number of threads.

...
However, your problem could also be related to the filesystem or

network.

...
Do you have jumbo frames (if yes, you should have them on clients and

server)? You might think about disabling flow control on the switch and on the network card. Are there a lot of dropped packets?

...
For network tuning, check http://fasterdata.es.net/host-tuning/linux/

Did you try to enable readahead (blockdev —setra) on the filesystem?

On the client side, changing the mount options helps. The default

read/write block size is quite little, increase it (rsize, wsize), and use noatime.

...
Cheers, Barbara

...
On 15 May 2020, at 09:26, Patrick Bégou

Patrick.Begou@legi.grenoble-inp.fr wrote:

...
...
Le 13/05/2020 à 15:36, Patrick Bégou a écrit :

...
Le 13/05/2020 à 07:32, Simon Matter via CentOS a écrit :

...
...
Le 12/05/2020 à 16:10, James Pearson a écrit : > Patrick Bégou wrote: >> Hi, >> >> I need some help with NFSv4 setup/tuning. I have a dedicated

nfs server

...
...
...
...
...
>> (2 x E5-2620 8cores/16 threads each, 64GB RAM, 1x10Gb ethernet

and 16x

...
...
...
...
...
>> 8TB HDD) used by two servers and a small cluster (400 cores).

All the

...
...
...
...
...
>> servers are running CentOS 7, the cluster is running CentOS6. >> >> Time to time on the server I get: >> >> kernel: NFSD: client xxx.xxx.xxx.xxx testing state ID

with

...
...
...
...
...
>> incorrect client ID >> >> And the client xxx.xxx.xxx.xxx freeze whith: >> >> kernel: nfs: server xxxxx.legi.grenoble-inp.fr not

responding,

...
...
...
...
...
>> still trying >> kernel: nfs: server xxxxx.legi.grenoble-inp.fr OK >> kernel: nfs: server xxxxx.legi.grenoble-inp.fr not

responding,

...
...
...
...
...
>> still trying >> kernel: nfs: server xxxxx.legi.grenoble-inp.fr OK >> >> There is a discussion on RedHat7 support about this but only

open to

...
...
...
...
...
>> subscribers. Other searches with google do not provide useful >> information. >> >> Do you have an idea how to solve these freeze states ? >> >> More generally I would be really interested with some

advice/tutorials

...
...
...
...
...
>> to improve NFS performances in this dedicated context. There

are so

...
...
...
...
...
>> many >> [different] things about tuning NFS available on the web that

I'm a

...
...
...
...
...
>> little bit lost (the opposite of the previous question). So if

some one

...
...
...
...
...
>> has "the tutorial"...;-) > How many nfsd threads are you running on the server? - current

count

...
...
...
...
...
> will be in /proc/fs/nfsd/threads > > James Pearson Hi James,

Thanks for your answer. I've configured 24 threads (for 16

hardware

...
...
...
...
...
cores/ 32Threads on the NFS server with this processors)

But it seams that there are buffer setup to modify too when

increasing

...
...
...
...
...
the threads number... It is not done.

Load average on the server is below 1....

I'd be very careful with higher thread numbers than physical

cores. NFS

...
...
...
...
threads and so called CPU hyper/simultaneous threads are quite

different

...
...
...
...
things and it can hurt performance if not configured correctly.

So you suggest to limit the setup to 16 daemons ? I'll try this

evening.

...
...
...
Setting 16 daemons (the number of physical cores) do not solve this problem. Moreover I saw a document (but old) provided by DELL to optimize NFS servers performances in HPC context and they suggest to use... 128 daemons on a dedicated poweredge server. :-\

I saw that it is always the same client showing the problem (a large

fat

...
...
node), may be I must investigate on the client side more than on the serveur side.

Patrick

CentOS mailing list CentOS@centos.org mailto:CentOS@centos.org https://lists.centos.org/mailman/listinfo/centos

https://lists.centos.org/mailman/listinfo/centos

...

CentOS mailing list CentOS@centos.org https://lists.centos.org/mailman/listinfo/centos

CentOS mailing list CentOS@centos.org https://lists.centos.org/mailman/listinfo/centos

Hi , Why don't you leave the client negotiate the version itself ? pNFS requires at minimum - v4.1 and can bring extra performance.

P.S.: According to the man pages 'vers' is : 'is an alternative to the nfsvers option. It is included for compatibility with other operating systems.' I was always using 'nfsvers' :).

Best Regards, Strahil Nikolov

Patrick Bégou

1 Jun 1 Jun

9:08 a.m.

Le 13/05/2020 à 02:13, Orion Poplawski a écrit :

...

On 5/12/20 2:46 AM, Patrick Bégou wrote:

...
Hi,

I need some help with NFSv4 setup/tuning. I have a dedicated nfs server (2 x E5-2620 8cores/16 threads each, 64GB RAM, 1x10Gb ethernet and 16x 8TB HDD) used by two servers and a small cluster (400 cores). All the servers are running CentOS 7, the cluster is running CentOS6.

Time to time on the server I get:

kernel: NFSD: client xxx.xxx.xxx.xxx testing state ID with incorrect client ID

And the client xxx.xxx.xxx.xxx freeze whith:

kernel: nfs: server xxxxx.legi.grenoble-inp.fr not responding, still trying kernel: nfs: server xxxxx.legi.grenoble-inp.fr OK kernel: nfs: server xxxxx.legi.grenoble-inp.fr not responding, still trying kernel: nfs: server xxxxx.legi.grenoble-inp.fr OK

There is a discussion on RedHat7 support about this but only open to subscribers. Other searches with google do not provide useful information.

FYI - you can get access to such info with a free RHEL developers account.

Thanks for your suggestion. As the problem is back I've subscribed to reach the full content of this discussion.

The answer was "do not use antivirus" :-(. I do not use antivirus as I am CentOS only.

Patrick

Orion Poplawski

2 Jul 2 Jul

10:05 p.m.

On 6/1/20 3:08 AM, Patrick Bégou wrote:

...

Le 13/05/2020 à 02:13, Orion Poplawski a écrit :

...
On 5/12/20 2:46 AM, Patrick Bégou wrote:

...
Hi,

I need some help with NFSv4 setup/tuning. I have a dedicated nfs server (2 x E5-2620 8cores/16 threads each, 64GB RAM, 1x10Gb ethernet and 16x 8TB HDD) used by two servers and a small cluster (400 cores). All the servers are running CentOS 7, the cluster is running CentOS6.

Time to time on the server I get:

kernel: NFSD: client xxx.xxx.xxx.xxx testing state ID with incorrect client ID

And the client xxx.xxx.xxx.xxx freeze whith:

kernel: nfs: server xxxxx.legi.grenoble-inp.fr not responding, still trying kernel: nfs: server xxxxx.legi.grenoble-inp.fr OK kernel: nfs: server xxxxx.legi.grenoble-inp.fr not responding, still trying kernel: nfs: server xxxxx.legi.grenoble-inp.fr OK

There is a discussion on RedHat7 support about this but only open to subscribers. Other searches with google do not provide useful information.

FYI - you can get access to such info with a free RHEL developers account.

Thanks for your suggestion. As the problem is back I've subscribed to reach the full content of this discussion.

The answer was "do not use antivirus" :-(. I do not use antivirus as I am CentOS only.

Patrick

Just curious to see if you have had any luck resolving these issues? I'm afraid that NFS on EL 7 has become much less stable for us recently as well with lots more client access hangs.

Orion

-- Orion Poplawski Manager of NWRA Technical Systems 720-772-5637 NWRA, Boulder/CoRA Office FAX: 303-415-9702 3380 Mitchell Lane orion@nwra.com Boulder, CO 80301 https://www.nwra.com/

cpolish＠surewest.net

4 Jul 4 Jul

12:28 a.m.

On 2020-06-01 11:08, Patrick Bégou wrote:

...

...
...
I need some help with NFSv4 setup/tuning. I have a dedicated nfs server (2 x E5-2620 8cores/16 threads each, 64GB RAM, 1x10Gb ethernet and 16x 8TB HDD) used by two servers and a small cluster (400 cores). All the servers are running CentOS 7, the cluster is running CentOS6.

Time to time on the server I get:

kernel: NFSD: client xxx.xxx.xxx.xxx testing state ID with incorrect client ID

According to Red Hat Bugzilla [1], 2015-11-19:

"testing state ID with incorrect client ID" means the server thinks a TEST_STATEID op was sent for a stateid associated with a client different from the client associated with the session over which the TEST_STATEID was sent. Perhaps this could be the result of some confusion in the server's data structures but the most straightforward explanation would be just that that's really what the client did (perhaps as a result of a bug in client recovery code?)

The above explanation is applicable but unless you're running a rather old kernel that /particular/ bug is not.

My understanding of your issue from the thread to date is you've not yet narrowed the issue to the NFS server, the network, the 2 server clients, or the cluster clients. In other words, the corrupt client ID could be tendered by either of the 2 servers, or by the cluster clients, or could be corrupted in transit over the network, or could originate on the NFS server. Correct?

According to my notes from a class given by Ted T'so on NFS, always start with checking network health. That should be easy, using interface statistics, eg:

$ ifconfig eth3 eth3 Link encap:Ethernet HWaddr A0:36:9F:10:A9:06 inet addr:10.10.1.100 Bcast:10.10.1.255 Mask:255.255.255.0 inet6 addr: fe80::a236:9fff:fe10:a906/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:9000 Metric:1 RX packets:858295756 errors:0 dropped:0 overruns:0 frame:0 ^^^^^^^^ TX packets:7090386023 errors:0 dropped:0 overruns:0 carrier:0 ^^^^^^^^ collisions:0 txqueuelen:1000 RX bytes:495026510281 (461.0 GiB) TX bytes:10475167734024 (9.5 TiB)

$ ip --stats link show eth3 eth3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UP qlen 1000 link/ether a0:36:9f:10:a9:06 brd ff:ff:ff:ff:ff:ff RX: bytes packets errors dropped overrun mcast 495027287320 858296399 0 0 0 112282 ^^^^^^ TX: bytes packets errors dropped carrier collsns 10475167775376 7090386249 0 0 0 0 ^^^^^^ Layer 2 stats can also be checked using ethtool:

$ sudo ethtool --statistics eth3 | egrep 'dropped|errors' rx_errors: 0 tx_errors: 0 ...

If you've got a clean, healthy network, that leaves the clients or the server. Maybe the clients are asking for the wrong ID. To analyze the client ID given, you could capture traffic at the server using, perhaps:

# tcpdump -W 10 -C 10 -w nfs_capture host <client-ipaddr>

Then using tshark or wireshark, see if the client is sending consistent client ID's. If so, that would exonerate the clients, leaving as suspect the NFS daemon code in the Linux kernel.

Another point that Mr. T'so made (which it sounds like you have covered) is, don't combine an NFS server with another application or service. I mention this only because I'm pedantic and obsessive, or maybe obsessively pedantic.

Also worth mentioning: consider specifying no_subtree_check in your NFS exports. And T'so suggested (ca. 2012) using fs_mark (available from the epel repository) to exercize your file systems.

Best luck,

-- Charles Polisher [1] https://bugzilla.redhat.com/show_bug.cgi?id=1233284

Patrick Bégou

9 Jul 9 Jul

10:11 a.m.

Hi Orion,

no, I still have this problem. I delay working on it as I the latest updates have not been installed on the server and on the client. I'll work again on this problem as soon as possible.

Thanks Charles for your detailed information on how to track this problem. I'll check all these metrics.

I have several clients for this nfs server and the problem seems only to occur from the client using nfs 4.1 in CentOS Linux release 7.7.1908 (Core). The default options used are: rw,relatime,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=194.254.xx.xx,local_lock=none,addr=194.254.yy.yy

On olders clients (Red Hat Enterprise Linux Server release 6.7 (Santiago)) default options are: rw,intr,hard,sloppy,vers=4,addr=194.254.xx.xx,clientaddr=194.254.yy.yy

The server in CentOS7.6.1810

Will see if the latest updates help to solve the problem.

Patrick

Le 03/07/2020 à 00:05, Orion Poplawski a écrit :

...

On 6/1/20 3:08 AM, Patrick Bégou wrote:

...
Le 13/05/2020 à 02:13, Orion Poplawski a écrit :

...
On 5/12/20 2:46 AM, Patrick Bégou wrote:

...
Hi,

I need some help with NFSv4 setup/tuning. I have a dedicated nfs server (2 x E5-2620 8cores/16 threads each, 64GB RAM, 1x10Gb ethernet and 16x 8TB HDD) used by two servers and a small cluster (400 cores). All the servers are running CentOS 7, the cluster is running CentOS6.

Time to time on the server I get:

kernel: NFSD: client xxx.xxx.xxx.xxx testing state ID with incorrect client ID

And the client xxx.xxx.xxx.xxx freeze whith:

kernel: nfs: server xxxxx.legi.grenoble-inp.fr not responding, still trying kernel: nfs: server xxxxx.legi.grenoble-inp.fr OK kernel: nfs: server xxxxx.legi.grenoble-inp.fr not responding, still trying kernel: nfs: server xxxxx.legi.grenoble-inp.fr OK

There is a discussion on RedHat7 support about this but only open to subscribers. Other searches with google do not provide useful information.

FYI - you can get access to such info with a free RHEL developers account.

Thanks for your suggestion. As the problem is back I've subscribed to reach the full content of this discussion.

The answer was "do not use antivirus" :-(. I do not use antivirus as I am CentOS only.

Patrick

Just curious to see if you have had any luck resolving these issues? I'm afraid that NFS on EL 7 has become much less stable for us recently as well with lots more client access hangs.

Orion

Patrick Bégou

28 Aug 28 Aug

9:24 a.m.

Hello,

I'm back with these NFS problems.... Server and client have been updated but it still rise time to time.

server is: Linux robin.legi.grenoble-inp.fr 3.10.0-1127.18.2.el7.x86_64 #1 SMP Sun Jul 26 15:27:06 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux client is : Linux grivola.legi.grenoble-inp.fr 3.10.0-1127.18.2.el7.x86_64 #1 SMP Sun Jul 26 15:27:06 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

CentOS Linux release 7.8.2003 (Core) each.

It seams related to an scp session: the NFS client downloads a large data set from a remote server and store the files on it's NFS file system.

On the client I have such messages in /var/log/messages:

Aug 28 10:03:08 grivola kernel: INFO: task scp:78495 blocked for more than 120 seconds. Aug 28 10:03:08 grivola kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Aug 28 10:03:08 grivola kernel: scp D ffff97e37fa9acc0 0 78495 147369 0x00000084 Aug 28 10:03:08 grivola kernel: Call Trace: Aug 28 10:03:08 grivola kernel: [<ffffffff92783ef0>] ? bit_wait+0x50/0x50 Aug 28 10:03:08 grivola kernel: [<ffffffff92785da9>] schedule+0x29/0x70 Aug 28 10:03:08 grivola kernel: [<ffffffff927838b1>] schedule_timeout+0x221/0x2d0 Aug 28 10:03:08 grivola kernel: [<ffffffffc132e7e6>] ? rpc_run_task+0xf6/0x150 [sunrpc] Aug 28 10:03:08 grivola kernel: [<ffffffffc133d850>] ? rpc_put_task+0x10/0x20 [sunrpc] Aug 28 10:03:08 grivola kernel: [<ffffffff92783ef0>] ? bit_wait+0x50/0x50 Aug 28 10:03:08 grivola kernel: [<ffffffff9278549d>] io_schedule_timeout+0xad/0x130 Aug 28 10:03:08 grivola kernel: [<ffffffff92785538>] io_schedule+0x18/0x20 Aug 28 10:03:08 grivola kernel: [<ffffffff92783f01>] bit_wait_io+0x11/0x50 Aug 28 10:03:08 grivola kernel: [<ffffffff92783a27>] __wait_on_bit+0x67/0x90 Aug 28 10:03:08 grivola kernel: [<ffffffff921bd741>] wait_on_page_bit+0x81/0xa0 Aug 28 10:03:08 grivola kernel: [<ffffffff920c7840>] ? wake_bit_function+0x40/0x40 Aug 28 10:03:08 grivola kernel: [<ffffffff921bd871>] __filemap_fdatawait_range+0x111/0x190 Aug 28 10:03:08 grivola kernel: [<ffffffff921bd904>] filemap_fdatawait_range+0x14/0x30 Aug 28 10:03:08 grivola kernel: [<ffffffff921bd947>] filemap_fdatawait+0x27/0x30 Aug 28 10:03:08 grivola kernel: [<ffffffff921bfd1c>] filemap_write_and_wait+0x4c/0x80 Aug 28 10:03:08 grivola kernel: [<ffffffffc097ddd0>] nfs_wb_all+0x20/0x100 [nfs] Aug 28 10:03:08 grivola kernel: [<ffffffffc09700e0>] nfs_setattr+0x1f0/0x210 [nfs] Aug 28 10:03:08 grivola kernel: [<ffffffff9226cecc>] notify_change+0x30c/0x4d0 Aug 28 10:03:08 grivola kernel: [<ffffffff9224af05>] do_truncate+0x75/0xc0 Aug 28 10:03:08 grivola kernel: [<ffffffff92250118>] ? __sb_start_write+0x58/0x120 Aug 28 10:03:08 grivola kernel: [<ffffffff9224b329>] do_sys_ftruncate.constprop.14+0x139/0x1a0 Aug 28 10:03:08 grivola kernel: [<ffffffff9224b3ce>] SyS_ftruncate+0xe/0x10 Aug 28 10:03:08 grivola kernel: [<ffffffff92792ed2>] system_call_fastpath+0x25/0x2a

At this time the NFS server freeze. Even a ssh session or the local console (via IDRAC or screen/keyboard physically plugged on the server) do not work.

I have no special messages on the NFS server. The freeze period end with:

On the server:

Aug 28 10:20:26 robin kernel: NFSD: client 194.254.66.26 testing state ID with incorrect client ID Aug 28 10:20:26 robin kernel: NFSD: client 194.254.66.26 testing state ID with incorrect client ID Aug 28 10:20:26 robin kernel: NFSD: client 194.254.66.26 testing state ID with incorrect client ID Aug 28 10:20:26 robin kernel: NFSD: client 194.254.66.26 testing state ID with incorrect client ID Aug 28 10:20:26 robin kernel: NFSD: client 194.254.66.26 testing state ID with incorrect client ID Aug 28 10:20:26 robin kernel: NFSD: client 194.254.66.26 testing state ID with incorrect client ID Aug 28 10:20:26 robin kernel: NFSD: client 194.254.66.26 testing state ID with incorrect client ID Aug 28 10:20:26 robin kernel: NFSD: client 194.254.66.26 testing state ID with incorrect client ID Aug 28 10:20:26 robin kernel: NFSD: client 194.254.66.26 testing state ID with incorrect client ID Aug 28 10:20:26 robin kernel: NFSD: client 194.254.66.26 testing state ID with incorrect client ID

and on the client:

Aug 28 10:20:26 grivola kernel: nfs: server robin.legi.grenoble-inp.fr OK Aug 28 10:20:26 grivola kernel: nfs: server robin.legi.grenoble-inp.fr OK Aug 28 10:20:26 grivola kernel: nfs: server robin.legi.grenoble-inp.fr OK Aug 28 10:20:26 grivola kernel: nfs: server robin.legi.grenoble-inp.fr OK Aug 28 10:20:26 grivola kernel: nfs: server robin.legi.grenoble-inp.fr OK

I do not know how to investigate this....

Patrick

Le 09/07/2020 à 12:11, Patrick Bégou a écrit :

...

Hi Orion,

no, I still have this problem. I delay working on it as I the latest updates have not been installed on the server and on the client. I'll work again on this problem as soon as possible.

Thanks Charles for your detailed information on how to track this problem. I'll check all these metrics.

I have several clients for this nfs server and the problem seems only to occur from the client using nfs 4.1 in CentOS Linux release 7.7.1908 (Core). The default options used are: rw,relatime,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=194.254.xx.xx,local_lock=none,addr=194.254.yy.yy

On olders clients (Red Hat Enterprise Linux Server release 6.7 (Santiago)) default options are: rw,intr,hard,sloppy,vers=4,addr=194.254.xx.xx,clientaddr=194.254.yy.yy

The server in CentOS7.6.1810

Will see if the latest updates help to solve the problem.

Patrick

Le 03/07/2020 à 00:05, Orion Poplawski a écrit :

...
On 6/1/20 3:08 AM, Patrick Bégou wrote:

...
Le 13/05/2020 à 02:13, Orion Poplawski a écrit :

...
On 5/12/20 2:46 AM, Patrick Bégou wrote:

...
Hi,

I need some help with NFSv4 setup/tuning. I have a dedicated nfs server (2 x E5-2620 8cores/16 threads each, 64GB RAM, 1x10Gb ethernet and 16x 8TB HDD) used by two servers and a small cluster (400 cores). All the servers are running CentOS 7, the cluster is running CentOS6.

Time to time on the server I get:

kernel: NFSD: client xxx.xxx.xxx.xxx testing state ID with incorrect client ID

And the client xxx.xxx.xxx.xxx freeze whith:

kernel: nfs: server xxxxx.legi.grenoble-inp.fr not responding, still trying kernel: nfs: server xxxxx.legi.grenoble-inp.fr OK kernel: nfs: server xxxxx.legi.grenoble-inp.fr not responding, still trying kernel: nfs: server xxxxx.legi.grenoble-inp.fr OK

There is a discussion on RedHat7 support about this but only open to subscribers. Other searches with google do not provide useful information.

FYI - you can get access to such info with a free RHEL developers account.

Thanks for your suggestion. As the problem is back I've subscribed to reach the full content of this discussion.

The answer was "do not use antivirus" :-(. I do not use antivirus as I am CentOS only.

Patrick

Just curious to see if you have had any luck resolving these issues? I'm afraid that NFS on EL 7 has become much less stable for us recently as well with lots more client access hangs.

Orion

CentOS mailing list CentOS@centos.org https://lists.centos.org/mailman/listinfo/centos

1782

Age (days ago)

1890

Last active (days ago)

discuss@lists.centos.org

14 comments

7 participants

tags (0)

participants (7)

Barbara Krašovec
cpolish＠surewest.net
James Pearson
Orion Poplawski
Patrick Bégou
Simon Matter
Strahil Nikolov