Hi,
first of all, please CC me on any reply, as I am only subscribed to the digest.
OK. Here is the problem. Said kernel (from 4.4) seems to have problems with file-locking when the system is under high, likely network related, load. The symptoms are things using file locking (rpm, the user-space automounter amd) fail to obtain locks, usually stating timeout problems.
The sytem in question is a HP/DL380G4 with dual-single-core EM64T CPUs and 8GB of Memory. The network interfaces are "tg3". It happens with both CentOs and RHEL4.
The high load can be triggered by copying three 3 GB files in parallel from an NFS server (Solaris10, NFS, TCP, 1GBit) to another NFS server (RHEL4, NFS, TCP, 100 MBit). The measured network performance is OK. During this operation the systems goes to Loads around/above 10. Overall responsiveness feels good, but software doing file-locking or stuff like opening a new ssh connection take extremely long.
So, if anyone has an idea or hint, it will be highly appreciated.
Cheers Martin
------------------------------------------------------ Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de
On Monday 27 November 2006 10:54, Martin Knoblauch wrote:
first of all, please CC me on any reply, as I am only subscribed to the digest.
OK. Here is the problem. Said kernel (from 4.4) seems to have problems with file-locking when the system is under high, likely network related, load. The symptoms are things using file locking (rpm, the user-space automounter amd) fail to obtain locks, usually stating timeout problems.
The sytem in question is a HP/DL380G4 with dual-single-core EM64T CPUs and 8GB of Memory. The network interfaces are "tg3". It happens with both CentOs and RHEL4.
The high load can be triggered by copying three 3 GB files in parallel from an NFS server (Solaris10, NFS, TCP, 1GBit) to another NFS server (RHEL4, NFS, TCP, 100 MBit). The measured network performance is OK. During this operation the systems goes to Loads around/above 10. Overall responsiveness feels good, but software doing file-locking or stuff like opening a new ssh connection take extremely long.
So, if anyone has an idea or hint, it will be highly appreciated.
NFS has known problems with flock. man flock(2) specifically notes this. Which file locking mechanism (flock or fcntl) does your system use predominantly (that is, how do the applications that uses NFS lock their files)?
NFS v4 has some major strides towards better locking, but it's been long enough since I dealt with this that I'm not sure if it actually solves anything (although it looks like it does). You might want to try NFS v4 if possible.
--- Kevan Benson kbenson@a-1networks.com wrote:
On Monday 27 November 2006 10:54, Martin Knoblauch wrote:
first of all, please CC me on any reply, as I am only subscribed
to
the digest.
OK. Here is the problem. Said kernel (from 4.4) seems to have
problems
with file-locking when the system is under high, likely network related, load. The symptoms are things using file locking (rpm, the
user-space
automounter amd) fail to obtain locks, usually stating timeout problems.
The sytem in question is a HP/DL380G4 with dual-single-core EM64T
CPUs
and 8GB of Memory. The network interfaces are "tg3". It happens
with
both CentOs and RHEL4.
The high load can be triggered by copying three 3 GB files in
parallel
from an NFS server (Solaris10, NFS, TCP, 1GBit) to another NFS
server
(RHEL4, NFS, TCP, 100 MBit). The measured network performance is
OK.
During this operation the systems goes to Loads around/above 10. Overall responsiveness feels good, but software doing file-locking
or
stuff like opening a new ssh connection take extremely long.
So, if anyone has an idea or hint, it will be highly appreciated.
Hi Kevan,
NFS has known problems with flock. man flock(2) specifically notes this. Which file locking mechanism (flock or fcntl) does your system use predominantly (that is, how do the applications that uses NFS lock their files)?
Now, amd uses fcntl to do the locking of /etc/mtab. But I do not believe that the locking problem is restricted to NFS. To me it looks like all kind of files are affected.
NFS v4 has some major strides towards better locking, but it's been long enough since I dealt with this that I'm not sure if it actually solves anything (although it looks like it does). You might want to try NFS v4 if possible.
Due to the environment (a very high number of potential NFS servers, all running NFSv3) V4 is not an option. And, as I said, the locking problems occur on local files. NFS is only related to the high-load condition that seems to accompany the locking problems.
Cheers Martin
------------------------------------------------------ Martin Knoblauch email: k n o b i AT knobisoft DOT de www: http://www.knobisoft.de
On Monday 27 November 2006 15:42, Martin Knoblauch wrote:
Due to the environment (a very high number of potential NFS servers, all running NFSv3) V4 is not an option. And, as I said, the locking problems occur on local files. NFS is only related to the high-load condition that seems to accompany the locking problems.
Ah, I misinterpreted what you meant. Are there other processed besides the copy of three large files tat might be keeping lots of files open? Have you checked the file allocation limits in proc (/proc/sys/fs/file-nr)? I hear for 2.6 kernels you rarely have to tweak these settings, but it might be worth looking into if you haven't already.