Reading the "waiting IOs" thread made me remember I have a similar problem that has been here for months, and I have no sulution yet.
A single CentOS 5.2 x86_64 machine here is overloading our NetApp filer with excessive NFS getattr, lookup and access operations. The weird thing is that the number of these operations increases over time. I have an mrtg graph (which I didn't want to attach here) showing e.g. 200 NFS Ops on Monday, measured with filer-mrtg, going up to, e.g. 1200 in a straight line within days. nfsstat -l on the filer proves beyond doubt that the load is caused by this particular machine. dstat shows me which NFS operations are causing it.
date/time | null gatr satr look aces ... 10-09 12:22:52| 0 0 0 0 0 10-09 12:22:53| 0 525 0 602 602 10-09 12:22:54| 0 1275 0 1464 1438 10-09 12:22:55| 0 0 0 0 0 10-09 12:22:56| 0 0 0 0 0 10-09 12:22:57| 0 0 0 0 0 10-09 12:22:58| 0 238 0 270 270 10-09 12:22:59| 0 1461 0 1663 1660 10-09 12:23:00| 0 205 0 133 114 10-09 12:23:01| 0 0 0 0 0 10-09 12:23:02| 0 1 0 0 0 10-09 12:23:03| 0 0 0 0 0 10-09 12:23:04| 0 1411 0 1574 1574 10-09 12:23:05| 0 498 0 465 466 10-09 12:23:06| 0 0 0 0 0 10-09 12:23:07| 0 0 0 0 0 10-09 12:23:08| 0 0 0 0 0 10-09 12:23:09| 0 1082 0 1178 1192 10-09 12:23:10| 0 790 0 885 865
This behaviour is somehow tied to the Gnome desktop. I have other machines running CentOS 5.2 x86_64 (at init level 3) which don't show this behaviour. I also have CentOS 5.2 i386 machines which don't show it either. None of the other machines on the lan show it - RHEL3 32 and 64bit, Solaris.
What I'd need is a monitoring tool than can tie the NFS ops to process ids or applications. lsof isn't nearly as helpful here as I thought. I even copied this workstation user's files to another account, logged in and ran the same apps - and couldn't reproduce it.
Ideas? Essentially, this makes CentOS 64bit undeployable in our environemnt.
lhecking@users.sourceforge.net wrote:
Reading the "waiting IOs" thread made me remember I have a similar problem that has been here for months, and I have no sulution yet.
A single CentOS 5.2 x86_64 machine here is overloading our NetApp filer with excessive NFS getattr, lookup and access operations. The weird thing is that the number of these operations increases over time. I have an mrtg graph (which I didn't want to attach here) showing e.g. 200 NFS Ops on Monday, measured with filer-mrtg, going up to, e.g. 1200 in a straight line within days. nfsstat -l on the filer proves beyond doubt that the load is caused by this particular machine. dstat shows me which NFS operations are causing it.
date/time | null gatr satr look aces ... 10-09 12:22:52| 0 0 0 0 0 10-09 12:22:53| 0 525 0 602 602 10-09 12:22:54| 0 1275 0 1464 1438 10-09 12:22:55| 0 0 0 0 0 10-09 12:22:56| 0 0 0 0 0 10-09 12:22:57| 0 0 0 0 0 10-09 12:22:58| 0 238 0 270 270 10-09 12:22:59| 0 1461 0 1663 1660 10-09 12:23:00| 0 205 0 133 114 10-09 12:23:01| 0 0 0 0 0 10-09 12:23:02| 0 1 0 0 0 10-09 12:23:03| 0 0 0 0 0 10-09 12:23:04| 0 1411 0 1574 1574 10-09 12:23:05| 0 498 0 465 466 10-09 12:23:06| 0 0 0 0 0 10-09 12:23:07| 0 0 0 0 0 10-09 12:23:08| 0 0 0 0 0 10-09 12:23:09| 0 1082 0 1178 1192 10-09 12:23:10| 0 790 0 885 865
This behaviour is somehow tied to the Gnome desktop. I have other machines running CentOS 5.2 x86_64 (at init level 3) which don't show this behaviour. I also have CentOS 5.2 i386 machines which don't show it either. None of the other machines on the lan show it - RHEL3 32 and 64bit, Solaris.
What I'd need is a monitoring tool than can tie the NFS ops to process ids or applications. lsof isn't nearly as helpful here as I thought. I even copied this workstation user's files to another account, logged in and ran the same apps - and couldn't reproduce it.
Ideas? Essentially, this makes CentOS 64bit undeployable in our environemnt.
Do you have anything running that would try to read all the files and build a search index - like beagle? There's also the nightly run of updatedb but that just reads the filenames and normally nfs mounts are excluded.
-- Les Mikesell lesmikesell@gmail.com
Do you have anything running that would try to read all the files and build a search index - like beagle? There's also the nightly run of updatedb but that just reads the filenames and normally nfs mounts are excluded.
There is no package beagle installed, I don't know if any other software doing this is part of a standard CentOS install. Definitely not updatedb, mlocate.cron runs once a day in the early morning, but the load pattern we see is a continuous increase.
At Thu, 10 Sep 2009 13:56:31 +0100 CentOS mailing list centos@centos.org wrote:
Do you have anything running that would try to read all the files and build a search index - like beagle? There's also the nightly run of updatedb but that just reads the filenames and normally nfs mounts are excluded.
There is no package beagle installed, I don't know if any other software doing this is part of a standard CentOS install. Definitely not updatedb, mlocate.cron runs once a day in the early morning, but the load pattern we see is a continuous increase.
What IS running on the problem machine? Is it a web server? Someone's Desktop? (If it is a personal Desktop box, what is the person running -- are they running find all of the time? Or doing some 'lets update 5 zillion files now' type of task?) A database server? Something else?
CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
What IS running on the problem machine? Is it a web server? Someone's Desktop? (If it is a personal Desktop box, what is the person running -- are they running find all of the time? Or doing some 'lets update 5 zillion files now' type of task?) A database server? Something else?
It's a plain desktop. User is running standard desktop apps like firefox, thunderbird, vmware, and EDA tools.
On 09/10/2009 04:28 PM, Lars Hecking wrote: ...
It's a plain desktop. User is running standard desktop apps like firefox, thunderbird, vmware, and EDA tools.
Which version of thunderbird? Version 3 killed our NFS server when ~/.thunderbird is accessed via NFS.
Mogens
Mogens Kjaer writes:
On 09/10/2009 04:28 PM, Lars Hecking wrote: ...
It's a plain desktop. User is running standard desktop apps like firefox, thunderbird, vmware, and EDA tools.
Which version of thunderbird? Version 3 killed our NFS server when ~/.thunderbird is accessed via NFS.
thunderbird-2.0.0.19-1.el5.centos
lhecking@users.sourceforge.net wrote:
Do you have anything running that would try to read all the files and build a search index - like beagle? There's also the nightly run of updatedb but that just reads the filenames and normally nfs mounts are excluded.
There is no package beagle installed, I don't know if any other software doing this is part of a standard CentOS install. Definitely not updatedb, mlocate.cron runs once a day in the early morning, but the load pattern we see is a continuous increase.
We had a similar issue with 100 CentOS 5 and Fedora 7 desktops mounting their $HOME directories from a Centos 4 server. We would see a steady (perfectly linear) increase of getattr and lookup requests from the time users logged in until they shutoff their machines (logging off stopped the linear growth but didn't always bring the number of requests down). Running hundreds of dstats and straces finally showed that the gamin package on each of the clients was causing all of the requests and simply killing that single process would instantly drop the getattr requests from 200 a second down to 3 or 4 a second where it should be. That was 200 per client so you can imagine how bad it would get! We rebuilt the gamin-0.1.9-5.rpm package and deployed it to all of the machines. We instantly saw improvement and we currently average 3 getattr requests a second. I don't know if this will help your situation but maybe someone will benefit.
Chris
We had a similar issue with 100 CentOS 5 and Fedora 7 desktops mounting their $HOME directories from a Centos 4 server. We would see a steady (perfectly linear) increase of getattr and lookup requests from the time users logged in until they shutoff their machines (logging off stopped the linear growth but didn't always bring the number of requests down). Running hundreds of dstats and straces finally showed that the gamin package on each of the clients was causing all of the requests and simply killing that single process would instantly drop the getattr requests from 200 a second down to 3 or 4 a second where it should be. That was 200 per client so you can imagine how bad it would get! We rebuilt the gamin-0.1.9-5.rpm package and deployed it to all of the machines. We instantly saw improvement and we currently average 3 getattr requests a second. I don't know if this will help your situation but maybe someone will benefit.
Intriguing. Chris, did you have this problem with all architectures? Did you apply any patches when rebuilding gamin?
On Thu, Sep 10, 2009 at 10:14 AM, Chris Murphy chris@castlebranch.com wrote:
lhecking@users.sourceforge.net wrote:
Do you have anything running that would try to read all the files and build a search index - like beagle? There's also the nightly run of updatedb but that just reads the filenames and normally nfs mounts are excluded.
There is no package beagle installed, I don't know if any other software doing this is part of a standard CentOS install. Definitely not updatedb, mlocate.cron runs once a day in the early morning, but the load pattern we see is a continuous increase.
We had a similar issue with 100 CentOS 5 and Fedora 7 desktops mounting their $HOME directories from a Centos 4 server. We would see a steady (perfectly linear) increase of getattr and lookup requests from the time users logged in until they shutoff their machines (logging off stopped the linear growth but didn't always bring the number of requests down). Running hundreds of dstats and straces finally showed that the gamin package on each of the clients was causing all of the requests and simply killing that single process would instantly drop the getattr requests from 200 a second down to 3 or 4 a second where it should be. That was 200 per client so you can imagine how bad it would get! We rebuilt the gamin-0.1.9-5.rpm package and deployed it to all of the machines. We instantly saw improvement and we currently average 3 getattr requests a second. I don't know if this will help your situation but maybe someone will benefit.
How about the gamin patch you rebuilt with?
Or at least a bugzilla entry with it...
-Ross
lhecking@users.sourceforge.net wrote:
Reading the "waiting IOs" thread made me remember I have a similar problem that has been here for months, and I have no sulution yet.
A single CentOS 5.2 x86_64 machine here is overloading our NetApp filer with excessive NFS getattr, lookup and access operations. The weird thing is
There was a kernel update in the 5.2/5.3 time frame that fixed a NFS client bug regarding lookups, what kernel are you running?
Have you tried running lsof on the client side to see which processes are using the files served over NFS?
nate
nate writes:
lhecking@users.sourceforge.net wrote:
Reading the "waiting IOs" thread made me remember I have a similar problem that has been here for months, and I have no sulution yet.
A single CentOS 5.2 x86_64 machine here is overloading our NetApp filer with excessive NFS getattr, lookup and access operations. The weird thing is
There was a kernel update in the 5.2/5.3 time frame that fixed a NFS client bug regarding lookups, what kernel are you running?
2.6.18-92.1.22.el5. I can test all newer kernels if necessary.
Have you tried running lsof on the client side to see which processes are using the files served over NFS?
Yes, but there's simply too much output to make this useful, and most of it, about 75%, is related to EDA software which is installed on the filer. I.e. 3rd party software we have no control over and don't know what it does internally.
A single CentOS 5.2 x86_64 machine here is overloading our NetApp filer with excessive NFS getattr, lookup and access operations.
There was a kernel update in the 5.2/5.3 time frame that fixed a NFS client bug regarding lookups, what kernel are you running?
2.6.18-92.1.22.el5. I can test all newer kernels if necessary.
Please update you centos installation. It will probably fix your NFS issues and your current system contains serious security problems. "yum update".
/jens
Lars Hecking wrote:
2.6.18-92.1.22.el5. I can test all newer kernels if necessary.
Sounds like another poster may of found the root cause but you should still upgrade anyways, I tracked down the specific update to this kernel, which is newer than what you have:
* Thu Jul 03 2008 Aristeu Rozanski arozansk@redhat.com [2.6.18-95.el5] [..] - [nfs] address nfs rewrite performance regression in RHEL5 (Eric Sandeen ) [436004] [..]
nate
nate writes:
Lars Hecking wrote:
2.6.18-92.1.22.el5. I can test all newer kernels if necessary.
Sounds like another poster may of found the root cause but you should still upgrade anyways, I tracked down the specific update to this kernel, which is newer than what you have:
- Thu Jul 03 2008 Aristeu Rozanski arozansk@redhat.com [2.6.18-95.el5]
[..]
- [nfs] address nfs rewrite performance regression in RHEL5 (Eric Sandeen )
[436004] [..]
This is an enterprise-wide setup I cannot change, but I will be able to deploy a newer kernel. It'll have to wait until I return to the office in a few weeks' time, though.
John R Pierce writes:
Lars Hecking wrote:
This is an enterprise-wide setup I cannot change, but I will be able to deploy a newer kernel. It'll have to wait until I return to the office in a few weeks' time, though.
an enterprise-wide setup that can't get regular security patches?!?
Regular as in "patch as soon as available", no, but if the current kernel helps with this problem, we'll roll it out. Before moving up to CentOS 5.4 ;)
On Thu, 2009-09-10 at 17:28 +0100, Lars Hecking wrote:
John R Pierce writes:
Lars Hecking wrote:
This is an enterprise-wide setup I cannot change, but I will be able to deploy a newer kernel. It'll have to wait until I return to the office in a few weeks' time, though.
an enterprise-wide setup that can't get regular security patches?!?
Regular as in "patch as soon as available", no, but if the current kernel helps with this problem, we'll roll it out. Before moving up to CentOS 5.4 ;)
So you are moving from 5.2 to 5.4 directly with skipping 5.3? Because the kernel you are running is from 5.2.
Ralph
On Thursday 10 September 2009, Lars Hecking wrote:
John R Pierce writes:
Lars Hecking wrote:
This is an enterprise-wide setup I cannot change, but I will be able to deploy a newer kernel. It'll have to wait until I return to the office in a few weeks' time, though.
an enterprise-wide setup that can't get regular security patches?!?
Regular as in "patch as soon as available", no, but if the current kernel helps with this problem, we'll roll it out.
...it will at least help with the small issue of: "any user can trivially become root"
/Peter
John R Pierce wrote:
Lars Hecking wrote:
This is an enterprise-wide setup I cannot change, but I will be able to deploy a newer kernel. It'll have to wait until I return to the office in a few weeks' time, though.
an enterprise-wide setup that can't get regular security patches?!?
Yes, seems to miss the point of using an 'enterprise' distribution where a lot of work goes into ensuring that updates don't cause surprises.
Lars Hecking wrote:
This is an enterprise-wide setup I cannot change, but I will be able to deploy a newer kernel. It'll have to wait until I return to the office in a few weeks' time, though.
If you get any flak for deploying a newer kernel remind whoever gives you the flak that the above kernel is more than a year old(the one that has that particular fix).
Myself I too run on older software, most of the latest and greatest are fairly up to date CentOS 5.2, or 4.6. Still have some older RHEL 4.1 systems in place though(from before my time here). At least we were able to retire the RHEL 3 systems that hadn't seen an update in probably 3-4 years.
I do for the most part keep the kernels more up to date though since they are pretty portable and often contain fixes I care about more (driver updates etc). Security is less of a concern in our mostly protected environment.
I do plan to address all of it, it's just not a high priority.
nate
nate wrote:
Lars Hecking wrote:
2.6.18-92.1.22.el5. I can test all newer kernels if necessary.
Sounds like another poster may of found the root cause but you should still upgrade anyways, I tracked down the specific update to this kernel, which is newer than what you have:
Whoops sorry that was the wrong update, the kernel you have already has the fix I was referring to
* Tue Feb 05 2008 Don Zickus dzickus@redhat.com [2.6.18-78.el5] [..] - [nfs] reduce number of wire RPC ops, increase perf (Peter Staubach ) [321111] [..]
https://bugzilla.redhat.com/show_bug.cgi?id=321111
But it (usually) can't hurt to upgrade to something more recent anyways.
nate
lhecking@users.sourceforge.net writes: [...]
A single CentOS 5.2 x86_64 machine here is overloading our NetApp filer with excessive NFS getattr, lookup and access operations. The weird thing is that the number of these operations increases over time. I have an mrtg graph (which I didn't want to attach here) showing e.g. 200 NFS Ops on Monday, measured with filer-mrtg, going up to, e.g. 1200 in a straight line within days. nfsstat -l on the filer proves beyond doubt that the load is caused by this particular machine. dstat shows me which NFS operations are causing it.
Thanks for all the replies. I believe we have found the culprit.
First, updating the CentOS kernel did not help.
I am now >99% certain that the problem was caused by the XScreenSaver "Phosphor" screensaver running in one or more vnc sessions to RHEL3 machines on the CentOS5 desktop. The screensaver was customised to run a perl script in the user's account that generates random quotes. In any case, disabling this screensaver under RHEL3 appears to have solved our problem, with about 5 days' worth of monitoring data to support this.
This is definitely a weird interaction, as neither the screensaver nor its components actually run on the CentOS machine. I have not checked whether any other activities in a vnc session cause similar behaviour.
At Fri, 2 Oct 2009 13:11:19 +0100 CentOS mailing list centos@centos.org wrote:
lhecking@users.sourceforge.net writes: [...]
A single CentOS 5.2 x86_64 machine here is overloading our NetApp filer with excessive NFS getattr, lookup and access operations. The weird thing is that the number of these operations increases over time. I have an mrtg graph (which I didn't want to attach here) showing e.g. 200 NFS Ops on Monday, measured with filer-mrtg, going up to, e.g. 1200 in a straight line within days. nfsstat -l on the filer proves beyond doubt that the load is caused by this particular machine. dstat shows me which NFS operations are causing it.
Thanks for all the replies. I believe we have found the culprit.
First, updating the CentOS kernel did not help.
I am now >99% certain that the problem was caused by the XScreenSaver "Phosphor" screensaver running in one or more vnc sessions to RHEL3 machines on the CentOS5 desktop. The screensaver was customised to run a perl script in the user's account that generates random quotes. In any case, disabling this screensaver under RHEL3 appears to have solved our problem, with about 5 days' worth of monitoring data to support this.
This is definitely a weird interaction, as neither the screensaver nor its components actually run on the CentOS machine. I have not checked whether any other activities in a vnc session cause similar behaviour.
Where does the screensaver's data files (eg where are the quotes stored) live? If on the CentOS machine, then it is simply that the screensaver is making lots of NFS I/O operations.
CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
This is definitely a weird interaction, as neither the screensaver nor its components actually run on the CentOS machine. I have not checked whether any other activities in a vnc session cause similar behaviour.
Where does the screensaver's data files (eg where are the quotes stored) live? If on the CentOS machine, then it is simply that the screensaver is making lots of NFS I/O operations.
On the filer. But this retrieval script runs on the RHEL3 box(es).
At Fri, 2 Oct 2009 16:55:51 +0100 CentOS mailing list centos@centos.org wrote:
This is definitely a weird interaction, as neither the screensaver nor its components actually run on the CentOS machine. I have not checked whether any other activities in a vnc session cause similar behaviour.
Where does the screensaver's data files (eg where are the quotes stored) live? If on the CentOS machine, then it is simply that the screensaver is making lots of NFS I/O operations.
On the filer. But this retrieval script runs on the RHEL3 box(es).
The retrieval script is/was hitting on the file server to fetch the quotes. It probably was doing something dumb and not caching the datafile. This file access was beating on your file server.
CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
Robert Heller wrote:
This is definitely a weird interaction, as neither the screensaver nor its components actually run on the CentOS machine. I have not checked whether any other activities in a vnc session cause similar behaviour.
Where does the screensaver's data files (eg where are the quotes stored) live? If on the CentOS machine, then it is simply that the screensaver is making lots of NFS I/O operations.
On the filer. But this retrieval script runs on the RHEL3 box(es).
The retrieval script is/was hitting on the file server to fetch the quotes. It probably was doing something dumb and not caching the datafile. This file access was beating on your file server.
Seems odd that caching wouldn't just happen naturally in the nfs client. Maybe it is updating the access time on each read or something that causes the activity.
At Fri, 02 Oct 2009 12:57:16 -0500 CentOS mailing list centos@centos.org wrote:
Robert Heller wrote:
This is definitely a weird interaction, as neither the screensaver nor its components actually run on the CentOS machine. I have not checked whether any other activities in a vnc session cause similar behaviour.
Where does the screensaver's data files (eg where are the quotes stored) live? If on the CentOS machine, then it is simply that the screensaver is making lots of NFS I/O operations.
On the filer. But this retrieval script runs on the RHEL3 box(es).
The retrieval script is/was hitting on the file server to fetch the quotes. It probably was doing something dumb and not caching the datafile. This file access was beating on your file server.
Seems odd that caching wouldn't just happen naturally in the nfs client.
I am not sure if it even makes sense to cache NFS files on a nfs client -- how does the client know that the file might not have changed on the server? At the very least it has to check the file mod times on the server to be sure its local cache is valid.
Maybe it is updating the access time on each read or something that causes the activity.
It is either re-reading the files or checking mod times to determin if the local cached copy is valid. Either way, lots of traffic.
Robert Heller wrote:
Seems odd that caching wouldn't just happen naturally in the nfs client.
I am not sure if it even makes sense to cache NFS files on a nfs client -- how does the client know that the file might not have changed on the server? At the very least it has to check the file mod times on the server to be sure its local cache is valid.
Pretty much all filesytems cache, and would be unusably slow if they didn't. File attributes should be only a few seconds on NFS, but that should be enough to avoid killing your server.
Maybe it is updating the access time on each read or something that causes the activity.
It is either re-reading the files or checking mod times to determin if the local cached copy is valid. Either way, lots of traffic.
And this was hundreds of ops/second?
Maybe it is updating the access time on each read or something that causes the activity.
It is either re-reading the files or checking mod times to determin if the local cached copy is valid. Either way, lots of traffic.
And this was hundreds of ops/second?
I need to ppoint out that the retieval script was running on a RHEL3 machine. The filer NFS load was registered from a CentOS5 machine, and the only connection between the two were one or more vnc sessions. I cannot explain what happened, other than the observation that the excessive NFS access from the CentOS machine stopped when we disabled that screensaver on RHEL3.