Hi list !
We have a very busy webserver hosted in a clustered environment where the document root and data is on a GFS2 partition off a fiber-attached disk array.
Now on busy moments, I can see in htop, nmon that there is a fair percentage of cpu that is waiting for I/O. In nmon, I can spot that the most busy block device correspond to our gfs2 partition where many times, it shows that it's 100% busy and is read all along.
Now, I want to know what files are being waited for. With lsof I can get a listing of open files, but it doesn't gives me if a file is just opended in ram or if it's being waited for...
What tools besides lsof, nmon, htop, atop can help me find that info ?
I am under RHEL/CentOS 6.1.
Thanks
Hi Nicolas,
While this doesn't exactly answer your question, I was wondering what scheduler you were using on your GFS2 (Note: I have not used this file system before) block. You can find this by issuing 'cat /sys/block/<insert block device>/queue/scheduler' ?
By default the system uses cfq, which will show up as [cfq] when catting the scheduler as I showed above. This is not the most optimal scheduler for a webserver. In most cases you'd be better off with deadline or noop. Not being familiar with GFS2 myself, I did skim this article, which makes me think noop would be the better choice:
http://www.redhat.com/archives/linux-cluster/2010-June/msg00027.html
This could be why you are seeing the processes waiting on I/O.
Chad M. Gross
On Tue, Sep 20, 2011 at 2:55 PM, Nicolas Ross rossnick-lists@cybercat.cawrote:
Hi list !
We have a very busy webserver hosted in a clustered environment where the document root and data is on a GFS2 partition off a fiber-attached disk array.
Now on busy moments, I can see in htop, nmon that there is a fair percentage of cpu that is waiting for I/O. In nmon, I can spot that the most busy block device correspond to our gfs2 partition where many times, it shows that it's 100% busy and is read all along.
Now, I want to know what files are being waited for. With lsof I can get a listing of open files, but it doesn't gives me if a file is just opended in ram or if it's being waited for...
What tools besides lsof, nmon, htop, atop can help me find that info ?
I am under RHEL/CentOS 6.1.
Thanks
CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
Hi Nicolas,
While this doesn't exactly answer your question, I was wondering what scheduler you were using on your GFS2 (Note: I have not used this file system before) block. You can find this by issuing 'cat /sys/block/<insert block device>/queue/scheduler' ?
By default the system uses cfq, which will show up as [cfq] when catting the scheduler as I showed above. This is not the most optimal scheduler for a webserver. In most cases you'd be better off with deadline or noop. Not being familiar with GFS2 myself, I did skim this article, which makes me think noop would be the better choice:
http://www.redhat.com/archives/linux-cluster/2010-June/msg00027.html
This could be why you are seeing the processes waiting on I/O.
In my case, /sys/block/dm-9/queue/scheduler show : none and /sys/block/sdb/queue/scheduler shows "noop anticipatory deadline [cfq]".
Since this is a production cluster, I do not want to make changes to it just now. I will ask advice from RHEL support for setting this.
But that seems logical.
In the meen time, I'd still like to find a tool to know what files are requeted to the filesystem and what ones are being waited for...
Thanks
On Wednesday, September 21, 2011 09:33 PM, Nicolas Ross wrote:
In the meen time, I'd still like to find a tool to know what files are requeted to the filesystem and what ones are being waited for...
atop and iotop are tools that do that...when the kernel has been appropriately patched or the kernel is of an appropriate version...
In the meen time, I'd still like to find a tool to know what files are requeted to the filesystem and what ones are being waited for...
atop and iotop are tools that do that...when the kernel has been appropriately patched or the kernel is of an appropriate version...
I did used these utilities, while it can help see what processes are generating IO, it doesn't show what files on the file system are being asked or waited for.
Basicly, what I'm searching is an equivalent for fs_usage on Mac OS X, or tcpdump, but on a bloc device...
Thanks,
On Mon, Sep 26, 2011 at 11:15 AM, Nicolas Ross rossnick-lists@cybercat.ca wrote:
In the meen time, I'd still like to find a tool to know what files are requeted to the filesystem and what ones are being waited for...
atop and iotop are tools that do that...when the kernel has been appropriately patched or the kernel is of an appropriate version...
I did used these utilities, while it can help see what processes are generating IO, it doesn't show what files on the file system are being asked or waited for.
Basicly, what I'm searching is an equivalent for fs_usage on Mac OS X, or tcpdump, but on a bloc device...
Not sure what those do, but lsof should show what files are open, and 'strace -p process_id' would show the system calls issued by a process.
I did used these utilities, while it can help see what processes are generating IO, it doesn't show what files on the file system are being asked or waited for.
Basicly, what I'm searching is an equivalent for fs_usage on Mac OS X, or tcpdump, but on a bloc device...
Not sure what those do, but lsof should show what files are open, and 'strace -p process_id' would show the system calls issued by a process.
Thanks, that might be usefull. I'ill just have to find a way to strace multiple process at once and find the usefull info among that load of data...
Nicolas
On Mon, Sep 26, 2011 at 1:47 PM, Nicolas Ross rossnick-lists@cybercat.ca wrote:
I did used these utilities, while it can help see what processes are generating IO, it doesn't show what files on the file system are being asked or waited for.
Basicly, what I'm searching is an equivalent for fs_usage on Mac OS X, or tcpdump, but on a bloc device...
Not sure what those do, but lsof should show what files are open, and 'strace -p process_id' would show the system calls issued by a process.
Thanks, that might be usefull. I'ill just have to find a way to strace multiple process at once and find the usefull info among that load of data...
Note that if what is really happening is that different processes are frequently accessing the same disk in different locations (a fairly likely scenario) the time will be mostly taken by the head seeks in between and may be hard to pin down.
Not sure what those do, but lsof should show what files are open, and 'strace -p process_id' would show the system calls issued by a process.
Thanks, that might be usefull. I'ill just have to find a way to strace multiple process at once and find the usefull info among that load of data...
Note that if what is really happening is that different processes are frequently accessing the same disk in different locations (a fairly likely scenario) the time will be mostly taken by the head seeks in between and may be hard to pin down.
I found the -f option to strace is able to attach to the forked child of a parent process, so I will be using that in my debuging in conjunction witgh -e to filter out only the calls I want to see...
But indeed, that might be hard to find. In one case, I want to see what files are opened / accessed on a gfs2 volume over a fiber channel link to a raid-1 array, and the controler is supposed to intelligent enough to distribute the read access across the 2 disks. And in the other case, it's an ssd, so seek time should be 0.
On Mon, Sep 26, 2011 at 2:02 PM, Nicolas Ross rossnick-lists@cybercat.ca wrote:
Not sure what those do, but lsof should show what files are open, and 'strace -p process_id' would show the system calls issued by a process.
Thanks, that might be usefull. I'ill just have to find a way to strace multiple process at once and find the usefull info among that load of data...
Note that if what is really happening is that different processes are frequently accessing the same disk in different locations (a fairly likely scenario) the time will be mostly taken by the head seeks in between and may be hard to pin down.
I found the -f option to strace is able to attach to the forked child of a parent process, so I will be using that in my debuging in conjunction witgh -e to filter out only the calls I want to see...
But indeed, that might be hard to find. In one case, I want to see what files are opened / accessed on a gfs2 volume over a fiber channel link to a raid-1 array, and the controler is supposed to intelligent enough to distribute the read access across the 2 disks. And in the other case, it's an ssd, so seek time should be 0.
Not sure how gfs2 deals with client caching, but in other scenarios it's probably easier to just throw a lot of ram in the system and let the filesystem cache do its job. You still have to deal with applications that need to fsync(), though.
Not sure how gfs2 deals with client caching, but in other scenarios it's probably easier to just throw a lot of ram in the system and let the filesystem cache do its job. You still have to deal with applications that need to fsync(), though.
Our nodes all have 12 gigs of ddr3 ram, that should be plenty. The node where the application I'm dealing with is has about half used.
On Wed, Sep 28, 2011 at 1:30 PM, Nicolas Ross rossnick-lists@cybercat.ca wrote:
Not sure how gfs2 deals with client caching, but in other scenarios it's probably easier to just throw a lot of ram in the system and let the filesystem cache do its job. You still have to deal with applications that need to fsync(), though.
Our nodes all have 12 gigs of ddr3 ram, that should be plenty. The node where the application I'm dealing with is has about half used.
Yes, but how does gfs2 deal with filesystem caching? There must be some restriction and overhead to keep it consistent across nodes.
Not sure how gfs2 deals with client caching, but in other scenarios it's probably easier to just throw a lot of ram in the system and let the filesystem cache do its job. You still have to deal with applications that need to fsync(), though.
Our nodes all have 12 gigs of ddr3 ram, that should be plenty. The node where the application I'm dealing with is has about half used.
Yes, but how does gfs2 deal with filesystem caching? There must be some restriction and overhead to keep it consistent across nodes.
Yes indeed, afaik, when reading (wich is mostly our case), the node reads the data from the disk and is able to cache it. When a node is trying to write to that file, it must tell other nodes to flush their cache for that file. But that is only my understanding of the mechanics of glocks, I might be wrong.
I did opened up a ticket with RH to help find the i/o contention source.
Regards,
On 09/26/2011 09:02 PM, Nicolas Ross wrote:
Not sure what those do, but lsof should show what files are open, and 'strace -p process_id' would show the system calls issued by a process.
Thanks, that might be usefull. I'ill just have to find a way to strace multiple process at once and find the usefull info among that load of data...
Note that if what is really happening is that different processes are frequently accessing the same disk in different locations (a fairly likely scenario) the time will be mostly taken by the head seeks in between and may be hard to pin down.
I found the -f option to strace is able to attach to the forked child of a parent process, so I will be using that in my debuging in conjunction witgh -e to filter out only the calls I want to see...
But indeed, that might be hard to find. In one case, I want to see what files are opened / accessed on a gfs2 volume over a fiber channel link to a raid-1 array, and the controler is supposed to intelligent enough to distribute the read access across the 2 disks. And in the other case, it's an ssd, so seek time should be 0.
You could try systemtap: http://sourceware.org/systemtap/examples/
In your case this script could be useful: http://sourceware.org/systemtap/examples/io/iotime.stp
"The script watches each open, close, read, and write syscalls on the system. For each file the scripts observes opened it accumulates the amount of wall clock time spent in read and write operations and the number of bytes read and written. When a file is closed the script prints out a pair of lines for the file. Both lines begin with a timestamp in microseconds, the PID number, and the executable name in parentheses. The first line with the "access" keyword lists the file name, the attempted number of bytes for the read and write operations. The second line with the "iotime" keyword list the file name and the number of microseconds accumulated in the read and write syscalls."
Regards, Dennis
On Sep 21, 2011, at 9:33 AM, "Nicolas Ross" rossnick-lists@cybercat.ca wrote:
Hi Nicolas,
While this doesn't exactly answer your question, I was wondering what scheduler you were using on your GFS2 (Note: I have not used this file system before) block. You can find this by issuing 'cat /sys/block/<insert block device>/queue/scheduler' ?
By default the system uses cfq, which will show up as [cfq] when catting the scheduler as I showed above. This is not the most optimal scheduler for a webserver. In most cases you'd be better off with deadline or noop. Not being familiar with GFS2 myself, I did skim this article, which makes me think noop would be the better choice:
http://www.redhat.com/archives/linux-cluster/2010-June/msg00027.html
This could be why you are seeing the processes waiting on I/O.
In my case, /sys/block/dm-9/queue/scheduler show : none and /sys/block/sdb/queue/scheduler shows "noop anticipatory deadline [cfq]".
Since this is a production cluster, I do not want to make changes to it just now. I will ask advice from RHEL support for setting this.
But that seems logical.
In the meen time, I'd still like to find a tool to know what files are requeted to the filesystem and what ones are being waited for...
You could try iotop, I am told it's good at showing both files and processes under high io or wait.
Switching to 'deadline' for a cluster file system (or any file server) is always a good idea as CFQ is designed to give equal weight to running processes on a system and kernel processes, remote processes or disk arrays were not factored into the equation.
-Ross