Possible to use multiple disk to bypass I/O wait?

List overview All Threads
Download

newer

older

Odd issue with custom udev rule at...

A bridge problem

Emmanuel Noobadmin

9 Jun 2011 9 Jun '11

9:24 a.m.

I'm trying to resolve an I/O problem on a CentOS 5.6 server. The process basically scans through Maildirs, checking for space usage and quota. Because there are hundred odd user folders and several 10s of thousands of small files, this sends the I/O wait % way high. The server hits a very high load level and stops responding to other requests until the crawl is done.

I am wondering if I add another disk and symlink the sub-directories to that, would that free up the server to respond to other requests despite the wait on that disk?

Alternatively, if I mdraid mirror the existing disk, would md be smart enough to read using the other disk while the first's tied up with the first process?

Show replies by date

Benjamin Franz

9 Jun 9 Jun

10:38 a.m.

On 06/09/2011 02:24 AM, Emmanuel Noobadmin wrote:

...

I'm trying to resolve an I/O problem on a CentOS 5.6 server. The process basically scans through Maildirs, checking for space usage and quota. Because there are hundred odd user folders and several 10s of thousands of small files, this sends the I/O wait % way high. The server hits a very high load level and stops responding to other requests until the crawl is done.

I am wondering if I add another disk and symlink the sub-directories to that, would that free up the server to respond to other requests despite the wait on that disk?

Alternatively, if I mdraid mirror the existing disk, would md be smart enough to read using the other disk while the first's tied up with the first process?

You should look at running your process using 'ionice -c3 program'. That way it won't starve everything else for I/O cycles. Also, you may want to experiment with using the 'deadline' elevator instead of the default 'cfq' (see http://www.redhat.com/magazine/008jun05/features/schedulers/ and http://www.wlug.org.nz/LinuxIoScheduler). Neither of those would require you to change your hardware out. Also, setting 'noatime' for the mount options for partition holding the files will reduce the number of required I/Os quite a lot.

But yes, in general, distributing your load across more disks should improve your I/O profile.

-- Benjamin Franz

Emmanuel Noobadmin

4:48 p.m.

On 6/9/11, Benjamin Franz jfranz@freerun.com wrote:

...

You should look at running your process using 'ionice -c3 program'. That way it won't starve everything else for I/O cycles. Also, you may want to experiment with using the 'deadline' elevator instead of the default 'cfq' (see http://www.redhat.com/magazine/008jun05/features/schedulers/ and http://www.wlug.org.nz/LinuxIoScheduler). Neither of those would require you to change your hardware out. Also, setting 'noatime' for the mount options for partition holding the files will reduce the number of required I/Os quite a lot.

Thanks for pointing out noatime, I came across in my reading previously but it never sunk in. This experience is definitely going to make sure of that :)

Tthe crawl process is started by another program. crond starts the program, the program starts the email crawl or take other more crucial action depending on situation so I'm unsure if I should run it with ionice since it could potentially cause the more crucial action to lag/slow down.

But I'll give it a try anyway over the weekend when any negative effect has lesser consequences and see if it affects other things.

...

But yes, in general, distributing your load across more disks should improve your I/O profile.

I'm going with noatime and ionice first to see the impact before I start playing around with the i/o scheduler. If all else fails, then I'll see about requesting for the extra hard disk.

Steven Tardy

8 p.m.

On 06/09/11 11:48, Emmanuel Noobadmin wrote:

...

I'm going with noatime and ionice first

did you set noatime on the host filesystem and/or the VM filesystem?

i would think noatime on the VM would provide more benefit than on the host... shrug. now my brain hurts. gee thanks. (:

-- Steven Tardy Systems Analyst Information Technology Infrastructure Information Technology Services Mississippi State University sjt5@its.msstate.edu

Emmanuel Noobadmin

10 Jun 10 Jun

3:21 a.m.

On 6/10/11, Steven Tardy sjt5@its.msstate.edu wrote:

...

did you set noatime on the host filesystem and/or the VM filesystem? i would think noatime on the VM would provide more benefit than on the host... shrug. now my brain hurts. gee thanks. (:

I was trying it on the host first, thinking that would cut down on half the writes since the host wouldn't have to update the atime everytime the diskfiles are accessed.

But now that you brought it up, I'm wondering if that would had been pointless. If the kernel considers KVM opening the diskfile and holding onto it as a single access, regardless of how many subsequent reads/writes there are, then this wouldn't make a difference would it?

Gordon Messmer

7:09 a.m.

On 06/09/2011 08:21 PM, Emmanuel Noobadmin wrote:

...

But now that you brought it up, I'm wondering if that would had been pointless. If the kernel considers KVM opening the diskfile and holding onto it as a single access, regardless of how many subsequent reads/writes there are, then this wouldn't make a difference would it?

atime and mtime are updated for *every* read and write operation, not for the open() of the file.

That aside, if you're running KVM I strongly recommend using LVM rather than file-backed VM guests. It's more work to set up, but you'll see drastically better IO performance in the guests. One system that I measured had a write speed of around 8 MB/s for sequential block output on file-backed VMs. LVM backed VMs wrote at around 56 MB/s for sequential block output.

You should *never* used file-backed VMs for production systems.

Emmanuel Noobadmin

11 Jun 11 Jun

3:17 a.m.

On 6/10/11, Gordon Messmer yinyang@eburg.com wrote:

...

atime and mtime are updated for *every* read and write operation, not for the open() of the file.

Ok. In any case, the combination of atime and ionice on the cronjob seems to have helped, no locked up in the past 24 hours. But it is a Saturday here so that might just be due to light usage, keeping fingers crossed.

...

That aside, if you're running KVM I strongly recommend using LVM rather than file-backed VM guests. It's more work to set up, but you'll see drastically better IO performance in the guests. One system that I measured had a write speed of around 8 MB/s for sequential block output on file-backed VMs. LVM backed VMs wrote at around 56 MB/s for sequential block output.

You should *never* used file-backed VMs for production systems.

The irony of it was that I decided to go with qcow2 because I thought that would save overheads from an additional LVM layer but provided snapshot capabilities too :(

Since I don't have enough spare space left on this particular system, I'll probably have to get them to agree to add an extra disk to do the LVM volumes, then figure out how to migrate the VM over from file to raw/partition.

Gordon Messmer

15 Jun 15 Jun

10:39 p.m.

On 06/10/2011 08:17 PM, Emmanuel Noobadmin wrote:

...

The irony of it was that I decided to go with qcow2 because I thought that would save overheads from an additional LVM layer but provided snapshot capabilities too :(

I read somewhere recently that people were complaining abut LVM overhead and poor performance, but I've never seen any evidence of it. Was there something that made you think that LVM had significant overhead?

Emmanuel Noobadmin

16 Jun 16 Jun

2:04 a.m.

On 6/16/11, Gordon Messmer yinyang@eburg.com wrote:

...

I read somewhere recently that people were complaining abut LVM overhead and poor performance, but I've never seen any evidence of it. Was there something that made you think that LVM had significant overhead?

Looking at some very sparse notes I made on the decision, I think what tipped the choice was that both qcow2 and lvm added overheads, but lvm was on the whole system i.e. the host has additional processing on every i/o whereas qcow2 overheads was only for guest i/o. More critically my note was the thought as well that it would be easier to move a qcow2 file to another machine/disk if necessary than to move a partition.

Gordon Messmer

7:08 a.m.

On 06/15/2011 07:04 PM, Emmanuel Noobadmin wrote:

...

Looking at some very sparse notes I made on the decision, I think what tipped the choice was that both qcow2 and lvm added overheads, but lvm was on the whole system i.e. the host has additional processing on every i/o whereas qcow2 overheads was only for guest i/o.

I think you were misinformed, or misled. LVM should not present any noticeable overhead on the host. Using "raw" files to back VMs presents a significant overhead to guests; the host performs all IO through its filesystem. Using "qcow2" files presents even more overhead (probably the most of any configuration) since there are complexities to the qcow2 file itself in addition to the host's filesystem.

...

More critically my note was the thought as well that it would be easier to move a qcow2 file to another machine/disk if necessary than to move a partition.

It shouldn't be significantly harder to copy the contents of a partition or LV. The block device is a file. You can read its contents to copy them just as easily as any other file.

Emmanuel Noobadmin

8:01 a.m.

On 6/16/11, Gordon Messmer yinyang@eburg.com wrote:

...

I think you were misinformed, or misled.

That wouldn't be new for me as far as system administration is concerned :D

...

LVM should not present any noticeable overhead on the host. Using "raw" files to back VMs presents a significant overhead to guests; the host performs all IO through its filesystem. Using "qcow2" files presents even more overhead (probably the most of any configuration) since there are complexities to the qcow2 file itself in addition to the host's filesystem.

I was concerned about qcow2 vs raw as well since it seemed logical that qcow2 would be slower for the added functionality. However there was some site I found that showed that KVM with virtio, turning off host caching (or specifying write-back instead of the default write-through) on the file and doing preallocation on qcow2 files will make qcow2 as fast as raw.

...

It shouldn't be significantly harder to copy the contents of a partition or LV. The block device is a file. You can read its contents to copy them just as easily as any other file.

Although the combination of ionice and atime seemed to have stopped things from going through the roof, I'll probably still try to convert one of them to LVM and see if that improves things even further.

Markus Falb

9 Jun 9 Jun

4:59 p.m.

On 9.6.2011 12:38, Benjamin Franz wrote:

...

On 06/09/2011 02:24 AM, Emmanuel Noobadmin wrote:

...
I'm trying to resolve an I/O problem on a CentOS 5.6 server. The process basically scans through Maildirs, checking for space usage and quota. Because there are hundred odd user folders and several 10s of thousands of small files, this sends the I/O wait % way high. The server hits a very high load level and stops responding to other requests until the crawl is done.

...

setting 'noatime' for the mount options for partition holding the files will reduce the number of required I/Os quite a lot.

Yes, but before doing this be sure that your Software does not need atime.

-- Kind Regards, Markus Falb

Emmanuel Noobadmin

5:09 p.m.

On 6/10/11, Markus Falb markus.falb@fasel.at wrote:

...

Yes, but before doing this be sure that your Software does not need atime.

For a brief moment, I had that sinking "Oh No... why didn't I see this earlier" feeling especially since I've already remounted the filesystem with noatime.

Fortunately, so far it seems that everything's still alive and working, keeping fingers crossed :D

Les Mikesell

5:30 p.m.

On 6/9/2011 12:09 PM, Emmanuel Noobadmin wrote:

...

On 6/10/11, Markus Falbmarkus.falb@fasel.at wrote:

...
Yes, but before doing this be sure that your Software does not need atime.

For a brief moment, I had that sinking "Oh No... why didn't I see this earlier" feeling especially since I've already remounted the filesystem with noatime.

Fortunately, so far it seems that everything's still alive and working, keeping fingers crossed :D

Some email software might use it to see if something has been updated since being read.

-- Les Mikesell lesmikesell@gmail.com

Thomas Harold

6:47 p.m.

On 6/9/2011 1:09 PM, Emmanuel Noobadmin wrote:

...

On 6/10/11, Markus Falbmarkus.falb@fasel.at wrote:

...
Yes, but before doing this be sure that your Software does not need atime.

For a brief moment, I had that sinking "Oh No... why didn't I see this earlier" feeling especially since I've already remounted the filesystem with noatime.

Fortunately, so far it seems that everything's still alive and working, keeping fingers crossed :D

The last access time is generally not needed, especially for Maildirs. On our setup, Postfix and Dovecot don't care. I always mount as many file systems as possible with 'noatime'.

(Our IMAP Maildir storage is a 4-disk RAID 1+0 array with a few million individual messages across a lot of accounts.)

Rudi Ahlers

5:04 p.m.

On Thu, Jun 9, 2011 at 12:38 PM, Benjamin Franz jfranz@freerun.com wrote:

...

On 06/09/2011 02:24 AM, Emmanuel Noobadmin wrote:

...
I'm trying to resolve an I/O problem on a CentOS 5.6 server. The process basically scans through Maildirs, checking for space usage and quota. Because there are hundred odd user folders and several 10s of thousands of small files, this sends the I/O wait % way high. The server hits a very high load level and stops responding to other requests until the crawl is done.

I am wondering if I add another disk and symlink the sub-directories to that, would that free up the server to respond to other requests despite the wait on that disk?

Alternatively, if I mdraid mirror the existing disk, would md be smart enough to read using the other disk while the first's tied up with the first process?

You should look at running your process using 'ionice -c3 program'. That way it won't starve everything else for I/O cycles. Also, you may want to experiment with using the 'deadline' elevator instead of the default 'cfq' (see http://www.redhat.com/magazine/008jun05/features/schedulers/ and http://www.wlug.org.nz/LinuxIoScheduler). Neither of those would require you to change your hardware out. Also, setting 'noatime' for the mount options for partition holding the files will reduce the number of required I/Os quite a lot.

But yes, in general, distributing your load across more disks should improve your I/O profile.

-- Benjamin Franz _______________________________________________

Can one mount the root filesystem with noatime?

-- Kind Regards Rudi Ahlers SoftDux Website: http://www.SoftDux.com Technical Blog: http://Blog.SoftDux.com Office: 087 805 9573 Cell: 082 554 7532

Devin Reade

6:28 p.m.

--On Thursday, June 09, 2011 07:04:24 PM +0200 Rudi Ahlers Rudi@SoftDux.com wrote:

...

Can one mount the root filesystem with noatime?

Generally speaking, one can mount any of the filesystems with noatime. Whether or not this is a good thing depends on your use. As was previously mentioned, some software (but not a lot) depends on it. The only thing that comes to mind offhand is mail software that uses a single-file monolithic mailbox. (Cyrus IMAPd, for example, uses one file per message, so noatime doesn't affect its behavior).

With noatime, you also (obviously) lose the ability to look at access times. *Once*, in my career, that was useful for doing forensics on a cracked system.

OTOH, it can make a good performance improvement. On SSDs, it's can also help extend the drive's life.

Devin

Devin Reade

7:04 p.m.

--On Thursday, June 09, 2011 12:28:28 PM -0600 Devin Reade gdr@gno.org wrote:

...

The only thing that comes to mind offhand is mail software that uses a single-file monolithic mailbox.

Another message reminded me that most such software is probably basing its checks off of the mtime anyway.

Devin

John R Pierce

5:26 p.m.

On 06/09/11 2:24 AM, Emmanuel Noobadmin wrote:

...

Alternatively, if I mdraid mirror the existing disk, would md be smart enough to read using the other disk while the first's tied up with the first process?

that woudl be my first choice, and yes, queued read IO could be satisfied by either mirror, hence they'd have double the read performance.

next step would be a raid 1+0 with yet more disks.

-- john r pierce N 37, W 122 santa cruz ca mid-left coast

Thomas Harold

6:51 p.m.

On 6/9/2011 1:26 PM, John R Pierce wrote:

...

On 06/09/11 2:24 AM, Emmanuel Noobadmin wrote:

...
Alternatively, if I mdraid mirror the existing disk, would md be smart enough to read using the other disk while the first's tied up with the first process?

that woudl be my first choice, and yes, queued read IO could be satisfied by either mirror, hence they'd have double the read performance.

next step would be a raid 1+0 with yet more disks.

mdadm is good, but you'll never get double the read performance. Even on our 3-way mirrors (RAID 1, 3 active disks), we don't come close to twice the performance gain.

RAID 1+0 with 4/6/8 spindles is the best way to ensure that you get better performance.

Adding RAM to the server so that you have a larger read buffer might also be an option.

Steve Thompson

7:06 p.m.

On Thu, 9 Jun 2011, Emmanuel Noobadmin wrote:

...

I'm trying to resolve an I/O problem on a CentOS 5.6 server. The process basically scans through Maildirs, checking for space usage and quota. Because there are hundred odd user folders and several 10s of thousands of small files, this sends the I/O wait % way high. The server hits a very high load level and stops responding to other requests until the crawl is done.

If the server is reduced to a crawl, it's possible that you are hitting the dirty_ratio limit due to writes and the server has entered synchronous I/O mode. As others have mentioned, setting noatime could have a significant effect, especially if there are many files and the server doesn't have much memory. You can try increasing dirty_ratio to see if it has an effect, eg:

# sysctl vm.dirty_ratio # sysctl -w vm.dirty_ratio=50

Steve

5175

Age (days ago)

5182

Last active (days ago)

discuss@lists.centos.org

20 comments

11 participants

tags (0)

participants (11)

Benjamin Franz
Devin Reade
Emmanuel Noobadmin
Gordon Messmer
John R Pierce
Les Mikesell
Markus Falb
Rudi Ahlers
Steve Thompson
Steven Tardy
Thomas Harold