Disk Elevator

List overview All Threads
Download

newer

older

Fwd: error, While mounting an...

WinXP don't boot

Matt

6 Jan 2007 6 Jan '07

5:45 a.m.

Can anyone explain how the disk elevator works and if there is anyway to tweak it? I have an email server which likely has a large number of read and write requests and was wandering if there was anyway to improve performance.

Matt

Show replies by date

Jim Perrin

6 Jan 6 Jan

5:53 a.m.

On 1/5/07, Matt lm7812@gmail.com wrote:

...

Can anyone explain how the disk elevator works and if there is anyway to tweak it? I have an email server which likely has a large number of read and write requests and was wandering if there was anyway to improve performance.

Reasonably decent writeup. Gives a good overview, but I'm not sure how much detail you'd like. http://www.redhat.com/magazine/008jun05/features/schedulers/

-- During times of universal deceit, telling the truth becomes a revolutionary act. George Orwell

Ross S. W. Walker

7 Jan 7 Jan

4:06 a.m.

...

-----Original Message----- From: centos-bounces@centos.org [mailto:centos-bounces@centos.org] On Behalf Of Jim Perrin Sent: Friday, January 05, 2007 7:24 PM To: CentOS mailing list Subject: Re: [CentOS] Disk Elevator

On 1/5/07, Matt lm7812@gmail.com wrote:

...
Can anyone explain how the disk elevator works and if there

is anyway

...
to tweak it? I have an email server which likely has a large number of read and write requests and was wandering if there was anyway to improve performance.

Reasonably decent writeup. Gives a good overview, but I'm not sure how much detail you'd like. http://www.redhat.com/magazine/008jun05/features/schedulers/

The disk elevators or io schedulers are there to minimize head seek by re-ordering and merging requests to read or write data from common areas of the disk.

There are some tweaks to improve performance, but the performance gains are minimal on a raid array (the elevators do not not stripe size as they were implemented with single-spindle drives in mind).

The biggest performance gain you can achieve on a raid array is to make sure you format the volume aligned to your raid stripe size. For example if you have a 4 drive raid 5 and it is using 64K chunks, your stripe size will be 256K. Given a 4K filesystem block size you would then have a stride of 64 (256/4), so when you format your volume:

Mke2fs -E stride=64 (other needed options -j for ext3, -N <# of inodes> for extended # of i-nodes, -O dir_index speeds up directory searches for large # of files) /dev/XXXX

By aligning the file-system to the array stripe size you can minimize short write penalties to your array which will speed up writes. By using the -O dir_index option you can speed up reads a fraction, but by minimizing the write penalties reads will gain performance anyway.

A short write penalty is when data is written to an array that is shorter then the stripe (256K) then the remaining blocks will need to be read from the stripe in order to compute a new parity for the stripe. If the OS knows the stripe size then each stripe can be cached before hand in a read-ahead so when a write comes it should have all the data it needs to write the full stripe to disk. It can also give hints to the page cache for combining separate io that falls in the same stripe.

-Ross

______________________________________________________________________ This e-mail, and any attachments thereto, is intended only for use by the addressee(s) named herein and may contain legally privileged and/or confidential information. If you are not the intended recipient of this e-mail, you are hereby notified that any dissemination, distribution or copying of this e-mail, and any attachments thereto, is strictly prohibited. If you have received this e-mail in error, please immediately notify the sender and permanently delete the original and any copy or printout thereof.

Aleksandar Milivojevic

8 Jan 8 Jan

11:30 p.m.

Quoting "Ross S. W. Walker" rwalker@medallion.com:

...

The biggest performance gain you can achieve on a raid array is to make sure you format the volume aligned to your raid stripe size. For example if you have a 4 drive raid 5 and it is using 64K chunks, your stripe size will be 256K. Given a 4K filesystem block size you would then have a stride of 64 (256/4), so when you format your volume:

Mke2fs -E stride=64 (other needed options -j for ext3, -N <# of inodes> for extended # of i-nodes, -O dir_index speeds up directory searches for large # of files) /dev/XXXX

Shouldn't the argument for stride option be how many file system blocks there is per stripe? After all, there's no way for OS to guess what RAID level you are using. For 4 disk RAID5 with 64k chunks and 4k file system blocks you have only 48 file system blocks per stripe ((4-1)x64k/4k=48). So it should be -E stride=48 in this particular case. If it was 4 disk RAID0 array, than it would be 64 (4x64k/4k=64). If it was 4 disk RAID10 array, than it would be 32 ((4/2)*64k/4k=32). Or at least that's the way I understood it by reading the man page.

Ross S. W. Walker

11:45 p.m.

...

-----Original Message----- From: centos-bounces@centos.org [mailto:centos-bounces@centos.org] On Behalf Of Aleksandar Milivojevic Sent: Monday, January 08, 2007 1:00 PM To: centos@centos.org Subject: RE: [CentOS] Disk Elevator

Quoting "Ross S. W. Walker" rwalker@medallion.com:

...
The biggest performance gain you can achieve on a raid

array is to make

...
sure you format the volume aligned to your raid stripe

size. For example

...
if you have a 4 drive raid 5 and it is using 64K chunks, your stripe size will be 256K. Given a 4K filesystem block size you

would then have

...
a stride of 64 (256/4), so when you format your volume:

Mke2fs -E stride=64 (other needed options -j for ext3, -N

<# of inodes>

...
for extended # of i-nodes, -O dir_index speeds up directory

searches for

...
large # of files) /dev/XXXX

Shouldn't the argument for stride option be how many file system blocks there is per stripe? After all, there's no way for OS to guess what RAID level you are using. For 4 disk RAID5 with 64k chunks and 4k file system blocks you have only 48 file system blocks per stripe ((4-1)x64k/4k=48). So it should be -E stride=48 in this particular case. If it was 4 disk RAID0 array, than it would be 64 (4x64k/4k=64). If it was 4 disk RAID10 array, than it would be 32 ((4/2)*64k/4k=32). Or at least that's the way I understood it by reading the man page.

You are correct, leave one of the chunks off for the parity, so for 4 disk raid5 stride=48. I had just computed all 4 chunks as part of the stride.

BTW that parity chunk still needs to be in memory to avoid the read on it, no? In that case wouldn't a stride of 64 help in that case? And if the stride leaves out the parity chunk then will not successive read-aheads cause a continuous wrap of the stripe which will negate the effect of the stride by not having the complete stripe cached?

-Ross

Ross S. W. Walker

9 Jan 9 Jan

12:14 a.m.

...

-----Original Message----- From: centos-bounces@centos.org [mailto:centos-bounces@centos.org] On Behalf Of Ross S. W. Walker Sent: Monday, January 08, 2007 1:15 PM To: CentOS mailing list Subject: RE: [CentOS] Disk Elevator

...
-----Original Message----- From: centos-bounces@centos.org [mailto:centos-bounces@centos.org] On Behalf Of Aleksandar

Milivojevic

...
Sent: Monday, January 08, 2007 1:00 PM To: centos@centos.org Subject: RE: [CentOS] Disk Elevator

Quoting "Ross S. W. Walker" rwalker@medallion.com:

...
The biggest performance gain you can achieve on a raid

array is to make

...
sure you format the volume aligned to your raid stripe

size. For example

...
if you have a 4 drive raid 5 and it is using 64K chunks,

your stripe

...
...
size will be 256K. Given a 4K filesystem block size you

would then have

...
a stride of 64 (256/4), so when you format your volume:

Mke2fs -E stride=64 (other needed options -j for ext3, -N

<# of inodes>

...
for extended # of i-nodes, -O dir_index speeds up directory

searches for

...
large # of files) /dev/XXXX

Shouldn't the argument for stride option be how many file system blocks there is per stripe? After all, there's no way for OS to guess what RAID level you are using. For 4 disk RAID5 with 64k

chunks and

...
4k file system blocks you have only 48 file system blocks

per stripe

...
((4-1)x64k/4k=48). So it should be -E stride=48 in this

particular

...
case. If it was 4 disk RAID0 array, than it would be 64 (4x64k/4k=64). If it was 4 disk RAID10 array, than it would be 32 ((4/2)*64k/4k=32). Or at least that's the way I understood it by reading the man page.

You are correct, leave one of the chunks off for the parity, so for 4 disk raid5 stride=48. I had just computed all 4 chunks as part of the stride.

BTW that parity chunk still needs to be in memory to avoid the read on it, no? In that case wouldn't a stride of 64 help in that case? And if the stride leaves out the parity chunk then will not successive read-aheads cause a continuous wrap of the stripe which will negate the effect of the stride by not having the complete stripe cached?

Let me follow up on my last post by saying that Aleksandar is abolutely correct. The stride is the # of blocks per-stripe and has nothing to do with read-ahead, and thus should be calculated by # of chunks minus parity in a stripe.

For read-ahead, you would set this through blockdev --setra X /dev/YY, and use a multiple of the # of sectors in a stripe, so for a 256K stripe, set the read-ahead to 512, 1024, 2048, depending if the io is mostly random or mostly sequential (bigger for sequential, smaller for random).

-Ross

Jim Perrin

16 Jan 16 Jan

8:07 p.m.

...

...
...
Quoting "Ross S. W. Walker" rwalker@medallion.com:

...
The biggest performance gain you can achieve on a raid

array is to make

...
sure you format the volume aligned to your raid stripe

size. For example

...
if you have a 4 drive raid 5 and it is using 64K chunks,

your stripe

...
...
size will be 256K. Given a 4K filesystem block size you

would then have

...
a stride of 64 (256/4), so when you format your volume:

Mke2fs -E stride=64 (other needed options -j for ext3, -N

<# of inodes>

...
for extended # of i-nodes, -O dir_index speeds up directory

searches for

...
large # of files) /dev/XXXX

Shouldn't the argument for stride option be how many file system blocks there is per stripe? After all, there's no way for OS to guess what RAID level you are using. For 4 disk RAID5 with 64k

chunks and

...
4k file system blocks you have only 48 file system blocks

per stripe

...
((4-1)x64k/4k=48). So it should be -E stride=48 in this

particular

...
case. If it was 4 disk RAID0 array, than it would be 64 (4x64k/4k=64). If it was 4 disk RAID10 array, than it would be 32 ((4/2)*64k/4k=32). Or at least that's the way I understood it by reading the man page.

You are correct, leave one of the chunks off for the parity, so for 4 disk raid5 stride=48. I had just computed all 4 chunks as part of the stride.

BTW that parity chunk still needs to be in memory to avoid the read on it, no? In that case wouldn't a stride of 64 help in that case? And if the stride leaves out the parity chunk then will not successive read-aheads cause a continuous wrap of the stripe which will negate the effect of the stride by not having the complete stripe cached?

...

For read-ahead, you would set this through blockdev --setra X /dev/YY, and use a multiple of the # of sectors in a stripe, so for a 256K stripe, set the read-ahead to 512, 1024, 2048, depending if the io is mostly random or mostly sequential (bigger for sequential, smaller for random).

To follow up on this (even if it is a little late), how is this affected by LVM use? I'm curious to know how (or if) this math changes with ext3 sitting on LVM on the raid array.

-- During times of universal deceit, telling the truth becomes a revolutionary act. George Orwell

Ross S. W. Walker

9:07 p.m.

...

-----Original Message----- From: centos-bounces@centos.org [mailto:centos-bounces@centos.org] On Behalf Of Jim Perrin Sent: Tuesday, January 16, 2007 9:37 AM To: CentOS mailing list Subject: Re: [CentOS] Disk Elevator

...
...
...
Quoting "Ross S. W. Walker" rwalker@medallion.com:

...
The biggest performance gain you can achieve on a raid

array is to make

...
sure you format the volume aligned to your raid stripe

size. For example

...
if you have a 4 drive raid 5 and it is using 64K chunks,

your stripe

...
...
size will be 256K. Given a 4K filesystem block size you

would then have

...
a stride of 64 (256/4), so when you format your volume:

Mke2fs -E stride=64 (other needed options -j for ext3, -N

<# of inodes>

...
for extended # of i-nodes, -O dir_index speeds up directory

searches for

...
large # of files) /dev/XXXX

Shouldn't the argument for stride option be how many file system blocks there is per stripe? After all, there's no way for OS to guess what RAID level you are using. For 4 disk RAID5 with 64k

chunks and

...
4k file system blocks you have only 48 file system blocks

per stripe

...
((4-1)x64k/4k=48). So it should be -E stride=48 in this

particular

...
case. If it was 4 disk RAID0 array, than it would be 64 (4x64k/4k=64). If it was 4 disk RAID10 array, than it

would be 32

...
...
...
((4/2)*64k/4k=32). Or at least that's the way I

understood it by

...
...
...
reading the man page.

You are correct, leave one of the chunks off for the

parity, so for 4

...
...
disk raid5 stride=48. I had just computed all 4 chunks as

part of the

...
...
stride.

BTW that parity chunk still needs to be in memory to

avoid the read on

...
...
it, no? In that case wouldn't a stride of 64 help in that

case? And if

...
...
the stride leaves out the parity chunk then will not successive read-aheads cause a continuous wrap of the stripe which will negate the effect of the stride by not having the complete stripe cached?

...
For read-ahead, you would set this through blockdev --setra

X /dev/YY,

...
and use a multiple of the # of sectors in a stripe, so for a 256K stripe, set the read-ahead to 512, 1024, 2048, depending if

the io is

...
mostly random or mostly sequential (bigger for sequential,

smaller for

...
random).

To follow up on this (even if it is a little late), how is this affected by LVM use? I'm curious to know how (or if) this math changes with ext3 sitting on LVM on the raid array.

Depends is the best answer. It really depends on LVM and the other block layer devices. As the io requests descend down the different layers they will enter multiple request_queues, each request_queue will have and io scheduler assigned to it, either the system default or one of the others, or one of the block devices own, so it is hard to say. Only by testing can you know for sure. In my tests LVM is very good with unnoticeable overhead going to hardware RAID, but if you use MD RAID then your experience might be different.

Ext3 | VFS | Page Cache | LVM request_queue (io scheduler) | LVM | MD request_queue (io scheduler) | MD | ----------------- | | | | | Que Que Que Que Que (io scheduler) | | | | | Sda sdb sdc sdd sde

Hope this helps clarify.

Jim Perrin

9:18 p.m.

...

Depends is the best answer. It really depends on LVM and the other block layer devices. As the io requests descend down the different layers they will enter multiple request_queues, each request_queue will have and io scheduler assigned to it, either the system default or one of the others, or one of the block devices own, so it is hard to say. Only by testing can you know for sure. In my tests LVM is very good with unnoticeable overhead going to hardware RAID, but if you use MD RAID then your experience might be different. Hope this helps clarify.

It does, and I should have specified at the outset that this was with respect to hardware raid.

-- During times of universal deceit, telling the truth becomes a revolutionary act. George Orwell

Peter Kjellstrom

10:25 p.m.

On Tuesday 16 January 2007 16:37, Ross S. W. Walker wrote: ...

...

...
To follow up on this (even if it is a little late), how is this affected by LVM use? I'm curious to know how (or if) this math changes with ext3 sitting on LVM on the raid array.

Depends is the best answer. It really depends on LVM and the other block layer devices. As the io requests descend down the different layers they will enter multiple request_queues, each request_queue will have and io scheduler assigned to it, either the system default or one of the others, or one of the block devices own, so it is hard to say. Only by testing can you know for sure. In my tests LVM is very good with unnoticeable overhead going to hardware RAID, but if you use MD RAID then your experience might be different.

I don't think that is quite correct. AFAICT only the "real" devices (such as /dev/sda) has an io-scheduler. See the difference of ls /sys/block/..: # ls /sys/block/dm-0 dev range removable size stat # ls /sys/block/sdc dev device queue range removable size stat

As for read-ahead it's the reverse. Read-ahead has no effect (in my tests) when applied to the underlying device (such as sda) but has to be set on the lvm-device. Here are some performance numbers:

sdc:256,dm-0:256 and sdc:8192,dm-0:256 gives: # time dd if=file10G of=/dev/null bs=1M real 0m59.465s

sdc:8192,dm-0:256 and sdc:8192,dm-0:8192 gives: # time dd if=file10G of=/dev/null bs=1M real 0m24.163s

This on a 8 disk 3ware raid6 (hardware raid) with fully updated centos-4.4 x86_64. The file dd read was 1000 MiB. 256 is the default read-ahead and blockdev --setra was used to change it.

/Peter

Ross S. W. Walker

10:51 p.m.

...

-----Original Message----- From: Peter Kjellstrom [mailto:cap@nsc.liu.se] Sent: Tuesday, January 16, 2007 11:55 AM To: centos@centos.org Cc: Ross S. W. Walker Subject: Re: [CentOS] Disk Elevator

On Tuesday 16 January 2007 16:37, Ross S. W. Walker wrote: ...

...
...
To follow up on this (even if it is a little late), how is this affected by LVM use? I'm curious to know how (or if) this math changes with

ext3 sitting on

...
...
LVM on the raid array.

Depends is the best answer. It really depends on LVM and

the other block

...
layer devices. As the io requests descend down the

different layers they

...
will enter multiple request_queues, each request_queue will

have and io

...
scheduler assigned to it, either the system default or one of the others, or one of the block devices own, so it is hard to

say. Only by

...
testing can you know for sure. In my tests LVM is very good with unnoticeable overhead going to hardware RAID, but if you use MD RAID then your experience might be different.

I don't think that is quite correct. AFAICT only the "real" devices (such as /dev/sda) has an io-scheduler. See the difference of ls /sys/block/..: # ls /sys/block/dm-0 dev range removable size stat # ls /sys/block/sdc dev device queue range removable size stat

How a device presents itself in /proc or /sys is completely up to the device.

All block devices have a request_queue. You can look at the struct of said queue in linux/blkdev.h, you can then look at the code ll_rw_blk.c to see how said queue is processed.

Here is the structure anyways:

struct request_queue { /* * Together with queue_head for cacheline sharing */ struct list_head queue_head; struct request *last_merge; elevator_t elevator;

/* * the queue request freelist, one for reads and one for writes */ struct request_list rq;

request_fn_proc *request_fn; merge_request_fn *back_merge_fn; merge_request_fn *front_merge_fn; merge_requests_fn *merge_requests_fn; make_request_fn *make_request_fn; prep_rq_fn *prep_rq_fn; unplug_fn *unplug_fn; merge_bvec_fn *merge_bvec_fn; activity_fn *activity_fn; issue_flush_fn *issue_flush_fn;

/* * Auto-unplugging state */ struct timer_list unplug_timer; int unplug_thresh; /* After this many requests */ unsigned long unplug_delay; /* After this many jiffies */ struct work_struct unplug_work;

struct backing_dev_info backing_dev_info;

/* * The queue owner gets to use this for whatever they like. * ll_rw_blk doesn't touch it. */ void *queuedata;

void *activity_data;

/* * queue needs bounce pages for pages above this limit */ unsigned long bounce_pfn; int bounce_gfp;

/* * various queue flags, see QUEUE_* below */ unsigned long queue_flags;

/* * protects queue structures from reentrancy */ spinlock_t *queue_lock;

/* * queue kobject */ struct kobject kobj;

/* * queue settings */ unsigned long nr_requests; /* Max # of requests */ unsigned int nr_congestion_on; unsigned int nr_congestion_off;

unsigned short max_sectors; unsigned short max_hw_sectors; unsigned short max_phys_segments; unsigned short max_hw_segments; unsigned short hardsect_size; unsigned int max_segment_size;

unsigned long seg_boundary_mask; unsigned int dma_alignment;

struct blk_queue_tag *queue_tags;

atomic_t refcnt;

unsigned int in_flight;

/* * sg stuff */ unsigned int sg_timeout; unsigned int sg_reserved_size; };

Every request queue needs an elevator/scheduler, otherwise as you go down the block layers you can get contention/starvation between them.

...

As for read-ahead it's the reverse. Read-ahead has no effect (in my tests) when applied to the underlying device (such as sda) but has to be set on the lvm-device. Here are some performance numbers:

I too see little improvement on read-ahead with sequential io, but surprisingly and completely non-intuitive it seems to help with random read io, as long as the read-aheads are kept low. Set the read-ahead to your stripe size in sectors and you will be pleasantly surprised with random read #s.

...

sdc:256,dm-0:256 and sdc:8192,dm-0:256 gives: # time dd if=file10G of=/dev/null bs=1M real 0m59.465s

sdc:8192,dm-0:256 and sdc:8192,dm-0:8192 gives: # time dd if=file10G of=/dev/null bs=1M real 0m24.163s

This on a 8 disk 3ware raid6 (hardware raid) with fully updated centos-4.4 x86_64. The file dd read was 1000 MiB. 256 is the default read-ahead and blockdev --setra was used to change it.

/Peter

Ross S. W. Walker

11:11 p.m.

...

-----Original Message----- From: centos-bounces@centos.org [mailto:centos-bounces@centos.org] On Behalf Of Ross S. W. Walker Sent: Tuesday, January 16, 2007 12:22 PM To: Peter Kjellstrom; centos@centos.org Subject: RE: [CentOS] Disk Elevator

...
-----Original Message----- From: Peter Kjellstrom [mailto:cap@nsc.liu.se] Sent: Tuesday, January 16, 2007 11:55 AM To: centos@centos.org Cc: Ross S. W. Walker Subject: Re: [CentOS] Disk Elevator

On Tuesday 16 January 2007 16:37, Ross S. W. Walker wrote: ...

...
...
To follow up on this (even if it is a little late), how is this affected by LVM use? I'm curious to know how (or if) this math changes with

ext3 sitting on

...
...
LVM on the raid array.

Depends is the best answer. It really depends on LVM and

the other block

...
layer devices. As the io requests descend down the

different layers they

...
will enter multiple request_queues, each request_queue will

have and io

...
scheduler assigned to it, either the system default or one of the others, or one of the block devices own, so it is hard to

say. Only by

...
testing can you know for sure. In my tests LVM is very good with unnoticeable overhead going to hardware RAID, but if you

use MD RAID

...
...
then your experience might be different.

I don't think that is quite correct. AFAICT only the "real" devices (such as /dev/sda) has an io-scheduler. See the difference of ls /sys/block/..: # ls /sys/block/dm-0 dev range removable size stat # ls /sys/block/sdc dev device queue range removable size stat

How a device presents itself in /proc or /sys is completely up to the device.

All block devices have a request_queue. You can look at the struct of said queue in linux/blkdev.h, you can then look at the code ll_rw_blk.c to see how said queue is processed.

Here is the structure anyways:

struct request_queue { /* * Together with queue_head for cacheline sharing */ struct list_head queue_head; struct request *last_merge; elevator_t elevator;
    /*
     * the queue request freelist, one for reads and one 
for writes */ struct request_list rq;
    request_fn_proc         *request_fn;
    merge_request_fn        *back_merge_fn;
    merge_request_fn        *front_merge_fn;
    merge_requests_fn       *merge_requests_fn;
    make_request_fn         *make_request_fn;
    prep_rq_fn              *prep_rq_fn;
    unplug_fn               *unplug_fn;
    merge_bvec_fn           *merge_bvec_fn;
    activity_fn             *activity_fn;
    issue_flush_fn          *issue_flush_fn;

    /*
     * Auto-unplugging state
     */
    struct timer_list       unplug_timer;
    int                     unplug_thresh;  /* After this many
requests */ unsigned long unplug_delay; /* After this many jiffies */ struct work_struct unplug_work;
    struct backing_dev_info backing_dev_info;

    /*
     * The queue owner gets to use this for whatever they like.
     * ll_rw_blk doesn't touch it.
     */
    void                    *queuedata;

    void                    *activity_data;

    /*
     * queue needs bounce pages for pages above this limit
     */
    unsigned long           bounce_pfn;
    int                     bounce_gfp;

    /*
     * various queue flags, see QUEUE_* below
     */
    unsigned long           queue_flags;

    /*
     * protects queue structures from reentrancy
     */
    spinlock_t              *queue_lock;

    /*
     * queue kobject
     */
    struct kobject kobj;

    /*
     * queue settings
     */
    unsigned long           nr_requests;    /* Max # of 
requests */ unsigned int nr_congestion_on; unsigned int nr_congestion_off;
    unsigned short          max_sectors;
    unsigned short          max_hw_sectors;
    unsigned short          max_phys_segments;
    unsigned short          max_hw_segments;
    unsigned short          hardsect_size;
    unsigned int            max_segment_size;

    unsigned long           seg_boundary_mask;
    unsigned int            dma_alignment;

    struct blk_queue_tag    *queue_tags;

    atomic_t                refcnt;

    unsigned int            in_flight;

    /*
     * sg stuff
     */
    unsigned int            sg_timeout;
    unsigned int            sg_reserved_size;
};

Every request queue needs an elevator/scheduler, otherwise as you go down the block layers you can get contention/starvation between them.

...
As for read-ahead it's the reverse. Read-ahead has no effect (in my tests) when applied to the underlying device (such as sda) but has to be set on the lvm-device. Here are some performance numbers:

Oh, and the read-aheads set by blockdev, they tend to be inherited by the block device driver using that device as it's backing device.

When sdX is created it defaults to 256 sectors, when partitions are mapped they have 256 sectors, when MD associates with the drive or it's partitions it uses the 256 read-ahead, when LVM comes on top it uses the MD 256 read-ahead, this is then passed up to the VFS routines that use this to determine the amount of read-ahead to do. Since VFS associates only with it's immediate backing device, setting the read-ahead at a lower backing device has no effect, set it on the immediate backing device, in this case LVM.

...

I too see little improvement on read-ahead with sequential io, but surprisingly and completely non-intuitive it seems to help with random read io, as long as the read-aheads are kept low. Set the read-ahead to your stripe size in sectors and you will be pleasantly surprised with random read #s.

...
sdc:256,dm-0:256 and sdc:8192,dm-0:256 gives: # time dd if=file10G of=/dev/null bs=1M real 0m59.465s

sdc:8192,dm-0:256 and sdc:8192,dm-0:8192 gives: # time dd if=file10G of=/dev/null bs=1M real 0m24.163s

This on a 8 disk 3ware raid6 (hardware raid) with fully updated centos-4.4 x86_64. The file dd read was 1000 MiB. 256 is the default read-ahead and blockdev --setra was used to change it.

/Peter

Aleksandar Milivojevic

9 Jan 9 Jan

1:20 a.m.

Quoting "Ross S. W. Walker" rwalker@medallion.com:

...

BTW that parity chunk still needs to be in memory to avoid the read on it, no? In that case wouldn't a stride of 64 help in that case? And if the stride leaves out the parity chunk then will not successive read-aheads cause a continuous wrap of the stripe which will negate the effect of the stride by not having the complete stripe cached?

Hm, not really. The parity chunk is never handed over to the OS. It's internal to the hardware RAID controller. OS doesn't know anything about it, it doesn't even know that the "disk" it is accessing is actually RAID5 array.

Back to your example of 4 disk RAID5, 64k chunks, 4k file system blocks.

If you set stride to 48, OS gives 3 chunks worth of data to the controller, aligned with stripes. Controller calculates parity and writes out 4 chunks (3 data, 1 parity).

If you set stride to 64, OS gives 4 chunks worth of data to the controller. In best case scenario first or last three will be aligned with stripes. Controller calculates parity on 3 of them, writes out 4 chunks (3 data, 1 parity). For the remaining data chunk, it needs to read 2 chunks from the disk, calculates parity and writes 2 chunks (1 data, 1 parity). In worst case scenario first or last 3 chunks will not be aligned with stripes. Controller reads 1 chunk, calculates parity writes out 3 chunks (2 data, 1 parity), than does the same thing again for remaining 2 chunks of data.

Anyhow, for large sequential reads and writes there's really not a big performace benefit (if any). OS will tend to combine and rearrange reads and writes to be sequential, and the hardware RAID controller will do the same using its cache. I've tested this once with good RAID controller, and bonnie++ (which benchmarks this kind of access) gave almost the same numbers with and without using stride option.

If disk access is random, read block here, write block there, there might be some benefit (however, cache in hardware RAID controller might kick in and save the day here too). It all depends on particular RAID contoller, workload and amount and type (write back vs. write through) of cache on the controller.

I'd say in most cases using stride option has very little effect if you have a large battery backed up write back cache (and good RAID controller, that is). If you are using software RAID, or have small and/or write through cache, stride option might have some effects.

Feizhou

7 Jan 7 Jan

7:56 p.m.

Matt wrote:

...

Can anyone explain how the disk elevator works and if there is anyway to tweak it? I have an email server which likely has a large number of read and write requests and was wandering if there was anyway to improve performance.

I assume you are talking about Centos 3.x with the 2.4 kernel.I know it is heavily patched but I don't think it has the complicated i/o schedulers you find in 2.6

You use elvtune to tweak it. Basically you just define the max length the write queue can be before attention will be given to reads and likewise for the read queue. man elvtune for more information.

I do not have a 2.4 kernel box anymore but I think you can try 'elvtune -r 128 -w 128 /dev/device' and see if that helps. 'elvtune /dev/device' will show you the settings being used. If you are using ext3, you probably also want to look at tweaking /proc/sys/vm/bdflush. Look here for more information:

http://people.redhat.com/alikins/system_tuning.html

Matt

8 Jan 8 Jan

11:29 p.m.

...

You use elvtune to tweak it. Basically you just define the max length the write queue can be before attention will be given to reads and likewise for the read queue. man elvtune for more information.

"[root@server ~]# elvtune /dev/hda ioctl get: Invalid argument

elvtune is only useful on older kernels; for 2.6 use IO scheduler sysfs tunables instead.."

Thats what I get.

Matt

Feizhou

9 Jan 9 Jan

7:47 a.m.

Matt wrote:

...

...
You use elvtune to tweak it. Basically you just define the max length the write queue can be before attention will be given to reads and likewise for the read queue. man elvtune for more information.

"[root@server ~]# elvtune /dev/hda ioctl get: Invalid argument

elvtune is only useful on older kernels; for 2.6 use IO scheduler sysfs tunables instead.."

Thats what I get.

I did say:

"I assume you are talking about Centos 3.x with the 2.4 kernel"

For tweaking the io schedulers in 2.6, you need to mount the sysfs filesystem which is under /sys on RHEL4/Centos4. Play around with the values under /sys/block/devicename/queue/*.

6742

Age (days ago)

6752

Last active (days ago)

discuss@lists.centos.org

15 comments

7 participants

tags (0)

participants (7)

Aleksandar Milivojevic
Aleksandar Milivojevic
Feizhou
Jim Perrin
Matt
Peter Kjellstrom
Ross S. W. Walker