Can anyone explain how the disk elevator works and if there is anyway to tweak it? I have an email server which likely has a large number of read and write requests and was wandering if there was anyway to improve performance.
Matt
On 1/5/07, Matt lm7812@gmail.com wrote:
Can anyone explain how the disk elevator works and if there is anyway to tweak it? I have an email server which likely has a large number of read and write requests and was wandering if there was anyway to improve performance.
Reasonably decent writeup. Gives a good overview, but I'm not sure how much detail you'd like. http://www.redhat.com/magazine/008jun05/features/schedulers/
-----Original Message----- From: centos-bounces@centos.org [mailto:centos-bounces@centos.org] On Behalf Of Jim Perrin Sent: Friday, January 05, 2007 7:24 PM To: CentOS mailing list Subject: Re: [CentOS] Disk Elevator
On 1/5/07, Matt lm7812@gmail.com wrote:
Can anyone explain how the disk elevator works and if there
is anyway
to tweak it? I have an email server which likely has a large number of read and write requests and was wandering if there was anyway to improve performance.
Reasonably decent writeup. Gives a good overview, but I'm not sure how much detail you'd like. http://www.redhat.com/magazine/008jun05/features/schedulers/
The disk elevators or io schedulers are there to minimize head seek by re-ordering and merging requests to read or write data from common areas of the disk.
There are some tweaks to improve performance, but the performance gains are minimal on a raid array (the elevators do not not stripe size as they were implemented with single-spindle drives in mind).
The biggest performance gain you can achieve on a raid array is to make sure you format the volume aligned to your raid stripe size. For example if you have a 4 drive raid 5 and it is using 64K chunks, your stripe size will be 256K. Given a 4K filesystem block size you would then have a stride of 64 (256/4), so when you format your volume:
Mke2fs -E stride=64 (other needed options -j for ext3, -N <# of inodes> for extended # of i-nodes, -O dir_index speeds up directory searches for large # of files) /dev/XXXX
By aligning the file-system to the array stripe size you can minimize short write penalties to your array which will speed up writes. By using the -O dir_index option you can speed up reads a fraction, but by minimizing the write penalties reads will gain performance anyway.
A short write penalty is when data is written to an array that is shorter then the stripe (256K) then the remaining blocks will need to be read from the stripe in order to compute a new parity for the stripe. If the OS knows the stripe size then each stripe can be cached before hand in a read-ahead so when a write comes it should have all the data it needs to write the full stripe to disk. It can also give hints to the page cache for combining separate io that falls in the same stripe.
-Ross
______________________________________________________________________ This e-mail, and any attachments thereto, is intended only for use by the addressee(s) named herein and may contain legally privileged and/or confidential information. If you are not the intended recipient of this e-mail, you are hereby notified that any dissemination, distribution or copying of this e-mail, and any attachments thereto, is strictly prohibited. If you have received this e-mail in error, please immediately notify the sender and permanently delete the original and any copy or printout thereof.
Quoting "Ross S. W. Walker" rwalker@medallion.com:
The biggest performance gain you can achieve on a raid array is to make sure you format the volume aligned to your raid stripe size. For example if you have a 4 drive raid 5 and it is using 64K chunks, your stripe size will be 256K. Given a 4K filesystem block size you would then have a stride of 64 (256/4), so when you format your volume:
Mke2fs -E stride=64 (other needed options -j for ext3, -N <# of inodes> for extended # of i-nodes, -O dir_index speeds up directory searches for large # of files) /dev/XXXX
Shouldn't the argument for stride option be how many file system blocks there is per stripe? After all, there's no way for OS to guess what RAID level you are using. For 4 disk RAID5 with 64k chunks and 4k file system blocks you have only 48 file system blocks per stripe ((4-1)x64k/4k=48). So it should be -E stride=48 in this particular case. If it was 4 disk RAID0 array, than it would be 64 (4x64k/4k=64). If it was 4 disk RAID10 array, than it would be 32 ((4/2)*64k/4k=32). Or at least that's the way I understood it by reading the man page.
-----Original Message----- From: centos-bounces@centos.org [mailto:centos-bounces@centos.org] On Behalf Of Aleksandar Milivojevic Sent: Monday, January 08, 2007 1:00 PM To: centos@centos.org Subject: RE: [CentOS] Disk Elevator
Quoting "Ross S. W. Walker" rwalker@medallion.com:
The biggest performance gain you can achieve on a raid
array is to make
sure you format the volume aligned to your raid stripe
size. For example
if you have a 4 drive raid 5 and it is using 64K chunks, your stripe size will be 256K. Given a 4K filesystem block size you
would then have
a stride of 64 (256/4), so when you format your volume:
Mke2fs -E stride=64 (other needed options -j for ext3, -N
<# of inodes>
for extended # of i-nodes, -O dir_index speeds up directory
searches for
large # of files) /dev/XXXX
Shouldn't the argument for stride option be how many file system blocks there is per stripe? After all, there's no way for OS to guess what RAID level you are using. For 4 disk RAID5 with 64k chunks and 4k file system blocks you have only 48 file system blocks per stripe ((4-1)x64k/4k=48). So it should be -E stride=48 in this particular case. If it was 4 disk RAID0 array, than it would be 64 (4x64k/4k=64). If it was 4 disk RAID10 array, than it would be 32 ((4/2)*64k/4k=32). Or at least that's the way I understood it by reading the man page.
You are correct, leave one of the chunks off for the parity, so for 4 disk raid5 stride=48. I had just computed all 4 chunks as part of the stride.
BTW that parity chunk still needs to be in memory to avoid the read on it, no? In that case wouldn't a stride of 64 help in that case? And if the stride leaves out the parity chunk then will not successive read-aheads cause a continuous wrap of the stripe which will negate the effect of the stride by not having the complete stripe cached?
-Ross
______________________________________________________________________ This e-mail, and any attachments thereto, is intended only for use by the addressee(s) named herein and may contain legally privileged and/or confidential information. If you are not the intended recipient of this e-mail, you are hereby notified that any dissemination, distribution or copying of this e-mail, and any attachments thereto, is strictly prohibited. If you have received this e-mail in error, please immediately notify the sender and permanently delete the original and any copy or printout thereof.
-----Original Message----- From: centos-bounces@centos.org [mailto:centos-bounces@centos.org] On Behalf Of Ross S. W. Walker Sent: Monday, January 08, 2007 1:15 PM To: CentOS mailing list Subject: RE: [CentOS] Disk Elevator
-----Original Message----- From: centos-bounces@centos.org [mailto:centos-bounces@centos.org] On Behalf Of Aleksandar
Milivojevic
Sent: Monday, January 08, 2007 1:00 PM To: centos@centos.org Subject: RE: [CentOS] Disk Elevator
Quoting "Ross S. W. Walker" rwalker@medallion.com:
The biggest performance gain you can achieve on a raid
array is to make
sure you format the volume aligned to your raid stripe
size. For example
if you have a 4 drive raid 5 and it is using 64K chunks,
your stripe
size will be 256K. Given a 4K filesystem block size you
would then have
a stride of 64 (256/4), so when you format your volume:
Mke2fs -E stride=64 (other needed options -j for ext3, -N
<# of inodes>
for extended # of i-nodes, -O dir_index speeds up directory
searches for
large # of files) /dev/XXXX
Shouldn't the argument for stride option be how many file system blocks there is per stripe? After all, there's no way for OS to guess what RAID level you are using. For 4 disk RAID5 with 64k
chunks and
4k file system blocks you have only 48 file system blocks
per stripe
((4-1)x64k/4k=48). So it should be -E stride=48 in this
particular
case. If it was 4 disk RAID0 array, than it would be 64 (4x64k/4k=64). If it was 4 disk RAID10 array, than it would be 32 ((4/2)*64k/4k=32). Or at least that's the way I understood it by reading the man page.
You are correct, leave one of the chunks off for the parity, so for 4 disk raid5 stride=48. I had just computed all 4 chunks as part of the stride.
BTW that parity chunk still needs to be in memory to avoid the read on it, no? In that case wouldn't a stride of 64 help in that case? And if the stride leaves out the parity chunk then will not successive read-aheads cause a continuous wrap of the stripe which will negate the effect of the stride by not having the complete stripe cached?
Let me follow up on my last post by saying that Aleksandar is abolutely correct. The stride is the # of blocks per-stripe and has nothing to do with read-ahead, and thus should be calculated by # of chunks minus parity in a stripe.
For read-ahead, you would set this through blockdev --setra X /dev/YY, and use a multiple of the # of sectors in a stripe, so for a 256K stripe, set the read-ahead to 512, 1024, 2048, depending if the io is mostly random or mostly sequential (bigger for sequential, smaller for random).
-Ross
______________________________________________________________________ This e-mail, and any attachments thereto, is intended only for use by the addressee(s) named herein and may contain legally privileged and/or confidential information. If you are not the intended recipient of this e-mail, you are hereby notified that any dissemination, distribution or copying of this e-mail, and any attachments thereto, is strictly prohibited. If you have received this e-mail in error, please immediately notify the sender and permanently delete the original and any copy or printout thereof.
Quoting "Ross S. W. Walker" rwalker@medallion.com:
The biggest performance gain you can achieve on a raid
array is to make
sure you format the volume aligned to your raid stripe
size. For example
if you have a 4 drive raid 5 and it is using 64K chunks,
your stripe
size will be 256K. Given a 4K filesystem block size you
would then have
a stride of 64 (256/4), so when you format your volume:
Mke2fs -E stride=64 (other needed options -j for ext3, -N
<# of inodes>
for extended # of i-nodes, -O dir_index speeds up directory
searches for
large # of files) /dev/XXXX
Shouldn't the argument for stride option be how many file system blocks there is per stripe? After all, there's no way for OS to guess what RAID level you are using. For 4 disk RAID5 with 64k
chunks and
4k file system blocks you have only 48 file system blocks
per stripe
((4-1)x64k/4k=48). So it should be -E stride=48 in this
particular
case. If it was 4 disk RAID0 array, than it would be 64 (4x64k/4k=64). If it was 4 disk RAID10 array, than it would be 32 ((4/2)*64k/4k=32). Or at least that's the way I understood it by reading the man page.
You are correct, leave one of the chunks off for the parity, so for 4 disk raid5 stride=48. I had just computed all 4 chunks as part of the stride.
BTW that parity chunk still needs to be in memory to avoid the read on it, no? In that case wouldn't a stride of 64 help in that case? And if the stride leaves out the parity chunk then will not successive read-aheads cause a continuous wrap of the stripe which will negate the effect of the stride by not having the complete stripe cached?
For read-ahead, you would set this through blockdev --setra X /dev/YY, and use a multiple of the # of sectors in a stripe, so for a 256K stripe, set the read-ahead to 512, 1024, 2048, depending if the io is mostly random or mostly sequential (bigger for sequential, smaller for random).
To follow up on this (even if it is a little late), how is this affected by LVM use? I'm curious to know how (or if) this math changes with ext3 sitting on LVM on the raid array.
-----Original Message----- From: centos-bounces@centos.org [mailto:centos-bounces@centos.org] On Behalf Of Jim Perrin Sent: Tuesday, January 16, 2007 9:37 AM To: CentOS mailing list Subject: Re: [CentOS] Disk Elevator
Quoting "Ross S. W. Walker" rwalker@medallion.com:
The biggest performance gain you can achieve on a raid
array is to make
sure you format the volume aligned to your raid stripe
size. For example
if you have a 4 drive raid 5 and it is using 64K chunks,
your stripe
size will be 256K. Given a 4K filesystem block size you
would then have
a stride of 64 (256/4), so when you format your volume:
Mke2fs -E stride=64 (other needed options -j for ext3, -N
<# of inodes>
for extended # of i-nodes, -O dir_index speeds up directory
searches for
large # of files) /dev/XXXX
Shouldn't the argument for stride option be how many file system blocks there is per stripe? After all, there's no way for OS to guess what RAID level you are using. For 4 disk RAID5 with 64k
chunks and
4k file system blocks you have only 48 file system blocks
per stripe
((4-1)x64k/4k=48). So it should be -E stride=48 in this
particular
case. If it was 4 disk RAID0 array, than it would be 64 (4x64k/4k=64). If it was 4 disk RAID10 array, than it
would be 32
((4/2)*64k/4k=32). Or at least that's the way I
understood it by
reading the man page.
You are correct, leave one of the chunks off for the
parity, so for 4
disk raid5 stride=48. I had just computed all 4 chunks as
part of the
stride.
BTW that parity chunk still needs to be in memory to
avoid the read on
it, no? In that case wouldn't a stride of 64 help in that
case? And if
the stride leaves out the parity chunk then will not successive read-aheads cause a continuous wrap of the stripe which will negate the effect of the stride by not having the complete stripe cached?
For read-ahead, you would set this through blockdev --setra
X /dev/YY,
and use a multiple of the # of sectors in a stripe, so for a 256K stripe, set the read-ahead to 512, 1024, 2048, depending if
the io is
mostly random or mostly sequential (bigger for sequential,
smaller for
random).
To follow up on this (even if it is a little late), how is this affected by LVM use? I'm curious to know how (or if) this math changes with ext3 sitting on LVM on the raid array.
Depends is the best answer. It really depends on LVM and the other block layer devices. As the io requests descend down the different layers they will enter multiple request_queues, each request_queue will have and io scheduler assigned to it, either the system default or one of the others, or one of the block devices own, so it is hard to say. Only by testing can you know for sure. In my tests LVM is very good with unnoticeable overhead going to hardware RAID, but if you use MD RAID then your experience might be different.
Ext3 | VFS | Page Cache | LVM request_queue (io scheduler) | LVM | MD request_queue (io scheduler) | MD | ----------------- | | | | | Que Que Que Que Que (io scheduler) | | | | | Sda sdb sdc sdd sde
Hope this helps clarify.
______________________________________________________________________ This e-mail, and any attachments thereto, is intended only for use by the addressee(s) named herein and may contain legally privileged and/or confidential information. If you are not the intended recipient of this e-mail, you are hereby notified that any dissemination, distribution or copying of this e-mail, and any attachments thereto, is strictly prohibited. If you have received this e-mail in error, please immediately notify the sender and permanently delete the original and any copy or printout thereof.
Depends is the best answer. It really depends on LVM and the other block layer devices. As the io requests descend down the different layers they will enter multiple request_queues, each request_queue will have and io scheduler assigned to it, either the system default or one of the others, or one of the block devices own, so it is hard to say. Only by testing can you know for sure. In my tests LVM is very good with unnoticeable overhead going to hardware RAID, but if you use MD RAID then your experience might be different. Hope this helps clarify.
It does, and I should have specified at the outset that this was with respect to hardware raid.
On Tuesday 16 January 2007 16:37, Ross S. W. Walker wrote: ...
To follow up on this (even if it is a little late), how is this affected by LVM use? I'm curious to know how (or if) this math changes with ext3 sitting on LVM on the raid array.
Depends is the best answer. It really depends on LVM and the other block layer devices. As the io requests descend down the different layers they will enter multiple request_queues, each request_queue will have and io scheduler assigned to it, either the system default or one of the others, or one of the block devices own, so it is hard to say. Only by testing can you know for sure. In my tests LVM is very good with unnoticeable overhead going to hardware RAID, but if you use MD RAID then your experience might be different.
I don't think that is quite correct. AFAICT only the "real" devices (such as /dev/sda) has an io-scheduler. See the difference of ls /sys/block/..: # ls /sys/block/dm-0 dev range removable size stat # ls /sys/block/sdc dev device queue range removable size stat
As for read-ahead it's the reverse. Read-ahead has no effect (in my tests) when applied to the underlying device (such as sda) but has to be set on the lvm-device. Here are some performance numbers:
sdc:256,dm-0:256 and sdc:8192,dm-0:256 gives: # time dd if=file10G of=/dev/null bs=1M real 0m59.465s
sdc:8192,dm-0:256 and sdc:8192,dm-0:8192 gives: # time dd if=file10G of=/dev/null bs=1M real 0m24.163s
This on a 8 disk 3ware raid6 (hardware raid) with fully updated centos-4.4 x86_64. The file dd read was 1000 MiB. 256 is the default read-ahead and blockdev --setra was used to change it.
/Peter
-----Original Message----- From: Peter Kjellstrom [mailto:cap@nsc.liu.se] Sent: Tuesday, January 16, 2007 11:55 AM To: centos@centos.org Cc: Ross S. W. Walker Subject: Re: [CentOS] Disk Elevator
On Tuesday 16 January 2007 16:37, Ross S. W. Walker wrote: ...
To follow up on this (even if it is a little late), how is this affected by LVM use? I'm curious to know how (or if) this math changes with
ext3 sitting on
LVM on the raid array.
Depends is the best answer. It really depends on LVM and
the other block
layer devices. As the io requests descend down the
different layers they
will enter multiple request_queues, each request_queue will
have and io
scheduler assigned to it, either the system default or one of the others, or one of the block devices own, so it is hard to
say. Only by
testing can you know for sure. In my tests LVM is very good with unnoticeable overhead going to hardware RAID, but if you use MD RAID then your experience might be different.
I don't think that is quite correct. AFAICT only the "real" devices (such as /dev/sda) has an io-scheduler. See the difference of ls /sys/block/..: # ls /sys/block/dm-0 dev range removable size stat # ls /sys/block/sdc dev device queue range removable size stat
How a device presents itself in /proc or /sys is completely up to the device.
All block devices have a request_queue. You can look at the struct of said queue in linux/blkdev.h, you can then look at the code ll_rw_blk.c to see how said queue is processed.
Here is the structure anyways:
struct request_queue { /* * Together with queue_head for cacheline sharing */ struct list_head queue_head; struct request *last_merge; elevator_t elevator;
/* * the queue request freelist, one for reads and one for writes */ struct request_list rq;
request_fn_proc *request_fn; merge_request_fn *back_merge_fn; merge_request_fn *front_merge_fn; merge_requests_fn *merge_requests_fn; make_request_fn *make_request_fn; prep_rq_fn *prep_rq_fn; unplug_fn *unplug_fn; merge_bvec_fn *merge_bvec_fn; activity_fn *activity_fn; issue_flush_fn *issue_flush_fn;
/* * Auto-unplugging state */ struct timer_list unplug_timer; int unplug_thresh; /* After this many requests */ unsigned long unplug_delay; /* After this many jiffies */ struct work_struct unplug_work;
struct backing_dev_info backing_dev_info;
/* * The queue owner gets to use this for whatever they like. * ll_rw_blk doesn't touch it. */ void *queuedata;
void *activity_data;
/* * queue needs bounce pages for pages above this limit */ unsigned long bounce_pfn; int bounce_gfp;
/* * various queue flags, see QUEUE_* below */ unsigned long queue_flags;
/* * protects queue structures from reentrancy */ spinlock_t *queue_lock;
/* * queue kobject */ struct kobject kobj;
/* * queue settings */ unsigned long nr_requests; /* Max # of requests */ unsigned int nr_congestion_on; unsigned int nr_congestion_off;
unsigned short max_sectors; unsigned short max_hw_sectors; unsigned short max_phys_segments; unsigned short max_hw_segments; unsigned short hardsect_size; unsigned int max_segment_size;
unsigned long seg_boundary_mask; unsigned int dma_alignment;
struct blk_queue_tag *queue_tags;
atomic_t refcnt;
unsigned int in_flight;
/* * sg stuff */ unsigned int sg_timeout; unsigned int sg_reserved_size; };
Every request queue needs an elevator/scheduler, otherwise as you go down the block layers you can get contention/starvation between them.
As for read-ahead it's the reverse. Read-ahead has no effect (in my tests) when applied to the underlying device (such as sda) but has to be set on the lvm-device. Here are some performance numbers:
I too see little improvement on read-ahead with sequential io, but surprisingly and completely non-intuitive it seems to help with random read io, as long as the read-aheads are kept low. Set the read-ahead to your stripe size in sectors and you will be pleasantly surprised with random read #s.
sdc:256,dm-0:256 and sdc:8192,dm-0:256 gives: # time dd if=file10G of=/dev/null bs=1M real 0m59.465s
sdc:8192,dm-0:256 and sdc:8192,dm-0:8192 gives: # time dd if=file10G of=/dev/null bs=1M real 0m24.163s
This on a 8 disk 3ware raid6 (hardware raid) with fully updated centos-4.4 x86_64. The file dd read was 1000 MiB. 256 is the default read-ahead and blockdev --setra was used to change it.
/Peter
______________________________________________________________________ This e-mail, and any attachments thereto, is intended only for use by the addressee(s) named herein and may contain legally privileged and/or confidential information. If you are not the intended recipient of this e-mail, you are hereby notified that any dissemination, distribution or copying of this e-mail, and any attachments thereto, is strictly prohibited. If you have received this e-mail in error, please immediately notify the sender and permanently delete the original and any copy or printout thereof.
-----Original Message----- From: centos-bounces@centos.org [mailto:centos-bounces@centos.org] On Behalf Of Ross S. W. Walker Sent: Tuesday, January 16, 2007 12:22 PM To: Peter Kjellstrom; centos@centos.org Subject: RE: [CentOS] Disk Elevator
-----Original Message----- From: Peter Kjellstrom [mailto:cap@nsc.liu.se] Sent: Tuesday, January 16, 2007 11:55 AM To: centos@centos.org Cc: Ross S. W. Walker Subject: Re: [CentOS] Disk Elevator
On Tuesday 16 January 2007 16:37, Ross S. W. Walker wrote: ...
To follow up on this (even if it is a little late), how is this affected by LVM use? I'm curious to know how (or if) this math changes with
ext3 sitting on
LVM on the raid array.
Depends is the best answer. It really depends on LVM and
the other block
layer devices. As the io requests descend down the
different layers they
will enter multiple request_queues, each request_queue will
have and io
scheduler assigned to it, either the system default or one of the others, or one of the block devices own, so it is hard to
say. Only by
testing can you know for sure. In my tests LVM is very good with unnoticeable overhead going to hardware RAID, but if you
use MD RAID
then your experience might be different.
I don't think that is quite correct. AFAICT only the "real" devices (such as /dev/sda) has an io-scheduler. See the difference of ls /sys/block/..: # ls /sys/block/dm-0 dev range removable size stat # ls /sys/block/sdc dev device queue range removable size stat
How a device presents itself in /proc or /sys is completely up to the device.
All block devices have a request_queue. You can look at the struct of said queue in linux/blkdev.h, you can then look at the code ll_rw_blk.c to see how said queue is processed.
Here is the structure anyways:
struct request_queue { /* * Together with queue_head for cacheline sharing */ struct list_head queue_head; struct request *last_merge; elevator_t elevator;
/* * the queue request freelist, one for reads and one
for writes */ struct request_list rq;
request_fn_proc *request_fn; merge_request_fn *back_merge_fn; merge_request_fn *front_merge_fn; merge_requests_fn *merge_requests_fn; make_request_fn *make_request_fn; prep_rq_fn *prep_rq_fn; unplug_fn *unplug_fn; merge_bvec_fn *merge_bvec_fn; activity_fn *activity_fn; issue_flush_fn *issue_flush_fn; /* * Auto-unplugging state */ struct timer_list unplug_timer; int unplug_thresh; /* After this many
requests */ unsigned long unplug_delay; /* After this many jiffies */ struct work_struct unplug_work;
struct backing_dev_info backing_dev_info; /* * The queue owner gets to use this for whatever they like. * ll_rw_blk doesn't touch it. */ void *queuedata; void *activity_data; /* * queue needs bounce pages for pages above this limit */ unsigned long bounce_pfn; int bounce_gfp; /* * various queue flags, see QUEUE_* below */ unsigned long queue_flags; /* * protects queue structures from reentrancy */ spinlock_t *queue_lock; /* * queue kobject */ struct kobject kobj; /* * queue settings */ unsigned long nr_requests; /* Max # of
requests */ unsigned int nr_congestion_on; unsigned int nr_congestion_off;
unsigned short max_sectors; unsigned short max_hw_sectors; unsigned short max_phys_segments; unsigned short max_hw_segments; unsigned short hardsect_size; unsigned int max_segment_size; unsigned long seg_boundary_mask; unsigned int dma_alignment; struct blk_queue_tag *queue_tags; atomic_t refcnt; unsigned int in_flight; /* * sg stuff */ unsigned int sg_timeout; unsigned int sg_reserved_size;
};
Every request queue needs an elevator/scheduler, otherwise as you go down the block layers you can get contention/starvation between them.
As for read-ahead it's the reverse. Read-ahead has no effect (in my tests) when applied to the underlying device (such as sda) but has to be set on the lvm-device. Here are some performance numbers:
Oh, and the read-aheads set by blockdev, they tend to be inherited by the block device driver using that device as it's backing device.
When sdX is created it defaults to 256 sectors, when partitions are mapped they have 256 sectors, when MD associates with the drive or it's partitions it uses the 256 read-ahead, when LVM comes on top it uses the MD 256 read-ahead, this is then passed up to the VFS routines that use this to determine the amount of read-ahead to do. Since VFS associates only with it's immediate backing device, setting the read-ahead at a lower backing device has no effect, set it on the immediate backing device, in this case LVM.
I too see little improvement on read-ahead with sequential io, but surprisingly and completely non-intuitive it seems to help with random read io, as long as the read-aheads are kept low. Set the read-ahead to your stripe size in sectors and you will be pleasantly surprised with random read #s.
sdc:256,dm-0:256 and sdc:8192,dm-0:256 gives: # time dd if=file10G of=/dev/null bs=1M real 0m59.465s
sdc:8192,dm-0:256 and sdc:8192,dm-0:8192 gives: # time dd if=file10G of=/dev/null bs=1M real 0m24.163s
This on a 8 disk 3ware raid6 (hardware raid) with fully updated centos-4.4 x86_64. The file dd read was 1000 MiB. 256 is the default read-ahead and blockdev --setra was used to change it.
/Peter
______________________________________________________________________ This e-mail, and any attachments thereto, is intended only for use by the addressee(s) named herein and may contain legally privileged and/or confidential information. If you are not the intended recipient of this e-mail, you are hereby notified that any dissemination, distribution or copying of this e-mail, and any attachments thereto, is strictly prohibited. If you have received this e-mail in error, please immediately notify the sender and permanently delete the original and any copy or printout thereof.
Quoting "Ross S. W. Walker" rwalker@medallion.com:
BTW that parity chunk still needs to be in memory to avoid the read on it, no? In that case wouldn't a stride of 64 help in that case? And if the stride leaves out the parity chunk then will not successive read-aheads cause a continuous wrap of the stripe which will negate the effect of the stride by not having the complete stripe cached?
Hm, not really. The parity chunk is never handed over to the OS. It's internal to the hardware RAID controller. OS doesn't know anything about it, it doesn't even know that the "disk" it is accessing is actually RAID5 array.
Back to your example of 4 disk RAID5, 64k chunks, 4k file system blocks.
If you set stride to 48, OS gives 3 chunks worth of data to the controller, aligned with stripes. Controller calculates parity and writes out 4 chunks (3 data, 1 parity).
If you set stride to 64, OS gives 4 chunks worth of data to the controller. In best case scenario first or last three will be aligned with stripes. Controller calculates parity on 3 of them, writes out 4 chunks (3 data, 1 parity). For the remaining data chunk, it needs to read 2 chunks from the disk, calculates parity and writes 2 chunks (1 data, 1 parity). In worst case scenario first or last 3 chunks will not be aligned with stripes. Controller reads 1 chunk, calculates parity writes out 3 chunks (2 data, 1 parity), than does the same thing again for remaining 2 chunks of data.
Anyhow, for large sequential reads and writes there's really not a big performace benefit (if any). OS will tend to combine and rearrange reads and writes to be sequential, and the hardware RAID controller will do the same using its cache. I've tested this once with good RAID controller, and bonnie++ (which benchmarks this kind of access) gave almost the same numbers with and without using stride option.
If disk access is random, read block here, write block there, there might be some benefit (however, cache in hardware RAID controller might kick in and save the day here too). It all depends on particular RAID contoller, workload and amount and type (write back vs. write through) of cache on the controller.
I'd say in most cases using stride option has very little effect if you have a large battery backed up write back cache (and good RAID controller, that is). If you are using software RAID, or have small and/or write through cache, stride option might have some effects.
Matt wrote:
Can anyone explain how the disk elevator works and if there is anyway to tweak it? I have an email server which likely has a large number of read and write requests and was wandering if there was anyway to improve performance.
I assume you are talking about Centos 3.x with the 2.4 kernel.I know it is heavily patched but I don't think it has the complicated i/o schedulers you find in 2.6
You use elvtune to tweak it. Basically you just define the max length the write queue can be before attention will be given to reads and likewise for the read queue. man elvtune for more information.
I do not have a 2.4 kernel box anymore but I think you can try 'elvtune -r 128 -w 128 /dev/device' and see if that helps. 'elvtune /dev/device' will show you the settings being used. If you are using ext3, you probably also want to look at tweaking /proc/sys/vm/bdflush. Look here for more information:
You use elvtune to tweak it. Basically you just define the max length the write queue can be before attention will be given to reads and likewise for the read queue. man elvtune for more information.
"[root@server ~]# elvtune /dev/hda ioctl get: Invalid argument
elvtune is only useful on older kernels; for 2.6 use IO scheduler sysfs tunables instead.."
Thats what I get.
Matt
Matt wrote:
You use elvtune to tweak it. Basically you just define the max length the write queue can be before attention will be given to reads and likewise for the read queue. man elvtune for more information.
"[root@server ~]# elvtune /dev/hda ioctl get: Invalid argument
elvtune is only useful on older kernels; for 2.6 use IO scheduler sysfs tunables instead.."
Thats what I get.
I did say:
"I assume you are talking about Centos 3.x with the 2.4 kernel"
For tweaking the io schedulers in 2.6, you need to mount the sysfs filesystem which is under /sys on RHEL4/Centos4. Play around with the values under /sys/block/devicename/queue/*.