[CentOS] Disk Elevator
Ross S. W. Walker
rwalker at medallion.com
Tue Jan 16 17:21:53 UTC 2007
> -----Original Message-----
> From: Peter Kjellstrom [mailto:cap at nsc.liu.se]
> Sent: Tuesday, January 16, 2007 11:55 AM
> To: centos at centos.org
> Cc: Ross S. W. Walker
> Subject: Re: [CentOS] Disk Elevator
>
> On Tuesday 16 January 2007 16:37, Ross S. W. Walker wrote:
> ...
> > > To follow up on this (even if it is a little late), how is this
> > > affected by LVM use?
> > > I'm curious to know how (or if) this math changes with
> ext3 sitting on
> > > LVM on the raid array.
> >
> > Depends is the best answer. It really depends on LVM and
> the other block
> > layer devices. As the io requests descend down the
> different layers they
> > will enter multiple request_queues, each request_queue will
> have and io
> > scheduler assigned to it, either the system default or one of the
> > others, or one of the block devices own, so it is hard to
> say. Only by
> > testing can you know for sure. In my tests LVM is very good with
> > unnoticeable overhead going to hardware RAID, but if you use MD RAID
> > then your experience might be different.
>
> I don't think that is quite correct. AFAICT only the "real"
> devices (such
> as /dev/sda) has an io-scheduler. See the difference of ls
> /sys/block/..:
> # ls /sys/block/dm-0
> dev range removable size stat
> # ls /sys/block/sdc
> dev device queue range removable size stat
How a device presents itself in /proc or /sys is completely up to the
device.
All block devices have a request_queue. You can look at the struct of
said queue in linux/blkdev.h, you can then look at the code ll_rw_blk.c
to see how said queue is processed.
Here is the structure anyways:
struct request_queue
{
/*
* Together with queue_head for cacheline sharing
*/
struct list_head queue_head;
struct request *last_merge;
elevator_t elevator;
/*
* the queue request freelist, one for reads and one for writes
*/
struct request_list rq;
request_fn_proc *request_fn;
merge_request_fn *back_merge_fn;
merge_request_fn *front_merge_fn;
merge_requests_fn *merge_requests_fn;
make_request_fn *make_request_fn;
prep_rq_fn *prep_rq_fn;
unplug_fn *unplug_fn;
merge_bvec_fn *merge_bvec_fn;
activity_fn *activity_fn;
issue_flush_fn *issue_flush_fn;
/*
* Auto-unplugging state
*/
struct timer_list unplug_timer;
int unplug_thresh; /* After this many
requests */
unsigned long unplug_delay; /* After this many
jiffies */
struct work_struct unplug_work;
struct backing_dev_info backing_dev_info;
/*
* The queue owner gets to use this for whatever they like.
* ll_rw_blk doesn't touch it.
*/
void *queuedata;
void *activity_data;
/*
* queue needs bounce pages for pages above this limit
*/
unsigned long bounce_pfn;
int bounce_gfp;
/*
* various queue flags, see QUEUE_* below
*/
unsigned long queue_flags;
/*
* protects queue structures from reentrancy
*/
spinlock_t *queue_lock;
/*
* queue kobject
*/
struct kobject kobj;
/*
* queue settings
*/
unsigned long nr_requests; /* Max # of requests */
unsigned int nr_congestion_on;
unsigned int nr_congestion_off;
unsigned short max_sectors;
unsigned short max_hw_sectors;
unsigned short max_phys_segments;
unsigned short max_hw_segments;
unsigned short hardsect_size;
unsigned int max_segment_size;
unsigned long seg_boundary_mask;
unsigned int dma_alignment;
struct blk_queue_tag *queue_tags;
atomic_t refcnt;
unsigned int in_flight;
/*
* sg stuff
*/
unsigned int sg_timeout;
unsigned int sg_reserved_size;
};
Every request queue needs an elevator/scheduler, otherwise as you go
down the block layers you can get contention/starvation between them.
> As for read-ahead it's the reverse. Read-ahead has no effect
> (in my tests)
> when applied to the underlying device (such as sda) but has
> to be set on the
> lvm-device. Here are some performance numbers:
I too see little improvement on read-ahead with sequential io, but
surprisingly and completely non-intuitive it seems to help with random
read io, as long as the read-aheads are kept low. Set the read-ahead to
your stripe size in sectors and you will be pleasantly surprised with
random read #s.
> sdc:256,dm-0:256 and sdc:8192,dm-0:256 gives:
> # time dd if=file10G of=/dev/null bs=1M
> real 0m59.465s
>
> sdc:8192,dm-0:256 and sdc:8192,dm-0:8192 gives:
> # time dd if=file10G of=/dev/null bs=1M
> real 0m24.163s
>
> This on a 8 disk 3ware raid6 (hardware raid) with fully
> updated centos-4.4
> x86_64. The file dd read was 1000 MiB. 256 is the default
> read-ahead and
> blockdev --setra was used to change it.
>
> /Peter
>
>
______________________________________________________________________
This e-mail, and any attachments thereto, is intended only for use by
the addressee(s) named herein and may contain legally privileged
and/or confidential information. If you are not the intended recipient
of this e-mail, you are hereby notified that any dissemination,
distribution or copying of this e-mail, and any attachments thereto,
is strictly prohibited. If you have received this e-mail in error,
please immediately notify the sender and permanently delete the
original and any copy or printout thereof.
More information about the CentOS
mailing list