> -----Original Message----- > From: centos-bounces at centos.org > [mailto:centos-bounces at centos.org] On Behalf Of Ross S. W. Walker > Sent: Tuesday, January 16, 2007 12:22 PM > To: Peter Kjellstrom; centos at centos.org > Subject: RE: [CentOS] Disk Elevator > > > -----Original Message----- > > From: Peter Kjellstrom [mailto:cap at nsc.liu.se] > > Sent: Tuesday, January 16, 2007 11:55 AM > > To: centos at centos.org > > Cc: Ross S. W. Walker > > Subject: Re: [CentOS] Disk Elevator > > > > On Tuesday 16 January 2007 16:37, Ross S. W. Walker wrote: > > ... > > > > To follow up on this (even if it is a little late), how is this > > > > affected by LVM use? > > > > I'm curious to know how (or if) this math changes with > > ext3 sitting on > > > > LVM on the raid array. > > > > > > Depends is the best answer. It really depends on LVM and > > the other block > > > layer devices. As the io requests descend down the > > different layers they > > > will enter multiple request_queues, each request_queue will > > have and io > > > scheduler assigned to it, either the system default or one of the > > > others, or one of the block devices own, so it is hard to > > say. Only by > > > testing can you know for sure. In my tests LVM is very good with > > > unnoticeable overhead going to hardware RAID, but if you > use MD RAID > > > then your experience might be different. > > > > I don't think that is quite correct. AFAICT only the "real" > > devices (such > > as /dev/sda) has an io-scheduler. See the difference of ls > > /sys/block/..: > > # ls /sys/block/dm-0 > > dev range removable size stat > > # ls /sys/block/sdc > > dev device queue range removable size stat > > How a device presents itself in /proc or /sys is completely up to the > device. > > All block devices have a request_queue. You can look at the struct of > said queue in linux/blkdev.h, you can then look at the code > ll_rw_blk.c > to see how said queue is processed. > > Here is the structure anyways: > > struct request_queue > { > /* > * Together with queue_head for cacheline sharing > */ > struct list_head queue_head; > struct request *last_merge; > elevator_t elevator; > > /* > * the queue request freelist, one for reads and one > for writes > */ > struct request_list rq; > > request_fn_proc *request_fn; > merge_request_fn *back_merge_fn; > merge_request_fn *front_merge_fn; > merge_requests_fn *merge_requests_fn; > make_request_fn *make_request_fn; > prep_rq_fn *prep_rq_fn; > unplug_fn *unplug_fn; > merge_bvec_fn *merge_bvec_fn; > activity_fn *activity_fn; > issue_flush_fn *issue_flush_fn; > > /* > * Auto-unplugging state > */ > struct timer_list unplug_timer; > int unplug_thresh; /* After this many > requests */ > unsigned long unplug_delay; /* After this many > jiffies */ > struct work_struct unplug_work; > > struct backing_dev_info backing_dev_info; > > /* > * The queue owner gets to use this for whatever they like. > * ll_rw_blk doesn't touch it. > */ > void *queuedata; > > void *activity_data; > > /* > * queue needs bounce pages for pages above this limit > */ > unsigned long bounce_pfn; > int bounce_gfp; > > /* > * various queue flags, see QUEUE_* below > */ > unsigned long queue_flags; > > /* > * protects queue structures from reentrancy > */ > spinlock_t *queue_lock; > > /* > * queue kobject > */ > struct kobject kobj; > > /* > * queue settings > */ > unsigned long nr_requests; /* Max # of > requests */ > unsigned int nr_congestion_on; > unsigned int nr_congestion_off; > > unsigned short max_sectors; > unsigned short max_hw_sectors; > unsigned short max_phys_segments; > unsigned short max_hw_segments; > unsigned short hardsect_size; > unsigned int max_segment_size; > > unsigned long seg_boundary_mask; > unsigned int dma_alignment; > > struct blk_queue_tag *queue_tags; > > atomic_t refcnt; > > unsigned int in_flight; > > /* > * sg stuff > */ > unsigned int sg_timeout; > unsigned int sg_reserved_size; > }; > > Every request queue needs an elevator/scheduler, otherwise as you go > down the block layers you can get contention/starvation between them. > > > As for read-ahead it's the reverse. Read-ahead has no effect > > (in my tests) > > when applied to the underlying device (such as sda) but has > > to be set on the > > lvm-device. Here are some performance numbers: Oh, and the read-aheads set by blockdev, they tend to be inherited by the block device driver using that device as it's backing device. When sdX is created it defaults to 256 sectors, when partitions are mapped they have 256 sectors, when MD associates with the drive or it's partitions it uses the 256 read-ahead, when LVM comes on top it uses the MD 256 read-ahead, this is then passed up to the VFS routines that use this to determine the amount of read-ahead to do. Since VFS associates only with it's immediate backing device, setting the read-ahead at a lower backing device has no effect, set it on the immediate backing device, in this case LVM. > I too see little improvement on read-ahead with sequential io, but > surprisingly and completely non-intuitive it seems to help with random > read io, as long as the read-aheads are kept low. Set the > read-ahead to > your stripe size in sectors and you will be pleasantly surprised with > random read #s. > > > sdc:256,dm-0:256 and sdc:8192,dm-0:256 gives: > > # time dd if=file10G of=/dev/null bs=1M > > real 0m59.465s > > > > sdc:8192,dm-0:256 and sdc:8192,dm-0:8192 gives: > > # time dd if=file10G of=/dev/null bs=1M > > real 0m24.163s > > > > This on a 8 disk 3ware raid6 (hardware raid) with fully > > updated centos-4.4 > > x86_64. The file dd read was 1000 MiB. 256 is the default > > read-ahead and > > blockdev --setra was used to change it. > > > > /Peter > > > > ______________________________________________________________________ This e-mail, and any attachments thereto, is intended only for use by the addressee(s) named herein and may contain legally privileged and/or confidential information. If you are not the intended recipient of this e-mail, you are hereby notified that any dissemination, distribution or copying of this e-mail, and any attachments thereto, is strictly prohibited. If you have received this e-mail in error, please immediately notify the sender and permanently delete the original and any copy or printout thereof.