> -----Original Message----- > From: Peter Kjellstrom [mailto:cap at nsc.liu.se] > Sent: Tuesday, January 16, 2007 11:55 AM > To: centos at centos.org > Cc: Ross S. W. Walker > Subject: Re: [CentOS] Disk Elevator > > On Tuesday 16 January 2007 16:37, Ross S. W. Walker wrote: > ... > > > To follow up on this (even if it is a little late), how is this > > > affected by LVM use? > > > I'm curious to know how (or if) this math changes with > ext3 sitting on > > > LVM on the raid array. > > > > Depends is the best answer. It really depends on LVM and > the other block > > layer devices. As the io requests descend down the > different layers they > > will enter multiple request_queues, each request_queue will > have and io > > scheduler assigned to it, either the system default or one of the > > others, or one of the block devices own, so it is hard to > say. Only by > > testing can you know for sure. In my tests LVM is very good with > > unnoticeable overhead going to hardware RAID, but if you use MD RAID > > then your experience might be different. > > I don't think that is quite correct. AFAICT only the "real" > devices (such > as /dev/sda) has an io-scheduler. See the difference of ls > /sys/block/..: > # ls /sys/block/dm-0 > dev range removable size stat > # ls /sys/block/sdc > dev device queue range removable size stat How a device presents itself in /proc or /sys is completely up to the device. All block devices have a request_queue. You can look at the struct of said queue in linux/blkdev.h, you can then look at the code ll_rw_blk.c to see how said queue is processed. Here is the structure anyways: struct request_queue { /* * Together with queue_head for cacheline sharing */ struct list_head queue_head; struct request *last_merge; elevator_t elevator; /* * the queue request freelist, one for reads and one for writes */ struct request_list rq; request_fn_proc *request_fn; merge_request_fn *back_merge_fn; merge_request_fn *front_merge_fn; merge_requests_fn *merge_requests_fn; make_request_fn *make_request_fn; prep_rq_fn *prep_rq_fn; unplug_fn *unplug_fn; merge_bvec_fn *merge_bvec_fn; activity_fn *activity_fn; issue_flush_fn *issue_flush_fn; /* * Auto-unplugging state */ struct timer_list unplug_timer; int unplug_thresh; /* After this many requests */ unsigned long unplug_delay; /* After this many jiffies */ struct work_struct unplug_work; struct backing_dev_info backing_dev_info; /* * The queue owner gets to use this for whatever they like. * ll_rw_blk doesn't touch it. */ void *queuedata; void *activity_data; /* * queue needs bounce pages for pages above this limit */ unsigned long bounce_pfn; int bounce_gfp; /* * various queue flags, see QUEUE_* below */ unsigned long queue_flags; /* * protects queue structures from reentrancy */ spinlock_t *queue_lock; /* * queue kobject */ struct kobject kobj; /* * queue settings */ unsigned long nr_requests; /* Max # of requests */ unsigned int nr_congestion_on; unsigned int nr_congestion_off; unsigned short max_sectors; unsigned short max_hw_sectors; unsigned short max_phys_segments; unsigned short max_hw_segments; unsigned short hardsect_size; unsigned int max_segment_size; unsigned long seg_boundary_mask; unsigned int dma_alignment; struct blk_queue_tag *queue_tags; atomic_t refcnt; unsigned int in_flight; /* * sg stuff */ unsigned int sg_timeout; unsigned int sg_reserved_size; }; Every request queue needs an elevator/scheduler, otherwise as you go down the block layers you can get contention/starvation between them. > As for read-ahead it's the reverse. Read-ahead has no effect > (in my tests) > when applied to the underlying device (such as sda) but has > to be set on the > lvm-device. Here are some performance numbers: I too see little improvement on read-ahead with sequential io, but surprisingly and completely non-intuitive it seems to help with random read io, as long as the read-aheads are kept low. Set the read-ahead to your stripe size in sectors and you will be pleasantly surprised with random read #s. > sdc:256,dm-0:256 and sdc:8192,dm-0:256 gives: > # time dd if=file10G of=/dev/null bs=1M > real 0m59.465s > > sdc:8192,dm-0:256 and sdc:8192,dm-0:8192 gives: > # time dd if=file10G of=/dev/null bs=1M > real 0m24.163s > > This on a 8 disk 3ware raid6 (hardware raid) with fully > updated centos-4.4 > x86_64. The file dd read was 1000 MiB. 256 is the default > read-ahead and > blockdev --setra was used to change it. > > /Peter > > ______________________________________________________________________ This e-mail, and any attachments thereto, is intended only for use by the addressee(s) named herein and may contain legally privileged and/or confidential information. If you are not the intended recipient of this e-mail, you are hereby notified that any dissemination, distribution or copying of this e-mail, and any attachments thereto, is strictly prohibited. If you have received this e-mail in error, please immediately notify the sender and permanently delete the original and any copy or printout thereof.