-----Original Message----- From: Peter Kjellstrom [mailto:cap@nsc.liu.se] Sent: Tuesday, January 16, 2007 11:55 AM To: centos@centos.org Cc: Ross S. W. Walker Subject: Re: [CentOS] Disk Elevator
On Tuesday 16 January 2007 16:37, Ross S. W. Walker wrote: ...
To follow up on this (even if it is a little late), how is this affected by LVM use? I'm curious to know how (or if) this math changes with
ext3 sitting on
LVM on the raid array.
Depends is the best answer. It really depends on LVM and
the other block
layer devices. As the io requests descend down the
different layers they
will enter multiple request_queues, each request_queue will
have and io
scheduler assigned to it, either the system default or one of the others, or one of the block devices own, so it is hard to
say. Only by
testing can you know for sure. In my tests LVM is very good with unnoticeable overhead going to hardware RAID, but if you use MD RAID then your experience might be different.
I don't think that is quite correct. AFAICT only the "real" devices (such as /dev/sda) has an io-scheduler. See the difference of ls /sys/block/..: # ls /sys/block/dm-0 dev range removable size stat # ls /sys/block/sdc dev device queue range removable size stat
How a device presents itself in /proc or /sys is completely up to the device.
All block devices have a request_queue. You can look at the struct of said queue in linux/blkdev.h, you can then look at the code ll_rw_blk.c to see how said queue is processed.
Here is the structure anyways:
struct request_queue { /* * Together with queue_head for cacheline sharing */ struct list_head queue_head; struct request *last_merge; elevator_t elevator;
/* * the queue request freelist, one for reads and one for writes */ struct request_list rq;
request_fn_proc *request_fn; merge_request_fn *back_merge_fn; merge_request_fn *front_merge_fn; merge_requests_fn *merge_requests_fn; make_request_fn *make_request_fn; prep_rq_fn *prep_rq_fn; unplug_fn *unplug_fn; merge_bvec_fn *merge_bvec_fn; activity_fn *activity_fn; issue_flush_fn *issue_flush_fn;
/* * Auto-unplugging state */ struct timer_list unplug_timer; int unplug_thresh; /* After this many requests */ unsigned long unplug_delay; /* After this many jiffies */ struct work_struct unplug_work;
struct backing_dev_info backing_dev_info;
/* * The queue owner gets to use this for whatever they like. * ll_rw_blk doesn't touch it. */ void *queuedata;
void *activity_data;
/* * queue needs bounce pages for pages above this limit */ unsigned long bounce_pfn; int bounce_gfp;
/* * various queue flags, see QUEUE_* below */ unsigned long queue_flags;
/* * protects queue structures from reentrancy */ spinlock_t *queue_lock;
/* * queue kobject */ struct kobject kobj;
/* * queue settings */ unsigned long nr_requests; /* Max # of requests */ unsigned int nr_congestion_on; unsigned int nr_congestion_off;
unsigned short max_sectors; unsigned short max_hw_sectors; unsigned short max_phys_segments; unsigned short max_hw_segments; unsigned short hardsect_size; unsigned int max_segment_size;
unsigned long seg_boundary_mask; unsigned int dma_alignment;
struct blk_queue_tag *queue_tags;
atomic_t refcnt;
unsigned int in_flight;
/* * sg stuff */ unsigned int sg_timeout; unsigned int sg_reserved_size; };
Every request queue needs an elevator/scheduler, otherwise as you go down the block layers you can get contention/starvation between them.
As for read-ahead it's the reverse. Read-ahead has no effect (in my tests) when applied to the underlying device (such as sda) but has to be set on the lvm-device. Here are some performance numbers:
I too see little improvement on read-ahead with sequential io, but surprisingly and completely non-intuitive it seems to help with random read io, as long as the read-aheads are kept low. Set the read-ahead to your stripe size in sectors and you will be pleasantly surprised with random read #s.
sdc:256,dm-0:256 and sdc:8192,dm-0:256 gives: # time dd if=file10G of=/dev/null bs=1M real 0m59.465s
sdc:8192,dm-0:256 and sdc:8192,dm-0:8192 gives: # time dd if=file10G of=/dev/null bs=1M real 0m24.163s
This on a 8 disk 3ware raid6 (hardware raid) with fully updated centos-4.4 x86_64. The file dd read was 1000 MiB. 256 is the default read-ahead and blockdev --setra was used to change it.
/Peter
______________________________________________________________________ This e-mail, and any attachments thereto, is intended only for use by the addressee(s) named herein and may contain legally privileged and/or confidential information. If you are not the intended recipient of this e-mail, you are hereby notified that any dissemination, distribution or copying of this e-mail, and any attachments thereto, is strictly prohibited. If you have received this e-mail in error, please immediately notify the sender and permanently delete the original and any copy or printout thereof.