[CentOS] Disk Elevator

Tue Jan 16 17:41:26 UTC 2007
Ross S. W. Walker <rwalker at medallion.com>

> -----Original Message-----
> From: centos-bounces at centos.org 
> [mailto:centos-bounces at centos.org] On Behalf Of Ross S. W. Walker
> Sent: Tuesday, January 16, 2007 12:22 PM
> To: Peter Kjellstrom; centos at centos.org
> Subject: RE: [CentOS] Disk Elevator
> 
> > -----Original Message-----
> > From: Peter Kjellstrom [mailto:cap at nsc.liu.se] 
> > Sent: Tuesday, January 16, 2007 11:55 AM
> > To: centos at centos.org
> > Cc: Ross S. W. Walker
> > Subject: Re: [CentOS] Disk Elevator
> > 
> > On Tuesday 16 January 2007 16:37, Ross S. W. Walker wrote:
> > ...
> > > > To follow up on this (even if it is a little late), how is this
> > > > affected by LVM use?
> > > > I'm curious to know how (or if) this math changes with 
> > ext3 sitting on
> > > > LVM on the raid array.
> > >
> > > Depends is the best answer. It really depends on LVM and 
> > the other block
> > > layer devices. As the io requests descend down the 
> > different layers they
> > > will enter multiple request_queues, each request_queue will 
> > have and io
> > > scheduler assigned to it, either the system default or one of the
> > > others, or one of the block devices own, so it is hard to 
> > say. Only by
> > > testing can you know for sure. In my tests LVM is very good with
> > > unnoticeable overhead going to hardware RAID, but if you 
> use MD RAID
> > > then your experience might be different.
> > 
> > I don't think that is quite correct. AFAICT only the "real" 
> > devices (such 
> > as /dev/sda) has an io-scheduler. See the difference of ls 
> > /sys/block/..:
> >  # ls /sys/block/dm-0
> >  dev  range  removable  size  stat
> >  # ls /sys/block/sdc
> >  dev  device  queue  range  removable  size  stat
> 
> How a device presents itself in /proc or /sys is completely up to the
> device.
> 
> All block devices have a request_queue. You can look at the struct of
> said queue in linux/blkdev.h, you can then look at the code 
> ll_rw_blk.c
> to see how said queue is processed.
> 
> Here is the structure anyways:
> 
> struct request_queue
> {
>         /*
>          * Together with queue_head for cacheline sharing
>          */
>         struct list_head        queue_head;
>         struct request          *last_merge;
>         elevator_t              elevator;
> 
>         /*
>          * the queue request freelist, one for reads and one 
> for writes
>          */
>         struct request_list     rq;
> 
>         request_fn_proc         *request_fn;
>         merge_request_fn        *back_merge_fn;
>         merge_request_fn        *front_merge_fn;
>         merge_requests_fn       *merge_requests_fn;
>         make_request_fn         *make_request_fn;
>         prep_rq_fn              *prep_rq_fn;
>         unplug_fn               *unplug_fn;
>         merge_bvec_fn           *merge_bvec_fn;
>         activity_fn             *activity_fn;
>         issue_flush_fn          *issue_flush_fn;
> 
>         /*
>          * Auto-unplugging state
>          */
>         struct timer_list       unplug_timer;
>         int                     unplug_thresh;  /* After this many
> requests */
>         unsigned long           unplug_delay;   /* After this many
> jiffies */
>         struct work_struct      unplug_work;
> 
>         struct backing_dev_info backing_dev_info;
> 
>         /*
>          * The queue owner gets to use this for whatever they like.
>          * ll_rw_blk doesn't touch it.
>          */
>         void                    *queuedata;
> 
>         void                    *activity_data;
> 
>         /*
>          * queue needs bounce pages for pages above this limit
>          */
>         unsigned long           bounce_pfn;
>         int                     bounce_gfp;
> 
>         /*
>          * various queue flags, see QUEUE_* below
>          */
>         unsigned long           queue_flags;
> 
>         /*
>          * protects queue structures from reentrancy
>          */
>         spinlock_t              *queue_lock;
> 
>         /*
>          * queue kobject
>          */
>         struct kobject kobj;
> 
>         /*
>          * queue settings
>          */
>         unsigned long           nr_requests;    /* Max # of 
> requests */
>         unsigned int            nr_congestion_on;
>         unsigned int            nr_congestion_off;
> 
>         unsigned short          max_sectors;
>         unsigned short          max_hw_sectors;
>         unsigned short          max_phys_segments;
>         unsigned short          max_hw_segments;
>         unsigned short          hardsect_size;
>         unsigned int            max_segment_size;
> 
>         unsigned long           seg_boundary_mask;
>         unsigned int            dma_alignment;
> 
>         struct blk_queue_tag    *queue_tags;
> 
>         atomic_t                refcnt;
> 
>         unsigned int            in_flight;
> 
>         /*
>          * sg stuff
>          */
>         unsigned int            sg_timeout;
>         unsigned int            sg_reserved_size;
> };
> 
> Every request queue needs an elevator/scheduler, otherwise as you go
> down the block layers you can get contention/starvation between them.
> 
> > As for read-ahead it's the reverse. Read-ahead has no effect 
> > (in my tests) 
> > when applied to the underlying device (such as sda) but has 
> > to be set on the 
> > lvm-device. Here are some performance numbers:

Oh, and the read-aheads set by blockdev, they tend to be inherited by
the block device driver using that device as it's backing device.

When sdX is created it defaults to 256 sectors, when partitions are
mapped they have 256 sectors, when MD associates with the drive or it's
partitions it uses the 256 read-ahead, when LVM comes on top it uses the
MD 256 read-ahead, this is then passed up to the VFS routines that use
this to determine the amount of read-ahead to do. Since VFS associates
only with it's immediate backing device, setting the read-ahead at a
lower backing device has no effect, set it on the immediate backing
device, in this case LVM.
 
> I too see little improvement on read-ahead with sequential io, but
> surprisingly and completely non-intuitive it seems to help with random
> read io, as long as the read-aheads are kept low. Set the 
> read-ahead to
> your stripe size in sectors and you will be pleasantly surprised with
> random read #s.
> 
> > sdc:256,dm-0:256 and sdc:8192,dm-0:256 gives:
> >  # time dd if=file10G of=/dev/null bs=1M
> >  real    0m59.465s
> > 
> > sdc:8192,dm-0:256 and sdc:8192,dm-0:8192 gives:
> >  # time dd if=file10G of=/dev/null bs=1M
> >  real    0m24.163s
> > 
> > This on a 8 disk 3ware raid6 (hardware raid) with fully 
> > updated centos-4.4 
> > x86_64. The file dd read was 1000 MiB. 256 is the default 
> > read-ahead and 
> > blockdev --setra was used to change it.
> > 
> > /Peter
> > 
> > 

______________________________________________________________________
This e-mail, and any attachments thereto, is intended only for use by
the addressee(s) named herein and may contain legally privileged
and/or confidential information. If you are not the intended recipient
of this e-mail, you are hereby notified that any dissemination,
distribution or copying of this e-mail, and any attachments thereto,
is strictly prohibited. If you have received this e-mail in error,
please immediately notify the sender and permanently delete the
original and any copy or printout thereof.