[CentOS] Disk Elevator

Tue Jan 16 17:21:53 UTC 2007

> -----Original Message-----
> From: Peter Kjellstrom [mailto:cap at nsc.liu.se] 
> Sent: Tuesday, January 16, 2007 11:55 AM
> To: centos at centos.org
> Cc: Ross S. W. Walker
> Subject: Re: [CentOS] Disk Elevator
> 
> On Tuesday 16 January 2007 16:37, Ross S. W. Walker wrote:
> ...
> > > To follow up on this (even if it is a little late), how is this
> > > affected by LVM use?
> > > I'm curious to know how (or if) this math changes with 
> ext3 sitting on
> > > LVM on the raid array.
> >
> > Depends is the best answer. It really depends on LVM and 
> the other block
> > layer devices. As the io requests descend down the 
> different layers they
> > will enter multiple request_queues, each request_queue will 
> have and io
> > scheduler assigned to it, either the system default or one of the
> > others, or one of the block devices own, so it is hard to 
> say. Only by
> > testing can you know for sure. In my tests LVM is very good with
> > unnoticeable overhead going to hardware RAID, but if you use MD RAID
> > then your experience might be different.
> 
> I don't think that is quite correct. AFAICT only the "real" 
> devices (such 
> as /dev/sda) has an io-scheduler. See the difference of ls 
> /sys/block/..:
>  # ls /sys/block/dm-0
>  dev  range  removable  size  stat
>  # ls /sys/block/sdc
>  dev  device  queue  range  removable  size  stat

How a device presents itself in /proc or /sys is completely up to the
device.

All block devices have a request_queue. You can look at the struct of
said queue in linux/blkdev.h, you can then look at the code ll_rw_blk.c
to see how said queue is processed.

Here is the structure anyways:

struct request_queue
{
        /*
         * Together with queue_head for cacheline sharing
         */
        struct list_head        queue_head;
        struct request          *last_merge;
        elevator_t              elevator;

        /*
         * the queue request freelist, one for reads and one for writes
         */
        struct request_list     rq;

        request_fn_proc         *request_fn;
        merge_request_fn        *back_merge_fn;
        merge_request_fn        *front_merge_fn;
        merge_requests_fn       *merge_requests_fn;
        make_request_fn         *make_request_fn;
        prep_rq_fn              *prep_rq_fn;
        unplug_fn               *unplug_fn;
        merge_bvec_fn           *merge_bvec_fn;
        activity_fn             *activity_fn;
        issue_flush_fn          *issue_flush_fn;

        /*
         * Auto-unplugging state
         */
        struct timer_list       unplug_timer;
        int                     unplug_thresh;  /* After this many
requests */
        unsigned long           unplug_delay;   /* After this many
jiffies */
        struct work_struct      unplug_work;

        struct backing_dev_info backing_dev_info;

        /*
         * The queue owner gets to use this for whatever they like.
         * ll_rw_blk doesn't touch it.
         */
        void                    *queuedata;

        void                    *activity_data;

        /*
         * queue needs bounce pages for pages above this limit
         */
        unsigned long           bounce_pfn;
        int                     bounce_gfp;

        /*
         * various queue flags, see QUEUE_* below
         */
        unsigned long           queue_flags;

        /*
         * protects queue structures from reentrancy
         */
        spinlock_t              *queue_lock;

        /*
         * queue kobject
         */
        struct kobject kobj;

        /*
         * queue settings
         */
        unsigned long           nr_requests;    /* Max # of requests */
        unsigned int            nr_congestion_on;
        unsigned int            nr_congestion_off;

        unsigned short          max_sectors;
        unsigned short          max_hw_sectors;
        unsigned short          max_phys_segments;
        unsigned short          max_hw_segments;
        unsigned short          hardsect_size;
        unsigned int            max_segment_size;

        unsigned long           seg_boundary_mask;
        unsigned int            dma_alignment;

        struct blk_queue_tag    *queue_tags;

        atomic_t                refcnt;

        unsigned int            in_flight;

        /*
         * sg stuff
         */
        unsigned int            sg_timeout;
        unsigned int            sg_reserved_size;
};

Every request queue needs an elevator/scheduler, otherwise as you go
down the block layers you can get contention/starvation between them.

> As for read-ahead it's the reverse. Read-ahead has no effect 
> (in my tests) 
> when applied to the underlying device (such as sda) but has 
> to be set on the 
> lvm-device. Here are some performance numbers:

I too see little improvement on read-ahead with sequential io, but
surprisingly and completely non-intuitive it seems to help with random
read io, as long as the read-aheads are kept low. Set the read-ahead to
your stripe size in sectors and you will be pleasantly surprised with
random read #s.

> sdc:256,dm-0:256 and sdc:8192,dm-0:256 gives:
>  # time dd if=file10G of=/dev/null bs=1M
>  real    0m59.465s
> 
> sdc:8192,dm-0:256 and sdc:8192,dm-0:8192 gives:
>  # time dd if=file10G of=/dev/null bs=1M
>  real    0m24.163s
> 
> This on a 8 disk 3ware raid6 (hardware raid) with fully 
> updated centos-4.4 
> x86_64. The file dd read was 1000 MiB. 256 is the default 
> read-ahead and 
> blockdev --setra was used to change it.
> 
> /Peter
> 
> 

______________________________________________________________________
This e-mail, and any attachments thereto, is intended only for use by
the addressee(s) named herein and may contain legally privileged
and/or confidential information. If you are not the intended recipient
of this e-mail, you are hereby notified that any dissemination,
distribution or copying of this e-mail, and any attachments thereto,
is strictly prohibited. If you have received this e-mail in error,
please immediately notify the sender and permanently delete the
original and any copy or printout thereof.