[CentOS] Disk Elevator

Tue Jan 16 15:37:00 UTC 2007
Ross S. W. Walker <rwalker at medallion.com>

> -----Original Message-----
> From: centos-bounces at centos.org 
> [mailto:centos-bounces at centos.org] On Behalf Of Jim Perrin
> Sent: Tuesday, January 16, 2007 9:37 AM
> To: CentOS mailing list
> Subject: Re: [CentOS] Disk Elevator
> 
> > > > Quoting "Ross S. W. Walker" <rwalker at medallion.com>:
> > > >
> > > > > The biggest performance gain you can achieve on a raid
> > > > array is to make
> > > > > sure you format the volume aligned to your raid stripe
> > > > size. For example
> > > > > if you have a 4 drive raid 5 and it is using 64K chunks,
> > > your stripe
> > > > > size will be 256K. Given a 4K filesystem block size you
> > > > would then have
> > > > > a stride of 64 (256/4), so when you format your volume:
> > > > >
> > > > > Mke2fs -E stride=64 (other needed options -j for ext3, -N
> > > > <# of inodes>
> > > > > for extended # of i-nodes, -O dir_index speeds up directory
> > > > searches for
> > > > > large # of files) /dev/XXXX
> > > >
> > > > Shouldn't the argument for stride option be how many file system
> > > > blocks there is per stripe?  After all, there's no way for OS
> > > > to guess
> > > > what RAID level you are using.  For 4 disk RAID5 with 64k
> > > chunks and
> > > > 4k file system blocks you have only 48 file system blocks
> > > per stripe
> > > > ((4-1)x64k/4k=48).  So it should be -E stride=48 in this
> > > particular
> > > > case.  If it was 4 disk RAID0 array, than it would be 64
> > > > (4x64k/4k=64).  If it was 4 disk RAID10 array, than it 
> would be 32
> > > > ((4/2)*64k/4k=32).  Or at least that's the way I 
> understood it by
> > > > reading the man page.
> > >
> > > You are correct, leave one of the chunks off for the 
> parity, so for 4
> > > disk raid5 stride=48. I had just computed all 4 chunks as 
> part of the
> > > stride.
> > >
> > > BTW that parity chunk still needs to be in memory to 
> avoid the read on
> > > it, no? In that case wouldn't a stride of 64 help in that 
> case? And if
> > > the stride leaves out the parity chunk then will not successive
> > > read-aheads cause a continuous wrap of the stripe which will
> > > negate the
> > > effect of the stride by not having the complete stripe cached?
> 
> > For read-ahead, you would set this through blockdev --setra 
> X /dev/YY,
> > and use a multiple of the # of sectors in a stripe, so for a 256K
> > stripe, set the read-ahead to 512, 1024, 2048, depending if 
> the io is
> > mostly random or mostly sequential (bigger for sequential, 
> smaller for
> > random).
> 
> 
> To follow up on this (even if it is a little late), how is this
> affected by LVM use?
> I'm curious to know how (or if) this math changes with ext3 sitting on
> LVM on the raid array.
> 

Depends is the best answer. It really depends on LVM and the other block
layer devices. As the io requests descend down the different layers they
will enter multiple request_queues, each request_queue will have and io
scheduler assigned to it, either the system default or one of the
others, or one of the block devices own, so it is hard to say. Only by
testing can you know for sure. In my tests LVM is very good with
unnoticeable overhead going to hardware RAID, but if you use MD RAID
then your experience might be different.

Ext3
 |
VFS
 |
Page Cache
 |
LVM request_queue (io scheduler)
 |
LVM
 |
MD request_queue (io scheduler)
 |
MD
 |
-----------------
|   |   |   |   |
Que Que Que Que Que (io scheduler)
|   |   |   |   |
Sda sdb sdc sdd sde

Hope this helps clarify.

______________________________________________________________________
This e-mail, and any attachments thereto, is intended only for use by
the addressee(s) named herein and may contain legally privileged
and/or confidential information. If you are not the intended recipient
of this e-mail, you are hereby notified that any dissemination,
distribution or copying of this e-mail, and any attachments thereto,
is strictly prohibited. If you have received this e-mail in error,
please immediately notify the sender and permanently delete the
original and any copy or printout thereof.