[CentOS] Re: mkfs.ext3 on a 9TB volume -- [Practices] Striping across intelligent RAID card

Mon Sep 12 17:30:04 UTC 2005
Bryan J. Smith <b.j.smith at ieee.org>

Francois Caen <frcaen at gmail.com> wrote:
> Wow! Odd! RH says 8TB but ext3 FAQ says 4TB.

Any filesystem originally designed for 32-bit x86 is full of
signed 32-bit structures.  The 2^31 * 512 = 1.1TB (1TiB)
limit comes from those structures using a 512 sector size.

Ext3 has used a couple of different techniques to allow
larger and larger support.  Depending on the hardware, kernel
(especially 2.4), etc..., there can be limits at 1, 2, 4, 8
and 16TiB.

Which is why the "common denominator" is 1.1TB (1TiB).  It
was rather enfuriating in front of a client when I attempted
to mount one 2TB Ext3 volume over a SAN created by one Red
Hat Enterprise Linux from another.  For the "heck of it" -- I
created a 1TB and tried again ... it worked.

ReiserFS 3 has the same issue, it grew up as a PC LBA32
filesystem.  ReiserFS 4 is supposed 64-bit clean.  Although
JFS for Linux came from OS/2 (32-bit PC) and not AIX/Power
(true 64-bit), it was designed to be largely "64-bit clean"
too.  XFS came from Irix/MIPS4000+ (true 64-bit).

Both JFS and XFS would _not_ work on 32-bit Linux until
patched with complete POSIX32 Large File Support (LFS).  LFS
became standard in the x86 target Linux kernel 2.4 and GLibC
2.2 (Red Hat Linux 7.x / Red Hat Enterprise Linux 2.1).

> Joshua, thanks for the reply on this.
> There's something kludgy about having to do softraid across
> 2 partitions before formatting.

RAID-0 is an _ideal_ software RAID.  Striping is best handled
by the OS, which can schedule over multiple I/O options.  In
2x and 4x S940 Opteron systems with at least one AMD8131
(dual PCI-X channels), I put a 3Ware card on each PCI-X
channel connected to the same CPU and stripe with LVM.  The
CPU interlaces writes directly over two (2) PCI-X channels to
two (2) 3Ware cards.  Ultimate I/O affinity, no bus
arbitration overhead, etc..., as well as the added
performance of striping.

The only negative is if one 3Ware card dies.  But that's why
I keep a spare per N servers (typically 1 for every 4
servers, 8 cards total).

> It adds a layer of complexity and reduces reliability.

That varies.  Yes, various kernel storage approaches --
especially LVM2/Device Manager (DM) at this point -- have
race conditions if you use more than one operation.  E.g.,
resizing and snapshots, RAID-1 (DM) atop of RAID-0, etc... 
But I _only_ use LVM/LVM2 with its native RAID-0 stripe, and
across two (2) 3Ware cards.

I've yet to have an issue.  But that's probably because LVM2
doesn't require DM for RAID-0.  DM is required for RAID-1,
snapshots, FRAID meta-data, etc...


Joshua Baker-LePain <jlb17 at duke.edu> wrote:
> I wouldn't call it that odd.  RH patches their kernels to a
> fair extent, both for stability and features.

Yep.  They are _very_ well trusted.  Now if they'd put that
into XFS too, I'd be a happy camper.

> > > mke2fs -b 4096 -j -m 0 -R stride=1024 -T largefile4
> > > /dev/md0

BTW, aren't you worried about running out of inodes?

At the same time, have you benchmarked how much faster a full
fsck takes using 1 inode per 4MiB versus the standard
16-64KiB?

That would be an interesting test IMHO.

> Err, it's not a kludge and it's not a trick.  Those 2
> "disks" are hardware RAID5 arrays from 2 12 port 3ware 9500
> cards.  I like 3ware's hardware RAID, and those are the
> biggest (in terms of ports) cards 3ware makes.  
> So, I hook 12 disks up to each card, and the OS sees those
> as 2 SCSI disks.  I then do the software RAID to get 1)
> speed and 2) one partition to present to the users.  Folks
> (myself included) have been doing this for years.

I am in total agreeance with you, with one exception.  I
always make 2 volumes (one System, one Data) per card (yes,
I'm aware of the 9.2 firmware bug, hence why I have avoided
the 9500S largely, although 9.2.1.1 seems promising now that
it's officially released).  So in my case, I'd have two
RAID-0 stripes.

BTW, supposedly 3Ware supports volumes across up to 4 cards. 
Have you tried this?  I have not myself.

> The one gotcha in this setup (other than not being able to
> boot from the big RAID5 arrays, since each is >2TiB)

Another reason to create a "System" volume and a "Data"
volume.  My "System" volume is typically 2/4 drives in
RAID-1/10.  My "Data" volume is typically RAID-5, or if I
really need performance, RAID-10.

> is that the version of mdadm shipped with RHEL4 does not
> support array members bigger than 2TiB.  I had 
> to upgrade to an upstream release to get that support.

Which is why I use LVM (and now LVM2) for RAID-0.  I know
there are claims it is slower than MD (at least LVM2), but I
just like the management of LVM.  I guess I'm typical of a
commercial UNIX wennie.


Chris Mauritz <chrism at imntv.com> wrote:
> For what it's worth, I have also done RAID0 stripes of 2
> raid arrays to get *really* fast read/write performance
> when used for storing  uncompressed video.  Recently, when
> I was at Apple for a meeting, that was their engineer's
> preferred method for getting huge RAIDs....running
> software RAID volumes across multiple Xserve-RAID devices

Software RAID-0 at the OS level (and not some FRAID driver)
is _always_ going to be the _ultimate_ because you can span
peripheral interconnects and cards.

> Perhaps I'm just extremely lucky, but I've not run into
> this magic 1TB barrier that I see bandied about here.

As I said, I've just ran into it on kernel 2.4 distributions.
  
Any filesystem that grows up on a POSIX32 implementation
(especially pre-kernel 2.4 / GLibC 2.2 before LFS was
standard) is going to have signed 32-bit int structures.

I'm sure Tweedie and the gang have gotten around all of them
in kernel 2.6 now.  But at the same time, I don't trust how
they are doing.

> Unfortunately, a lot of the documentation and FAQs are
> quite out of date which can lead to some confusion.

Yeah.  LVM2 and Device Mapper (DM) are a real PITA if you
start playing with newer developments, and race conditions
seem to be never-ending.

But when it comes to using intelligent 3Ware RAID with just
LVM2 for RAID-0, it has worked flawlessly for myself on
kernel 2.6.



-- 
Bryan J. Smith                | Sent from Yahoo Mail
mailto:b.j.smith at ieee.org     |  (please excuse any
http://thebs413.blogspot.com/ |   missing headers)