[CentOS] Re: mkfs.ext3 on a 9TB volume -- [Practices] Striping
across intelligent RAID card
Bryan J. Smith
b.j.smith at ieee.org
Mon Sep 12 17:30:04 UTC 2005
Francois Caen <frcaen at gmail.com> wrote:
> Wow! Odd! RH says 8TB but ext3 FAQ says 4TB.
Any filesystem originally designed for 32-bit x86 is full of
signed 32-bit structures. The 2^31 * 512 = 1.1TB (1TiB)
limit comes from those structures using a 512 sector size.
Ext3 has used a couple of different techniques to allow
larger and larger support. Depending on the hardware, kernel
(especially 2.4), etc..., there can be limits at 1, 2, 4, 8
Which is why the "common denominator" is 1.1TB (1TiB). It
was rather enfuriating in front of a client when I attempted
to mount one 2TB Ext3 volume over a SAN created by one Red
Hat Enterprise Linux from another. For the "heck of it" -- I
created a 1TB and tried again ... it worked.
ReiserFS 3 has the same issue, it grew up as a PC LBA32
filesystem. ReiserFS 4 is supposed 64-bit clean. Although
JFS for Linux came from OS/2 (32-bit PC) and not AIX/Power
(true 64-bit), it was designed to be largely "64-bit clean"
too. XFS came from Irix/MIPS4000+ (true 64-bit).
Both JFS and XFS would _not_ work on 32-bit Linux until
patched with complete POSIX32 Large File Support (LFS). LFS
became standard in the x86 target Linux kernel 2.4 and GLibC
2.2 (Red Hat Linux 7.x / Red Hat Enterprise Linux 2.1).
> Joshua, thanks for the reply on this.
> There's something kludgy about having to do softraid across
> 2 partitions before formatting.
RAID-0 is an _ideal_ software RAID. Striping is best handled
by the OS, which can schedule over multiple I/O options. In
2x and 4x S940 Opteron systems with at least one AMD8131
(dual PCI-X channels), I put a 3Ware card on each PCI-X
channel connected to the same CPU and stripe with LVM. The
CPU interlaces writes directly over two (2) PCI-X channels to
two (2) 3Ware cards. Ultimate I/O affinity, no bus
arbitration overhead, etc..., as well as the added
performance of striping.
The only negative is if one 3Ware card dies. But that's why
I keep a spare per N servers (typically 1 for every 4
servers, 8 cards total).
> It adds a layer of complexity and reduces reliability.
That varies. Yes, various kernel storage approaches --
especially LVM2/Device Manager (DM) at this point -- have
race conditions if you use more than one operation. E.g.,
resizing and snapshots, RAID-1 (DM) atop of RAID-0, etc...
But I _only_ use LVM/LVM2 with its native RAID-0 stripe, and
across two (2) 3Ware cards.
I've yet to have an issue. But that's probably because LVM2
doesn't require DM for RAID-0. DM is required for RAID-1,
snapshots, FRAID meta-data, etc...
Joshua Baker-LePain <jlb17 at duke.edu> wrote:
> I wouldn't call it that odd. RH patches their kernels to a
> fair extent, both for stability and features.
Yep. They are _very_ well trusted. Now if they'd put that
into XFS too, I'd be a happy camper.
> > > mke2fs -b 4096 -j -m 0 -R stride=1024 -T largefile4
> > > /dev/md0
BTW, aren't you worried about running out of inodes?
At the same time, have you benchmarked how much faster a full
fsck takes using 1 inode per 4MiB versus the standard
That would be an interesting test IMHO.
> Err, it's not a kludge and it's not a trick. Those 2
> "disks" are hardware RAID5 arrays from 2 12 port 3ware 9500
> cards. I like 3ware's hardware RAID, and those are the
> biggest (in terms of ports) cards 3ware makes.
> So, I hook 12 disks up to each card, and the OS sees those
> as 2 SCSI disks. I then do the software RAID to get 1)
> speed and 2) one partition to present to the users. Folks
> (myself included) have been doing this for years.
I am in total agreeance with you, with one exception. I
always make 2 volumes (one System, one Data) per card (yes,
I'm aware of the 9.2 firmware bug, hence why I have avoided
the 9500S largely, although 220.127.116.11 seems promising now that
it's officially released). So in my case, I'd have two
BTW, supposedly 3Ware supports volumes across up to 4 cards.
Have you tried this? I have not myself.
> The one gotcha in this setup (other than not being able to
> boot from the big RAID5 arrays, since each is >2TiB)
Another reason to create a "System" volume and a "Data"
volume. My "System" volume is typically 2/4 drives in
RAID-1/10. My "Data" volume is typically RAID-5, or if I
really need performance, RAID-10.
> is that the version of mdadm shipped with RHEL4 does not
> support array members bigger than 2TiB. I had
> to upgrade to an upstream release to get that support.
Which is why I use LVM (and now LVM2) for RAID-0. I know
there are claims it is slower than MD (at least LVM2), but I
just like the management of LVM. I guess I'm typical of a
commercial UNIX wennie.
Chris Mauritz <chrism at imntv.com> wrote:
> For what it's worth, I have also done RAID0 stripes of 2
> raid arrays to get *really* fast read/write performance
> when used for storing uncompressed video. Recently, when
> I was at Apple for a meeting, that was their engineer's
> preferred method for getting huge RAIDs....running
> software RAID volumes across multiple Xserve-RAID devices
Software RAID-0 at the OS level (and not some FRAID driver)
is _always_ going to be the _ultimate_ because you can span
peripheral interconnects and cards.
> Perhaps I'm just extremely lucky, but I've not run into
> this magic 1TB barrier that I see bandied about here.
As I said, I've just ran into it on kernel 2.4 distributions.
Any filesystem that grows up on a POSIX32 implementation
(especially pre-kernel 2.4 / GLibC 2.2 before LFS was
standard) is going to have signed 32-bit int structures.
I'm sure Tweedie and the gang have gotten around all of them
in kernel 2.6 now. But at the same time, I don't trust how
they are doing.
> Unfortunately, a lot of the documentation and FAQs are
> quite out of date which can lead to some confusion.
Yeah. LVM2 and Device Mapper (DM) are a real PITA if you
start playing with newer developments, and race conditions
seem to be never-ending.
But when it comes to using intelligent 3Ware RAID with just
LVM2 for RAID-0, it has worked flawlessly for myself on
Bryan J. Smith | Sent from Yahoo Mail
mailto:b.j.smith at ieee.org | (please excuse any
http://thebs413.blogspot.com/ | missing headers)
More information about the CentOS