[CentOS] Re: mkfs.ext3 on a 9TB volume -- [Practices] Striping across intelligent RAID card

12 Sep 2005


      Francois Caen frcaen@gmail.com wrote:
...
Wow! Odd! RH says 8TB but ext3 FAQ says 4TB.
Any filesystem originally designed for 32-bit x86 is full of
signed 32-bit structures.  The 2^31 * 512 = 1.1TB (1TiB)
limit comes from those structures using a 512 sector size.
Ext3 has used a couple of different techniques to allow
larger and larger support.  Depending on the hardware, kernel
(especially 2.4), etc..., there can be limits at 1, 2, 4, 8
and 16TiB.
Which is why the "common denominator" is 1.1TB (1TiB).  It
was rather enfuriating in front of a client when I attempted
to mount one 2TB Ext3 volume over a SAN created by one Red
Hat Enterprise Linux from another.  For the "heck of it" -- I
created a 1TB and tried again ... it worked.
ReiserFS 3 has the same issue, it grew up as a PC LBA32
filesystem.  ReiserFS 4 is supposed 64-bit clean.  Although
JFS for Linux came from OS/2 (32-bit PC) and not AIX/Power
(true 64-bit), it was designed to be largely "64-bit clean"
too.  XFS came from Irix/MIPS4000+ (true 64-bit).
Both JFS and XFS would _not_ work on 32-bit Linux until
patched with complete POSIX32 Large File Support (LFS).  LFS
became standard in the x86 target Linux kernel 2.4 and GLibC
2.2 (Red Hat Linux 7.x / Red Hat Enterprise Linux 2.1).
...
Joshua, thanks for the reply on this.
There's something kludgy about having to do softraid across
2 partitions before formatting.
RAID-0 is an _ideal_ software RAID.  Striping is best handled
by the OS, which can schedule over multiple I/O options.  In
2x and 4x S940 Opteron systems with at least one AMD8131
(dual PCI-X channels), I put a 3Ware card on each PCI-X
channel connected to the same CPU and stripe with LVM.  The
CPU interlaces writes directly over two (2) PCI-X channels to
two (2) 3Ware cards.  Ultimate I/O affinity, no bus
arbitration overhead, etc..., as well as the added
performance of striping.
The only negative is if one 3Ware card dies.  But that's why
I keep a spare per N servers (typically 1 for every 4
servers, 8 cards total).
...
It adds a layer of complexity and reduces reliability.
That varies.  Yes, various kernel storage approaches --
especially LVM2/Device Manager (DM) at this point -- have
race conditions if you use more than one operation.  E.g.,
resizing and snapshots, RAID-1 (DM) atop of RAID-0, etc... 
But I _only_ use LVM/LVM2 with its native RAID-0 stripe, and
across two (2) 3Ware cards.
I've yet to have an issue.  But that's probably because LVM2
doesn't require DM for RAID-0.  DM is required for RAID-1,
snapshots, FRAID meta-data, etc...
Joshua Baker-LePain jlb17@duke.edu wrote:
...
I wouldn't call it that odd.  RH patches their kernels to a
fair extent, both for stability and features.
Yep.  They are _very_ well trusted.  Now if they'd put that
into XFS too, I'd be a happy camper.
...
...
...
mke2fs -b 4096 -j -m 0 -R stride=1024 -T largefile4
/dev/md0
BTW, aren't you worried about running out of inodes?
At the same time, have you benchmarked how much faster a full
fsck takes using 1 inode per 4MiB versus the standard
16-64KiB?
That would be an interesting test IMHO.
...
Err, it's not a kludge and it's not a trick.  Those 2
"disks" are hardware RAID5 arrays from 2 12 port 3ware 9500
cards.  I like 3ware's hardware RAID, and those are the
biggest (in terms of ports) cards 3ware makes.  
So, I hook 12 disks up to each card, and the OS sees those
as 2 SCSI disks.  I then do the software RAID to get 1)
speed and 2) one partition to present to the users.  Folks
(myself included) have been doing this for years.
I am in total agreeance with you, with one exception.  I
always make 2 volumes (one System, one Data) per card (yes,
I'm aware of the 9.2 firmware bug, hence why I have avoided
the 9500S largely, although 9.2.1.1 seems promising now that
it's officially released).  So in my case, I'd have two
RAID-0 stripes.
BTW, supposedly 3Ware supports volumes across up to 4 cards. 
Have you tried this?  I have not myself.
...
The one gotcha in this setup (other than not being able to
boot from the big RAID5 arrays, since each is >2TiB)
Another reason to create a "System" volume and a "Data"
volume.  My "System" volume is typically 2/4 drives in
RAID-1/10.  My "Data" volume is typically RAID-5, or if I
really need performance, RAID-10.
...
is that the version of mdadm shipped with RHEL4 does not
support array members bigger than 2TiB.  I had 
to upgrade to an upstream release to get that support.
Which is why I use LVM (and now LVM2) for RAID-0.  I know
there are claims it is slower than MD (at least LVM2), but I
just like the management of LVM.  I guess I'm typical of a
commercial UNIX wennie.
Chris Mauritz chrism@imntv.com wrote:
...
For what it's worth, I have also done RAID0 stripes of 2
raid arrays to get *really* fast read/write performance
when used for storing  uncompressed video.  Recently, when
I was at Apple for a meeting, that was their engineer's
preferred method for getting huge RAIDs....running
software RAID volumes across multiple Xserve-RAID devices
Software RAID-0 at the OS level (and not some FRAID driver)
is _always_ going to be the _ultimate_ because you can span
peripheral interconnects and cards.
...
Perhaps I'm just extremely lucky, but I've not run into
this magic 1TB barrier that I see bandied about here.
As I said, I've just ran into it on kernel 2.4 distributions.
Any filesystem that grows up on a POSIX32 implementation
(especially pre-kernel 2.4 / GLibC 2.2 before LFS was
standard) is going to have signed 32-bit int structures.
I'm sure Tweedie and the gang have gotten around all of them
in kernel 2.6 now.  But at the same time, I don't trust how
they are doing.
...
Unfortunately, a lot of the documentation and FAQs are
quite out of date which can lead to some confusion.
Yeah.  LVM2 and Device Mapper (DM) are a real PITA if you
start playing with newer developments, and race conditions
seem to be never-ending.
But when it comes to using intelligent 3Ware RAID with just
LVM2 for RAID-0, it has worked flawlessly for myself on
kernel 2.6.
-- 
Bryan J. Smith                | Sent from Yahoo Mail
mailto:b.j.smith@ieee.org     |  (please excuse any
http://thebs413.blogspot.com/ |   missing headers)

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

[CentOS] Re: mkfs.ext3 on a 9TB volume -- [Practices] Striping across intelligent RAID card