mkfs.ext3 on a 9TB volume

List overview All Threads
Download

newer

older

centos 4 and BlueQuartz 5100R

CentOS-announce Digest, Vol 7,...

Francois Caen

11 Sep 2005 11 Sep '05

2:03 a.m.

Hello,

I have: CentOS4.1 x86_64 directly-attached Infortrend 9TB array QLogic HBA seen as sdb GPT label created in parted

I want one single 9TB ext3 partition.

I am experiencing crazy behavior from mke2fs / mkfs.ext3 (tried both).

If I create partitions in parted up to approx 4,100,000 MB in parted, mkfs.ext3 works great. It lists the right number of blocks and creates a filesystem that fills the partition.

Any partition approx 4,200,000 MB and larger, mkfs.ext3 sees a totally random number of blocks, equivalent to a few hundred GBs of space. It create healthy partitions, but they are way too small!!!

...

From dmesg to parted, I can see the system recognizes the hardware

properly, including the number of blocks. And I know the upper limit of ext3 is 16TB, not 4!

What am I doing wrong? What wall am I hitting around 4TB? HELP!!!

Thanks, Francois Caen

output of a few commands:

(parted) p Disk geometry for /dev/sdb: 0.000-9149550.000 megabytes Disk label type: gpt Minor Start End Filesystem Name Flags 1 0.017 4100000.000 ext3 <------ 4TB partition 2 4100000.000 9149549.983 <------- 5TB partition

# mke2fs -n /dev/sdb1 mke2fs 1.35 (28-Feb-2004) Filesystem label= OS type: Linux Block size=4096 (log=2) Fragment size=4096 (log=2) 524812288 inodes, 1049599995 blocks <------------ GOOD!!!! 9530326 blocks (0.91%) reserved for the super user First data block=0 Maximum filesystem blocks=4294967296 32032 block groups 32768 blocks per group, 32768 fragments per group 16384 inodes per group Superblock backups stored on blocks: 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968, 102400000, 214990848, 512000000, 550731776, 644972544

# mke2fs -n /dev/sdb2 mke2fs 1.35 (28-Feb-2004) Filesystem label= OS type: Linux Block size=4096 (log=2) Fragment size=4096 (log=2) 109477888 inodes, 218942971 blocks <-------- BAD!!!!! only 830GB!!! 10947148 blocks (5.00%) reserved for the super user First data block=0 Maximum filesystem blocks=4294967296 6682 block groups 32768 blocks per group, 32768 fragments per group 16384 inodes per group Superblock backups stored on blocks: 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968, 102400000, 214990848

Show replies by date

Bryan J. Smith

11 Sep 11 Sep

5:14 a.m.

On Sat, 2005-09-10 at 19:03 -0700, Francois Caen wrote:

...

What am I doing wrong? What wall am I hitting around 4TB?

There are some size limitations _below_ the current 16TiB (17.6TB) "absolute" limitations depending on various geometry/hardware considerations. Kernel version can also reduce the "absolute" as well.

I don't have a list because I personally never make an Ext3 filesystem greater than 1TB, period. And I try to keep them below 100GBs if I can help it.

As I mentioned before, I have some Fedora Core 3 (FC3) systems in test/limited production with XFS, which is what I use for TB-sized filesystems. So far, so good, at least on the very latest kernels. But I have not deployed XFS in heavy production since the 1.3.1 release on Red Hat Linux (RHL) 9, and the overwhelming majority of my deployments were the 1.2.x release on Red Hat Linux (RHL) 7.3.

-- Bryan J. Smith b.j.smith@ieee.org http://thebs413.blogspot.com ---------------------------------------------------------------------- The best things in life are NOT free - which is why life is easiest if you save all the bills until you can share them with the perfect woman

Francois Caen

5:26 a.m.

On 9/10/05, Bryan J. Smith b.j.smith@ieee.org wrote:

...

I don't have a list because I personally never make an Ext3 filesystem greater than 1TB, period. And I try to keep them below 100GBs if I can help it.

Why? Perfomance? Stability???

Francois

Peter Arremann

4:20 p.m.

On Sunday 11 September 2005 01:26, Francois Caen wrote:

...

On 9/10/05, Bryan J. Smith b.j.smith@ieee.org wrote:

...
I don't have a list because I personally never make an Ext3 filesystem greater than 1TB, period. And I try to keep them below 100GBs if I can help it.

Why? Perfomance? Stability???

ext2 and therefore ext3 was designed with a few GB disks in mind... XFS and JFS were both designed with multi TB disks in mind... Therefore it _should_ work better if you use one of those more modern fs... There are benchmarks and all for that - i.e. the performance of reiserfs has been disected over and over again when it comes to handling small files.

Anyway, when running oracle on a cooked space I've not yet seen performance differences. If you just have a huge mirror (i.e. all the linux distributions and more) I have not noticed performance either... Sure, there are places where xfs and the like will work better but I haven't seen an environment like that yet.

All that said, I too would recommend going with Reiser or XFS. I once had a ext3 filesystem that had one damaged sector in the journal... of course it fell back to ext2 behavior and the fs check took all weekend :-) XFS will handle a defective journal much better - so the chance that you ever encounter a situation where you have to do a full fs check is much lower.

Peter.

Bryan J. Smith

5:06 p.m.

On Sun, 2005-09-11 at 12:20 -0400, Peter Arremann wrote:

...

All that said, I too would recommend going with Reiser or XFS.

I can't recommend ReiserFS because it lacks interfaces and compatibility, ones that are at the heart of most Red Hat distribution deployments. Red Hat will never support ReiserFS for this reason, and not because of some Tweedie v. Reiser debate. Tweedie has his focus because of Red Hat's focus.

SuSE continues to try to hack more and more interface/compatibility support for ReiserFS, including Extended Attributes (EAs). This is no different than when they original did various NFS support hacks. Frankly, I wish SuSE (let alone Red Hat) would put that effort into XFS instead. But SuSE is dedicated to ReiserFS, and Red Hat seems unwilling (often using incorrect assumptions on XFS' interfaces/compatibility, and absolute limitations in Ext3, in many replies) to join SGI in making Red Hat the absolute ultimate enterprise distro (IMHO) with the Ext3/XFS combination.

...

I once had a ext3 filesystem that had one damaged sector in the journal... of course it fell back to ext2 behavior and the fs check took all weekend :-)

But it recovered. I'll take that level of trust any day over recovery time. But that's just me.

...

XFS will handle a defective journal much better - so the chance that you ever encounter a situation where you have to do a full fs check is much lower.

Assuming the XFS kernel build is complete.

The only XFS kernel build I have extensively tested as complete are the official SGI XFS releases. I had a really horrendous experience with the kernel 2.4 backport, and I would never touch a 3rd party rebuild.

So far the XFS kernel build in newer Fedora Core 3 seems to be usable, but I haven't put a full load on it. I don't know how it compares to the CentOS Plus kernel at all. There are a few issues with NFS though, and that has been a disappointment. All my attempts to use the XFS from the SGI cvs tree atop of RHEL/FC has also resulted in additional issues.

The reason why I adopted the official SGI XFS releases back in the early/mid-2.4 kernel was because of solid NFS capability as well as POSIX EA/ACL support, plus the xfsdump and other user-space utilities (which Ext3 still lacks). If I had to deploy a serious NFS server today (1+TB), I would deploy Solaris/Opteron instead, hands down, no hesitation.

Otherwise, RHEL4/FC3 with Ext3 is fine for my typical deployments with filesystems commonly no larger than 100GB -- 1TB is an absolute maximum for myself and Ext3. I will not consider larger because 1TB is the "common denominator" of Ext3 filesystem size on _any_ kernel, _any_ hardware.

Les Mikesell

6:13 p.m.

On Sun, 2005-09-11 at 11:20, Peter Arremann wrote:

...

All that said, I too would recommend going with Reiser or XFS. I once had a ext3 filesystem that had one damaged sector in the journal... of course it fell back to ext2 behavior and the fs check took all weekend :-)

I've had exactly the same experience with reiserfs several times and the weird syntax of the the reiserfs version of fsck makes it not work automatically. So not only does it take forever, it waits all weekend for you to come in and type the --rebuild-tree option to the command before even starting.

-- Les Mikesell lesmikesell@gmail.com

Francois Caen

9:02 p.m.

On 9/11/05, Peter Arremann loony@loonybin.org wrote:

...

ext2 and therefore ext3 was designed with a few GB disks in mind... XFS and JFS were both designed with multi TB disks in mind... Therefore it _should_ work better if you use one of those more modern fs...

My concern with xfs, reiser or jfs is not really how good they are, but how well they are implemented/supported in CentOS.

My application is a huge backup-to-disk samba-accessed storage. Performance and fsck-caused downtime are not important to me. Integrity of the data is critical.

I need the most reliable multi-TB filesystem I can use with CentOS/RHEL.

And it's hard to choose between the better-but-less-supported xfs/reiser/... or the well-supported but not that multi-TB-friendly ext3...

Francois

Peter Arremann

9:10 p.m.

On Sunday 11 September 2005 17:02, Francois Caen wrote:

...

My concern with xfs, reiser or jfs is not really how good they are, but how well they are implemented/supported in CentOS.

My application is a huge backup-to-disk samba-accessed storage. Performance and fsck-caused downtime are not important to me. Integrity of the data is critical.

I need the most reliable multi-TB filesystem I can use with CentOS/RHEL.

And it's hard to choose between the better-but-less-supported xfs/reiser/... or the well-supported but not that multi-TB-friendly ext3...

We usually have the same issue - and so far the answer has always been ext3 simply because its easier to support. Gladly so far we haven't hit the 4TB limit (http://batleth.sapienti-sat.org/projects/FAQs/ext3-faq.html)... always ended up making sliceses smaller than that for individual uses.

Peter.

Joshua Baker-LePain

12 Sep 12 Sep

1:01 a.m.

On Sun, 11 Sep 2005 at 5:10pm, Peter Arremann wrote

...

On Sunday 11 September 2005 17:02, Francois Caen wrote:

...
My concern with xfs, reiser or jfs is not really how good they are, but how well they are implemented/supported in CentOS.

My application is a huge backup-to-disk samba-accessed storage. Performance and fsck-caused downtime are not important to me. Integrity of the data is critical.

I need the most reliable multi-TB filesystem I can use with CentOS/RHEL.

And it's hard to choose between the better-but-less-supported xfs/reiser/... or the well-supported but not that multi-TB-friendly ext3...

We usually have the same issue - and so far the answer has always been ext3 simply because its easier to support. Gladly so far we haven't hit the 4TB limit (http://batleth.sapienti-sat.org/projects/FAQs/ext3-faq.html)... always ended up making sliceses smaller than that for individual uses.

Having hit a similar issue (big FS, I wanted XFS, but needed to run centos 4), I just went ahead and stuck with ext3. My FS is 5.5TiB -- a software RAID0 across 2 3w-9xxx arrays. I had no issues formatting it and have had no issues in testing or production with it. So, it can be done. Perhaps the bugs you're hitting are in the FC driver layer?

-- Joshua Baker-LePain Department of Biomedical Engineering Duke University

Francois Caen

1:41 a.m.

On 9/11/05, Joshua Baker-LePain jlb17@duke.edu wrote:

...

Having hit a similar issue (big FS, I wanted XFS, but needed to run centos 4), I just went ahead and stuck with ext3. My FS is 5.5TiB -- a software RAID0 across 2 3w-9xxx arrays. I had no issues formatting it and have had no issues in testing or production with it. So, it can be done. Perhaps the bugs you're hitting are in the FC driver layer?

ext3 had a 4TB limit: http://batleth.sapienti-sat.org/projects/FAQs/ext3-faq.html) which I didn't know when I started this thread.

I found it the hard way, through testing. There are ways to force past that limit (mkpartfs ext2 in parted, then tune2fs -j), but the resulting filesystem is totally unstable.

Joshua, how the heck did you format your 5.5TB in ext3? You 100% sure it's not mounted as ext2?

Francois

Joshua Baker-LePain

9:25 a.m.

On Sun, 11 Sep 2005 at 6:41pm, Francois Caen wrote

...

On 9/11/05, Joshua Baker-LePain jlb17@duke.edu wrote:

...
Having hit a similar issue (big FS, I wanted XFS, but needed to run centos 4), I just went ahead and stuck with ext3. My FS is 5.5TiB -- a software RAID0 across 2 3w-9xxx arrays. I had no issues formatting it and have had no issues in testing or production with it. So, it can be done. Perhaps the bugs you're hitting are in the FC driver layer?

ext3 had a 4TB limit: http://batleth.sapienti-sat.org/projects/FAQs/ext3-faq.html) which I didn't know when I started this thread.

As I mentioned, I'm running centos-4, which, as we all know, is based off RHEL 4. If you go to http://www.redhat.com/software/rhel/features/, they explicitly state that they support ext3 FSs up to 8TB.

...

I found it the hard way, through testing. There are ways to force past that limit (mkpartfs ext2 in parted, then tune2fs -j), but the resulting filesystem is totally unstable.

Joshua, how the heck did you format your 5.5TB in ext3? You 100% sure it's not mounted as ext2?

To answer the 2nd question: [jlb@$HOST ~]$ df -h Filesystem Size Used Avail Use% Mounted on . . /dev/md0 5.5T 634G 4.9T 12% /nefs [jlb@$HOST ~]$ mount . . /dev/md0 on /nefs type ext3 (rw)

As to the first, I created the FS as simply as possible. /dev/sdb and /dev/sdc both look like this:

(parted) print Disk geometry for /dev/sdb: 0.000-2860920.000 megabytes Disk label type: gpt Minor Start End Filesystem Name Flags 1 0.017 2860919.983

I then did a software RAIDO across them, and finally:

mke2fs -b 4096 -j -m 0 -R stride=1024 -T largefile4 /dev/md0

-- Joshua Baker-LePain Department of Biomedical Engineering Duke University

Francois Caen

3:42 p.m.

On 9/12/05, Joshua Baker-LePain jlb17@duke.edu wrote:

...

As I mentioned, I'm running centos-4, which, as we all know, is based off RHEL 4. If you go to http://www.redhat.com/software/rhel/features/, they explicitly state that they support ext3 FSs up to 8TB.

Wow! Odd! RH says 8TB but ext3 FAQ says 4TB.

...

From my personal testing on CentOS 4.1, you can't go over 4TB without kludging.

...

I then did a software RAIDO across them, and finally:

mke2fs -b 4096 -j -m 0 -R stride=1024 -T largefile4 /dev/md0

Joshua, thanks for the reply on this. There's something kludgy about having to do softraid across 2 partitions before formatting. It adds a layer of complexity and reduces reliability. Is that the trick RH recommended to go up to 8TB?

-- Francois Caen, RHCE, CCNA SpiderMaker, LLC

Joshua Baker-LePain

3:49 p.m.

On Mon, 12 Sep 2005 at 8:42am, Francois Caen wrote

...

On 9/12/05, Joshua Baker-LePain jlb17@duke.edu wrote:

...
As I mentioned, I'm running centos-4, which, as we all know, is based off RHEL 4. If you go to http://www.redhat.com/software/rhel/features/, they explicitly state that they support ext3 FSs up to 8TB.

Wow! Odd! RH says 8TB but ext3 FAQ says 4TB.

I wouldn't call it that odd. RH patches their kernels to a fair extent, both for stability and features.

...

...
From my personal testing on CentOS 4.1, you can't go over 4TB without kludging.

...
I then did a software RAIDO across them, and finally:

mke2fs -b 4096 -j -m 0 -R stride=1024 -T largefile4 /dev/md0

Joshua, thanks for the reply on this. There's something kludgy about having to do softraid across 2 partitions before formatting. It adds a layer of complexity and reduces reliability. Is that the trick RH recommended to go up to 8TB?

Err, it's not a kludge and it's not a trick. Those 2 "disks" are hardware RAID5 arrays from 2 12 port 3ware 9500 cards. I like 3ware's hardware RAID, and those are the biggest (in terms of ports) cards 3ware makes. So, I hook 12 disks up to each card, and the OS sees those as 2 SCSI disks. I then do the software RAID to get 1) speed and 2) one partition to present to the users. Folks (myself included) have been doing this for years.

The one gotcha in this setup (other than not being able to boot from the big RAID5 arrays, since each is >2TiB) is that the version of mdadm shipped with RHEL4 does not support array members bigger than 2TiB. I had to upgrade to an upstream release to get that support.

-- Joshua Baker-LePain Department of Biomedical Engineering Duke University

Chris Mauritz

4:05 p.m.

Joshua Baker-LePain wrote:

...

The one gotcha in this setup (other than not being able to boot from the big RAID5 arrays, since each is >2TiB) is that the version of mdadm shipped with RHEL4 does not support array members bigger than 2TiB. I had to upgrade to an upstream release to get that support.

I would be interested in hearing the details on that. I ended up hunting down all the mdadm stuff and recompiling everything from SRPMs.

Cheers,

Joshua Baker-LePain

4:13 p.m.

On Mon, 12 Sep 2005 at 12:05pm, Chris Mauritz wrote

...

Joshua Baker-LePain wrote:

...
The one gotcha in this setup (other than not being able to boot from the big RAID5 arrays, since each is >2TiB) is that the version of mdadm shipped with RHEL4 does not support array members bigger than 2TiB. I had to upgrade to an upstream release to get that support.

I would be interested in hearing the details on that. I ended up hunting down all the mdadm stuff and recompiling everything from SRPMs.

Which part -- the booting or the mdadm 2TiB support? The former I've talked about on this list before. There's not much to say on the latter. mdadm-1.6.0-2 (still current in RHEL4, AFAIK) says "Cannot get size of /dev/sda4: File too large", e.g., when an array member is > 2TiB. I downloaded mdadm-1.11.0.tgz from http://www.cse.unsw.edu.au/~neilb/source/mdadm/ (which was current at the time), did 'rpmbuild -tb' on it, and installed the resulting RPM, which worked.

-- Joshua Baker-LePain Department of Biomedical Engineering Duke University

Chris Mauritz

4:18 p.m.

Joshua Baker-LePain wrote:

...

On Mon, 12 Sep 2005 at 12:05pm, Chris Mauritz wrote

...
Joshua Baker-LePain wrote:

...
The one gotcha in this setup (other than not being able to boot from the big RAID5 arrays, since each is >2TiB) is that the version of mdadm shipped with RHEL4 does not support array members bigger than 2TiB. I had to upgrade to an upstream release to get that support.

I would be interested in hearing the details on that. I ended up hunting down all the mdadm stuff and recompiling everything from SRPMs.

Which part -- the booting or the mdadm 2TiB support? The former I've talked about on this list before. There's not much to say on the latter. mdadm-1.6.0-2 (still current in RHEL4, AFAIK) says "Cannot get size of /dev/sda4: File too large", e.g., when an array member is > 2TiB. I downloaded mdadm-1.11.0.tgz from http://www.cse.unsw.edu.au/~neilb/source/mdadm/ (which was current at the time), did 'rpmbuild -tb' on it, and installed the resulting RPM, which worked.

Heh, not the booting. 8-)

OK, so we basically did the same thing with mdadm then. I think I'm running 1.12.mumble and it seems to be up to version 2 now. I've not fiddled with it at all since the install since it was working OK and I didn't want to upset the apple cart.

Cheers,

Nick Bryant

14 Sep 14 Sep

2:28 p.m.

...

On Mon, 12 Sep 2005 at 8:42am, Francois Caen wrote

...
On 9/12/05, Joshua Baker-LePain jlb17@duke.edu wrote:

...
As I mentioned, I'm running centos-4, which, as we all know, is based

off

...
...
RHEL 4. If you go to http://www.redhat.com/software/rhel/features/, they explicitly state that they support ext3 FSs up to 8TB.

Wow! Odd! RH says 8TB but ext3 FAQ says 4TB.

I wouldn't call it that odd. RH patches their kernels to a fair extent, both for stability and features.

...
...
From my personal testing on CentOS 4.1, you can't go over 4TB without

kludging.

...
...
I then did a software RAIDO across them, and finally:

mke2fs -b 4096 -j -m 0 -R stride=1024 -T largefile4 /dev/md0

Joshua, thanks for the reply on this. There's something kludgy about having to do softraid across 2 partitions before formatting. It adds a layer of complexity and reduces reliability. Is that the trick RH recommended to go up to 8TB?

Err, it's not a kludge and it's not a trick. Those 2 "disks" are hardware RAID5 arrays from 2 12 port 3ware 9500 cards. I like 3ware's hardware RAID, and those are the biggest (in terms of ports) cards 3ware makes. So, I hook 12 disks up to each card, and the OS sees those as 2 SCSI disks. I then do the software RAID to get 1) speed and 2) one partition to present to the users. Folks (myself included) have been doing this for years.

The one gotcha in this setup (other than not being able to boot from the big RAID5 arrays, since each is >2TiB) is that the version of mdadm shipped with RHEL4 does not support array members bigger than 2TiB. I had to upgrade to an upstream release to get that support.

Just out of interest, and to complicate the matter even more, does anyone know what the upper limit of GFS is?

Matt Hyclak

2:29 p.m.

On Thu, Sep 15, 2005 at 12:28:01AM +1000, Nick Bryant enlightened us:

...

...
...
On 9/12/05, Joshua Baker-LePain jlb17@duke.edu wrote:

...
As I mentioned, I'm running centos-4, which, as we all know, is based

off

...
...
RHEL 4. If you go to http://www.redhat.com/software/rhel/features/, they explicitly state that they support ext3 FSs up to 8TB.

Wow! Odd! RH says 8TB but ext3 FAQ says 4TB.

I wouldn't call it that odd. RH patches their kernels to a fair extent, both for stability and features.

...
...
From my personal testing on CentOS 4.1, you can't go over 4TB without

kludging.

...
...
I then did a software RAIDO across them, and finally:

mke2fs -b 4096 -j -m 0 -R stride=1024 -T largefile4 /dev/md0

Joshua, thanks for the reply on this. There's something kludgy about having to do softraid across 2 partitions before formatting. It adds a layer of complexity and reduces reliability. Is that the trick RH recommended to go up to 8TB?

Err, it's not a kludge and it's not a trick. Those 2 "disks" are hardware RAID5 arrays from 2 12 port 3ware 9500 cards. I like 3ware's hardware RAID, and those are the biggest (in terms of ports) cards 3ware makes. So, I hook 12 disks up to each card, and the OS sees those as 2 SCSI disks. I then do the software RAID to get 1) speed and 2) one partition to present to the users. Folks (myself included) have been doing this for years.

The one gotcha in this setup (other than not being able to boot from the big RAID5 arrays, since each is >2TiB) is that the version of mdadm shipped with RHEL4 does not support array members bigger than 2TiB. I had to upgrade to an upstream release to get that support.

Just out of interest, and to complicate the matter even more, does anyone know what the upper limit of GFS is?

From what I've been reading, there's an 8TB limit of all GFS file systems in a cluster.

http://www.redhat.com/docs/manuals/csgfs/browse/rh-gfs-en/s1-sysreq-fibredev...

Matt

-- Matt Hyclak Department of Mathematics Department of Social Work Ohio University (740) 593-1263

Chris Mauritz

12 Sep 12 Sep

4:02 p.m.

Francois Caen wrote:

...

...
I then did a software RAIDO across them, and finally:

mke2fs -b 4096 -j -m 0 -R stride=1024 -T largefile4 /dev/md0

Joshua, thanks for the reply on this. There's something kludgy about having to do softraid across 2 partitions before formatting. It adds a layer of complexity and reduces reliability. Is that the trick RH recommended to go up to 8TB?

Huh? I suspect he did this not because of the OS, but because each RAID card had maxed out the number of physical ports. You don't HAVE to do that. I suspect it would also work fine if you had a dozen 500gb Hitachi drives on a 12-port 3ware card.

For what it's worth, I have also done RAID0 stripes of 2 raid arrays to get *really* fast read/write performance when used for storing uncompressed video. Recently, when I was at Apple for a meeting, that was their engineer's preferred method for getting huge RAIDs....running software RAID volumes across multiple Xserve-RAID devices

Perhaps I'm just extremely lucky, but I've not run into this magic 1TB barrier that I see bandied about here. Heck, if you're willing to roll the dice on Hitachi drives, you can get a terabyte these days with just 2 hard disks in the array with RAID0 or 3 disks with RAID5.

Unfortunately, a lot of the documentation and FAQs are quite out of date which can lead to some confusion.

Cheers,

William A. Mahaffey III

13 Sep 13 Sep

12:24 p.m.

Chris Mauritz wrote:

<snip>

...

Perhaps I'm just extremely lucky, but I've not run into this magic 1TB barrier that I see bandied about here. Heck, if you're willing to roll the dice on Hitachi drives, you can get a terabyte these days with just 2 hard disks in the array with RAID0 or 3 disks with RAID5. Unfortunately, a lot of the documentation and FAQs are quite out of date which can lead to some confusion.

Grrrrrrrrrrr, 1 of my *long time* complaints w/ Linux (any Linux, apparently) is that nobody bothers to keep man-pages/FAQS/other-documentation up to date. The cron(8) on my SuSE 9.2 P4 is dated 1996 (!!!). I lay this at the feet of the distro folks myself, but none of them have taken it up :-) ....

-- William A. Mahaffey III --------------------------------------------------------------------- Remember, ignorance is bliss, but willful ignorance is LIBERALISM !!!!

Bryan J. Smith

12 Sep 12 Sep

5:04 p.m.

New subject: mkfs.ext3 on a 9TB volume -- [Practices] Striping across intelligent RAID cards

Francois Caen frcaen@gmail.com wrote:

...

Wow! Odd! RH says 8TB but ext3 FAQ says 4TB.

Any filesystem originally designed for 32-bit x86 is full of signed 32-bit structures. The 2^31 * 512 = 1.1TB (1TiB) limit comes from those structures using a 512 sector size.

Ext3 has used a couple of different techniques to allow larger and larger support. Depending on the hardware, kernel (especially 2.4), etc..., there can be limits at 1, 2, 4, 8 and 16TiB.

Which is why the "common denominator" is 1.1TB (1TiB). It was rather enfuriating in front of a client when I attempted to mount one Ext3 volume

...

...
From my personal testing on CentOS 4.1, you can't go over

4TB without kludging.

...
I then did a software RAIDO across them, and finally:

mke2fs -b 4096 -j -m 0 -R stride=1024 -T largefile4

/dev/md0

Joshua, thanks for the reply on this. There's something kludgy about having to do softraid across 2 partitions before formatting. It adds a layer of complexity and reduces reliability. Is that the trick RH recommended to go up to 8TB?

-- Francois Caen, RHCE, CCNA SpiderMaker, LLC _______________________________________________ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos

-- Bryan J. Smith | Sent from Yahoo Mail mailto:b.j.smith@ieee.org | (please excuse any http://thebs413.blogspot.com/ | missing headers)

Bryan J. Smith

5:30 p.m.

New subject: mkfs.ext3 on a 9TB volume -- [Practices] Striping across intelligent RAID card

Francois Caen frcaen@gmail.com wrote:

...

Wow! Odd! RH says 8TB but ext3 FAQ says 4TB.

Any filesystem originally designed for 32-bit x86 is full of signed 32-bit structures. The 2^31 * 512 = 1.1TB (1TiB) limit comes from those structures using a 512 sector size.

Ext3 has used a couple of different techniques to allow larger and larger support. Depending on the hardware, kernel (especially 2.4), etc..., there can be limits at 1, 2, 4, 8 and 16TiB.

Which is why the "common denominator" is 1.1TB (1TiB). It was rather enfuriating in front of a client when I attempted to mount one 2TB Ext3 volume over a SAN created by one Red Hat Enterprise Linux from another. For the "heck of it" -- I created a 1TB and tried again ... it worked.

ReiserFS 3 has the same issue, it grew up as a PC LBA32 filesystem. ReiserFS 4 is supposed 64-bit clean. Although JFS for Linux came from OS/2 (32-bit PC) and not AIX/Power (true 64-bit), it was designed to be largely "64-bit clean" too. XFS came from Irix/MIPS4000+ (true 64-bit).

Both JFS and XFS would _not_ work on 32-bit Linux until patched with complete POSIX32 Large File Support (LFS). LFS became standard in the x86 target Linux kernel 2.4 and GLibC 2.2 (Red Hat Linux 7.x / Red Hat Enterprise Linux 2.1).

...

Joshua, thanks for the reply on this. There's something kludgy about having to do softraid across 2 partitions before formatting.

RAID-0 is an _ideal_ software RAID. Striping is best handled by the OS, which can schedule over multiple I/O options. In 2x and 4x S940 Opteron systems with at least one AMD8131 (dual PCI-X channels), I put a 3Ware card on each PCI-X channel connected to the same CPU and stripe with LVM. The CPU interlaces writes directly over two (2) PCI-X channels to two (2) 3Ware cards. Ultimate I/O affinity, no bus arbitration overhead, etc..., as well as the added performance of striping.

The only negative is if one 3Ware card dies. But that's why I keep a spare per N servers (typically 1 for every 4 servers, 8 cards total).

...

It adds a layer of complexity and reduces reliability.

That varies. Yes, various kernel storage approaches -- especially LVM2/Device Manager (DM) at this point -- have race conditions if you use more than one operation. E.g., resizing and snapshots, RAID-1 (DM) atop of RAID-0, etc... But I _only_ use LVM/LVM2 with its native RAID-0 stripe, and across two (2) 3Ware cards.

I've yet to have an issue. But that's probably because LVM2 doesn't require DM for RAID-0. DM is required for RAID-1, snapshots, FRAID meta-data, etc...

Joshua Baker-LePain jlb17@duke.edu wrote:

...

I wouldn't call it that odd. RH patches their kernels to a fair extent, both for stability and features.

Yep. They are _very_ well trusted. Now if they'd put that into XFS too, I'd be a happy camper.

...

...
...
mke2fs -b 4096 -j -m 0 -R stride=1024 -T largefile4 /dev/md0

BTW, aren't you worried about running out of inodes?

At the same time, have you benchmarked how much faster a full fsck takes using 1 inode per 4MiB versus the standard 16-64KiB?

That would be an interesting test IMHO.

...

Err, it's not a kludge and it's not a trick. Those 2 "disks" are hardware RAID5 arrays from 2 12 port 3ware 9500 cards. I like 3ware's hardware RAID, and those are the biggest (in terms of ports) cards 3ware makes. So, I hook 12 disks up to each card, and the OS sees those as 2 SCSI disks. I then do the software RAID to get 1) speed and 2) one partition to present to the users. Folks (myself included) have been doing this for years.

I am in total agreeance with you, with one exception. I always make 2 volumes (one System, one Data) per card (yes, I'm aware of the 9.2 firmware bug, hence why I have avoided the 9500S largely, although 9.2.1.1 seems promising now that it's officially released). So in my case, I'd have two RAID-0 stripes.

BTW, supposedly 3Ware supports volumes across up to 4 cards. Have you tried this? I have not myself.

...

The one gotcha in this setup (other than not being able to boot from the big RAID5 arrays, since each is >2TiB)

Another reason to create a "System" volume and a "Data" volume. My "System" volume is typically 2/4 drives in RAID-1/10. My "Data" volume is typically RAID-5, or if I really need performance, RAID-10.

...

is that the version of mdadm shipped with RHEL4 does not support array members bigger than 2TiB. I had to upgrade to an upstream release to get that support.

Which is why I use LVM (and now LVM2) for RAID-0. I know there are claims it is slower than MD (at least LVM2), but I just like the management of LVM. I guess I'm typical of a commercial UNIX wennie.

Chris Mauritz chrism@imntv.com wrote:

...

For what it's worth, I have also done RAID0 stripes of 2 raid arrays to get *really* fast read/write performance when used for storing uncompressed video. Recently, when I was at Apple for a meeting, that was their engineer's preferred method for getting huge RAIDs....running software RAID volumes across multiple Xserve-RAID devices

Software RAID-0 at the OS level (and not some FRAID driver) is _always_ going to be the _ultimate_ because you can span peripheral interconnects and cards.

...

Perhaps I'm just extremely lucky, but I've not run into this magic 1TB barrier that I see bandied about here.

As I said, I've just ran into it on kernel 2.4 distributions.

Any filesystem that grows up on a POSIX32 implementation (especially pre-kernel 2.4 / GLibC 2.2 before LFS was standard) is going to have signed 32-bit int structures.

I'm sure Tweedie and the gang have gotten around all of them in kernel 2.6 now. But at the same time, I don't trust how they are doing.

...

Unfortunately, a lot of the documentation and FAQs are quite out of date which can lead to some confusion.

Yeah. LVM2 and Device Mapper (DM) are a real PITA if you start playing with newer developments, and race conditions seem to be never-ending.

But when it comes to using intelligent 3Ware RAID with just LVM2 for RAID-0, it has worked flawlessly for myself on kernel 2.6.

-- Bryan J. Smith | Sent from Yahoo Mail mailto:b.j.smith@ieee.org | (please excuse any http://thebs413.blogspot.com/ | missing headers)

Bryan J. Smith

3:34 a.m.

On Sun, 2005-09-11 at 21:01 -0400, Joshua Baker-LePain wrote:

...

Having hit a similar issue (big FS, I wanted XFS, but needed to run centos 4), I just went ahead and stuck with ext3. My FS is 5.5TiB -- a software RAID0 across 2 3w-9xxx arrays. I had no issues formatting it and have had no issues in testing or production with it. So, it can be done.

I don't think I _ever_ said it couldn't be done. In fact, the Ext3 support is now up to 17.6TB (16TiB) now.

But is there any guarantee that volume will work if moved to another set of hardware, kernels, etc...??? As I said, I _never_ create Ext3 filesystems greater than 1TB for this reason. 1TB is the "common denominator" when it comes to Ext3.

...

Perhaps the bugs you're hitting are in the FC driver layer?

There's all sorts of "requirements" for Ext3 sizes above 1TB, and assuming it will always work is an assumption I'm not willing to make with my client's data. But that's just me. I have to see repeatable results, and I have not with Ext3 above 1TB.

-- Bryan

P.S. Red Hat's going to wake up sooner or later and realize it's just as Sun said, they have not addressed the enterprise filesystem issue. I'm sure SGI and the XFS team would be more than happy to see some engagement from Red Hat on this matter -- and have wished for years now -- and the said thing is that it would _help_ Red Hat's future. XFS is the only option -- ReiserFS and JFS have interface/compatibility issues that are "show stoppers" for Red Hat. XFS has not, and the only issues are newer kernel/distribution developments that just need to be addressed at a distro-level.

Francois Caen

4:01 a.m.

On 9/11/05, Bryan J. Smith b.j.smith@ieee.org wrote:

...

I don't think I _ever_ said it couldn't be done. In fact, the Ext3 support is now up to 17.6TB (16TiB) now.

Not according to the ext3 FAQ (see link in thread) and some extensive testing I did over the weekend. Ext2 is 16TB. Ext3 is only 4TB.

...

But is there any guarantee that volume will work if moved to another set of hardware, kernels, etc...??? As I said, I _never_ create Ext3 filesystems greater than 1TB for this reason. 1TB is the "common denominator" when it comes to Ext3.

I've seen you mention that 1TB limit. Could you please post some references to it? I'm now familiar with limits at 2TB (gpt), 4TB (ext3), 16TB (ext2) but you're the only person talking about a 1TB barrier.

...

There's all sorts of "requirements" for Ext3 sizes above 1TB

I'd like to hear about that.

Thanks, Francois

Peter Arremann

4:12 a.m.

On Monday 12 September 2005 00:01, Francois Caen wrote:

...

I've seen you mention that 1TB limit. Could you please post some references to it? I'm now familiar with limits at 2TB (gpt), 4TB (ext3), 16TB (ext2) but you're the only person talking about a 1TB barrier.

You'll find that a lot with Bryan that he makes up his own terms or puts his opinion down as facts.

But seriously - 1TB is just a good round number. Unlike the US tax code that always keeps you guessing, saying "No more than 1TB for ext3" is a quick and easy rule - and considering the fsck times you can get with a whole lot of small files 1TB is a good number.

To sum it up, 1TB is no software limit but rather a number Bryan, me and a whole bunch of others (just search google) see as the maximum filesystem size they feel comfortable with.

Peter.

Bryan J. Smith

5:01 a.m.

On Mon, 2005-09-12 at 00:12 -0400, Peter Arremann wrote:

...

You'll find that a lot with Bryan that he makes up his own terms or puts his opinion down as facts.

Which is why you should feel free to believe I pull everything out of my asshole -- I've said that before, and I'll say it again.

Of course, it's often my documentation that gets referenced by other people -- Google searches, print publications, etc... and "explains" things. And it's _that_repeatable_, technical information year in and year out that earns my trust. I've only been posting here 4 months. In 4 years, you might feel differently.

...

But seriously - 1TB is just a good round number. Unlike the US tax code that always keeps you guessing, saying "No more than 1TB for ext3" is a quick and easy rule - and considering the fsck times you can get with a whole lot of small files 1TB is a good number.

It's a signed 32-bit integer for number of sectors. That results in a 1.1TB (1TiB) limitation. Ext3 has them all over the freak'n place in its codebase, although I haven't seen it on kernel 2.6.5+. In other words, I don't like to have Ext3 data volumes over 1TB because some systems just simply can't read them.

REAL WORLD EXAMPLE:

I previously ran into the issue where I created a 2TB Ext3 volume o a SAN device, and select versions/kernels could not use it. But the second I tried to mount a sub-1TB Ext3 filesystem from the same creator on the same system, I had no issue.

I have absolutely _never_ ran into that problem with XFS -- Irix, Linux 2.4, etc...

...

To sum it up, 1TB is no software limit but rather a number Bryan, me and a whole bunch of others (just search google) see as the maximum filesystem size they feel comfortable with.

Whatever you think it is, go on an answer for me. I'm used to you answering for me now. ;->

Bryan J. Smith

4:55 a.m.

On Sun, 2005-09-11 at 21:01 -0700, Francois Caen wrote:

...

I've seen you mention that 1TB limit.

Signed LBA32 limitation. Various hardware and kernel limitations are commonplace. But it seems to nip me with Ext3 entirely.

I've been able to create large XFS filesystems for non-booting volumes (with a non-PC disk label) in the past without issue, every single time.

...

Could you please post some references to it?

Not really. It's one of those things that is extremely poorly documented -- just like disk geometry and other things. I bet if I did a presentation or HOWTO, it would be the best documentation compared to what's out there.

...

I'm now familiar with limits at 2TB (gpt), 4TB (ext3), 16TB (ext2) but you're the only person talking about a 1TB barrier.

What I said was there is _no_consistency_ above 1TiB for Ext3. Some systems absolutely prevent me creating or -- worse yet -- moving a volume that is larger than 1TiB. That's the problem.

With XFS, I can _always_ use a disk label and create the filesystems at just about any size (although booting is another story). This is not true with Ext3.

...

I'd like to hear about that.

To be honest, I haven't run into them with the 2.6.5+ kernels. But I still do on the 2.4 kernels. It's clearly a signed LBA32 limitation everytime.

Joshua Baker-LePain

9:31 a.m.

On Sun, 11 Sep 2005 at 10:34pm, Bryan J. Smith wrote

...

On Sun, 2005-09-11 at 21:01 -0400, Joshua Baker-LePain wrote:

...
Having hit a similar issue (big FS, I wanted XFS, but needed to run centos 4), I just went ahead and stuck with ext3. My FS is 5.5TiB -- a software RAID0 across 2 3w-9xxx arrays. I had no issues formatting it and have had no issues in testing or production with it. So, it can be done.

I don't think I _ever_ said it couldn't be done. In fact, the Ext3 support is now up to 17.6TB (16TiB) now.

And I never said that you said that, nor did I mean to imply it.

...

But is there any guarantee that volume will work if moved to another set of hardware, kernels, etc...??? As I said, I _never_ create Ext3

As I just mentioned in another post, this configuration is explicitly supported by Red Hat. Therefore, if it doesn't work in some other configuration, it's a bug that Red Hat will want to fix.

...

P.S. Red Hat's going to wake up sooner or later and realize it's just as Sun said, they have not addressed the enterprise filesystem issue. I'm sure SGI and the XFS team would be more than happy to see some engagement from Red Hat on this matter -- and have wished for years now -- and the said thing is that it would _help_ Red Hat's future. XFS is the only option -- ReiserFS and JFS have interface/compatibility issues that are "show stoppers" for Red Hat. XFS has not, and the only issues are newer kernel/distribution developments that just need to be addressed at a distro-level.

I too have been waiting for a long while for Red Hat to wake up to XFS. My *other* 5.5TB of RAID space (spread over 4 servers) is all XFS on RH7.3. But this volume needed large block device support (obviously), and I couldn't get consistent results wedging XFS into centos-4, so I went with the supported configuration. I'm not willing to go to SuSE just to get XFS.

-- Joshua Baker-LePain Department of Biomedical Engineering Duke University

Les Mikesell

11 Sep 11 Sep

10:11 p.m.

On Sun, 2005-09-11 at 16:02, Francois Caen wrote:

...

My application is a huge backup-to-disk samba-accessed storage. Performance and fsck-caused downtime are not important to me.

Keep in mind when you say that, the downtime in question might be several days for a full fsck to run, depending more on the number of files than the size of the filesystem. Even if you don't happen to need to restore anything then that can be a long time to not be making current backups.

-- Les Mikesell lesmikesell@gmail.com

Bryan J. Smith

11:29 p.m.

On Sun, 2005-09-11 at 17:11 -0500, Les Mikesell wrote:

...

Keep in mind when you say that, the downtime in question might be several days for a full fsck to run, depending more on the number of files than the size of the filesystem.

Is there some way you'all are formatting your Ext3 partitions that takes so long? I don't think has ever taken longer than 8 hours to fully fsck my Ext3 partitions on a multi-TB server -- and typically I only have 1-2 filesystems that need a full fsck and I'm _never_ down more than 1-2 hours.

I'm wondering if it's becauase I'm typically using only 100GB (and _never_ more than 1TB) Ext3 filesystems. And I'm deploying 3Ware hardware RAID-10 and, limitedly, RAID-5 on servers that have PCI-X slots. I often find _extremely_poorly_ designed servers are the culprit, and not so much other issues.

So maybe this is one of those "best practices" discussions we should have on another list? I just have to shake my head when I hear people talk about multi-day fsck runs. I've _never_ had that happen.

Les Mikesell

12 Sep 12 Sep

12:15 a.m.

On Sun, 2005-09-11 at 18:29, Bryan J. Smith wrote:

...

Is there some way you'all are formatting your Ext3 partitions that takes so long? I don't think has ever taken longer than 8 hours to fully fsck my Ext3 partitions on a multi-TB server -- and typically I only have 1-2 filesystems that need a full fsck and I'm _never_ down more than 1-2 hours.

It goes with the number of files more than the size of the partition or size used. In my case it is a backuppc archive containing multiple copies of most of the other servers at the site. Backuppc compresses files, then links duplicates so it can cram about 10x what you could otherwise fit on a drive. Fsck still has to follow each directory entry to check it even though all the hard links point to the same place.

-- Les Mikesell lesmikesell@gmail.com

Bryan J. Smith

11 Sep 11 Sep

11:24 p.m.

On Sun, 2005-09-11 at 14:02 -0700, Francois Caen wrote:

...

My concern with xfs, reiser or jfs is not really how good they are, but how well they are implemented/supported in CentOS.

Ditto. Unfortunately, I don't like the answer.

...

And it's hard to choose between the better-but-less-supported xfs/reiser/... or the well-supported but not that multi-TB-friendly ext3...

For me, it's Ext3-only on RHEL/CentOS 3/4, as well as Fedora Core 3, with limited XFS testing on the latter.

7242

Age (days ago)

7245

Last active (days ago)

discuss@lists.centos.org

31 comments

9 participants

tags (0)

participants (9)

Bryan J. Smith
Chris Mauritz
Francois Caen
Joshua Baker-LePain
Les Mikesell
Matt Hyclak
Nick Bryant
Peter Arremann
William A. Mahaffey III