40TB File System Recommendations

List overview All Threads
Download

newer

older

unrar rpm package

Gnome Notification Applet

Matthew Feinberg

12 Apr 2011 12 Apr '11

7:23 a.m.

Hello All

I have a brand spanking new 40TB Hardware Raid6 array to play around with. I am looking for recommendations for which filesystem to use. I am trying not to break this up into multiple file systems as we are going to use it for backups. Other factors is performance and reliability.

CentOS 5.6

array is /dev/sdb

So here is what I have tried so far reiserfs is limited to 16TB ext4 does not seem to be fully baked in 5.6 yet. parted 1.8 does not support creating ext4 (strange)

Anyone work with large filesystems like this that have any suggestions/recommendations?

-- Matthew Feinberg matthew@choopa.com AIM: matthewchoopa

Show replies by date

John R Pierce

12 Apr 12 Apr

7:31 a.m.

On 04/12/11 12:23 AM, Matthew Feinberg wrote:

...

Hello All

I have a brand spanking new 40TB Hardware Raid6 array

never mind file systems... is that one raid set? do you have any idea how LONG rebuilding that is going to take when there are any drive hiccups? or how painfully slow writes will be until its rebuilt? is that something like 22 x 2TB or 16 x 3TB? I'll bet a raid rebuild takes nearly a WEEK, maybe even longer..

I am very strongly NOT in favor of raid6, even for nearline bulk backup storage. I would sacrifice the space and format that as raid10, and have at LEAST a couple hot spares too.

aurfalien＠gmail.com

5:50 p.m.

On Apr 12, 2011, at 12:31 AM, John R Pierce wrote:

...

On 04/12/11 12:23 AM, Matthew Feinberg wrote:

...
Hello All

I have a brand spanking new 40TB Hardware Raid6 array

never mind file systems... is that one raid set? do you have any idea how LONG rebuilding that is going to take when there are any drive hiccups? or how painfully slow writes will be until its rebuilt? is that something like 22 x 2TB or 16 x 3TB? I'll bet a raid rebuild takes nearly a WEEK, maybe even longer..

I am very strongly NOT in favor of raid6, even for nearline bulk backup storage. I would sacrifice the space and format that as raid10, and have at LEAST a couple hot spares too.

+1 for the 1+0 and a few hot spares.

Raid 6 + spare ran great but rebuilds took 2 days. The likely hood of 2+ failed drives is less then 1 failed drive but I actually had 2 failed drives so RAID6 + spare saved me.

Hence why I switched to RAID 1+0 + spares.

A tuned XFS fs will work great.

I run my large RAID XFS fs with logbufs=8, and no(atime.dirtime).

I also run iozone for testing my tuned options for optimum performance in my env.

- aurf

Bent Terp

7:36 a.m.

On Tue, Apr 12, 2011 at 9:23 AM, Matthew Feinberg matthew@choopa.com wrote:

...

Hello All

I have a brand spanking new 40TB Hardware Raid6 array to play around with. I am looking for recommendations for which filesystem to use. I am trying not to break this up into multiple file systems as we are going to use it for backups. Other factors is performance and reliability.

We've been very happy with XFS, as it allows us to add diskspace through LVM and grow the filesystem online - we've had to reboot the server when we add new diskenclosures, but that's not XFS's fault...

BR Bent

...

CentOS 5.6

array is /dev/sdb

So here is what I have tried so far reiserfs is limited to 16TB ext4 does not seem to be fully baked in 5.6 yet. parted 1.8 does not support creating ext4 (strange)

Anyone work with large filesystems like this that have any suggestions/recommendations?

-- Matthew Feinberg matthew@choopa.com AIM: matthewchoopa

CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos

Alain Péan

7:36 a.m.

Le 12/04/2011 09:23, Matthew Feinberg a écrit :

...

Hello All

I have a brand spanking new 40TB Hardware Raid6 array to play around with. I am looking for recommendations for which filesystem to use. I am trying not to break this up into multiple file systems as we are going to use it for backups. Other factors is performance and reliability.

CentOS 5.6

array is /dev/sdb

So here is what I have tried so far reiserfs is limited to 16TB ext4 does not seem to be fully baked in 5.6 yet. parted 1.8 does not support creating ext4 (strange)

Anyone work with large filesystems like this that have any suggestions/recommendations?

Hi Matthew,

I would go for xfs, which is now supported in CentOS. This is what I use for a 16 TB storage, with CentOS 5.3 (Rocks Cluster), and it woks fine. No problem with lengthy fsck, as with ext3 (which does not support such capacities). I did not try yet ext4...

Alain

-- ========================================================== Alain Péan - LPP/CNRS Administrateur Système/Réseau Laboratoire de Physique des Plasmas - UMR 7648 Observatoire de Saint-Maur 4, av de Neptune, Bat. A 94100 Saint-Maur des Fossés Tel : 01-45-11-42-39 - Fax : 01-48-89-44-33 ==========================================================

Marian Marinov

8:21 a.m.

On Tuesday 12 April 2011 10:36:54 Alain Péan wrote:

...

Le 12/04/2011 09:23, Matthew Feinberg a écrit :

...
Hello All

I have a brand spanking new 40TB Hardware Raid6 array to play around with. I am looking for recommendations for which filesystem to use. I am trying not to break this up into multiple file systems as we are going to use it for backups. Other factors is performance and reliability.

CentOS 5.6

array is /dev/sdb

So here is what I have tried so far reiserfs is limited to 16TB ext4 does not seem to be fully baked in 5.6 yet. parted 1.8 does not support creating ext4 (strange)

Anyone work with large filesystems like this that have any suggestions/recommendations?

Hi Matthew,

I would go for xfs, which is now supported in CentOS. This is what I use for a 16 TB storage, with CentOS 5.3 (Rocks Cluster), and it woks fine. No problem with lengthy fsck, as with ext3 (which does not support such capacities). I did not try yet ext4...

Alain

I have Raid6 Arrays with 30TB. We have tested XFS and its write performance was really dissapointing. So we looked at Ext4. It is really good for our workloads, but it lacks the ability to grow over 16TB. So we crated two partitions on the raid with ext4.

The RAID rebuild time is around 2 days, max 3 if the workload is higher. So I presume that for 40TB it will be around 4 days.

Marian

-- Best regards, Marian Marinov

Steve Brooks

10:35 a.m.

On Tue, 12 Apr 2011, Marian Marinov wrote:

...

On Tuesday 12 April 2011 10:36:54 Alain Péan wrote:

...
Le 12/04/2011 09:23, Matthew Feinberg a écrit :

...
Hello All

I have a brand spanking new 40TB Hardware Raid6 array to play around with. I am looking for recommendations for which filesystem to use. I am trying not to break this up into multiple file systems as we are going to use it for backups. Other factors is performance and reliability.

CentOS 5.6

array is /dev/sdb

So here is what I have tried so far reiserfs is limited to 16TB ext4 does not seem to be fully baked in 5.6 yet. parted 1.8 does not support creating ext4 (strange)

Anyone work with large filesystems like this that have any suggestions/recommendations?

Hi Matthew,

I would go for xfs, which is now supported in CentOS. This is what I use for a 16 TB storage, with CentOS 5.3 (Rocks Cluster), and it woks fine. No problem with lengthy fsck, as with ext3 (which does not support such capacities). I did not try yet ext4...

Alain

I have Raid6 Arrays with 30TB. We have tested XFS and its write performance was really dissapointing. So we looked at Ext4. It is really good for our workloads, but it lacks the ability to grow over 16TB. So we crated two partitions on the raid with ext4.

The RAID rebuild time is around 2 days, max 3 if the workload is higher. So I presume that for 40TB it will be around 4 days.

Marian

For interest how much *memory* would you need in your raid management node to support "fsck" on a 40TB array. I imagine it would be very high.

Steve

Boris Epstein

2:21 p.m.

On Tue, Apr 12, 2011 at 3:36 AM, Alain Péan <alain.pean@lpp.polytechnique.fr

...

wrote:

...

Le 12/04/2011 09:23, Matthew Feinberg a écrit :

...
Hello All

I have a brand spanking new 40TB Hardware Raid6 array to play around with. I am looking for recommendations for which filesystem to use. I am trying not to break this up into multiple file systems as we are going to use it for backups. Other factors is performance and reliability.

CentOS 5.6

array is /dev/sdb

So here is what I have tried so far reiserfs is limited to 16TB ext4 does not seem to be fully baked in 5.6 yet. parted 1.8 does not support creating ext4 (strange)

Anyone work with large filesystems like this that have any suggestions/recommendations?

Hi Matthew,

I would go for xfs, which is now supported in CentOS. This is what I use for a 16 TB storage, with CentOS 5.3 (Rocks Cluster), and it woks fine. No problem with lengthy fsck, as with ext3 (which does not support such capacities). I did not try yet ext4...

Alain

--

Alain Péan - LPP/CNRS Administrateur Système/Réseau Laboratoire de Physique des Plasmas - UMR 7648 Observatoire de Saint-Maur 4, av de Neptune, Bat. A 94100 Saint-Maur des Fossés Tel : 01-45-11-42-39 - Fax : 01-48-89-44-33 ==========================================================

CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos

I fully second Alain's opinion. An fsck on a 6 TB RAID6 containing about 30 million files takes over 10 hours.

As for XFS, we are running it on a 25 TB array and so far there has been no trouble.

Boris.

John Jasen

2:36 p.m.

On 04/12/2011 10:21 AM, Boris Epstein wrote:

...

On Tue, Apr 12, 2011 at 3:36 AM, Alain Péan <alain.pean@lpp.polytechnique.fr mailto:alain.pean@lpp.polytechnique.fr> wrote:

<snipped: two recommendations for XFS>

I would chime in with a dis-commendation for XFS. At my previous employer, two cases involving XFS resulted in irrecoverable data corruption. These were on RAID systems running from 4 to 20 TB.

-- -- John E. Jasen (jjasen@realityfailure.org) -- "Deserve Victory." -- Terry Goodkind, Naked Empire

Marian Marinov

3 p.m.

On Tuesday 12 April 2011 17:36:39 John Jasen wrote:

...

On 04/12/2011 10:21 AM, Boris Epstein wrote:

...
On Tue, Apr 12, 2011 at 3:36 AM, Alain Péan <alain.pean@lpp.polytechnique.fr

...
mailto:alain.pean@lpp.polytechnique.fr> wrote:

<snipped: two recommendations for XFS>

I would chime in with a dis-commendation for XFS. At my previous employer, two cases involving XFS resulted in irrecoverable data corruption. These were on RAID systems running from 4 to 20 TB.

Can someone(who actually knows) share with us, what is the state of xfs-utils, how stable and usable are they for recovery of broken XFS filesystems?

Marian

James A. Peltier

3:30 p.m.

On 64-bit platforms the tools are totally stable, but it does depend on the degree of "broken" state that the file system is in. I've had xfs_checks run for days and eat up 96GB of memory because of various degrees of "broken"-ness. These are on 35 and 45TB file systems. Be prepared to throw memory at the problem or lots of swap files if you get really buggered up.

-- James A. Peltier IT Services - Research Computing Group Simon Fraser University - Burnaby Campus Phone : 778-782-6573 Fax : 778-782-3045 E-Mail : jpeltier@sfu.ca Website : http://www.sfu.ca/itservices http://blogs.sfu.ca/people/jpeltier

Keith Keller

4:21 p.m.

On Tue, Apr 12, 2011 at 06:00:57PM +0300, Marian Marinov wrote:

...

Can someone(who actually knows) share with us, what is the state of xfs-utils, how stable and usable are they for recovery of broken XFS filesystems?

I have done an XFS repair once or twice on a real filesystem (~4TB) in a 64bit kernel. It worked fine, but I don't think the filesystem was too badly thrashed.

As another poster noted, be ready to throw memory or swap at the XFS check and repair tools. (I read that it's slightly better memory-wise to run xfs_repair -n than xfs_check, but I believe that's mainly for 32bit systems, and that may have been fixed anyway.)

--keith

-- kkeller@wombat.san-francisco.ca.us

Les Mikesell

3:30 p.m.

On 4/12/2011 9:36 AM, John Jasen wrote:

...

<snipped: two recommendations for XFS>

I would chime in with a dis-commendation for XFS. At my previous employer, two cases involving XFS resulted in irrecoverable data corruption. These were on RAID systems running from 4 to 20 TB.

Was this on a 32 or 64 bit system?

-- Les Mikesell lesmikesell@gmail.com

John Jasen

13 Apr 13 Apr

11:18 p.m.

On 04/12/2011 11:30 AM, Les Mikesell wrote:

...

On 4/12/2011 9:36 AM, John Jasen wrote:

...
<snipped: two recommendations for XFS>

I would chime in with a dis-commendation for XFS. At my previous employer, two cases involving XFS resulted in irrecoverable data corruption. These were on RAID systems running from 4 to 20 TB.

Was this on a 32 or 64 bit system?

Yes. IE: both.

-- -- John E. Jasen (jjasen@realityfailure.org) -- "Deserve Victory." -- Terry Goodkind, Naked Empire

Pasi Kärkkäinen

14 Apr 14 Apr

8:23 p.m.

On Wed, Apr 13, 2011 at 07:18:23PM -0400, John Jasen wrote:

...

On 04/12/2011 11:30 AM, Les Mikesell wrote:

...
On 4/12/2011 9:36 AM, John Jasen wrote:

...
<snipped: two recommendations for XFS>

I would chime in with a dis-commendation for XFS. At my previous employer, two cases involving XFS resulted in irrecoverable data corruption. These were on RAID systems running from 4 to 20 TB.

Was this on a 32 or 64 bit system?

Yes. IE: both.

XFS is known to be broken on 32bit Linux..

XFS was originally developed on 64bit IRIX (iirc), so it also "requires" 64bit Linux.

32bit Linux has too small stack for XFS. Redhat only supports XFS on x86_64 RHEL.

-- Pasi

Pasi Kärkkäinen

12 Apr 12 Apr

4:39 p.m.

On Tue, Apr 12, 2011 at 10:36:39AM -0400, John Jasen wrote:

...

On 04/12/2011 10:21 AM, Boris Epstein wrote:

...
On Tue, Apr 12, 2011 at 3:36 AM, Alain Péan <alain.pean@lpp.polytechnique.fr mailto:alain.pean@lpp.polytechnique.fr> wrote:

<snipped: two recommendations for XFS>

I would chime in with a dis-commendation for XFS. At my previous employer, two cases involving XFS resulted in irrecoverable data corruption. These were on RAID systems running from 4 to 20 TB.

Did you have these problems with XFS on 32bit Linux?

-- Pasi

Christopher Chan

13 Apr 13 Apr

12:19 a.m.

On Tuesday, April 12, 2011 10:36 PM, John Jasen wrote:

...

On 04/12/2011 10:21 AM, Boris Epstein wrote:

...
On Tue, Apr 12, 2011 at 3:36 AM, Alain Péan <alain.pean@lpp.polytechnique.fr mailto:alain.pean@lpp.polytechnique.fr> wrote:

<snipped: two recommendations for XFS>

I would chime in with a dis-commendation for XFS. At my previous employer, two cases involving XFS resulted in irrecoverable data corruption. These were on RAID systems running from 4 to 20 TB.

What were those circumstances? Crash? Power outage? What are the components of the RAID systems?

John Jasen

11:26 p.m.

On 04/12/2011 08:19 PM, Christopher Chan wrote:

...

On Tuesday, April 12, 2011 10:36 PM, John Jasen wrote:

...
On 04/12/2011 10:21 AM, Boris Epstein wrote:

...
On Tue, Apr 12, 2011 at 3:36 AM, Alain Péan <alain.pean@lpp.polytechnique.fr mailto:alain.pean@lpp.polytechnique.fr> wrote:

<snipped: two recommendations for XFS>

I would chime in with a dis-commendation for XFS. At my previous employer, two cases involving XFS resulted in irrecoverable data corruption. These were on RAID systems running from 4 to 20 TB.

What were those circumstances? Crash? Power outage? What are the components of the RAID systems?

One was a hardware raid over fibre channel, which silently corrupted itself. System checked out fine, raid array checked out fine, xfs was replaced with ext3, and the system ran without issue.

Second was multiple hardware arrays over linux md raid0, also over fibre channel. This was not so silent corruption, as in xfs would detect it and lock the filesystem into read-only before it, pardon the pun, truly fscked itself. Happened two or three times, before we gave up, split up the raid, and went ext3, Again, no issues.

-- -- John E. Jasen (jjasen@realityfailure.org) -- "Deserve Victory." -- Terry Goodkind, Naked Empire

Ross Walker

14 Apr 14 Apr

1:04 a.m.

On Apr 13, 2011, at 7:26 PM, John Jasen jjasen@realityfailure.org wrote:

...

On 04/12/2011 08:19 PM, Christopher Chan wrote:

...
On Tuesday, April 12, 2011 10:36 PM, John Jasen wrote:

...
On 04/12/2011 10:21 AM, Boris Epstein wrote:

...
On Tue, Apr 12, 2011 at 3:36 AM, Alain Péan <alain.pean@lpp.polytechnique.fr mailto:alain.pean@lpp.polytechnique.fr> wrote:

<snipped: two recommendations for XFS>

I would chime in with a dis-commendation for XFS. At my previous employer, two cases involving XFS resulted in irrecoverable data corruption. These were on RAID systems running from 4 to 20 TB.

What were those circumstances? Crash? Power outage? What are the components of the RAID systems?

One was a hardware raid over fibre channel, which silently corrupted itself. System checked out fine, raid array checked out fine, xfs was replaced with ext3, and the system ran without issue.

Second was multiple hardware arrays over linux md raid0, also over fibre channel. This was not so silent corruption, as in xfs would detect it and lock the filesystem into read-only before it, pardon the pun, truly fscked itself. Happened two or three times, before we gave up, split up the raid, and went ext3, Again, no issues.

Every now and then I hear these XFS horror stories. They seem too impossible to believe.

Nothing breaks for absolutely no reason and failure to know where the breakage was shows that maybe there wasn't adequately skilled techinicians for the technology deployed.

XFS if run in a properly configured environment will run flawlessly.

-Ross

Brandon Ooi

1:40 a.m.

On Wed, Apr 13, 2011 at 6:04 PM, Ross Walker rswwalker@gmail.com wrote:

...

...
One was a hardware raid over fibre channel, which silently corrupted itself. System checked out fine, raid array checked out fine, xfs was replaced with ext3, and the system ran without issue.

Second was multiple hardware arrays over linux md raid0, also over fibre channel. This was not so silent corruption, as in xfs would detect it and lock the filesystem into read-only before it, pardon the pun, truly fscked itself. Happened two or three times, before we gave up, split up the raid, and went ext3, Again, no issues.

Every now and then I hear these XFS horror stories. They seem too impossible to believe.

Nothing breaks for absolutely no reason and failure to know where the breakage was shows that maybe there wasn't adequately skilled techinicians for the technology deployed.

XFS if run in a properly configured environment will run flawlessly.

That's not entirely true. Even in Centos 5.3(?), we ran into an issue of XFS running on an md array would lock up for seemingly no reason due to possible corruption. I've even bookmarked the relevant bug thread for posterity sake since it caused us so much grief.

https://bugzilla.redhat.com/show_bug.cgi?id=512552

Ross Walker

3:55 a.m.

On Apr 13, 2011, at 9:40 PM, Brandon Ooi brandono@gmail.com wrote:

...

On Wed, Apr 13, 2011 at 6:04 PM, Ross Walker rswwalker@gmail.com wrote:

...
One was a hardware raid over fibre channel, which silently corrupted itself. System checked out fine, raid array checked out fine, xfs was replaced with ext3, and the system ran without issue.

Second was multiple hardware arrays over linux md raid0, also over fibre channel. This was not so silent corruption, as in xfs would detect it and lock the filesystem into read-only before it, pardon the pun, truly fscked itself. Happened two or three times, before we gave up, split up the raid, and went ext3, Again, no issues.

Every now and then I hear these XFS horror stories. They seem too impossible to believe.

Nothing breaks for absolutely no reason and failure to know where the breakage was shows that maybe there wasn't adequately skilled techinicians for the technology deployed.

XFS if run in a properly configured environment will run flawlessly.

That's not entirely true. Even in Centos 5.3(?), we ran into an issue of XFS running on an md array would lock up for seemingly no reason due to possible corruption. I've even bookmarked the relevant bug thread for posterity sake since it caused us so much grief.

https://bugzilla.redhat.com/show_bug.cgi?id=512552

Once I had ext3 corrupt on a NAS box with a bad controller. Can I not recommend using it? Should it have detected or prevented this corruption from occurring? Maybe it isn't safe?

For every one bad experience with a given technology there are thousands of success stories. All software has bugs and advocacy really shouldn't play a part in determining the proper technology, it should be picked for the application and by it's merits and as with anything, thoroughly tested before put into production.

-Ross

Charles Polisher

17 Apr 17 Apr

7:05 a.m.

On Wed, Apr 13, 2011 at 11:55:08PM -0400, Ross Walker wrote:

...

On Apr 13, 2011, at 9:40 PM, Brandon Ooi brandono@gmail.com wrote:

...
On Wed, Apr 13, 2011 at 6:04 PM, Ross Walker rswwalker@gmail.com wrote:

...
One was a hardware raid over fibre channel, which silently corrupted itself. System checked out fine, raid array checked out fine, xfs was replaced with ext3, and the system ran without issue.

Second was multiple hardware arrays over linux md raid0, also over fibre channel. This was not so silent corruption, as in xfs would detect it and lock the filesystem into read-only before it, pardon the pun, truly fscked itself. Happened two or three times, before we gave up, split up the raid, and went ext3, Again, no issues.

Every now and then I hear these XFS horror stories. They seem too impossible to believe.

Nothing breaks for absolutely no reason and failure to know where the breakage was shows that maybe there wasn't adequately skilled techinicians for the technology deployed.

XFS if run in a properly configured environment will run flawlessly.

Here's some deconstruction of your argument:

"... and failure to know where the breakage was shows that maybe there wasn't adequately skilled techinicians for the technology deployed"

This is blaming the victim. One must have the time, skills and often other resources to do root cause analysis.

"XFS if run in a properly configured environment will run flawlessly."

I think a more narrowly qualified opinion is appropriate: "XFS, properly configured, running on perfect hardware atop a perfect kernel, will have fewer serious bugs than it had on Jan 1, 2009." Here's a summary of XFS bugzilla data from 2009 through today:

Bug Status Severity NEW ASSIGNED REOPENED Total blocker 3 . . 3 critical 10 2 . 12 major 48 2 . 50 normal 118 46 3 167 minor 26 3 . 29 trivial 7 . . 7 enhancement 39 9 1 49 Total 251 62 4 317

See also the XFS mailing list for a big dose of reality. Flawlessly is not the label I would use for XFS. /Maybe/ for Ext2.

-- Charles Polisher

Ross Walker

18 Apr 18 Apr

4:27 p.m.

On Apr 17, 2011, at 3:05 AM, Charles Polisher cpolish@surewest.net wrote:

...

On Wed, Apr 13, 2011 at 11:55:08PM -0400, Ross Walker wrote:

...
On Apr 13, 2011, at 9:40 PM, Brandon Ooi brandono@gmail.com wrote:

...
On Wed, Apr 13, 2011 at 6:04 PM, Ross Walker rswwalker@gmail.com wrote:

...
One was a hardware raid over fibre channel, which silently corrupted itself. System checked out fine, raid array checked out fine, xfs was replaced with ext3, and the system ran without issue.

Second was multiple hardware arrays over linux md raid0, also over fibre channel. This was not so silent corruption, as in xfs would detect it and lock the filesystem into read-only before it, pardon the pun, truly fscked itself. Happened two or three times, before we gave up, split up the raid, and went ext3, Again, no issues.

Every now and then I hear these XFS horror stories. They seem too impossible to believe.

Nothing breaks for absolutely no reason and failure to know where the breakage was shows that maybe there wasn't adequately skilled techinicians for the technology deployed.

XFS if run in a properly configured environment will run flawlessly.

Here's some deconstruction of your argument:

"... and failure to know where the breakage was shows that maybe there wasn't adequately skilled techinicians for the technology deployed"

This is blaming the victim. One must have the time, skills and often other resources to do root cause analysis.

"XFS if run in a properly configured environment will run flawlessly."

I think a more narrowly qualified opinion is appropriate: "XFS, properly configured, running on perfect hardware atop a perfect kernel, will have fewer serious bugs than it had on Jan 1, 2009." Here's a summary of XFS bugzilla data from 2009 through today:

I already apologized for those comments last week. No need to keep flogging a dead horse here.

...

                            Bug Status
Severity NEW ASSIGNED REOPENED Total blocker 3 . . 3 critical 10 2 . 12 major 48 2 . 50 normal 118 46 3 167 minor 26 3 . 29 trivial 7 . . 7 enhancement 39 9 1 49 Total 251 62 4 317

See also the XFS mailing list for a big dose of reality. Flawlessly is not the label I would use for XFS. /Maybe/ for Ext2.

Basically it comes down to that all file systems, as do all software, have bugs and edge cases and thinking that one can find a file system that is bug free is naive.

Test, test, test.

-Ross

John Jasen

14 Apr 14 Apr

10:54 a.m.

On 04/13/2011 09:04 PM, Ross Walker wrote:

...

On Apr 13, 2011, at 7:26 PM, John Jasen jjasen@realityfailure.org wrote:

...

Every now and then I hear these XFS horror stories. They seem too impossible to believe.

Nothing breaks for absolutely no reason and failure to know where the breakage was shows that maybe there wasn't adequately skilled techinicians for the technology deployed.

Waving your hands and insulting the people who went through XFS failures doesn't make me feel any better or make the problems not have occurred.

I would presume that we were lucky enough to have technicians on-site skilled enough to track the problems down to XFS itself.

-- -- John E. Jasen (jjasen@realityfailure.org) -- "Deserve Victory." -- Terry Goodkind, Naked Empire

Ross Walker

1:43 p.m.

On Apr 14, 2011, at 6:54 AM, John Jasen jjasen@realityfailure.org wrote:

...

On 04/13/2011 09:04 PM, Ross Walker wrote:

...
On Apr 13, 2011, at 7:26 PM, John Jasen jjasen@realityfailure.org wrote:

<snipped my stuff>

...
Every now and then I hear these XFS horror stories. They seem too impossible to believe.

Nothing breaks for absolutely no reason and failure to know where the breakage was shows that maybe there wasn't adequately skilled techinicians for the technology deployed.

Waving your hands and insulting the people who went through XFS failures doesn't make me feel any better or make the problems not have occurred.

W You are correct it came across as rude and condescending, I apologize.

It was a knee jerk reaction that came from reading many such posts that XFS is no good because it caused X where X came about because people didn't know how to implement XFS safely or correctly.

Of course I'm not trying to make any legitimately bad experiences any less legitimate. We all have them, and over a long enough period of time, with most file systems.

...

I would presume that we were lucky enough to have technicians on-site skilled enough to track the problems down to XFS itself.

Yes, it is always better to catch these through testing then in production.

-Ross

aurfalien＠gmail.com

6:17 p.m.

On Apr 14, 2011, at 6:43 AM, Ross Walker wrote:

...

On Apr 14, 2011, at 6:54 AM, John Jasen jjasen@realityfailure.org wrote:

...
On 04/13/2011 09:04 PM, Ross Walker wrote:

...
On Apr 13, 2011, at 7:26 PM, John Jasen jjasen@realityfailure.org wrote:

<snipped my stuff>

...
Every now and then I hear these XFS horror stories. They seem too impossible to believe.

Nothing breaks for absolutely no reason and failure to know where the breakage was shows that maybe there wasn't adequately skilled techinicians for the technology deployed.

Waving your hands and insulting the people who went through XFS failures doesn't make me feel any better or make the problems not have occurred.

W You are correct it came across as rude and condescending, I apologize.

It was a knee jerk reaction that came from reading many such posts that XFS is no good because it caused X where X came about because people didn't know how to implement XFS safely or correctly.

Well, while a fan of anything IRIX, I've had issues with XFS in the past as with all filesystems.

I still use it but not in all cases.

A good fs, fast, reliable for the most part but by no means a fan boi of it.

You did come across as a serious fan though.

However if you like XFS, I'll assume you liek IRIX so check the 5dwm project which is the IRIX desktop for Linux.

- aurf

Lamar Owen

7:43 p.m.

On Thursday, April 14, 2011 02:17:41 PM aurfalien@gmail.com wrote:

...

However if you like XFS, I'll assume you liek IRIX so check the 5dwm project which is the IRIX desktop for Linux.

Cool. Now if they ported the Audio DAT ripping program for IRIX to Linux, I'd be able to get rid of my O2..... (SGI got special DAT tape drive firmware made by Seagate that can read and write Audio DAT tapes in a particular Seagate/Archive Python DDS-1 drive; SGI also put the software wot work with Audio DAT's in IRIX. I use that program occasionally on my O2, and previously on my Indigo2/IMPACT, to 'rip' Audio DAT's for my professional audio production side business.... I can also master to Audio DAT with the same program, making it quite nice indeed.).

aurfalien＠gmail.com

7:46 p.m.

On Apr 14, 2011, at 12:43 PM, Lamar Owen wrote:

...

On Thursday, April 14, 2011 02:17:41 PM aurfalien@gmail.com wrote:

...
However if you like XFS, I'll assume you liek IRIX so check the 5dwm project which is the IRIX desktop for Linux.

Cool. Now if they ported the Audio DAT ripping program for IRIX to Linux, I'd be able to get rid of my O2..... (SGI got special DAT tape drive firmware made by Seagate that can read and write Audio DAT tapes in a particular Seagate/Archive Python DDS-1 drive; SGI also put the software wot work with Audio DAT's in IRIX. I use that program occasionally on my O2, and previously on my Indigo2/IMPACT, to 'rip' Audio DAT's for my professional audio production side business.... I can also master to Audio DAT with the same program, making it quite nice indeed.).

Dude, thats killer.

I miss the SGI/Irix dayz.

Solid hardware/OS for sure.

Can you believe that in the early 90s they had market share for desktop Unix boxes.

- aurf

m.roth＠5-cent.us

7:48 p.m.

aurfalien@gmail.com wrote:

...

On Apr 14, 2011, at 12:43 PM, Lamar Owen wrote:

...
On Thursday, April 14, 2011 02:17:41 PM aurfalien@gmail.com wrote:

...
However if you like XFS, I'll assume you liek IRIX so check the 5dwm project which is the IRIX desktop for Linux.

Cool. Now if they ported the Audio DAT ripping program for IRIX to Linux, I'd be able to get rid of my O2..... (SGI got special DAT

<snip>

...

Dude, thats killer.

I miss the SGI/Irix dayz.

Solid hardware/OS for sure.

Can you believe that in the early 90s they had market share for desktop Unix boxes.

Yeah, I liked SGI's and Irix; liked Suns and Solaris. Sun, er, Oracle, now? HELL, NO!!!

mark

Christopher Chan

12:32 p.m.

On Thursday, April 14, 2011 09:04 AM, Ross Walker wrote:

...

On Apr 13, 2011, at 7:26 PM, John Jasenjjasen@realityfailure.org wrote:

...
On 04/12/2011 08:19 PM, Christopher Chan wrote:

...
On Tuesday, April 12, 2011 10:36 PM, John Jasen wrote:

...
On 04/12/2011 10:21 AM, Boris Epstein wrote:

...
On Tue, Apr 12, 2011 at 3:36 AM, Alain Péan <alain.pean@lpp.polytechnique.fr mailto:alain.pean@lpp.polytechnique.fr> wrote:

<snipped: two recommendations for XFS>

I would chime in with a dis-commendation for XFS. At my previous employer, two cases involving XFS resulted in irrecoverable data corruption. These were on RAID systems running from 4 to 20 TB.

What were those circumstances? Crash? Power outage? What are the components of the RAID systems?

One was a hardware raid over fibre channel, which silently corrupted itself. System checked out fine, raid array checked out fine, xfs was replaced with ext3, and the system ran without issue.

Second was multiple hardware arrays over linux md raid0, also over fibre channel. This was not so silent corruption, as in xfs would detect it and lock the filesystem into read-only before it, pardon the pun, truly fscked itself. Happened two or three times, before we gave up, split up the raid, and went ext3, Again, no issues.

Every now and then I hear these XFS horror stories. They seem too impossible to believe.

Nothing breaks for absolutely no reason and failure to know where the breakage was shows that maybe there wasn't adequately skilled techinicians for the technology deployed.

XFS if run in a properly configured environment will run flawlessly.

HAHAHAHHHHHHHHAAAAAAAAAAAAAAAHAAAAAAAAAAAAAAAAAAAAA

The XFS codebase is the biggest pile of mess in the Linux kernel and you expect it to be not run into mysterious problems? Remember, XFS was PORTED over to Linux. It is not a 'native' thing to Linux.

Simon Matter

12:55 p.m.

...

On Thursday, April 14, 2011 09:04 AM, Ross Walker wrote:

...
On Apr 13, 2011, at 7:26 PM, John Jasenjjasen@realityfailure.org wrote:

...
On 04/12/2011 08:19 PM, Christopher Chan wrote:

...
On Tuesday, April 12, 2011 10:36 PM, John Jasen wrote:

...
On 04/12/2011 10:21 AM, Boris Epstein wrote:

...
On Tue, Apr 12, 2011 at 3:36 AM, Alain PÃ©an <alain.pean@lpp.polytechnique.fr mailto:alain.pean@lpp.polytechnique.fr> wrote:

<snipped: two recommendations for XFS>

I would chime in with a dis-commendation for XFS. At my previous employer, two cases involving XFS resulted in irrecoverable data corruption. These were on RAID systems running from 4 to 20 TB.

What were those circumstances? Crash? Power outage? What are the components of the RAID systems?

One was a hardware raid over fibre channel, which silently corrupted itself. System checked out fine, raid array checked out fine, xfs was replaced with ext3, and the system ran without issue.

Second was multiple hardware arrays over linux md raid0, also over fibre channel. This was not so silent corruption, as in xfs would detect it and lock the filesystem into read-only before it, pardon the pun, truly fscked itself. Happened two or three times, before we gave up, split up the raid, and went ext3, Again, no issues.

Every now and then I hear these XFS horror stories. They seem too impossible to believe.

Nothing breaks for absolutely no reason and failure to know where the breakage was shows that maybe there wasn't adequately skilled techinicians for the technology deployed.

XFS if run in a properly configured environment will run flawlessly.

HAHAHAHHHHHHHHAAAAAAAAAAAAAAAHAAAAAAAAAAAAAAAAAAAAA

The XFS codebase is the biggest pile of mess in the Linux kernel and you expect it to be not run into mysterious problems? Remember, XFS was PORTED over to Linux. It is not a 'native' thing to Linux.

You're confusing me, I always thought Linux has been ported to XFS :)

There were some issues with XFS and maybe there still are. But, you can not say there are no environments where it work very stable. I've started using XFS back in the RH7.2 days and I can also tell some stories, but not all of them were XFS's fault. The only real problem was the fact that RedHat didn't chose XFS as their FS of choice which meant that just a few ressources were put into the XFS code and just a few peoples actually used it. That's the only thing where ext2,3,4 was better IMHO.

Simon

Christopher Chan

2:37 p.m.

On Thursday, April 14, 2011 08:55 PM, Simon Matter wrote:

...

...
On Thursday, April 14, 2011 09:04 AM, Ross Walker wrote:

...
On Apr 13, 2011, at 7:26 PM, John Jasenjjasen@realityfailure.org wrote:

...
On 04/12/2011 08:19 PM, Christopher Chan wrote:

...
On Tuesday, April 12, 2011 10:36 PM, John Jasen wrote:

...
On 04/12/2011 10:21 AM, Boris Epstein wrote: > On Tue, Apr 12, 2011 at 3:36 AM, Alain PÃ©an > <alain.pean@lpp.polytechnique.fr > mailto:alain.pean@lpp.polytechnique.fr> wrote:

<snipped: two recommendations for XFS>

I would chime in with a dis-commendation for XFS. At my previous employer, two cases involving XFS resulted in irrecoverable data corruption. These were on RAID systems running from 4 to 20 TB.

What were those circumstances? Crash? Power outage? What are the components of the RAID systems?

One was a hardware raid over fibre channel, which silently corrupted itself. System checked out fine, raid array checked out fine, xfs was replaced with ext3, and the system ran without issue.

Second was multiple hardware arrays over linux md raid0, also over fibre channel. This was not so silent corruption, as in xfs would detect it and lock the filesystem into read-only before it, pardon the pun, truly fscked itself. Happened two or three times, before we gave up, split up the raid, and went ext3, Again, no issues.

Every now and then I hear these XFS horror stories. They seem too impossible to believe.

Nothing breaks for absolutely no reason and failure to know where the breakage was shows that maybe there wasn't adequately skilled techinicians for the technology deployed.

XFS if run in a properly configured environment will run flawlessly.

HAHAHAHHHHHHHHAAAAAAAAAAAAAAAHAAAAAAAAAAAAAAAAAAAAA

The XFS codebase is the biggest pile of mess in the Linux kernel and you expect it to be not run into mysterious problems? Remember, XFS was PORTED over to Linux. It is not a 'native' thing to Linux.

You're confusing me, I always thought Linux has been ported to XFS :)

There were some issues with XFS and maybe there still are. But, you can not say there are no environments where it work very stable. I've started using XFS back in the RH7.2 days and I can also tell some stories, but not all of them were XFS's fault. The only real problem was the fact that RedHat didn't chose XFS as their FS of choice which meant that just a few ressources were put into the XFS code and just a few peoples actually used it. That's the only thing where ext2,3,4 was better IMHO.

Where did I say that there are no environments where it works very stable? I used XFS extensively when I was running mail server farms for the mail queue filesystem and I only remember one or two incidents when the filesystem was marked read-only for no reason (seemingly - never had the time to find out why) but a reboot fixed those. XFS was better performing then but less reliable (yoohoo, hi Linux fake fsync/fdatasync) than ext3. So I personally have not had MAJOR problems with XFS but you bet that I don't think it's 100% safe in a properly configured environment. But that does not mean I am saying one must always encounter issues with it.

Redhat not choosing XFS is because the thing's code base is a quagmire and they had no developer familiar with it. Only Suse supported it because they could since they had XFS developers on their payroll and those developers were kept busy if you ask me.

Lamar Owen

2:54 p.m.

On Thursday, April 14, 2011 10:37:15 AM Christopher Chan wrote:

...

I used XFS extensively when I was running mail server farms for the mail queue filesystem and I only remember one or two incidents when the filesystem was marked read-only for no reason (seemingly - never had the time to find out why) but a reboot fixed those.

I've had that happen, recently, with ext3 on CentOS 4.

FWIW.

Christopher Chan

2:58 p.m.

On Thursday, April 14, 2011 10:54 PM, Lamar Owen wrote:

...

On Thursday, April 14, 2011 10:37:15 AM Christopher Chan wrote:

...
I used XFS extensively when I was running mail server farms for the mail queue filesystem and I only remember one or two incidents when the filesystem was marked read-only for no reason (seemingly - never had the time to find out why) but a reboot fixed those.

I've had that happen, recently, with ext3 on CentOS 4.

FWIW.

I wonder if there were any changes to the ext3 code in the CentOS 4 kernel lately...

Les Mikesell

3:20 p.m.

On 4/14/2011 9:54 AM, Lamar Owen wrote:

...

On Thursday, April 14, 2011 10:37:15 AM Christopher Chan wrote:

...
I used XFS extensively when I was running mail server farms for the mail queue filesystem and I only remember one or two incidents when the filesystem was marked read-only for no reason (seemingly - never had the time to find out why) but a reboot fixed those.

I've had that happen, recently, with ext3 on CentOS 4.

Same here, CentOS5 and ext3. Rare and random across identical hardware. So far I've blamed the hardware.

-- Les Mikesell lesmikesell@gmail.com

Lamar Owen

3:49 p.m.

On Thursday, April 14, 2011 11:20:23 AM Les Mikesell wrote:

...

Same here, CentOS5 and ext3. Rare and random across identical hardware. So far I've blamed the hardware.

I don't have that luxury. This is one VM on a VMware ESX 3.5U5 host, and the storage is EMC Clariion fibre-channel, with the VMware VMFS3 in between. Same storage RAID groups serve other VMs that haven't shown the problem. Happened regardless of the ESX host on which the guest was running; I even svmotioned the vmx/vmdk over to a different RAID group, and after roughly two weeks it did it again.

I haven't had the issue since the 4.9 update, and transitioning from the all-in-one vmware-tools package to the OSP stuff at packages.vmware.com (did that for a different reason, that of the 'can't reboot/restart vmxnet if IPv6 enabled' issue on ESX 3.5).

Only the one VM guest had the problem; several C4 VM's, too. This one has the Scalix mailstore on it. Reboot into single user, disable the journal, fsck, re-enable the journal, things are ok. Well, the last time it happened I didn't disable the journal before the fsck/reboot, but didn't suffer any data loss even then (journal replay in the 'fs went read-only journal stopped' case isn't something you want to have happen in the general case).

Peter Kjellström

3:41 p.m.

On Thursday, April 14, 2011 04:54:34 PM Lamar Owen wrote:

...

On Thursday, April 14, 2011 10:37:15 AM Christopher Chan wrote:

...
I used XFS extensively when I was running mail server farms for the mail queue filesystem and I only remember one or two incidents when the filesystem was marked read-only for no reason (seemingly - never had the time to find out why) but a reboot fixed those.

I've had that happen, recently, with ext3 on CentOS 4.

The default behaviour for ext3 on CentOS-5 is to remount read-only, as a safety measure, when something goes wrong beneath it (see mount option "errors" in man mount). The root cause can be any of a long list of hardware or software (kernel) problems (typically not ext3's fault though).

/Peter

Lamar Owen

3:55 p.m.

New subject: Ext3 remount ro (was:Re: 40TB File System Recommendations)

On Thursday, April 14, 2011 11:41:07 AM Peter Kjellström wrote:

...

The default behaviour for ext3 on CentOS-5 is to remount read-only, as a safety measure, when something goes wrong beneath it (see mount option "errors" in man mount). The root cause can be any of a long list of hardware or software (kernel) problems (typically not ext3's fault though).

The root cause made its appearance as clamd getting oom-killed. Eight hours of rampant oom-killer activity, and the fs goes bang. Plenty of memory allocated by the host; perhaps too much memory for the 32-bit guest. But, as I said, the combination of the 4.9 update and going with VMware's OSP setup from packages.vmware.com seem to have fixed the underlying issue.

Looking at a whole e-mail system overhaul anyway; while Scalix the package is preforming well for what we need it to do, Scalix the company has been incredibly slow on the next update. Looking to go to Zarafa on C6 x86_64, perhaps. MS Outlook public folder/shared calendar/shared contacts/group scheduling support number one criterion; and Exchange is not the answer. So an upgrade of the existing system isn't on the radar at the moment, but a full migration to something else is.

Les Mikesell

3:30 p.m.

On 4/14/2011 7:32 AM, Christopher Chan wrote:

...

HAHAHAHHHHHHHHAAAAAAAAAAAAAAAHAAAAAAAAAAAAAAAAAAAAA

The XFS codebase is the biggest pile of mess in the Linux kernel and you expect it to be not run into mysterious problems? Remember, XFS was PORTED over to Linux. It is not a 'native' thing to Linux.

Well yeah, but the way I remember it, SGI was using it for real work like video editing and storing zillions of files back when Linux was a toy with a 2 gig file size limit and linear directory scans as the only option. If you mean that the Linux side had a not-invented-here attitude about it and did the port badly you might be right...

-- Les Mikesell lesmikesell@gmail.com

Christopher Chan

15 Apr 15 Apr

3:52 a.m.

On Thursday, April 14, 2011 11:30 PM, Les Mikesell wrote:

...

On 4/14/2011 7:32 AM, Christopher Chan wrote:

...
HAHAHAHHHHHHHHAAAAAAAAAAAAAAAHAAAAAAAAAAAAAAAAAAAAA

The XFS codebase is the biggest pile of mess in the Linux kernel and you expect it to be not run into mysterious problems? Remember, XFS was PORTED over to Linux. It is not a 'native' thing to Linux.

Well yeah, but the way I remember it, SGI was using it for real work like video editing and storing zillions of files back when Linux was a toy with a 2 gig file size limit and linear directory scans as the only option. If you mean that the Linux side had a not-invented-here attitude about it and did the port badly you might be right...

No, the XFS guys had to work around the differences between the Linux vm and IRIX's and that eventually led to what we have today - a big messy pile of code. It would be no surprise for there to be stuff that get triggered imho.

I am not saying that XFS itself is bad. Just that the implementation on Linux was not quite the same quality as it is on IRIX.

Christopher Chan

14 Apr 14 Apr

12:32 p.m.

On Thursday, April 14, 2011 07:26 AM, John Jasen wrote:

...

On 04/12/2011 08:19 PM, Christopher Chan wrote:

...
On Tuesday, April 12, 2011 10:36 PM, John Jasen wrote:

...
On 04/12/2011 10:21 AM, Boris Epstein wrote:

...
On Tue, Apr 12, 2011 at 3:36 AM, Alain Péan <alain.pean@lpp.polytechnique.fr mailto:alain.pean@lpp.polytechnique.fr> wrote:

<snipped: two recommendations for XFS>

I would chime in with a dis-commendation for XFS. At my previous employer, two cases involving XFS resulted in irrecoverable data corruption. These were on RAID systems running from 4 to 20 TB.

What were those circumstances? Crash? Power outage? What are the components of the RAID systems?

One was a hardware raid over fibre channel, which silently corrupted itself. System checked out fine, raid array checked out fine, xfs was replaced with ext3, and the system ran without issue.

Second was multiple hardware arrays over linux md raid0, also over fibre channel. This was not so silent corruption, as in xfs would detect it and lock the filesystem into read-only before it, pardon the pun, truly fscked itself. Happened two or three times, before we gave up, split up the raid, and went ext3, Again, no issues.

32-bit kernel by any chance?

Torres, Giovanni (NIH/NINDS) [C]

12 Apr 12 Apr

12:34 p.m.

On Apr 12, 2011, at 3:23 AM, Matthew Feinberg wrote:

ext4 does not seem to be fully baked in 5.6 yet. parted 1.8 does not support creating ext4 (strange)

The CentOS homepage states that ext4 is now a fully supported filesystem in 5.6.

Marian Marinov

12:47 p.m.

On Tuesday 12 April 2011 15:34:21 Torres, Giovanni (NIH/NINDS) [C] wrote:

...

On Apr 12, 2011, at 3:23 AM, Matthew Feinberg wrote:

ext4 does not seem to be fully baked in 5.6 yet. parted 1.8 does not support creating ext4 (strange)

The CentOS homepage states that ext4 is now a fully supported filesystem in 5.6. _______________________________________________ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos

Steve, I'm managing machines with 30TB of storage for more then two years. And with good reporting and reaction we have never had to run fsck.

However I'm sure that if you have to run fsck on so big file systems, it will be fater to rebuild the array from other storage then waiting for a few weeks to finish.

On machines like that I use CentOS but I'm pratitioning them before the install with a rescue live cd that I have created for me.

Marian

Rudi Ahlers

12:53 p.m.

On Tue, Apr 12, 2011 at 2:47 PM, Marian Marinov mm@yuhu.biz wrote:

...

Steve, I'm managing machines with 30TB of storage for more then two years. And with good reporting and reaction we have never had to run fsck.

However I'm sure that if you have to run fsck on so big file systems, it will be fater to rebuild the array from other storage then waiting for a few weeks to finish.

On machines like that I use CentOS but I'm pratitioning them before the install with a rescue live cd that I have created for me.

Marian

As matter of interest, what hardware do you use? i.e. what CPU's, size of RAM and RAID cards do you use on this size system?

Everyone always recommends to use smaller RAID arrays than one big fat one. So, I'm interested to know what you use, and how effective it works. i.e. if that 30TB was actively used by many hosts how does it cope? Or is it just archival storage?

-- Kind Regards Rudi Ahlers SoftDux Website: http://www.SoftDux.com Technical Blog: http://Blog.SoftDux.com Office: 087 805 9573 Cell: 082 554 7532

m.roth＠5-cent.us

1:20 p.m.

Rudi Ahlers wrote:

...

On Tue, Apr 12, 2011 at 2:47 PM, Marian Marinov mm@yuhu.biz wrote:

...
I'm managing machines with 30TB of storage for more then two years. And with good reporting and reaction we have never had to run fsck.

However I'm sure that if you have to run fsck on so big file systems, it will be fater to rebuild the array from other storage then waiting for

a few

...

...
weeks to finish.

<snip> Here's a question: which would be faster on that huge a filesystem: fsck, or having a second 30TB filesystem, and rsyncing everything over?

mark

Marian Marinov

1:47 p.m.

On Tuesday 12 April 2011 16:20:22 m.roth@5-cent.us wrote:

...

Rudi Ahlers wrote:

...
On Tue, Apr 12, 2011 at 2:47 PM, Marian Marinov mm@yuhu.biz wrote:

...
I'm managing machines with 30TB of storage for more then two years. And with good reporting and reaction we have never had to run fsck.

However I'm sure that if you have to run fsck on so big file systems, it will be fater to rebuild the array from other storage then waiting for

a few

...
...
weeks to finish.

<snip> Here's a question: which would be faster on that huge a filesystem: fsck, or having a second 30TB filesystem, and rsyncing everything over?

For us, it was faster to transfer the information again. At least this was during the tests. We have never had to do it for real.

I guess the time for the fsck depends on the amount of errors that you have. If it has to check only the jurnal the fsck will not take long. But i it has to do a full check of the FS... an rsync may be faster.

Marian

Ross Walker

13 Apr 13 Apr

2:54 a.m.

On Apr 12, 2011, at 8:53 AM, Rudi Ahlers Rudi@SoftDux.com wrote:

...

On Tue, Apr 12, 2011 at 2:47 PM, Marian Marinov mm@yuhu.biz wrote:

...
Steve, I'm managing machines with 30TB of storage for more then two years. And with good reporting and reaction we have never had to run fsck.

However I'm sure that if you have to run fsck on so big file systems, it will be fater to rebuild the array from other storage then waiting for a few weeks to finish.

On machines like that I use CentOS but I'm pratitioning them before the install with a rescue live cd that I have created for me.

Marian

As matter of interest, what hardware do you use? i.e. what CPU's, size of RAM and RAID cards do you use on this size system?

Everyone always recommends to use smaller RAID arrays than one big fat one. So, I'm interested to know what you use, and how effective it works. i.e. if that 30TB was actively used by many hosts how does it cope? Or is it just archival storage?

I would never create a RAID5/6 greater then 8 disks. Usually I create a 6 or 7 disk RAID5 which means I can fit 2 in a 15 disk enclosure and have a hot spare and stripe them.

The more RAID5 sets you have the greater the write IOPS you can achieve.

Though for max IOPS nothing beats RAID10.

-Ross

Peter Kjellström

14 Apr 14 Apr

11:20 a.m.

On Wednesday, April 13, 2011 04:54:01 AM Ross Walker wrote:

...

On Apr 12, 2011, at 8:53 AM, Rudi Ahlers Rudi@SoftDux.com wrote:

...

...
As matter of interest, what hardware do you use? i.e. what CPU's, size of RAM and RAID cards do you use on this size system?

Everyone always recommends to use smaller RAID arrays than one big fat one. So, I'm interested to know what you use, and how effective it works. i.e. if that 30TB was actively used by many hosts how does it cope? Or is it just archival storage?

I would never create a RAID5/6 greater then 8 disks. Usually I create a 6 or 7 disk RAID5 which means I can fit 2 in a 15 disk enclosure and have a hot spare and stripe them.

Personal preference here, personal preference there. Here's a datapoint: We run >PB of data on 12 drive raid6 using sata (no hot spare). Are we happy with that config: yes, would it be faster to use 15K sas in raid10: yes *shrug*

/Peter

...

The more RAID5 sets you have the greater the write IOPS you can achieve.

Though for max IOPS nothing beats RAID10.

-Ross

rainer＠ultra-secure.de

12 Apr 12 Apr

12:56 p.m.

...

On Tuesday 12 April 2011 15:34:21 Torres, Giovanni (NIH/NINDS) [C] wrote:

...
On Apr 12, 2011, at 3:23 AM, Matthew Feinberg wrote:

ext4 does not seem to be fully baked in 5.6 yet. parted 1.8 does not support creating ext4 (strange)

The CentOS homepage states that ext4 is now a fully supported filesystem in 5.6. _______________________________________________ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos

Steve, I'm managing machines with 30TB of storage for more then two years. And with good reporting and reaction we have never had to run fsck.

That's not the issue. The issue is rebuild-time. The longer it takes, the more likely is another failure in the array. With RAID6, this does not instantly kill your RAID, as with RAID5 - but I assume it will further decrease overall-performance and the rebuild-time will go up significantly - adding the the risk. Thus, it's generally advisable to do just use RAID10 (in this case, a thin-striped array of RAID1-arrays).

Marian Marinov

1:02 p.m.

On Tuesday 12 April 2011 15:56:54 rainer@ultra-secure.de wrote:

...

...
On Tuesday 12 April 2011 15:34:21 Torres, Giovanni (NIH/NINDS) [C] wrote:

...
On Apr 12, 2011, at 3:23 AM, Matthew Feinberg wrote:

ext4 does not seem to be fully baked in 5.6 yet. parted 1.8 does not support creating ext4 (strange)

The CentOS homepage states that ext4 is now a fully supported filesystem in 5.6. _______________________________________________ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos

Steve, I'm managing machines with 30TB of storage for more then two years. And with good reporting and reaction we have never had to run fsck.

That's not the issue. The issue is rebuild-time. The longer it takes, the more likely is another failure in the array. With RAID6, this does not instantly kill your RAID, as with RAID5 - but I assume it will further decrease overall-performance and the rebuild-time will go up significantly - adding the the risk. Thus, it's generally advisable to do just use RAID10 (in this case, a thin-striped array of RAID1-arrays).

Yes... but with such RAID10 solution you get only half of the disk space... so from 10 2TB drives you get only 10TB instead of 16TB with RAID6.

Some of us really need the space. Rebuild time(while it is less then 4 days) is considered good enough. In my case I'm using these servers for backups and the raid rebuilds haven't made any changes to the performance of the backups.

I'm sure that if you use such storage with RAID6 for VMs it wont perform very well.

Marian

Markus Falb

1:48 p.m.

On 12.4.2011 15:02, Marian Marinov wrote:

...

On Tuesday 12 April 2011 15:56:54 rainer-RNrd0m5o0MABOiyIzIsiOw@public.gmane.org wrote:

...

Yes... but with such RAID10 solution you get only half of the disk space... so from 10 2TB drives you get only 10TB instead of 16TB with RAID6.

From a somewhat theoretical view, this is true for standard raid10 but Linux md raid10 is much more flexible as I understood it. You could do 2 copys over 2 disks, thats like standard 10. Or you could do 2 copys over 2 or 3 or ... x disks. Or you could do 3 copys over 3 or 4 or ... x disks. Do the math. See the manpage for md(4) and http://en.wikipedia.org/wiki/Non-standard_RAID_levels#Linux_MD_RAID_10

However, I have to admit that I have no experience with that but would like to hear about any disadvantages or if I am mislead. I am just interested.

-- Kind Regards, Markus Falb

Rudi Ahlers

1:56 p.m.

On Tue, Apr 12, 2011 at 3:48 PM, Markus Falb markus.falb@fasel.at wrote:

...

On 12.4.2011 15:02, Marian Marinov wrote:

...
On Tuesday 12 April 2011 15:56:54

rainer-RNrd0m5o0MABOiyIzIsiOw@public.gmane.org wrote:

...
Yes... but with such RAID10 solution you get only half of the disk

space... so

...
from 10 2TB drives you get only 10TB instead of 16TB with RAID6.

From a somewhat theoretical view, this is true for standard raid10 but Linux md raid10 is much more flexible as I understood it. You could do 2 copys over 2 disks, thats like standard 10. Or you could do 2 copys over 2 or 3 or ... x disks. Or you could do 3 copys over 3 or 4 or ... x disks. Do the math. See the manpage for md(4) and http://en.wikipedia.org/wiki/Non-standard_RAID_levels#Linux_MD_RAID_10

However, I have to admit that I have no experience with that but would like to hear about any disadvantages or if I am mislead. I am just interested.

--

We only use RAID 10 (rather 1+0) and never even bothered with RAID6. And we've had no data loss in the past 3 years with it yet, on hundreds of servers.

But, our RAID10 is setup as a stripe of mirrors, i.e. sda1 & sdb1 -> md0, sdc1 + sdd1 ->md1, then sde1 + sdf1 ->md2, and finally md0 + md1 + md2 are stripped. The advantage of this is that we can add more disks to the whole RAID set with no downtime (all server have hot swap HDD cages) and very little performance degradation since the 2 new drives have to be mirrored on their own first (take very little CPU / RAM resources) and then added to the RAID set. Rebuild is generally quick since it only rebuilds the broken mirror

-- Kind Regards Rudi Ahlers SoftDux Website: http://www.SoftDux.com Technical Blog: http://Blog.SoftDux.com Office: 087 805 9573 Cell: 082 554 7532

Emmanuel Noobadmin

13 Apr 13 Apr

4:35 a.m.

On 4/12/11, Rudi Ahlers Rudi@softdux.com wrote:

...

But, our RAID10 is setup as a stripe of mirrors, i.e. sda1 & sdb1 -> md0, sdc1 + sdd1 ->md1, then sde1 + sdf1 ->md2, and finally md0 + md1 + md2 are stripped. The advantage of this is that we can add more disks to the whole RAID set with no downtime

Off-topic, but when you say add more disks, do you mean for the purpose of replacing failing disks or for expanding the array? I'm curious because on initial reading I read it to mean expanding the storage capacity of the array but thought it was currently not possible to expand a mdadm RAID 0 non-destructively.

Brandon Ooi

5:04 a.m.

On Tue, Apr 12, 2011 at 9:35 PM, Emmanuel Noobadmin centos.admin@gmail.comwrote:

...

Off-topic, but when you say add more disks, do you mean for the purpose of replacing failing disks or for expanding the array? I'm curious because on initial reading I read it to mean expanding the storage capacity of the array but thought it was currently not possible to expand a mdadm RAID 0 non-destructively.

centos 5 can expand raid 0/1/5. just not 6. 10 is just layered 0/1 so you can expand it. centos 6 will be able to expand raid6 as it was a feature in 2.6.20 or something.

Brandon

Emmanuel Noobadmin

6:35 a.m.

On 4/13/11, Brandon Ooi brandono@gmail.com wrote:

...

centos 5 can expand raid 0/1/5. just not 6. 10 is just layered 0/1 so you can expand it. centos 6 will be able to expand raid6 as it was a feature in 2.6.20 or something.

This is where I'm getting confused. I had been reading up on mdadm, torn between using RAID 5/6 for the ability to grow the array with more disks and RAID 10 for better IOPS. The man pages itself says that "Currently supported growth options include changing the active size of component devices and changing the number of active devices in RAID levels 1/4/5/6,"

Along with other internet sources seems to imply that growing RAID 0 is not supported, and by therefore by extension neither is RAID 10. Furthermore, I read on Neil Brown's blog that reshaping RAID 10 was a planned but not implemented feature.

Is the difference here between using mdadm to directly create a RAID 10 vs manually layering on RAID 0 on RAID 1 devices?

Or is the expansion here limited to replacing the existing components drive with larger ones, e.g. replacing four 1TB drives with four 2TB drives so going from a 2TB to 4TB array?

Rudi Ahlers

7:01 a.m.

On Wed, Apr 13, 2011 at 6:35 AM, Emmanuel Noobadmin centos.admin@gmail.comwrote:

...

On 4/12/11, Rudi Ahlers Rudi@softdux.com wrote:

...
But, our RAID10 is setup as a stripe of mirrors, i.e. sda1 & sdb1 -> md0, sdc1 + sdd1 ->md1, then sde1 + sdf1 ->md2, and finally md0 + md1 + md2

are

...
stripped. The advantage of this is that we can add more disks to the

whole

...
RAID set with no downtime

Off-topic, but when you say add more disks, do you mean for the purpose of replacing failing disks or for expanding the array? I'm curious because on initial reading I read it to mean expanding the storage capacity of the array but thought it was currently not possible to expand a mdadm RAID 0 non-destructively. _______________________________________________

to expand the array :)

I haven't had problems doing it this way yet.

The other way is to run LVM on top of the three md's, i.e pvcreate volume01 /dev/md0 /dev/md1 /dev/md2 etc. LVM expands very easily with no downtime either.

-- Kind Regards Rudi Ahlers SoftDux Website: http://www.SoftDux.com Technical Blog: http://Blog.SoftDux.com Office: 087 805 9573 Cell: 082 554 7532

Marian Marinov

12 Apr 12 Apr

2:01 p.m.

On Tuesday 12 April 2011 16:48:14 Markus Falb wrote:

...

On 12.4.2011 15:02, Marian Marinov wrote:

...
On Tuesday 12 April 2011 15:56:54 rainer-RNrd0m5o0MABOiyIzIsiOw@public.gmane.org wrote:

Yes... but with such RAID10 solution you get only half of the disk space... so from 10 2TB drives you get only 10TB instead of 16TB with RAID6.

From a somewhat theoretical view, this is true for standard raid10 but Linux md raid10 is much more flexible as I understood it. You could do 2 copys over 2 disks, thats like standard 10. Or you could do 2 copys over 2 or 3 or ... x disks. Or you could do 3 copys over 3 or 4 or ... x disks. Do the math. See the manpage for md(4) and http://en.wikipedia.org/wiki/Non-standard_RAID_levels#Linux_MD_RAID_10

However, I have to admit that I have no experience with that but would like to hear about any disadvantages or if I am mislead. I am just interested.

Its like doing RAID50 or RAID60... Again the cheapest solution is RAID6. I really like the software raid in linux, it has good performance. But I have never tested it on such big volumes. And usually it is really hard to put 10 or more drives on a machine without buying a sata controler.

Marian

John R Pierce

6:51 p.m.

On 04/12/11 6:02 AM, Marian Marinov wrote:

...

Yes... but with such RAID10 solution you get only half of the disk space... so from 10 2TB drives you get only 10TB instead of 16TB with RAID6.

those disks are $100 each. whats your data worth?

The rebuild time goes way up as the number of drives in the raid stripe goes up.

in this case, the OP is talking about a 40TB array, so thats a TWENTY TWO drive raid. NOONE I know in the storage business will use larger than a 8 or 10 drive raid set. If you really need such a massive volume, you stripe several smaller raidsets, so the raid6 version would be 2 x 12 x 2TB or 24 drives for raid6+0 == 40TB.

but the OP's application is backup. for backup, it really doesn't matter what the volume size is, more smaller file systems is fine, so you can partition your backups by date interval or whatever.

let me throw out another thing. I assume this 40TB backup server is not just ONE backup of the current state, but an archive of point-in-time backups? you better have more than one of them, where you backup the backup on the 2nd. there's any number of scenarios the raid6 won't protect against, including file system corruption, raid controller failure where it dumps across a whole stripe, etc.

Lamar Owen

8:54 p.m.

On Tuesday, April 12, 2011 02:51:45 PM John R Pierce wrote:

...

On 04/12/11 6:02 AM, Marian Marinov wrote:

...
Yes... but with such RAID10 solution you get only half of the disk space... so from 10 2TB drives you get only 10TB instead of 16TB with RAID6.

those disks are $100 each. whats your data worth?

Where can I get an enterprise-class 2TB drive for $100? Commodity SATA isn't enterprise-class. SAS is; FC is, SCSI is. A 500GB FC drive with EMC firmware new is going to set you back ten times that, at least. What's youre data worth indeed, putting it on commodity disk.... :-)

...

in this case, the OP is talking about a 40TB array, so thats a TWENTY TWO drive raid. NOONE I know in the storage business will use larger than a 8 or 10 drive raid set.

EMC allows RAID groups up to 16 drives on Clariion storage. I've been doing this with EMC stuff for a while, with RAID6 plus a hotspare per DAE; that's a 14 drive RAID group plus the hotspare on one DAE. Some systems I forgo the dedicated per-DAE hotspare and spread a 16 drive RAID6 group and a 14 drive RAID6 group across two DAE's with hotspares on other DAE's. Works ok, and I've had double drive soft failures on a single RAID6 group that successfully hotspared (and back). This is partially due to the custom EMC firmware on the drives, and the interaction with the storage processor.

Rebuild time is several hours, but with more smaller drives it's not too bad.

...

If you really need such a massive volume, you stripe several smaller raidsets, so the raid6 version would be 2 x 12 x 2TB or 24 drives for raid6+0 == 40TB.

Or you do metaLUNs, or similar using LVM.

aurfalien＠gmail.com

9:01 p.m.

On Apr 12, 2011, at 1:54 PM, Lamar Owen wrote:

...

On Tuesday, April 12, 2011 02:51:45 PM John R Pierce wrote:

...
On 04/12/11 6:02 AM, Marian Marinov wrote:

...
Yes... but with such RAID10 solution you get only half of the disk space... so from 10 2TB drives you get only 10TB instead of 16TB with RAID6.

those disks are $100 each. whats your data worth?

Where can I get an enterprise-class 2TB drive for $100?

This is a good point.

The cheapies are so called green as they spin down often which is not what you want in a RAID setup.

While I've been able to tweak this in OSX, I haven't yet tried to see what to do in Linux or Winblowz which I will eventually do as some turd nugget bought a bunch of these for pro use.

...

Commodity SATA isn't enterprise-class. SAS is; FC is, SCSI is. A 500GB FC drive with EMC firmware new is going to set you back ten times that, at least. What's youre data worth indeed, putting it on commodity disk.... :-)

...
in this case, the OP is talking about a 40TB array, so thats a TWENTY TWO drive raid. NOONE I know in the storage business will use larger than a 8 or 10 drive raid set.

EMC allows RAID groups up to 16 drives on Clariion storage.

Yea, as does BlueArc, unsure of the rest but agreed.

- aurf

Keith Keller

10:02 p.m.

On Tue, Apr 12, 2011 at 02:01:42PM -0700, aurfalien@gmail.com wrote:

...

The cheapies are so called green as they spin down often which is not what you want in a RAID setup.

The WD RE4-GP is a so-called ''green'' disk that's suitable for RAID arrays. It's marketed and priced as an enterprise drive.

--keith

-- kkeller@wombat.san-francisco.ca.us

aurfalien＠gmail.com

10:14 p.m.

On Apr 12, 2011, at 3:02 PM, Keith Keller wrote:

...

On Tue, Apr 12, 2011 at 02:01:42PM -0700, aurfalien@gmail.com wrote:

...
The cheapies are so called green as they spin down often which is not what you want in a RAID setup.

The WD RE4-GP is a so-called ''green'' disk that's suitable for RAID arrays. It's marketed and priced as an enterprise drive.

Well, it may either be BS marketing or is so called green for a diff reason and not the freq spin downs.

I'm finding green can mean many things from the fact that there product packaging is made from recycled material to there manufacturing plant no longer uses mercury to there power consumption is lower then previous models, etc...

I would say that the $100 price tag would be a caution but then again one doesn't always get what one pays for.

Either way, it makes our jobs more challenging for sure.

- aurf

compdoc

11 p.m.

...

The WD RE4-GP is a so-called ''green'' disk that's suitable for RAID arrays. It's marketed and priced as an enterprise drive.

I've had good luck with green, 5400 rpm Samsung drives. They don't spin down automatically and work fine in my raid 5 arrays. The cost is about $80 for 2TB drives.

I also have a few 5900 rpm Seagate ST32000542AS drives, but not currently in raids. They don't spin down, so I'm sure they would be fine in a raid.

None of the drives in the raids have failed, although I've replaced a couple that developed reallocated sectors as reported by smart.

Just because they are so tiny on the outside, 2.5 inch drives like the Seagate Constellation and WD Raptors are great. Unfortunately, the don't come any larger than 1TB, so I use them in special situations.

Lamar Owen

13 Apr 13 Apr

1:50 p.m.

On Tuesday, April 12, 2011 07:00:26 PM compdoc wrote:

...

I've had good luck with green, 5400 rpm Samsung drives. They don't spin down automatically and work fine in my raid 5 arrays. The cost is about $80 for 2TB drives.

And that's a good price point for a commodity drive; not something I would count on for long-term use, but still a good price point.

...

I also have a few 5900 rpm Seagate ST32000542AS drives, but not currently in raids. They don't spin down, so I'm sure they would be fine in a raid.

The biggest issue isn't the spindown. Google 'WDTLER' and see the other, bigger, issue. In a nutshell, TLER (Time-Limited Error Recovery; see https://secure.wikimedia.org/wikipedia/en/wiki/TLER ) allows the drive to not try to recover soft errors quite as long. The error recovery time can cause the drive to drop out of RAID sets and be marked as faulted.

...

Just because they are so tiny on the outside, 2.5 inch drives like the Seagate Constellation and WD Raptors are great. Unfortunately, the don't come any larger than 1TB, so I use them in special situations.

FWIW, EMC's new VNX storage systems are at the 2.5 inch formfactor, with SSD and mechanical platter drives as options, using 6G SAS interfaces.

compdoc

3:05 p.m.

...

The biggest issue isn't the spindown. Google 'WDTLER' and see the other,

bigger, issue. In a nutshell, TLER (Time-Limited Error Recovery; see https://secure.wikimedia.org/wikipedia/en/wiki/TLER ) allows the drive to not try to recover soft errors quite as long. The error recovery time can cause the drive to drop out of RAID sets and be marked as faulted.

Yes, I'm aware of that and it's the reason I have to replace drives developing reallocated sectors: they get dropped by my 3ware controllers. There's a penalty for using cheap drives, but there's also a benefit from the low heat and power savings.

To me, drives and power supplies are a consumable item - something you're going to have to replace from time to time. I'm used to it since I service computers for a living. I've seen enterprise drives fail too, although probably not as often.

By the way, I'm seeing too many ppl with failing SSD's to start relying on those yet. I own one so far, but it's not used much.

Drew

12 Apr 12 Apr

10:49 p.m.

...

Where can I get an enterprise-class 2TB drive for $100? Commodity SATA isn't enterprise-class. SAS is; FC is, SCSI is. A 500GB FC drive with EMC firmware new is going to set you back ten times that, at least. What's youre data worth indeed, putting it on commodity disk.... :-)

I can get Seagate's Constellation ES series SATA drives in 1TB for $125. 2TB will run me around $225.

They're not something I'd run my database off, I have 15k SAS drives for that, but for large amounts of storage on the cheap like our backup system, it's just fine.

-- Drew "Nothing in life is to be feared. It is only to be understood." --Marie Curie

Lamar Owen

13 Apr 13 Apr

1:41 p.m.

On Tuesday, April 12, 2011 06:49:08 PM Drew wrote:

...

...
Where can I get an enterprise-class 2TB drive for $100? Commodity SATA isn't enterprise-class.

...

I can get Seagate's Constellation ES series SATA drives in 1TB for $125. 2TB will run me around $225.

Yeah, those are reasonable near-line drives for archival storage, or when you have a very small number of servers accessing the storage, and large amounts of cache.

EMC used Barracuda ES SATA drives in their Clariion CX3 boxes for a while; used a dual attach 4G FC bridge controller to go from the DAE backplane to the SATA port, and emulated the dual attach functionality of FC with it. I'm not 100% sure, but I think the SATA drive itself got EMC-specific firmware.

David Miller

12 Apr 12 Apr

2:25 p.m.

On Tue, Apr 12, 2011 at 8:56 AM, rainer@ultra-secure.de wrote:

...

That's not the issue. The issue is rebuild-time. The longer it takes, the more likely is another failure in the array. With RAID6, this does not instantly kill your RAID, as with RAID5 - but I assume it will further decrease overall-performance and the rebuild-time will go up significantly - adding the the risk. Thus, it's generally advisable to do just use RAID10 (in this case, a thin-striped array of RAID1-arrays).

Statistically speaking that risk isn't there. RAID6 arrays have a slightly higher mean time between dataloss than RAID10's. But the difference here is very small. So if you need the capacity and don't mind the performance difference between these two RAID levels then RAID6 is perfectly fine in my opinion.

Here's a great blog post on calculating Mean Time Between Dataloss and they have a spreadsheet that you can download to play with. http://info.zetta.net/blog/bid/45661/Calculating-Mean-Time-To-Data-Loss-and-...

In my configuration which is 12 drives the chances of a dataloss event over a 10 year period with RAID10 is 2.51% and with RAID6 is 1.31%. I would expect those numbers to go up a bit with 16 drive configuration.

-- David

Peter Kjellström

14 Apr 14 Apr

2:42 p.m.

On Tuesday, April 12, 2011 02:56:54 PM rainer@ultra-secure.de wrote: ...

...

...
Steve, I'm managing machines with 30TB of storage for more then two years. And with good reporting and reaction we have never had to run fsck.

That's not the issue. The issue is rebuild-time. The longer it takes, the more likely is another failure in the array. With RAID6, this does not instantly kill your RAID, as with RAID5 - but I assume it will further decrease overall-performance and the rebuild-time will go up significantly - adding the the risk.

While I do concede the obvious point regarding rebuild time (raid6 takes from long to very long to rebuild) I'd like to point out:

* If you do the math for a 12 drive raid10 vs raid6 then (using actual data from ~500 1T drives on HP cciss controllers during two years) raid10 is ~3x more likely to cause hard data loss than raid6.

* mtbf is not everything there's also the thing called unrecoverable read errors. If you hit one while rebuilding your raid10 you're toast while in the raid6 case you'll use your 2nd parity and continue the rebuild.

/Peter (who runs many 12 drive raid6 systems just fine)

...

Thus, it's generally advisable to do just use RAID10 (in this case, a thin-striped array of RAID1-arrays).

Ross Walker

3:26 p.m.

2011/4/14 Peter Kjellström cap@nsc.liu.se:

...

On Tuesday, April 12, 2011 02:56:54 PM rainer@ultra-secure.de wrote: ...

...
...
Steve, I'm managing machines with 30TB of storage for more then two years. And with good reporting and reaction we have never had to run fsck.

That's not the issue. The issue is rebuild-time. The longer it takes, the more likely is another failure in the array. With RAID6, this does not instantly kill your RAID, as with RAID5 - but I assume it will further decrease overall-performance and the rebuild-time will go up significantly - adding the the risk.

While I do concede the obvious point regarding rebuild time (raid6 takes from long to very long to rebuild) I'd like to point out:

* If you do the math for a 12 drive raid10 vs raid6 then (using actual data from ~500 1T drives on HP cciss controllers during two years) raid10 is ~3x more likely to cause hard data loss than raid6.

* mtbf is not everything there's also the thing called unrecoverable read errors. If you hit one while rebuilding your raid10 you're toast while in the raid6 case you'll use your 2nd parity and continue the rebuild.

You mean if the other side of the mirror fails while rebuilding it. Yes this is true, of course if this happens with RAID6 it will rebuild from parity IF there is a second hotspare available, cause remember the first failure wasn't cleared before the second failure occurred. Now your RAID6 is in severe degraded state, one more failure before either of these disks is rebuilt will mean toast for the array. Now the performance of the array is practically unusable and the load on the disks is high as it does a full recalculation rebuild, and if they are large it will be high for a very long time, now if any other disk in the very large RAID6 array is near failure, or has a bad sector, this taxing load could very well push it over the edge and the risk of such an event occurring increases with the size of the array and the size of the disk surface.

I think this is where the mdraid raid10 shines because it can have 3 copies (or more) of the data instead of just two, of course a three times (or more) the cost. It also allows for uneven number of disks as it just saves copies on different spindles rather then "mirrors". This I think provides the best protection against failure and the best performance, but at the worst cost, but with 2TB and 4TB disks coming out it may very well be worth it as the cost per-GB drives lower and lower and one can get 12TB of raw storage out of only 4 platters, imagine 12 platters, I wouldn't mind getting 16TB out of 48TB of raw if it costs me less then what 16TB of raw cost me just 2 years ago, especially if it means I get both performance and reliability.

...

/Peter (who runs many 12 drive raid6 systems just fine)

...
Thus, it's generally advisable to do just use RAID10 (in this case, a thin-striped array of RAID1-arrays).

It is not advisable to use any level of RAID.

The RAID level is determined by the needs of the application vs the risks of the RAID level vs the risks of the storage technology.

-Ross

Peter Kjellström

15 Apr 15 Apr

12:51 p.m.

On Thursday, April 14, 2011 05:26:41 PM Ross Walker wrote:

...

2011/4/14 Peter Kjellström cap@nsc.liu.se:

...

...
While I do concede the obvious point regarding rebuild time (raid6 takes from long to very long to rebuild) I'd like to point out:

If you do the math for a 12 drive raid10 vs raid6 then (using actual

data from ~500 1T drives on HP cciss controllers during two years) raid10 is ~3x more likely to cause hard data loss than raid6.

mtbf is not everything there's also the thing called unrecoverable

read errors. If you hit one while rebuilding your raid10 you're toast while in the raid6 case you'll use your 2nd parity and continue the rebuild.

You mean if the other side of the mirror fails while rebuilding it.

No, the drive (unrecoverably) failing to read a sector is not the same thing as a drive failure. Drive failure frequency expressed in mtbf is around 1M hours (even though including predictive fail we see more like 250K hours). Unrecoverable read error rate (per sector) was quite recently on the order of 1x to 10x of the drive size (a drive I looked up now was spec'ed alot higher at ~1000x drive size). If we assume a raid10 rebuild time of 12h and an unrecoverable read error once every 10x of drive size then the effective mean time between read error is 120h (two to ten thousand times worse than the drive mtbf). Admittedly these numbers are hard to get and equally hard to trust (or double check).

What it all comes down to is that raid10 (assuming just double- not tripple copy) stores your data with one extra copy/parity and in a single drive failure scenario you have zero extra data left (on that part of the array). That is, you depend on each and every bit of that (meaning the degraded part) data being correctly read. This means you very much want both:

1) Very fast rebuilds (=> you need hot-spare) 2) An unrecoverable read error rate much larger than your drive size

or as you suggest below:

3) Tripple copy

...

Yes this is true, of course if this happens with RAID6 it will rebuild from parity IF there is a second hotspare available,

This is wrong, hot-spares are not that necessary when using raid6. This has to do with the fact that rebuild times (time from you start being vulnerable to whatever rebuild completes) are already long. An added 12h for a tech to swap in the spare only marginally increases your risks.

...

cause remember the first failure wasn't cleared before the second failure occurred. Now your RAID6 is in severe degraded state, one more failure before either of these disks is rebuilt will mean toast for the array.

All of this was taken into account in my original example above. In the end (with my data) raid10 was around 3x more likely to cause ultimate data loss than raid6.

...

Now the performance of the array is practically unusable and the load on the disks is high as it does a full recalculation rebuild, and if they are large it will be high for a very long time, now if any other disk in the very large RAID6 array is near failure, or has a bad sector, this taxing load could very well push it over the edge

In my example a 12 drive raid6 rebuild takes 6-7 days this works out to < 5 MB/s seq read per drive. This added load is not very noticeable in our environment (taking into account normal patrol reads and user data traffic).

Either way, the general problem of "[rebuild stress] pushing drives over the edge" is a larger threat to raid10 than raid6 (it being fatal in the first case...).

...

and the risk of such an event occurring increases with the size of the array and the size of the disk surface.

I think this is where the mdraid raid10 shines because it can have 3 copies (or more) of the data instead of just two,

I think we've now moved into what most people would call unreasonable. Let's see what we have for a 12 drive box (quite common 2U size):

raid6: 12x on raid6 no hot spare (see argument above) => 10 data drives raid10: 11x tripple store on raid10 one spare => 3.66 data drives

or (if your raid's not odd-drive capable):

raid10: 9x tripple store on raid10 one to three spares => 3 data drives

(ok, yes you could get 4 data drives out of it if you skipped hot-spare)

That is almost a 2.7x-3.3x diff! My users sure care if their X $ results in 1/3 the space (or cost => 3x for the same space if you prefer).

On top of this most raid implementations for raid10 lacks tripple copy functionality.

Also note that raid10 that allows for odd number of drives is more vulnerable to 2nd drive failures resulting in an even larger than 3x improvement using raid6 (vs double copy odd drive handling raid10).

/Peter

...

of course a three times (or more) the cost. It also allows for uneven number of disks as it just saves copies on different spindles rather then "mirrors". This I think provides the best protection against failure and the best performance, but at the worst cost, but with 2TB and 4TB disks coming out

...

Sorin Srbu

12 Apr 12 Apr

12:51 p.m.

...

-----Original Message----- From: centos-bounces@centos.org [mailto:centos-bounces@centos.org] On Behalf Of Torres, Giovanni (NIH/NINDS) [C] Sent: Tuesday, April 12, 2011 2:34 PM To: CentOS mailing list Subject: Re: [CentOS] 40TB File System Recommendations

On Apr 12, 2011, at 3:23 AM, Matthew Feinberg wrote: ext4 does not seem to be fully baked in 5.6 yet. parted 1.8 does not support creating ext4 (strange)

The CentOS homepage states that ext4 is now a fully supported filesystem in 5.6.

I finalized an install with CentOS 5.6 yesterday on a machine that will be our department fileserver. Ext4 seems to work fine on this raid-array.

In what way is ext4 not "fully baked" on CentOS 5.6?

IIRC, gparted won't be able to manipulate eg ext4 partitions if you don't have the appropriate ext4 fs-utils installed. I might be wrong though.

OTOH, gparted doesn't see my software raid array either. Gparted it rather practical for regular plain vanilla partitions, but for more advanced stuff and filesystems, fdisk is probably better.

My two oere.

-- /Sorin

Matthew Feinberg

13 Apr 13 Apr

7:29 a.m.

Thank you everyone for the advice and great information. From what I am gathering XFS is the way to go.

A couple more questions. What partitioning utility is suggested? parted and fdisk do not seem to be doing the job.

Raid Level. I am considering moving away from the raid6 due to possible write performance issues. The array is 22 disks. I am not opposed to going with raid10 but I am looking for a good balance of performance/capacity.

Hardware or software raid. Is there an advantage either way on such a large array?

-- Matthew Feinberg matthew@choopa.com AIM: matthewchoopa

Sorin Srbu

8 a.m.

...

-----Original Message----- From: centos-bounces@centos.org [mailto:centos-bounces@centos.org] On Behalf Of Matthew Feinberg Sent: Wednesday, April 13, 2011 9:29 AM To: CentOS mailing list Subject: Re: [CentOS] 40TB File System Recommendations

Hardware or software raid. Is there an advantage either way on such a large array?

Doesn't that depend on what sort of backup solution you're planning, and the level of "criticalness" of the backups saved?

Some say that for more serious raid solutions, hardware is the way to go, while software raids are sort of a middle-road.

Me, I usually go with software raid. I've had one too many hardware raid failures where I haven't been able to restore the data contained. With software raid a restore has always worked fine for me, especially broken raids in Windows. While raid in Windows isn't overly performance-inclined, I've come to appreciate the software ditto in linux - both performance and stability is top-notch IMHO.

With today's CPU-performance and RAM available, software raids are not a problem to power.

-- /Sorin

Peter Kjellström

14 Apr 14 Apr

2:47 p.m.

On Wednesday, April 13, 2011 09:29:29 AM Matthew Feinberg wrote:

...

Thank you everyone for the advice and great information. From what I am gathering XFS is the way to go.

A couple more questions. What partitioning utility is suggested? parted and fdisk do not seem to be doing the job.

My suggestion is don't partition at all, use LVM.

...

Raid Level. I am considering moving away from the raid6 due to possible write performance issues. The array is 22 disks. I am not opposed to going with raid10 but I am looking for a good balance of performance/capacity.

Then try both for your use case and your hardware. We have wide raid6 setups that does well over 500 MB/s write (that is: not all raid6 writes suck...).

/Peter

...

Hardware or software raid. Is there an advantage either way on such a large array?

Christopher Chan

3:04 p.m.

On Thursday, April 14, 2011 10:47 PM, Peter Kjellström wrote:

...

On Wednesday, April 13, 2011 09:29:29 AM Matthew Feinberg wrote:

...
Thank you everyone for the advice and great information. From what I am gathering XFS is the way to go.

A couple more questions. What partitioning utility is suggested? parted and fdisk do not seem to be doing the job.

My suggestion is don't partition at all, use LVM.

...
Raid Level. I am considering moving away from the raid6 due to possible write performance issues. The array is 22 disks. I am not opposed to going with raid10 but I am looking for a good balance of performance/capacity.

Then try both for your use case and your hardware. We have wide raid6 setups that does well over 500 MB/s write (that is: not all raid6 writes suck...).

/me replaces all of Peter's cache with 64MB modules.

Let's try again.

Benjamin Franz

3:26 p.m.

On 04/14/2011 08:04 AM, Christopher Chan wrote:

...

...
Then try both for your use case and your hardware. We have wide raid6 setups that does well over 500 MB/s write (that is: not all raid6 writes suck...).

/me replaces all of Peter's cache with 64MB modules.

Let's try again.

If you are trying to imply that RAID6 can't go fast when write size is larger than the cache, you are simply wrong. Even with just a 8 x RAID6, I've tested a system as sustained sequential (not burst) 156Mbytes/s out and 387 Mbytes/s in using 7200 rpm 1.5 TB drives. Bonnie++ results attached. Bonnie++ by default uses twice as much data as your available RAM to make sure you aren't just seeing cache. IOW: That machine only had 4GB of RAM and 256 MB of controller cache during the test but wrote and read 8 GB of data for the tests.

Version 1.96 ------Sequential Output------ --Sequential Input- --Random- Concurrency 1 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP xxxx 8G 248 99 155996 74 85600 42 961 99 386900 62 628.3 29 Latency 33323us 224ms 1105ms 19047us 77599us 113ms Version 1.96 ------Sequential Create------ --------Random Create-------- xxxx -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 17395 56 +++++ +++ 23951 61 27125 84 +++++ +++ 32154 84 Latency 330us 993us 980us 344us 64us 80us

-- Benjamin Franz

Christopher Chan

15 Apr 15 Apr

4 a.m.

On Thursday, April 14, 2011 11:26 PM, Benjamin Franz wrote:

...

On 04/14/2011 08:04 AM, Christopher Chan wrote:

...
...
Then try both for your use case and your hardware. We have wide raid6 setups that does well over 500 MB/s write (that is: not all raid6 writes suck...).

/me replaces all of Peter's cache with 64MB modules.

Let's try again.

If you are trying to imply that RAID6 can't go fast when write size is larger than the cache, you are simply wrong. Even with just a 8 x RAID6, I've tested a system as sustained sequential (not burst) 156Mbytes/s out and 387 Mbytes/s in using 7200 rpm 1.5 TB drives. Bonnie++ results attached. Bonnie++ by default uses twice as much data as your available RAM to make sure you aren't just seeing cache. IOW: That machine only had 4GB of RAM and 256 MB of controller cache during the test but wrote and read 8 GB of data for the tests.

Wanna try that again with 64MB of cache only and tell us whether there is a difference in performance?

There is a reason why 3ware 85xx cards were complete rubbish when used for raid5 and which led to the 95xx/96xx series.

Benjamin Franz

11:24 a.m.

On 04/14/2011 09:00 PM, Christopher Chan wrote:

...

Wanna try that again with 64MB of cache only and tell us whether there is a difference in performance?

There is a reason why 3ware 85xx cards were complete rubbish when used for raid5 and which led to the 95xx/96xx series. _

I don't happen to have any systems I can test with the 1.5TB drives without controller cache right now, but I have a system with some old 500GB drives (which are about half as fast as the 1.5TB drives in individual sustained I/O throughput) attached directly to onboard SATA ports in a 8 x RAID6 with *no* controller cache at all. The machine has 16GB of RAM and bonnie++ therefore used 32GB of data for the test.

Version 1.96 ------Sequential Output------ --Sequential Input- --Random- Concurrency 1 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP pbox3 32160M 389 98 76709 22 91071 26 2209 95 264892 26 590.5 11 Latency 24190us 1244ms 1580ms 60411us 69901us 42586us Version 1.96 ------Sequential Create------ --------Random Create-------- pbox3 -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 10910 31 +++++ +++ +++++ +++ 29293 80 +++++ +++ +++++ +++ Latency 775us 610us 979us 740us 370us 380us

Given that the underlaying drives are effectively something like half as fast as the drives in the other test, the results are quite comparable.

Cache doesn't make a lot of difference when you quickly write a lot more data than the cache can hold. The limiting factor becomes the slowest component - usually the drives themselves. Cache isn't magic performance pixie dust. It helps in certain use cases and is nearly irrelevant in others.

-- Benjamin Franz

Christopher Chan

1:05 p.m.

On Friday, April 15, 2011 07:24 PM, Benjamin Franz wrote:

...

On 04/14/2011 09:00 PM, Christopher Chan wrote:

...
Wanna try that again with 64MB of cache only and tell us whether there is a difference in performance?

There is a reason why 3ware 85xx cards were complete rubbish when used for raid5 and which led to the 95xx/96xx series. _

I don't happen to have any systems I can test with the 1.5TB drives without controller cache right now, but I have a system with some old 500GB drives (which are about half as fast as the 1.5TB drives in individual sustained I/O throughput) attached directly to onboard SATA ports in a 8 x RAID6 with *no* controller cache at all. The machine has 16GB of RAM and bonnie++ therefore used 32GB of data for the test.

Version 1.96 ------Sequential Output------ --Sequential Input- --Random- Concurrency 1 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP pbox3 32160M 389 98 76709 22 91071 26 2209 95 264892 26 590.5 11 Latency 24190us 1244ms 1580ms 60411us 69901us 42586us Version 1.96 ------Sequential Create------ --------Random Create-------- pbox3 -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 10910 31 +++++ +++ +++++ +++ 29293 80 +++++ +++ +++++ +++ Latency 775us 610us 979us 740us 370us 380us

Given that the underlaying drives are effectively something like half as fast as the drives in the other test, the results are quite comparable.

Woohoo, next we will be seeing md raid6 also giving comparable results if that is the case. I am not the only person on this list that thinks cache is king for raid5/6 on hardware raid boards and the using hardware raid + bbu cache for better performance one of the two reasons why we don't do md raid5/6.

...

Cache doesn't make a lot of difference when you quickly write a lot more data than the cache can hold. The limiting factor becomes the slowest component - usually the drives themselves. Cache isn't magic performance pixie dust. It helps in certain use cases and is nearly irrelevant in others.

Yeah, you are right - but cache is primarily to buffer the writes for performance. Why else go through the expense of getting bbu cache? So what happens when you tweak bonnie a bit?

Rudi Ahlers

1:17 p.m.

On Fri, Apr 15, 2011 at 3:05 PM, Christopher Chan < christopher.chan@bradbury.edu.hk> wrote:

...

On Friday, April 15, 2011 07:24 PM, Benjamin Franz wrote:

...
On 04/14/2011 09:00 PM, Christopher Chan wrote:

...
Wanna try that again with 64MB of cache only and tell us whether there is a difference in performance?

There is a reason why 3ware 85xx cards were complete rubbish when used for raid5 and which led to the 95xx/96xx series. _

I don't happen to have any systems I can test with the 1.5TB drives without controller cache right now, but I have a system with some old 500GB drives (which are about half as fast as the 1.5TB drives in individual sustained I/O throughput) attached directly to onboard SATA ports in a 8 x RAID6 with *no* controller cache at all. The machine has 16GB of RAM and bonnie++ therefore used 32GB of data for the test.

Version 1.96 ------Sequential Output------ --Sequential Input- --Random- Concurrency 1 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP pbox3 32160M 389 98 76709 22 91071 26 2209 95 264892 26 590.5 11 Latency 24190us 1244ms 1580ms 60411us 69901us 42586us Version 1.96 ------Sequential Create------ --------Random Create-------- pbox3 -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 10910 31 +++++ +++ +++++ +++ 29293 80 +++++ +++ +++++ +++ Latency 775us 610us 979us 740us 370us 380us

Given that the underlaying drives are effectively something like half as fast as the drives in the other test, the results are quite comparable.

Woohoo, next we will be seeing md raid6 also giving comparable results if that is the case. I am not the only person on this list that thinks cache is king for raid5/6 on hardware raid boards and the using hardware raid + bbu cache for better performance one of the two reasons why we don't do md raid5/6.

...
Cache doesn't make a lot of difference when you quickly write a lot more data than the cache can hold. The limiting factor becomes the slowest component - usually the drives themselves. Cache isn't magic performance pixie dust. It helps in certain use cases and is nearly irrelevant in others.

Yeah, you are right - but cache is primarily to buffer the writes for performance. Why else go through the expense of getting bbu cache? So what happens when you tweak bonnie a bit? _______________________________________________

As matter of interest, does anyone know how to use an SSD drive for cach purposes on Linux software RAID drives? ZFS has this feature and it makes a helluva difference to a storage server's performance.

-- Kind Regards Rudi Ahlers SoftDux Website: http://www.SoftDux.com Technical Blog: http://Blog.SoftDux.com Office: 087 805 9573 Cell: 082 554 7532

Ross Walker

4:26 p.m.

On Apr 15, 2011, at 9:17 AM, Rudi Ahlers Rudi@SoftDux.com wrote:

...

On Fri, Apr 15, 2011 at 3:05 PM, Christopher Chan christopher.chan@bradbury.edu.hk wrote: On Friday, April 15, 2011 07:24 PM, Benjamin Franz wrote:

...
On 04/14/2011 09:00 PM, Christopher Chan wrote:

...
Wanna try that again with 64MB of cache only and tell us whether there is a difference in performance?

There is a reason why 3ware 85xx cards were complete rubbish when used for raid5 and which led to the 95xx/96xx series. _

I don't happen to have any systems I can test with the 1.5TB drives without controller cache right now, but I have a system with some old 500GB drives (which are about half as fast as the 1.5TB drives in individual sustained I/O throughput) attached directly to onboard SATA ports in a 8 x RAID6 with *no* controller cache at all. The machine has 16GB of RAM and bonnie++ therefore used 32GB of data for the test.

Version 1.96 ------Sequential Output------ --Sequential Input- --Random- Concurrency 1 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP pbox3 32160M 389 98 76709 22 91071 26 2209 95 264892 26 590.5 11 Latency 24190us 1244ms 1580ms 60411us 69901us 42586us Version 1.96 ------Sequential Create------ --------Random Create-------- pbox3 -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 10910 31 +++++ +++ +++++ +++ 29293 80 +++++ +++ +++++ +++ Latency 775us 610us 979us 740us 370us 380us

Given that the underlaying drives are effectively something like half as fast as the drives in the other test, the results are quite comparable.

Woohoo, next we will be seeing md raid6 also giving comparable results if that is the case. I am not the only person on this list that thinks cache is king for raid5/6 on hardware raid boards and the using hardware raid + bbu cache for better performance one of the two reasons why we don't do md raid5/6.

...
Cache doesn't make a lot of difference when you quickly write a lot more data than the cache can hold. The limiting factor becomes the slowest component - usually the drives themselves. Cache isn't magic performance pixie dust. It helps in certain use cases and is nearly irrelevant in others.

Yeah, you are right - but cache is primarily to buffer the writes for performance. Why else go through the expense of getting bbu cache? So what happens when you tweak bonnie a bit? _______________________________________________

As matter of interest, does anyone know how to use an SSD drive for cach purposes on Linux software RAID drives? ZFS has this feature and it makes a helluva difference to a storage server's performance.

Put the file system's log device on it.

-Ross

Rudi Ahlers

4:32 p.m.

On Fri, Apr 15, 2011 at 6:26 PM, Ross Walker rswwalker@gmail.com wrote:

...

On Apr 15, 2011, at 9:17 AM, Rudi Ahlers Rudi@SoftDux.com wrote:

On Fri, Apr 15, 2011 at 3:05 PM, Christopher Chan <christopher.chan@bradbury.edu.hk christopher.chan@bradbury.edu.hk> wrote:

...
On Friday, April 15, 2011 07:24 PM, Benjamin Franz wrote:

...
On 04/14/2011 09:00 PM, Christopher Chan wrote:

...
Wanna try that again with 64MB of cache only and tell us whether there is a difference in performance?

There is a reason why 3ware 85xx cards were complete rubbish when used for raid5 and which led to the 95xx/96xx series. _

I don't happen to have any systems I can test with the 1.5TB drives without controller cache right now, but I have a system with some old 500GB drives (which are about half as fast as the 1.5TB drives in individual sustained I/O throughput) attached directly to onboard SATA ports in a 8 x RAID6 with *no* controller cache at all. The machine has 16GB of RAM and bonnie++ therefore used 32GB of data for the test.

Version 1.96 ------Sequential Output------ --Sequential Input- --Random- Concurrency 1 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP pbox3 32160M 389 98 76709 22 91071 26 2209 95 264892 26 590.5 11 Latency 24190us 1244ms 1580ms 60411us 69901us 42586us Version 1.96 ------Sequential Create------ --------Random Create-------- pbox3 -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 10910 31 +++++ +++ +++++ +++ 29293 80 +++++ +++ +++++ +++ Latency 775us 610us 979us 740us 370us 380us

Given that the underlaying drives are effectively something like half as fast as the drives in the other test, the results are quite comparable.

Woohoo, next we will be seeing md raid6 also giving comparable results if that is the case. I am not the only person on this list that thinks cache is king for raid5/6 on hardware raid boards and the using hardware raid + bbu cache for better performance one of the two reasons why we don't do md raid5/6.

...
Cache doesn't make a lot of difference when you quickly write a lot more data than the cache can hold. The limiting factor becomes the slowest component - usually the drives themselves. Cache isn't magic performance pixie dust. It helps in certain use cases and is nearly irrelevant in others.

Yeah, you are right - but cache is primarily to buffer the writes for performance. Why else go through the expense of getting bbu cache? So what happens when you tweak bonnie a bit? _______________________________________________

As matter of interest, does anyone know how to use an SSD drive for cach purposes on Linux software RAID drives? ZFS has this feature and it makes a helluva difference to a storage server's performance.

Put the file system's log device on it.

-Ross

Well, ZFS has a separate ZIL for that purpose, and the ZIL adds extra protection / redundancy to the whole pool.

But the Cache / L2ARC drive caches all common reads & writes (simply put) onto SSD to improve overall system performance.

So I was wondering if one could do this with mdraid or even just EXT3 / EXT4?

-- Kind Regards Rudi Ahlers SoftDux Website: http://www.SoftDux.com Technical Blog: http://Blog.SoftDux.com Office: 087 805 9573 Cell: 082 554 7532

Jerry Franz

1:47 p.m.

On 04/15/2011 06:05 AM, Christopher Chan wrote:

...

Woohoo, next we will be seeing md raid6 also giving comparable results if that is the case. I am not the only person on this list that thinks cache is king for raid5/6 on hardware raid boards and the using hardware raid + bbu cache for better performance one of the two reasons why we don't do md raid5/6.

That *is* md RAID6. Sorry I didn't make that clear. I don't use anyone's hardware RAID6 right now because I haven't found a board so far that was as fast as using md. I get better performance from even a BBU backed 95X series 3ware board by using it to serve the drives as JBOD and then using md to do the actual raid.

...

Yeah, you are right - but cache is primarily to buffer the writes for performance. Why else go through the expense of getting bbu cache? So what happens when you tweak bonnie a bit?

For smaller writes. When writes *do* fit in the cache you get a big bump. As I said: Helps some cases, not all cases. BBU backed cache helps if you have lots of small writes. Not so much if you are writing gigabytes of stuff more sequentially.

-- Benjamin Franz

5308

Age (days ago)

5314

Last active (days ago)

discuss@lists.centos.org

83 comments

32 participants

tags (0)

participants (32)

Alain Péan
aurfalien＠gmail.com
Benjamin Franz
Bent Terp
Boris Epstein
Brandon Ooi
Charles Polisher
Christopher Chan
compdoc
David Miller
Drew
Emmanuel Noobadmin
James A. Peltier
Jerry Franz
John Jasen
John R Pierce
Keith Keller
Lamar Owen
Les Mikesell
m.roth＠5-cent.us
Marian Marinov
Markus Falb
Matthew Feinberg
Pasi Kärkkäinen
Peter Kjellström
rainer＠ultra-secure.de
Ross Walker
Rudi Ahlers
Simon Matter
Sorin Srbu
Steve Brooks
Torres, Giovanni (NIH/NINDS) [C]