really large file systems with centos

List overview All Threads
Download

newer

older

HOWTO install CentOS 6 on low...

Error updating CentOS 6 kernel

John R Pierce

14 Jul 2011 14 Jul '11

6:32 a.m.

I've been asked for ideas on building a rather large archival storage system for inhouse use, on the order of 100-400TB. Probably using CentOS 6. The existing system this would replace is using Solaris 10 and ZFS, but I want to explore using Linux instead.

We have our own tomcat based archiving software that would run on this storage server, along with NFS client and server. Its a write once, read almost never kind of application, storing compressed batches of archive files for a year or two. 400TB written over 2 years translates to about 200TB/year or about 7MB/second average write speed. The very rare and occasional read accesses are done by batches where a client makes a webservice call to get a specific set of files, then they are pushed as a batch to staging storage where the user can then browse them, this can take minutes without any problems.

My general idea is a 2U server with 1-4 SAS cards connected to strings of about 48 SATA disks (4 x 12 or 3 x 16), all configured as JBOD, so there would potentially be 48 or 96 or 192 drives on this one server. I'm thinking they should be laid as as 4 or 8 or 16 seperate RAID6 sets of 10 disks each, then use LVM to put those into a larger volume. About 10% of the disks would be reserved as global hot spares.

So, my questions...

A) Can CentOS 6 handle that many JBOD disks in one system? is my upper size too big and I should plan for 2 or more servers? What happens with the device names when you've gone past /dev/sdz ?

B) What is the status of large file system support in CentOS 6? I know XFS is frequently mentioned with such systems, but I/we have zero experience with it, its never been natively supported in EL up to 5, anyways.

C) Is GFS suitable for this, or is it strictly for clustered storage systems?

D) anything important I've neglected?

-- john r pierce N 37, W 122 santa cruz ca mid-left coast

Show replies by date

Joseph L. Casale

14 Jul 14 Jul

9:32 a.m.

...

A) Can CentOS 6 handle that many JBOD disks in one system? is my upper size too big and I should plan for 2 or more servers? What happens with the device names when you've gone past /dev/sdz ?

Dev names double, sdaa etc.

...

B) What is the status of large file system support in CentOS 6? I know XFS is frequently mentioned with such systems, but I/we have zero experience with it, its never been natively supported in EL up to 5, anyways.

My use of XFS has been with great success in 5.x. I have never scaled that large though.

...

C) Is GFS suitable for this, or is it strictly for clustered storage systems?

Unless you plan on mounting the fs by more than one server at once, I most certainly would not add that layer of complexity, its also slower.

...

D) anything important I've neglected?

If I understand, it's not your only backup system, so I don't think it's that critical, but the rebuild time on each array versus the degraded IO capacity and its impact on serving content would be something interesting.

Do you plan on making hotspares available? That many discs will likely have a higher rate of failure...

What kind of discs and controllers do you intend on using?

jlc

John R Pierce

15 Jul 15 Jul

midnight

On 07/14/11 2:32 AM, Joseph L. Casale wrote:

...

If I understand, it's not your only backup system, so I don't think it's that critical, but the rebuild time on each array versus the degraded IO capacity and its impact on serving content would be something interesting.

Do you plan on making hotspares available? That many discs will likely have a higher rate of failure...

planning on about 5-10% hot spares in the array. For example, my suggested layout for 192 disks is...

18 x 10 drive raid6, with 12 hot spares. The 18 seperate raid sets would be merged into larger volume groups, probably as volume sets and NOT stripes so the idle disks could spin down, since most of this data, once written, will rarely if ever be looked at again. With 3TB disks, this gives 18 x 8 x 3 TB == 432TB total usable capacity (really more like 400TB when you figure binary vs decimal, etc etc)

...

What kind of discs and controllers do you intend on using?

Seagate Constellation ES.2 3TB SAS drives, in (one of a couple major label vendors) SAS chassis, with (major vendors' rebranded) LSI SAS HBA's (configured for JBOD). Using SAS disks rather than SATA to gain the path redundancy from dual porting.

-- john r pierce N 37, W 122 santa cruz ca mid-left coast

Pasi Kärkkäinen

14 Jul 14 Jul

1:53 p.m.

On Wed, Jul 13, 2011 at 11:32:14PM -0700, John R Pierce wrote:

...

I've been asked for ideas on building a rather large archival storage system for inhouse use, on the order of 100-400TB. Probably using CentOS 6. The existing system this would replace is using Solaris 10 and ZFS, but I want to explore using Linux instead.

We have our own tomcat based archiving software that would run on this storage server, along with NFS client and server. Its a write once, read almost never kind of application, storing compressed batches of archive files for a year or two. 400TB written over 2 years translates to about 200TB/year or about 7MB/second average write speed. The very rare and occasional read accesses are done by batches where a client makes a webservice call to get a specific set of files, then they are pushed as a batch to staging storage where the user can then browse them, this can take minutes without any problems.

My general idea is a 2U server with 1-4 SAS cards connected to strings of about 48 SATA disks (4 x 12 or 3 x 16), all configured as JBOD, so there would potentially be 48 or 96 or 192 drives on this one server. I'm thinking they should be laid as as 4 or 8 or 16 seperate RAID6 sets of 10 disks each, then use LVM to put those into a larger volume. About 10% of the disks would be reserved as global hot spares.

So, my questions...

D) anything important I've neglected?

Remember Solaris ZFS does checksumming for all data, so with weekly/monthly ZFS scrubbing it can detect silent data/disk corruption automatically and fix it. With a lot of data, that might get pretty important..

-- Pasi

przemolicc＠poczta.fm

2:39 p.m.

On Thu, Jul 14, 2011 at 04:53:11PM +0300, Pasi Kärkkäinen wrote:

...

On Wed, Jul 13, 2011 at 11:32:14PM -0700, John R Pierce wrote:

...
I've been asked for ideas on building a rather large archival storage system for inhouse use, on the order of 100-400TB. Probably using CentOS 6. The existing system this would replace is using Solaris 10 and ZFS, but I want to explore using Linux instead.

We have our own tomcat based archiving software that would run on this storage server, along with NFS client and server. Its a write once, read almost never kind of application, storing compressed batches of archive files for a year or two. 400TB written over 2 years translates to about 200TB/year or about 7MB/second average write speed. The very rare and occasional read accesses are done by batches where a client makes a webservice call to get a specific set of files, then they are pushed as a batch to staging storage where the user can then browse them, this can take minutes without any problems.

My general idea is a 2U server with 1-4 SAS cards connected to strings of about 48 SATA disks (4 x 12 or 3 x 16), all configured as JBOD, so there would potentially be 48 or 96 or 192 drives on this one server. I'm thinking they should be laid as as 4 or 8 or 16 seperate RAID6 sets of 10 disks each, then use LVM to put those into a larger volume. About 10% of the disks would be reserved as global hot spares.

So, my questions...

D) anything important I've neglected?

Remember Solaris ZFS does checksumming for all data, so with weekly/monthly ZFS scrubbing it can detect silent data/disk corruption automatically and fix it. With a lot of data, that might get pretty important..

What is the reason to avoid ZFS ? IMHO for such systems ZFS is the best.

Regards Przemyslaw Bak (przemol) -- http://przemol.blogspot.com/

---------------------------------------------------------------- Znajdz samochod idealny dla siebie! Szukaj >> http://linkint.pl/f29e2

John R Pierce

15 Jul 15 Jul

12:02 a.m.

On 07/14/11 7:39 AM, przemolicc@poczta.fm wrote:

...

What is the reason to avoid ZFS ? IMHO for such systems ZFS is the best.

Oracle, mostly.

-- john r pierce N 37, W 122 santa cruz ca mid-left coast

Ross Walker

1:50 p.m.

On Jul 14, 2011, at 8:02 PM, John R Pierce pierce@hogranch.com wrote:

...

On 07/14/11 7:39 AM, przemolicc@poczta.fm wrote:

...
What is the reason to avoid ZFS ? IMHO for such systems ZFS is the best.

Oracle, mostly.

How about Nexenta then? Their product is solid, their prices reasonable and I think their in good financial health.

-Ross

sz quadri

14 Jul 14 Jul

3:04 p.m.

True. For your kind of usage, I too think (and recommend) you should stick with ZFS.

On Thu, Jul 14, 2011 at 7:23 PM, Pasi Kärkkäinen pasik@iki.fi wrote:

...

On Wed, Jul 13, 2011 at 11:32:14PM -0700, John R Pierce wrote:

...
I've been asked for ideas on building a rather large archival storage system for inhouse use, on the order of 100-400TB. Probably using CentOS 6. The existing system this would replace is using Solaris 10 and ZFS, but I want to explore using Linux instead.

We have our own tomcat based archiving software that would run on this storage server, along with NFS client and server. Its a write once, read almost never kind of application, storing compressed batches of archive files for a year or two. 400TB written over 2 years translates to about 200TB/year or about 7MB/second average write speed. The very rare and occasional read accesses are done by batches where a client makes a webservice call to get a specific set of files, then they are pushed as a batch to staging storage where the user can then browse them, this can take minutes without any problems.

My general idea is a 2U server with 1-4 SAS cards connected to strings of about 48 SATA disks (4 x 12 or 3 x 16), all configured as JBOD, so there would potentially be 48 or 96 or 192 drives on this one server. I'm thinking they should be laid as as 4 or 8 or 16 seperate RAID6 sets of 10 disks each, then use LVM to put those into a larger volume. About 10% of the disks would be reserved as global hot spares.

So, my questions...

D) anything important I've neglected?

Remember Solaris ZFS does checksumming for all data, so with weekly/monthly ZFS scrubbing it can detect silent data/disk corruption automatically and fix it. With a lot of data, that might get pretty important..

-- Pasi

CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos

Pasi Kärkkäinen

7:56 p.m.

On Thu, Jul 14, 2011 at 04:53:11PM +0300, Pasi Kärkkäinen wrote:

...

On Wed, Jul 13, 2011 at 11:32:14PM -0700, John R Pierce wrote:

...
I've been asked for ideas on building a rather large archival storage system for inhouse use, on the order of 100-400TB. Probably using CentOS 6. The existing system this would replace is using Solaris 10 and ZFS, but I want to explore using Linux instead.

We have our own tomcat based archiving software that would run on this storage server, along with NFS client and server. Its a write once, read almost never kind of application, storing compressed batches of archive files for a year or two. 400TB written over 2 years translates to about 200TB/year or about 7MB/second average write speed. The very rare and occasional read accesses are done by batches where a client makes a webservice call to get a specific set of files, then they are pushed as a batch to staging storage where the user can then browse them, this can take minutes without any problems.

My general idea is a 2U server with 1-4 SAS cards connected to strings of about 48 SATA disks (4 x 12 or 3 x 16), all configured as JBOD, so there would potentially be 48 or 96 or 192 drives on this one server. I'm thinking they should be laid as as 4 or 8 or 16 seperate RAID6 sets of 10 disks each, then use LVM to put those into a larger volume. About 10% of the disks would be reserved as global hot spares.

So, my questions...

D) anything important I've neglected?

Remember Solaris ZFS does checksumming for all data, so with weekly/monthly ZFS scrubbing it can detect silent data/disk corruption automatically and fix it. With a lot of data, that might get pretty important..

Oh, and one more thing.. if you're going to use that many JBODs, pay attention to SES chassis management chips/drivers and software, so that you get the error/fault LEDs working on disk failure!

-- Pasi

Don Krause

10:35 p.m.

On Jul 14, 2011, at 12:56 PM, Pasi Kärkkäinen wrote:

...

On Thu, Jul 14, 2011 at 04:53:11PM +0300, Pasi Kärkkäinen wrote:

...
On Wed, Jul 13, 2011 at 11:32:14PM -0700, John R Pierce wrote:

...
I've been asked for ideas on building a rather large archival storage system for inhouse use, on the order of 100-400TB. Probably using CentOS 6. The existing system this would replace is using Solaris 10 and ZFS, but I want to explore using Linux instead.

We have our own tomcat based archiving software that would run on this storage server, along with NFS client and server. Its a write once, read almost never kind of application, storing compressed batches of archive files for a year or two. 400TB written over 2 years translates to about 200TB/year or about 7MB/second average write speed. The very rare and occasional read accesses are done by batches where a client makes a webservice call to get a specific set of files, then they are pushed as a batch to staging storage where the user can then browse them, this can take minutes without any problems.

My general idea is a 2U server with 1-4 SAS cards connected to strings of about 48 SATA disks (4 x 12 or 3 x 16), all configured as JBOD, so there would potentially be 48 or 96 or 192 drives on this one server. I'm thinking they should be laid as as 4 or 8 or 16 seperate RAID6 sets of 10 disks each, then use LVM to put those into a larger volume. About 10% of the disks would be reserved as global hot spares.

So, my questions...

D) anything important I've neglected?

Remember Solaris ZFS does checksumming for all data, so with weekly/monthly ZFS scrubbing it can detect silent data/disk corruption automatically and fix it. With a lot of data, that might get pretty important..

Oh, and one more thing.. if you're going to use that many JBODs, pay attention to SES chassis management chips/drivers and software, so that you get the error/fault LEDs working on disk failure!

-- Pasi

And make sure the assembler wires it all up correctly, I have a JBOD box, 16 drives in a supermicro chassis, where the drives are numbered left to right, but the error lights assume top to bottom.

The first time we had a drive fail, I opened the RAID management software, clicked "Blink Light" on the failed drive, pulled the unit that was flashing, and toasted the array. (Of course, NOW it's RAID6 with hot spare so that won't happen anymore..)

-- Don Krause "This message represents the official view of the voices in my head."

Rudi Ahlers

15 Jul 15 Jul

7:06 a.m.

On Fri, Jul 15, 2011 at 12:35 AM, Don Krause dkrause@optivus.com wrote:

...

On Jul 14, 2011, at 12:56 PM, Pasi Kärkkäinen wrote:

...
On Thu, Jul 14, 2011 at 04:53:11PM +0300, Pasi Kärkkäinen wrote:

...
On Wed, Jul 13, 2011 at 11:32:14PM -0700, John R Pierce wrote:

...
I've been asked for ideas on building a rather large archival storage system for inhouse use, on the order of 100-400TB. Probably using CentOS 6. The existing system this would replace is using Solaris 10 and ZFS, but I want to explore using Linux instead.

We have our own tomcat based archiving software that would run on this storage server, along with NFS client and server. Its a write once, read almost never kind of application, storing compressed batches of archive files for a year or two. 400TB written over 2 years translates to about 200TB/year or about 7MB/second average write speed. The very rare and occasional read accesses are done by batches where a client makes a webservice call to get a specific set of files, then they are pushed as a batch to staging storage where the user can then browse them, this can take minutes without any problems.

My general idea is a 2U server with 1-4 SAS cards connected to strings of about 48 SATA disks (4 x 12 or 3 x 16), all configured as JBOD, so there would potentially be 48 or 96 or 192 drives on this one server. I'm thinking they should be laid as as 4 or 8 or 16 seperate RAID6 sets of 10 disks each, then use LVM to put those into a larger volume. About 10% of the disks would be reserved as global hot spares.

So, my questions...

D) anything important I've neglected?

Remember Solaris ZFS does checksumming for all data, so with weekly/monthly ZFS scrubbing it can detect silent data/disk corruption automatically and fix it. With a lot of data, that might get pretty important..

Oh, and one more thing.. if you're going to use that many JBODs, pay attention to SES chassis management chips/drivers and software, so that you get the error/fault LEDs working on disk failure!

-- Pasi

And make sure the assembler wires it all up correctly, I have a JBOD box, 16 drives in a supermicro chassis, where the drives are numbered left to right, but the error lights assume top to bottom.

The first time we had a drive fail, I opened the RAID management software, clicked "Blink Light" on the failed drive, pulled the unit that was flashing, and toasted the array. (Of course, NOW it's RAID6 with hot spare so that won't happen anymore..)

-- Don Krause "This message represents the official view of the voices in my head."

Which is why nobody should use RAID5 for anything other than test purposes :)

-- Kind Regards Rudi Ahlers SoftDux Website: http://www.SoftDux.com Technical Blog: http://Blog.SoftDux.com Office: 087 805 9573 Cell: 082 554 7532

Les Mikesell

14 Jul 14 Jul

2:16 p.m.

On 7/14/2011 1:32 AM, John R Pierce wrote:

...

I've been asked for ideas on building a rather large archival storage system for inhouse use, on the order of 100-400TB. Probably using CentOS 6. The existing system this would replace is using Solaris 10 and ZFS, but I want to explore using Linux instead.

We have our own tomcat based archiving software that would run on this storage server, along with NFS client and server. Its a write once, read almost never kind of application, storing compressed batches of archive files for a year or two. 400TB written over 2 years translates to about 200TB/year or about 7MB/second average write speed. The very rare and occasional read accesses are done by batches where a client makes a webservice call to get a specific set of files, then they are pushed as a batch to staging storage where the user can then browse them, this can take minutes without any problems.

If it doesn't have to look exactly like a file system you might like luwak which is a layer over the riak nosql distributed database to handle large files. (http://wiki.basho.com/Luwak.html) The underlying storage is distributed across any number of nodes with a scheme that lets you add more as needed and keeps redundant copies to handle node failures. A down side of luwak for most purposes is that because it chunks the data and re-uses duplicates, you can't remove anything, but for archive purposes it might work well.

For something that looks more like a filesystem, but is also distributed and redundant: http://www.moosefs.org/.

-- Les Mikesell lesmikesell@gmail.com

Devin Reade

3:20 p.m.

Two thoughts:

1. Others have already inquired as to your motivation to move away from ZFS/Solaris. If it is just the hardware & licensing aspect, you might want to consider ZFS on FreeBSD. (I understand that unlike the Linux ZFS implementation, the FreeBSD one is in-kernel.)

2. If you really want to move away from move away from ZFS, one possibility is to use glusterfs, which is a lockless distributed filesystem. Based on the glusterfs architecture, you scale out horizontally over time; instead of buying a single server with massive capacity, you buy smaller servers and add more as your space requirements exceed your current capacity. You also decide over how many nodes you want your data to be mirrored. Think about it as a RAID0/RAID1/RAID10 solution spread over machines rather than just disk. It uses fuse over native filesystems, so if you decide to back it out you turn off glusterfs and you still have your data on the native filesystem.

From the client perspective the server cluster looks like a single logical entity, either over NFS or the native client software. (The native client is configured with info on all the server nodes, the NFS client depends on round-robin DNS to connect to *some* node of the cluster.)

http://www.gluster.org

Caveat: I've only used glusterfs in one small deployment in a mirrored-between-two-nodes configuration. Glusterfs doesn't have as many miles on it as ZFS or the other more common filesystems. I've not run into any serious hiccoughs, but put in a test cluster first and try it out. Commodity hardware is just fine for such a test cluster.

Devin

Christopher Chan

15 Jul 15 Jul

2:54 p.m.

On Thursday, July 14, 2011 11:20 PM, Devin Reade wrote:

...

Two thoughts:

Others have already inquired as to your motivation to move away from ZFS/Solaris. If it is just the hardware& licensing aspect, you might want to consider ZFS on FreeBSD. (I understand that unlike the Linux ZFS implementation, the FreeBSD one is in-kernel.)

I would not touch ZFS on FreeBSD with a ten-foot pole.

I don't see a problem with using Nexenta/OpenIndiana but then I only have twelve disks in my setup currently

Devin Reade

8:24 p.m.

--On Friday, July 15, 2011 10:54:35 PM +0800 Christopher Chan christopher.chan@bradbury.edu.hk wrote:

...

I would not touch ZFS on FreeBSD with a ten-foot pole.

Would you care to elaborate as to why? And specifically if it is particular to FreeBSD or ZFS or the combination.

I've not used it so I do not have any opinions on it.

Devin

Christopher Chan

11:37 p.m.

On Saturday, July 16, 2011 04:24 AM, Devin Reade wrote:

...

--On Friday, July 15, 2011 10:54:35 PM +0800 Christopher Chan christopher.chan@bradbury.edu.hk wrote:

...
I would not touch ZFS on FreeBSD with a ten-foot pole.

Would you care to elaborate as to why? And specifically if it is particular to FreeBSD or ZFS or the combination.

This is particular to the ZFS implementation on FreeBSD. It is not stable, please check its list. However, I have not bothered checking on things within the last year or so after I went with OpenIndiana.

Les Mikesell

11:44 p.m.

On 7/15/2011 6:37 PM, Christopher Chan wrote:

...

On Saturday, July 16, 2011 04:24 AM, Devin Reade wrote:

...
--On Friday, July 15, 2011 10:54:35 PM +0800 Christopher Chan christopher.chan@bradbury.edu.hk wrote:

...
I would not touch ZFS on FreeBSD with a ten-foot pole.

Would you care to elaborate as to why? And specifically if it is particular to FreeBSD or ZFS or the combination.

This is particular to the ZFS implementation on FreeBSD. It is not stable, please check its list. However, I have not bothered checking on things within the last year or so after I went with OpenIndiana.

I've been thinking about trying freenas 8 as a file/backup server to get zfs. That's based on freebsd 8 which is probably newer than your experience but maybe they haven't solved all the problems yet.

-- Les Mikesell lesmikesll@gmail.com

James A. Peltier

16 Jul 16 Jul

2:18 a.m.

While we're talking non-CentOS DragonFlyBSD with HAMMERFS

-- James A. Peltier IT Services - Research Computing Group Simon Fraser University - Burnaby Campus Phone : 778-782-6573 Fax : 778-782-3045 E-Mail : jpeltier@sfu.ca Website : http://www.sfu.ca/itservices http://blogs.sfu.ca/people/jpeltier

Tracy Reed

27 Jul 27 Jul

12:22 a.m.

On Wed, Jul 13, 2011 at 11:32:14PM -0700, John R Pierce spake thusly:

...

I've been asked for ideas on building a rather large archival storage system for inhouse use, on the order of 100-400TB. Probably using CentOS 6. The existing system this would replace is using Solaris 10 and ZFS, but I want to explore using Linux instead.

If you don't need a POSIX filesystem interface check out MogileFS. It could greatly simplify a lot of these scalability issues.

-- Tracy Reed http://tracyreed.org

5126

Age (days ago)

5139

Last active (days ago)

discuss@lists.centos.org

18 comments

13 participants

tags (0)

participants (13)

Christopher Chan
Devin Reade
Don Krause
James A. Peltier
John R Pierce
Joseph L. Casale
Les Mikesell
Pasi Kärkkäinen
przemolicc＠poczta.fm
Ross Walker
Rudi Ahlers
sz quadri
Tracy Reed