suggestions for large filesystem server setup (n * 100 TB)

List overview All Threads
Download

newer

older

ZFS on Linux testing effort

Centos 6.5 on USB stick...

Götz Reinicke - IT Koordinator

28 Feb 2014 28 Feb '14

1:15 p.m.

Hi,

over time the requirements and possibilities regarding filesystems changed for our users.

currently I'm faced with the question:

What might be a good way to provide one big filesystem for a few users which could also be enlarged; backuping the data is not the question.

Big in that context is up to couple of 100 TB may be.

O.K. I could install one hardware raid with e.g. N big drives format with xfs. And export one big share. Done.

On the other hand, e.g. using 60 4 TB Disks in one storage would be a lot of space, but a nightmare in rebuilding on a disk crash.

Now if the share fills up, my users "complain", that they usually get a new share (what is a new raidbox).

From my POV I could e.g. use hardware raidboxes, and use LVM and filesystem growth options to extend the final share, but what if one of the boxes crash totally? The whole Filesystem would be gone.

hm.

So how do you handle big filesystems/storages/shares?

Regards . Götz

-- Götz Reinicke IT-Koordinator Tel. +49 7141 969 82 420 E-Mail goetz.reinicke@filmakademie.de Filmakademie Baden-Württemberg GmbH Akademiehof 10 71638 Ludwigsburg www.filmakademie.de Eintragung Amtsgericht Stuttgart HRB 205016 Vorsitzender des Aufsichtsrats: Jürgen Walter MdL Staatssekretär im Ministerium für Wissenschaft, Forschung und Kunst Baden-Württemberg Geschäftsführer: Prof. Thomas Schadt

Show replies by date

Phelps, Matt

28 Feb 28 Feb

1:55 p.m.

New subject: suggestions for large filesystem server setup (n * 100 TB)

I'd highly recommend getting a NetApp storage device for something that big.

It's more expensive up front, but the amount of heartache/time saved in the long run is WELL worth it.

On Fri, Feb 28, 2014 at 8:15 AM, Götz Reinicke - IT Koordinator < goetz.reinicke@filmakademie.de> wrote:

...

Hi,

over time the requirements and possibilities regarding filesystems changed for our users.

currently I'm faced with the question:

What might be a good way to provide one big filesystem for a few users which could also be enlarged; backuping the data is not the question.

Big in that context is up to couple of 100 TB may be.

O.K. I could install one hardware raid with e.g. N big drives format with xfs. And export one big share. Done.

On the other hand, e.g. using 60 4 TB Disks in one storage would be a lot of space, but a nightmare in rebuilding on a disk crash.

Now if the share fills up, my users "complain", that they usually get a new share (what is a new raidbox).

From my POV I could e.g. use hardware raidboxes, and use LVM and filesystem growth options to extend the final share, but what if one of the boxes crash totally? The whole Filesystem would be gone.

hm.

So how do you handle big filesystems/storages/shares?
    Regards . Götz
-- Götz Reinicke IT-Koordinator

Tel. +49 7141 969 82 420 E-Mail goetz.reinicke@filmakademie.de

Filmakademie Baden-Württemberg GmbH Akademiehof 10 71638 Ludwigsburg www.filmakademie.de

Eintragung Amtsgericht Stuttgart HRB 205016

Vorsitzender des Aufsichtsrats: Jürgen Walter MdL Staatssekretär im Ministerium für Wissenschaft, Forschung und Kunst Baden-Württemberg

Geschäftsführer: Prof. Thomas Schadt

CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos

-- Matt Phelps System Administrator, Computation Facility Harvard - Smithsonian Center for Astrophysics mphelps@cfa.harvard.edu, http://www.cfa.harvard.edu

Mauricio Tavares

2:30 p.m.

New subject: suggestions for large filesystem server setup (n * 100 TB)

On Fri, Feb 28, 2014 at 8:55 AM, Phelps, Matt mphelps@cfa.harvard.edu wrote:

...

I'd highly recommend getting a NetApp storage device for something that big.

It's more expensive up front, but the amount of heartache/time saved in the long run is WELL worth it.

My vote would be for a ZFS-based storage solution, be it homegrown or appliance (like nextenta). Remember, as far as ZFS (and similar filesystems whose acronyms are more than 3 letters) is concerned, a petabyte is still small fry.

...

On Fri, Feb 28, 2014 at 8:15 AM, Götz Reinicke - IT Koordinator < goetz.reinicke@filmakademie.de> wrote:

...
Hi,

over time the requirements and possibilities regarding filesystems changed for our users.

currently I'm faced with the question:

What might be a good way to provide one big filesystem for a few users which could also be enlarged; backuping the data is not the question.

Big in that context is up to couple of 100 TB may be.

O.K. I could install one hardware raid with e.g. N big drives format with xfs. And export one big share. Done.

On the other hand, e.g. using 60 4 TB Disks in one storage would be a lot of space, but a nightmare in rebuilding on a disk crash.

Now if the share fills up, my users "complain", that they usually get a new share (what is a new raidbox).

From my POV I could e.g. use hardware raidboxes, and use LVM and filesystem growth options to extend the final share, but what if one of the boxes crash totally? The whole Filesystem would be gone.

hm.

So how do you handle big filesystems/storages/shares?
    Regards . Götz
-- Götz Reinicke IT-Koordinator

Tel. +49 7141 969 82 420 E-Mail goetz.reinicke@filmakademie.de

Filmakademie Baden-Württemberg GmbH Akademiehof 10 71638 Ludwigsburg www.filmakademie.de

Eintragung Amtsgericht Stuttgart HRB 205016

Vorsitzender des Aufsichtsrats: Jürgen Walter MdL Staatssekretär im Ministerium für Wissenschaft, Forschung und Kunst Baden-Württemberg

Geschäftsführer: Prof. Thomas Schadt

CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
-- Matt Phelps System Administrator, Computation Facility Harvard - Smithsonian Center for Astrophysics mphelps@cfa.harvard.edu, http://www.cfa.harvard.edu _______________________________________________ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos

Lists

2 Mar 2 Mar

1:27 a.m.

On 02/28/2014 06:30 AM, Mauricio Tavares wrote:

...

On Fri, Feb 28, 2014 at 8:55 AM, Phelps, Matt mphelps@cfa.harvard.edu wrote:

...
I'd highly recommend getting a NetApp storage device for something that big.

It's more expensive up front, but the amount of heartache/time saved in the long run is WELL worth it.
   My vote would be for a ZFS-based storage solution, be it
homegrown or appliance (like nextenta). Remember, as far as ZFS (and similar filesystems whose acronyms are more than 3 letters) is concerned, a petabyte is still small fry.

Ditto on ZFS! I've been experimenting with it for about 5-6 months and it really is the way to go for any filesystem greater than about 10 GB IMHO. We're in the process of transitioning several of our big data pools to ZFS because it's so obviously better.

Just remember that ZFS Isn't casual! You have to take the time to understand what it is and how it works, because if you make the wrong mistake, it's curtains for your data. ZFS has a few maddening limitations** that you have to plan for. But it is far and away the leader in Copy-On-Write, large scale file systems, and once you know how to plan for it, ZFS capabilities are jaw-dropping. Here are a few off the top of my head:

1) Check for and fix filesystem errors without ever taking it offline. 2) Replace failed HDDs from a raidz pool without ever taking it offline. 3) Works best with inexpensive JBOD drives - it's actually recommended to not use expensive HW raid devices. 4) Native, built-in compression: double your usable disk space for free. 5) Extend (grow) your zfs pool without ever taking it offline. 6) Create a snapshot in seconds that you can keep or expire at any time. (snapshots are read-only, and take no disk space initially) 7) Send a snapshot (entire filesystem) to another server. Binary perfect copies in a single command, much faster than rsync when you have a large data set. 8) Ability to make a clone - a writable copy of a snapshot in seconds. A clone of a snapshot is writable, and snapshots can be created of a clone. A clone initially uses no disk space, and as you use it, it only uses the disk space of the changes between the current state of the clone and the snapshot it's derived from.

** Limitations? ZFS? Say it isn't so! But here they are:

1) You can't add redundancy after creating a vdev in a zfs pool. So if you make a ZFS vdev and don't make it raidz at the start, you can't add another more drives to get raidz. You also can't "add" redundancy to an existing raidz partition. Once you've made it raidz1, you can't add a drive to get raidz2. I've found a workaround, where you create a "fake" drive with a sparse file, and add the fake drive(s) to your RAIDZ pool upon creation, and immediately remove them. But you have to do this on initial creation! http://jeanbruenn.info/2011/01/18/setting-up-zfs-with-3-out-of-4-discs-raidz...

2) Zpools are grouped into vdevs, which you can think of like a block device made from 1 or more HDs. You can add vdevs without issue, but you can't remove them. EVER. Combine this fact with #1 and you had better be planning carefully when you extend a file system. See "Hating your data" section in this excellent ZFS walkthrough: http://arstechnica.com/information-technology/2014/02/ars-walkthrough-using-...

3) Like any COW file system, ZFS tends to fragment. This cuts into performance, especially when you have less than about 20-30% free space. This isn't as bad as it sounds, you can enable compression to double your usable space.

Bug) ZFS on Linux has been quite stable in my testing, but as of this writing, has a memory leak. The workaround is manageable but if you don't do it ZFS servers will eventually lock up. The workaround is fairly simple, google for "zfs /bin/echo 3 > /proc/sys/vm/drop_caches;"

Laurent Wandrebeck

28 Feb 28 Feb

2:39 p.m.

Götz Reinicke - IT Koordinator goetz.reinicke@filmakademie.de a écrit :

...

Hi,

<snip>

If you are to have an ever-growing volume, I’d suggest some distributed FS, like glusterfs, moosefs, lustre… You need more space ? Add a box.

We do use happily moosefs at work for a couple years (begun with a couple TB, now up to 250).

HTH, Laurent.

Lamar Owen

3:15 p.m.

New subject: suggestions for large filesystem server setup (n * 100 TB)

On 02/28/2014 08:15 AM, Götz Reinicke - IT Koordinator wrote:

...

... Big in that context is up to couple of 100 TB may be.

... From my POV I could e.g. use hardware raidboxes, and use LVM and filesystem growth options to extend the final share, but what if one of the boxes crash totally? The whole Filesystem would be gone.

... So how do you handle big filesystems/storages/shares?

We handle it with EMC Clariion fibre channel units, and LVM. Always make your PV's relatively small, use RAID6 on the array, and mirror it, either at the SAN level with MirrorView or at the OS level, using LUNs as mdraid components for the PV's. Today that would become a set of VNX systems with SAS on the backend and iSCSI on the front end, but, well, a SAN is a SAN. Surplus FC HBA's are cheap; licenses won't be, nor will the array. But how valuable is that data, or, more to the point, if you were to lose all of that 100TB what would it cost you to recreate it?

With this size of data, rolling-your-own should be the last resort, and only if you can't afford something properly engineered for high availability, like basically anything from NetApp, Nimble, or EMC (among others; those are the first three off the top of my head). The value-add with these three (among others) is the long track record of reliability and the software management tools that make life so much easier when a drive or other component inevitably fails.

An enterprise-grade SAN or NAS from a serious vendor is going to cost serious money, but you do get what you pay for, again primarily on the software side. Our four Clariions (two CX3-10c's, one CX3-80, and a CX4-480) just simply don't go down, and upgrades are very easy and reliable, in my experience. The two CX3-10c's have been online continually since mid-2007, and while they are way out of warranty, past the normal service life, even, they just run and run and run and run. (I even used the redundancy features in one of them to good effect while (slowly and carefully!) moving the array from one room to another.... long extension power cords, and long fiber jumpers worked to my advantage; of course, a stable rack on wheels made it possible. The array stayed up, and no servers lost connectivity to the SAN during the move, not that I would recommend it for normal operations, but this wasn't a normal operation.) The storage processor sends alerts when drives fault, and a drive fault is an easy hotswap with the DAE and the drive clearly identified at the front of the array. Everything (drives, power supplies, storage processor modules, LCC's) except a whole DAE or storage processor enclosure is hotswap, and I haven't had a DAE fault yet that required pulling the whole DAE out of service.

If you do roll-your-own, do not use consumer-class drives. One reason NetApp and the rest charge so much for drives is due to the extra testing and sometimes the custom firmware that goes into the drives (in a nutshell, you do NOT want the drive doing its own error recovery, that's the array storage processor's job!).

Those are my opinions and experience. YMMV.

Les Mikesell

3:43 p.m.

New subject: suggestions for large filesystem server setup (n * 100 TB)

On Fri, Feb 28, 2014 at 7:15 AM, Götz Reinicke - IT Koordinator

...

What might be a good way to provide one big filesystem for a few users which could also be enlarged; backuping the data is not the question.

Really? Does that mean you already have a backup, don't care if you lose it, or that you want something with redundancy built in (which still isn't quite the same as having a backup).

...

So how do you handle big filesystems/storages/shares?

The easy/expensive way is to buy an appliance like a NetApp and mount it via nfs. And use its own tools for snapshots, raid management, etc. I don't have any experience with it, but from what I've read I would say that 'ceph' is the up-and-coming way of doing your own distributed/redundant storage although I'm not sure I'd trust the pieces that turn it into a filesystem yet.

-- Les Mikesell lesmikesell@gmail.com

James A. Peltier

5:35 p.m.

----- Original Message ----- | Hi, | | over time the requirements and possibilities regarding filesystems | changed for our users. | | currently I'm faced with the question: | | What might be a good way to provide one big filesystem for a few | users | which could also be enlarged; backuping the data is not the question. | | Big in that context is up to couple of 100 TB may be. | | O.K. I could install one hardware raid with e.g. N big drives format | with xfs. And export one big share. Done. | | On the other hand, e.g. using 60 4 TB Disks in one storage would be a | lot of space, but a nightmare in rebuilding on a disk crash. | | Now if the share fills up, my users "complain", that they usually get | a | new share (what is a new raidbox). | | From my POV I could e.g. use hardware raidboxes, and use LVM and | filesystem growth options to extend the final share, but what if one | of | the boxes crash totally? The whole Filesystem would be gone. | | hm. | | So how do you handle big filesystems/storages/shares? | | Regards . Götz

My personal view is that you don't want any single machine to contain a 100TB file system. You'd be best served using a distributed file system such as GlusterFS or Lustre. If you insist on having a single machine with a 100TB file system on it, make sure that you install at least 300GB of memory or more if you think you'll ever have to perform a file system check on it. You're going to need it.

Note, it's that that difficult or expensive to build a supermicro box with 48 x 4TB drives to scale out the size that you need with GlusterFS, however, building it is the easiest part. It's maintaining it and troubleshooting it when things go wrong. Choosing a platform to support also depends on I/O access patterns, number of clients, connectivity (IB vs Ethernet vs iSCSI/FC/AoE,etc).

Currently we're not using any clustered file system for our data access. We have a single NFS machine which is the "front-end" to the data. It contains a whole bunch of symlinks to other NFS servers (Dell R720XD/36TB each) which the machines automount. This is really simple to maintain and if we want to do replication on a per volume level we can. We are looking into GlusterFS though for certain things.

-- James A. Peltier Manager, IT Services - Research Computing Group Simon Fraser University - Burnaby Campus Phone : 778-782-6573 Fax : 778-782-3045 E-Mail : jpeltier@sfu.ca Website : http://www.sfu.ca/itservices "Around here, however, we don’t look backwards for very long. We KEEP MOVING FORWARD, opening up new doors and doing things because we’re curious and curiosity keeps leading us down new paths." - Walt Disney

Nux!

1 Mar 1 Mar

12:22 a.m.

On 28.02.2014 13:15, Götz Reinicke - IT Koordinator wrote:

...

Hi,

over time the requirements and possibilities regarding filesystems changed for our users.

currently I'm faced with the question:

What might be a good way to provide one big filesystem for a few users which could also be enlarged; backuping the data is not the question.

I use GlusterFS. http://www.gluster.org/

HTH Lucian

-- Sent from the Delta quadrant using Borg technology! Nux! www.nux.ro

4252

Age (days ago)

4254

Last active (days ago)

discuss@lists.centos.org

8 comments

9 participants

tags (0)

participants (9)

Götz Reinicke - IT Koordinator
James A. Peltier
Lamar Owen
Laurent Wandrebeck
Les Mikesell
Lists
Mauricio Tavares
Nux!
Phelps, Matt