What FileSystems for large stores and very very large stores?

List overview All Threads
Download

newer

older

Autofs - can you mount only...

MP3 Tagger

Eliezer Croitoru

5 Jul 2013 5 Jul '13

1:45 p.m.

I was learning about the different FS exists. I was working on systems that ReiserFS was the star but since there is no longer support from the creator there are other consolidations to be done. I want to ask about couple FS options. EXT4 which is amazing for one node but for more it's another story. I have heard about GFS2 and GlusterFS and read the docs and official materials from RH on them. In the RH docs it states the EXT4 limit files per directory is 65k and I had a directory which was pretty loaded with files and I am unsure exactly what was the size but I am almost sure it was larger the 65k files per directory.

I was considering using GlusterFS for a very large storage system with NFS front. I am still unsure EXT4 should or shouldn't be able to handle more then 16TB since the linux kernel ext4 docs at: https://www.kernel.org/doc/Documentation/filesystems/ext4.txt in section 2.1 it states: * ability to use filesystems > 16TB (e2fsprogs support not available yet). so can I use it or not?? if there are no tools to handle this size then I cannot trust it.

I want to create a storage with more then 16TB based on GlusterFS since it allows me to use 2-3 rings FS which will allow me to put the storage in a form of: 1 client -> HA NFS servers -> GlusterFS cluster.

it seems to more that GlusterFS is a better choice then Swift since RH do provide support for it.

Every response will be appreciated.

Thanks, Eliezer

Show replies by date

m.roth＠5-cent.us

5 Jul 5 Jul

1:55 p.m.

New subject: What FileSystems for large stores and very very large stores?

Eliezer Croitoru wrote: <snip>

...

I was considering using GlusterFS for a very large storage system with NFS front. I am still unsure EXT4 should or shouldn't be able to handle more then 16TB since the linux kernel ext4 docs at: https://www.kernel.org/doc/Documentation/filesystems/ext4.txt in section 2.1 it states: * ability to use filesystems > 16TB (e2fsprogs support not available yet). so can I use it or not?? if there are no tools to handle this size then I cannot trust it.

I would not go over 16TB (actually, I should say that I have not gone over it, and I have several LARGE RAID boxes). The tools aren't there, or don't work usefully (days to check a filesystem isn't "useful").

...

I want to create a storage with more then 16TB based on GlusterFS since it allows me to use 2-3 rings FS which will allow me to put the storage in a form of: 1 client -> HA NFS servers -> GlusterFS cluster.

We tried out glusterfs a couple years ago, but I gather, from my manage and the user who tried it, that there were some issues. I have no idea how one would fsck a glusterfs.

*shrug* Why not stay under 16TB, and just mount the filesystems where you want? Were you looking at someone creating a single file larger than that?

mark mark

Les Mikesell

4:05 p.m.

On Fri, Jul 5, 2013 at 8:45 AM, Eliezer Croitoru eliezer@ngtech.co.il wrote:

...

I was learning about the different FS exists. I was working on systems that ReiserFS was the star but since there is no longer support from the creator there are other consolidations to be done. I want to ask about couple FS options. EXT4 which is amazing for one node but for more it's another story. I have heard about GFS2 and GlusterFS and read the docs and official materials from RH on them. In the RH docs it states the EXT4 limit files per directory is 65k and I had a directory which was pretty loaded with files and I am unsure exactly what was the size but I am almost sure it was larger the 65k files per directory.

I was considering using GlusterFS for a very large storage system with NFS front. I am still unsure EXT4 should or shouldn't be able to handle more then 16TB since the linux kernel ext4 docs at: https://www.kernel.org/doc/Documentation/filesystems/ext4.txt in section 2.1 it states: * ability to use filesystems > 16TB (e2fsprogs support not available yet). so can I use it or not?? if there are no tools to handle this size then I cannot trust it.

I want to create a storage with more then 16TB based on GlusterFS since it allows me to use 2-3 rings FS which will allow me to put the storage in a form of: 1 client -> HA NFS servers -> GlusterFS cluster.

If you really only have one client, you might look at ceph for distributed block storage with xfs on top so you don't need to run through fuse. Or if your application can be changed to use the s3 interface you could get away from the posix filesystem bottlenecks completely.

No experience with this stuff - just sounds like the promising up-and-coming thing...

-- Les Mikesell lesmikesell@gmail.com

SilverTip257

4:34 p.m.

On Fri, Jul 5, 2013 at 12:05 PM, Les Mikesell lesmikesell@gmail.com wrote:

...

On Fri, Jul 5, 2013 at 8:45 AM, Eliezer Croitoru eliezer@ngtech.co.il wrote:

...
I want to create a storage with more then 16TB based on GlusterFS since it allows me to use 2-3 rings FS which will allow me to put the storage in a form of: 1 client -> HA NFS servers -> GlusterFS cluster.

If you really only have one client, you might look at ceph for distributed block storage with xfs on top so you don't need to run

Or DRBD [0] to block replicate the storage I know a handful of people on the lists here use DRBD on the virtualization clusters.

A quick Google search for 'ceph vs drbd' yields these URLs [1] [2] the OP might also look at.

...

through fuse. Or if your application can be changed to use the s3 interface you could get away from the posix filesystem bottlenecks completely.

No experience with this stuff - just sounds like the promising up-and-coming thing...

-- Les Mikesell lesmikesell@gmail.com _______________________________________________ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos

[0] http://www.drbd.org/ [1] http://ceph.com/community/ceph-comes-to-synnefo-and-ganeti/ [2] http://forum.proxmox.com/threads/10452-Why-a-Shared-or-Cluster-Filesystem-is...

-- ---~~.~~--- Mike // SilverTip257 //

James A. Peltier

4:37 p.m.

New subject: What FileSystems for large stores and very very large stores?

As someone who has some rather large volumes for research storage I will say that ALL of the file systems have limitations, *especially* in the case of failures. I have typical volumes that range from 16TB up to 48TB and the big issue is when it comes to performing file system checks. You see, there is a lot of information that gets loaded into memory in order to perform a file system check. A number of years ago I was unable to perform a EXT4 file system check on a 15TB volume without consuming over 32GB of memory on a file system with very few files. At the time, the file server only had 8GB of memory, so this presented a problem.

However, while this problem was solvable it was also subject to usage. The file system in question only had large files on it. These files were typically gigabytes in size, but for another filer, this time with 48GB of memory but a tens of millions of very small files, the file system check for it took nearly 96GB of memory in order to perform a file system check.

So far, without a doubt, XFS has been the best "overall" file system for our usages, but YMMV. It would seem that Red Hat is also pushing it as the file system of choice going forward until something better ( btrfs *snicker* ) comes along. XFS is also the recommended file system for use with GlusterFS so that makes it an easy choice too.

GlusterFS itself has some H/A built in. You can talk to any of the GlusterFS servers via NFS and it will fully operate in an active/active manner so your diagram would be 1 client -> Gluster Cluster (via protocols supported by Gluster NFS/CIFS/NATIVE). I have found it to be rather fragile as well in some respects and performance for some of my workloads just don't map well to it even though it looks like they should gain some benefit. However, it does work seemingly well for other workloads and it is being actively developed.

GlusterFS also allows you to "import" existing file systems at a later time. So feel free to start off with a standard XFS volume, but be mindful of the XFS options that GlusterFS requires, namely the inode size being 64K, then if you decide to add cluster to your storage infrastructure you can perform the said "import" function and then start replication or distributed file serving from Gluster.

-- James A. Peltier Manager, IT Services - Research Computing Group Simon Fraser University - Burnaby Campus Phone : 778-782-6573 Fax : 778-782-3045 E-Mail : jpeltier@sfu.ca Website : http://www.sfu.ca/itservices “A successful person is one who can lay a solid foundation from the bricks others have thrown at them.” -David Brinkley via Luke Shaw

Les Mikesell

4:59 p.m.

On Fri, Jul 5, 2013 at 11:37 AM, James A. Peltier jpeltier@sfu.ca wrote:

...

As someone who has some rather large volumes for research storage I will say that ALL of the file systems have limitations, *especially* in the case of failures. I have typical volumes that range from 16TB up to 48TB and the big issue is when it comes to performing file system checks.

Have you done anything with ceph? With/without a filesystem on top?

...

So far, without a doubt, XFS has been the best "overall" file system for our usages, but YMMV. It would seem that Red Hat is also pushing it as the file system of choice going forward until something better ( btrfs *snicker* ) comes along. XFS is also the recommended file system for use with GlusterFS so that makes it an easy choice too.

Is the (snicker) from the slow development or do you think the goals are impossible? Btrfs on top of ceph sounds as good as a posix-looking fs could get.

-- Les Mikesell lesmikesell@gmail.com

James A. Peltier

5:15 p.m.

New subject: What FileSystems for large stores and very very large stores?

----- Original Message ----- | On Fri, Jul 5, 2013 at 11:37 AM, James A. Peltier jpeltier@sfu.ca | wrote: | > | > As someone who has some rather large volumes for research storage I | > will say that ALL of the file systems have limitations, | > *especially* in the case of failures. I have typical volumes that | > range from 16TB up to 48TB and the big issue is when it comes to | > performing file system checks. | | Have you done anything with ceph? With/without a filesystem on top?

Nope, went with GlusterFS testing as it was

1) Something that we could get full stack support if we opted to 2) Did as little as possible to deviate from the base OS as provided

| > So far, without a doubt, XFS has been the best "overall" file | > system for our usages, but YMMV. It would seem that Red Hat is | > also pushing it as the file system of choice going forward until | > something better ( btrfs *snicker* ) comes along. XFS is also the | > recommended file system for use with GlusterFS so that makes it an | > easy choice too. | > | | Is the (snicker) from the slow development or do you think the goals | are impossible? Btrfs on top of ceph sounds as good as a | posix-looking fs could get.

I don't like to start flame wars so lets just say that I think the limitations imposed on btrfs from a design perspective were such that I don't think there is a chance that it will ever get the capabilities of the file system that it is trying to compete against (ZFS). There is a reason that the ZFS developers decided to toss out years of experience in file systems and start over. The overhead and limitations of the traditional methods just didn't cut it.

Again, these are only my opinions, based on what I see in front of me today and taking into consideration what I saw ZFS go through over the past 5-7 years.

Les Mikesell

5:35 p.m.

On Fri, Jul 5, 2013 at 12:15 PM, James A. Peltier jpeltier@sfu.ca wrote:

...

| Is the (snicker) from the slow development or do you think the goals | are impossible? Btrfs on top of ceph sounds as good as a | posix-looking fs could get.

I don't like to start flame wars so lets just say that I think the limitations imposed on btrfs from a design perspective were such that I don't think there is a chance that it will ever get the capabilities of the file system that it is trying to compete against (ZFS). There is a reason that the ZFS developers decided to toss out years of experience in file systems and start over. The overhead and limitations of the traditional methods just didn't cut it.

I just think it is sad that the linux kernel license prohibits distribution with 'best-of-breed' components... But conceptually, distributing the block storage seems like a good idea and zfs embeds a lot of the block device management.

-- Les Mikesell lesmikesell@gmail.com

James A. Peltier

5:47 p.m.

New subject: What FileSystems for large stores and very very large stores?

I guess that's what FUSE is for, LOL! Indeed it ZFS does implement this, but none of the block storage impacts or is utilized by anything but ZFS. With btrfs this all has to do with maintaining the legacy way of doing things which is a severely limiting factor. The ZFS devs knew that this would not be a backward compatible change that it's native file system UFS would be able to use at all anyway. They knew that UFS/SFS was not what was required for the next round of storage technology either.

I just think that were were many correct justifications made to do what the ZFS devs did and that by doing so ZFS became a much better product for it. I mean try to understand the btrfs syntax vs the zfs syntax. btrfs is INSANE!

Eliezer Croitoru

7 Aug 7 Aug

12:58 a.m.

OK so back to the issue in hands. The issue is that I have a mail storage for more then 65k users per domain and the ext4 doesn't support this size of directory list. The reiser FS indeed fits for the purpose but ext4 doesn't even start to scratch it. Now the real question is that: What FS will you use for dovecot backhand to store a domain with more then 65k users?

Eliezer

On 07/05/2013 04:45 PM, Eliezer Croitoru wrote:

...

I was learning about the different FS exists. I was working on systems that ReiserFS was the star but since there is no longer support from the creator there are other consolidations to be done. I want to ask about couple FS options. EXT4 which is amazing for one node but for more it's another story. I have heard about GFS2 and GlusterFS and read the docs and official materials from RH on them. In the RH docs it states the EXT4 limit files per directory is 65k and I had a directory which was pretty loaded with files and I am unsure exactly what was the size but I am almost sure it was larger the 65k files per directory.

I was considering using GlusterFS for a very large storage system with NFS front. I am still unsure EXT4 should or shouldn't be able to handle more then 16TB since the linux kernel ext4 docs at: https://www.kernel.org/doc/Documentation/filesystems/ext4.txt in section 2.1 it states: * ability to use filesystems > 16TB (e2fsprogs support not available yet). so can I use it or not?? if there are no tools to handle this size then I cannot trust it.

I want to create a storage with more then 16TB based on GlusterFS since it allows me to use 2-3 rings FS which will allow me to put the storage in a form of: 1 client -> HA NFS servers -> GlusterFS cluster.

it seems to more that GlusterFS is a better choice then Swift since RH do provide support for it.

Every response will be appreciated.

Thanks, Eliezer

CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos

SilverTip257

1:30 a.m.

On Tue, Aug 6, 2013 at 8:58 PM, Eliezer Croitoru eliezer@ngtech.co.ilwrote:

...

OK so back to the issue in hands. The issue is that I have a mail storage for more then 65k users per domain and the ext4 doesn't support this size of directory list. The reiser FS indeed fits for the purpose but ext4 doesn't even start to scratch it. Now the real question is that: What FS will you use for dovecot backhand to store a domain with more then 65k users?

XFS? Used for situations where one has "lots or large" as Dave Chinner says [0] Meaning lots of files or large files.

[0] http://www.youtube.com/watch?v=i3IreQHLELU

...

Eliezer

On 07/05/2013 04:45 PM, Eliezer Croitoru wrote:

...
I was learning about the different FS exists. I was working on systems that ReiserFS was the star but since there is no longer support from the creator there are other consolidations to be done. I want to ask about couple FS options. EXT4 which is amazing for one node but for more it's another story. I have heard about GFS2 and GlusterFS and read the docs and official materials from RH on them. In the RH docs it states the EXT4 limit files per directory is 65k and I had a directory which was pretty loaded with files and I am unsure exactly what was the size but I am almost sure it was larger the 65k files per directory.

I was considering using GlusterFS for a very large storage system with NFS front. I am still unsure EXT4 should or shouldn't be able to handle more then 16TB since the linux kernel ext4 docs at: https://www.kernel.org/doc/Documentation/filesystems/ext4.txt in

section 2.1

...
it states: * ability to use filesystems > 16TB (e2fsprogs support not available yet). so can I use it or not?? if there are no tools to handle this size then I cannot trust it.

I want to create a storage with more then 16TB based on GlusterFS since it allows me to use 2-3 rings FS which will allow me to put the storage in a form of: 1 client -> HA NFS servers -> GlusterFS cluster.

it seems to more that GlusterFS is a better choice then Swift since RH do provide support for it.

Every response will be appreciated.

Thanks, Eliezer

CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos

CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos

-- ---~~.~~--- Mike // SilverTip257 //

Steve Brooks

9:45 a.m.

New subject: What FileSystems for large stores and very very large stores?

On Tue, 6 Aug 2013, SilverTip257 wrote:

...

On Tue, Aug 6, 2013 at 8:58 PM, Eliezer Croitoru eliezer@ngtech.co.ilwrote:

...
OK so back to the issue in hands. The issue is that I have a mail storage for more then 65k users per domain and the ext4 doesn't support this size of directory list. The reiser FS indeed fits for the purpose but ext4 doesn't even start to scratch it. Now the real question is that: What FS will you use for dovecot backhand to store a domain with more then 65k users?

XFS? Used for situations where one has "lots or large" as Dave Chinner says [0] Meaning lots of files or large files.

[0] http://www.youtube.com/watch?v=i3IreQHLELU

...
Eliezer

On 07/05/2013 04:45 PM, Eliezer Croitoru wrote:

...
I was learning about the different FS exists. I was working on systems that ReiserFS was the star but since there is no longer support from the creator there are other consolidations to be done. I want to ask about couple FS options. EXT4 which is amazing for one node but for more it's another story. I have heard about GFS2 and GlusterFS and read the docs and official materials from RH on them. In the RH docs it states the EXT4 limit files per directory is 65k and I had a directory which was pretty loaded with files and I am unsure exactly what was the size but I am almost sure it was larger the 65k files per directory.

I was considering using GlusterFS for a very large storage system with NFS front. I am still unsure EXT4 should or shouldn't be able to handle more then 16TB since the linux kernel ext4 docs at: https://www.kernel.org/doc/Documentation/filesystems/ext4.txt in

section 2.1

...
it states: * ability to use filesystems > 16TB (e2fsprogs support not available yet). so can I use it or not?? if there are no tools to handle this size then I cannot trust it.

I want to create a storage with more then 16TB based on GlusterFS since it allows me to use 2-3 rings FS which will allow me to put the storage in a form of: 1 client -> HA NFS servers -> GlusterFS cluster.

it seems to more that GlusterFS is a better choice then Swift since RH do provide support for it.

Every response will be appreciated.

Thanks, Eliezer

Just for interest I have had two 44TB "raid 6" arrays using EXT4, running with heavy useage 24/7 on el6 now since January 2013 without any problems so far. I rebuilt "e2fsprogs" from source. Something along the lines below looking at my notes.

wget http://atoomnet.net/files/rpm/e2fsprogs/e2fsprogs-1.42.6-1.el6.src.rpm yum-builddep e2fsprogs-1.42.6-1.el6.src.rpm rpmbuild --rebuild --recompile e2fsprogs-1.42.6-1.el6.src.rpm cd /root/rpmbuild/RPMS/x86_64 rpm -Uvh *.rpm

###### build array with a partition ####### parted /dev/sda mkpart primary ext4 1 -1 mkfs.ext4 -L sraid1v -E stride=64,stripe-width=384 /dev/sda1

###### build array without a partition #######

mkfs.ext4 -L sraid1v -E stride=64,stripe-width=384 /dev/sda

Maybe this will help someone.

Cheers Steve

-- Dr Stephen Brooks http://www-solar.mcs.st-and.ac.uk/ Solar MHD Theory Group Tel :: 01334 463735 Fax :: 01334 463748 E-mail :: steveb@mcs.st-andrews.ac.uk --------------------------------------- Mathematical Institute North Haugh University of St. Andrews St Andrews, Fife KY16 9SS SCOTLAND ---------------------------------------

Matti Aarnio

6:42 p.m.

New subject: What FileSystems for large stores and very very large stores?

On 08/07/2013 03:58 AM, Eliezer Croitoru wrote:

...

OK so back to the issue in hands. The issue is that I have a mail storage for more then 65k users per domain and the ext4 doesn't support this size of directory list. The reiser FS indeed fits for the purpose but ext4 doesn't even start to scratch it. Now the real question is that: What FS will you use for dovecot backhand to store a domain with more then 65k users?

Eliezer

It was back in 1995 when I had this kind of problem with about 0.05 M accounts, and our solution was used until at least 0.5 M accounts, when I left the company. The filesystem in question back then degraded severely in performance when there were more than about 200 files in a directory.

We ended up cooking our own way using FNV-1a hash, but Dovecot has something similar natively:

http://wiki2.dovecot.org/MailLocation

The "Directory hashing" is the interesting part, although that explanation does look like needing a complete rewrite.

Having lots of file names in directory will likely mean that a) your directory file is actually grown over time in small extents spanning all over the disk space and b) thus its reading becomes very inefficient.

Having a hashed subdirectory structure will mean that a 4kB file system block size will likely not overflow , or at most have only a few extend blocks, and their reading will not be _that_ much slower.

Best Regards, Matti Aarnio

Eliezer Croitoru

9 Aug 9 Aug

8:56 p.m.

Thanks! This was very helpful and I am testing something and writing on the dovecot mailing list about it.

Eliezer

On 08/07/2013 09:42 PM, Matti Aarnio wrote:

...

On 08/07/2013 03:58 AM, Eliezer Croitoru wrote:

...
OK so back to the issue in hands. The issue is that I have a mail storage for more then 65k users per domain and the ext4 doesn't support this size of directory list. The reiser FS indeed fits for the purpose but ext4 doesn't even start to scratch it. Now the real question is that: What FS will you use for dovecot backhand to store a domain with more then 65k users?

Eliezer

It was back in 1995 when I had this kind of problem with about 0.05 M accounts, and our solution was used until at least 0.5 M accounts, when I left the company. The filesystem in question back then degraded severely in performance when there were more than about 200 files in a directory.

We ended up cooking our own way using FNV-1a hash, but Dovecot has something similar natively:
http://wiki2.dovecot.org/MailLocation
The "Directory hashing" is the interesting part, although that explanation does look like needing a complete rewrite.

Having lots of file names in directory will likely mean that a) your directory file is actually grown over time in small extents spanning all over the disk space and b) thus its reading becomes very inefficient.

Having a hashed subdirectory structure will mean that a 4kB file system block size will likely not overflow , or at most have only a few extend blocks, and their reading will not be _that_ much slower.

Best Regards, Matti Aarnio

4469

Age (days ago)

4504

Last active (days ago)

discuss@lists.centos.org

13 comments

7 participants

tags (0)

participants (7)

Eliezer Croitoru
James A. Peltier
Les Mikesell
m.roth＠5-cent.us
Matti Aarnio
SilverTip257
Steve Brooks