Is glusterfs ready?

List overview All Threads
Download

newer

older

CentOS-announce Digest, Vol 91,...

Problems with logwatch under...

John Doe

28 Aug 2012 28 Aug '12

10:14 a.m.

Hey,

since RH took control of glusterfs, I've been looking to convert our old independent RAID storage servers to several non RAID glustered ones.

The thing is that I, here and there, heard a few frightening stories from some users (even with latest release). Any one has experienced with it long enough to think one can blindly trust it or if it is almost there but not yet ready?

Thx, JD

Show replies by date

isdtor

28 Aug 28 Aug

10:50 a.m.

On 28 August 2012 11:14, John Doe jdmls@yahoo.com wrote:

...

Hey,

since RH took control of glusterfs, I've been looking to convert our old independent RAID storage servers to several non RAID glustered ones.

The thing is that I, here and there, heard a few frightening stories from some users (even with latest release). Any one has experienced with it long enough to think one can blindly trust it or if it is almost there but not yet ready

I can't say anything about the RH Storage Appliance, but for us, gluster up to 3.2.x was most definitely not ready. We went through a lot of pain, and even after optimizing OS config with help of gluster support, we were facing insurmountable problems. One of them was kswapd instances going into overdrive, and once the machine reached a certain load, all networking functions just stopped. I'm not saying this is gluster's fault, but even with support we were unable to configure the machines so that this doesn't happen. That was on CentOS 5.6/x86_64.

Another problem was that due to load and frequent updates (each new version was supposed to fix bugs; some weren't fixed, and there were plenty of new ones) the filesystems became inconsistent. In theory, each file lives on a single brick. The reality was that in the end, there were many files that existed on all bricks, one copy fully intact, the others with zero size and funny permissions. You can guess what happens if you're not aware of this and try to copy/rsync data off all bricks to different storage. IIRC there were internal changes that required going through a certain procedure during some upgrades to ensure filesystem consistency, and these procedures were followed.

We only started out with 3.0.x, and my impression was that development was focusing on new features rather than bug fixes.

John Doe

29 Aug 29 Aug

9:07 a.m.

From: isdtor isdtor@gmail.com

...

I can't say anything about the RH Storage Appliance, but for us, gluster up to 3.2.x was most definitely not ready. ... We only started out with 3.0.x, and my impression was that development was focusing on new features rather than bug fixes.

From: David C. Miller millerdc@fusion.gat.com

...

I'm using gluster 3.3.0-1 ... Been running this since 3.3 came out. I did quite a bit of failure testing before going live. So far it is working well.

I read that 3.3 was the first "RH" release. Let's hope they did/will focus on bug fixing... So I guess I will wait a little bit more.

Thx to both, JD

Johnny Hughes

9:54 a.m.

On 08/29/2012 04:07 AM, John Doe wrote:

...

From: isdtor isdtor@gmail.com

...
I can't say anything about the RH Storage Appliance, but for us, gluster up to 3.2.x was most definitely not ready. ... We only started out with 3.0.x, and my impression was that development was focusing on new features rather than bug fixes.

From: David C. Miller millerdc@fusion.gat.com

...
I'm using gluster 3.3.0-1 ... Been running this since 3.3 came out. I did quite a bit of failure testing before going live. So far it is working well.

I read that 3.3 was the first "RH" release. Let's hope they did/will focus on bug fixing... So I guess I will wait a little bit more.

We use glusterfs in the CentOS build infrastructure ... and for the most part it works fairly well.

It is sometimes very slow on file systems with lots of small files ... especially for operations like find or chmod/chown on a large volume with lots of small files.

BUT, that said, it is very convenient to use commodity hardware and have redundant, large, failover volumes on the local network.

We started with version 3.2.5 and now use 3.3.0-3, which is faster than 3.2.5 ... so it should get better in the future.

I can recommend glusterfs as I have not found anything that does what it does and does it better, but it is challenging and may not be good for all situations, so test it before you use it.

John Doe

10:16 a.m.

From: Johnny Hughes johnny@centos.org

...

We use glusterfs in the CentOS build infrastructure ... and for the most part it works fairly well. It is sometimes very slow on file systems with lots of small files ... especially for operations like find or chmod/chown on a large volume with lots of small files. BUT, that said, it is very convenient to use commodity hardware and have redundant, large, failover volumes on the local network. We started with version 3.2.5 and now use 3.3.0-3, which is faster than 3.2.5 ... so it should get better in the future. I can recommend glusterfs as I have not found anything that does what it does and does it better, but it is challenging and may not be good for all situations, so test it before you use it.

I am not too worried about bad performances. I am afraid to get paged one night because the 50+ TB of the storage cluster are gone followinf a bug/crash... It would take days/weeks to set it back up from the backups. If we were rich, I guess we would have two (or more) "geo-replicated" glusters and be able to withstand one failing... I would like the same trust level that I have in RAID.

Johnny Hughes

12:49 p.m.

On 08/29/2012 05:16 AM, John Doe wrote:

...

From: Johnny Hughes johnny@centos.org

...
We use glusterfs in the CentOS build infrastructure ... and for the most part it works fairly well. It is sometimes very slow on file systems with lots of small files ... especially for operations like find or chmod/chown on a large volume with lots of small files. BUT, that said, it is very convenient to use commodity hardware and have redundant, large, failover volumes on the local network. We started with version 3.2.5 and now use 3.3.0-3, which is faster than 3.2.5 ... so it should get better in the future. I can recommend glusterfs as I have not found anything that does what it does and does it better, but it is challenging and may not be good for all situations, so test it before you use it.

I am not too worried about bad performances. I am afraid to get paged one night because the 50+ TB of the storage cluster are gone followinf a bug/crash... It would take days/weeks to set it back up from the backups. If we were rich, I guess we would have two (or more) "geo-replicated" glusters and be able to withstand one failing... I would like the same trust level that I have in RAID.

I have routinely used DRBD for things like this ... 2 servers, one a complete failover of the other one. Of course, that requires a 50+ TB file system on each machine.

Les Mikesell

1:06 p.m.

On Wed, Aug 29, 2012 at 7:49 AM, Johnny Hughes johnny@centos.org wrote:

...

...
If we were rich, I guess we would have two (or more) "geo-replicated" glusters and be able to withstand one failing... I would like the same trust level that I have in RAID.

I have routinely used DRBD for things like this ... 2 servers, one a complete failover of the other one. Of course, that requires a 50+ TB file system on each machine.

How well do glusterfs or drbd deal with downtime of one of the members? Do they catch up quickly with incremental updates and what kind of impact does that have on performance as it happens? And is either suitable for running over distances where there is some network latency?

-- Les Mikesell lesmikesell@gmail.com

John R Pierce

1:13 p.m.

On 08/29/12 6:06 AM, Les Mikesell wrote:

...

How well do glusterfs or drbd deal with downtime of one of the members? Do they catch up quickly with incremental updates and what kind of impact does that have on performance as it happens? And is either suitable for running over distances where there is some network latency?

the extreme case is when one end fails, and you rebuild it,and have to replicate the whole thing. how long does it take to move 50TB across your LAN ? how fast can your file system write that much ?

-- john r pierce N 37, W 122 santa cruz ca mid-left coast

Johnny Hughes

1:17 p.m.

On 08/29/2012 08:06 AM, Les Mikesell wrote:

...

On Wed, Aug 29, 2012 at 7:49 AM, Johnny Hughes johnny@centos.org wrote:

...
...
If we were rich, I guess we would have two (or more) "geo-replicated" glusters and be able to withstand one failing... I would like the same trust level that I have in RAID.

I have routinely used DRBD for things like this ... 2 servers, one a complete failover of the other one. Of course, that requires a 50+ TB file system on each machine.

How well do glusterfs or drbd deal with downtime of one of the members? Do they catch up quickly with incremental updates and what kind of impact does that have on performance as it happens? And is either suitable for running over distances where there is some network latency?

Well, DRBD is a tried and true solution, but it requires dedicated boxes and crossover network connections, etc. I would consider it by far the best method for providing critical failover.

I would consider gluserfs almost a different thing entirely ... it provides the ability to string several partitions on different machines into one shared network volume.

Glusterfs does also provide redundancy if you set it up that way ... and if you have a fast network and enough volumes then the performance is not very degraded when a gluster volume comes back, etc.

However, I don't think I would trust extremely critical things on glusterfs at this point.

Dennis Jacobfeuerborn

2:09 p.m.

On 08/29/2012 03:17 PM, Johnny Hughes wrote:

...

On 08/29/2012 08:06 AM, Les Mikesell wrote:

...
On Wed, Aug 29, 2012 at 7:49 AM, Johnny Hughes johnny@centos.org wrote:

...
...
If we were rich, I guess we would have two (or more) "geo-replicated" glusters and be able to withstand one failing... I would like the same trust level that I have in RAID.

I have routinely used DRBD for things like this ... 2 servers, one a complete failover of the other one. Of course, that requires a 50+ TB file system on each machine.

How well do glusterfs or drbd deal with downtime of one of the members? Do they catch up quickly with incremental updates and what kind of impact does that have on performance as it happens? And is either suitable for running over distances where there is some network latency?

Well, DRBD is a tried and true solution, but it requires dedicated boxes and crossover network connections, etc. I would consider it by far the best method for providing critical failover.

I would consider gluserfs almost a different thing entirely ... it provides the ability to string several partitions on different machines into one shared network volume.

Glusterfs does also provide redundancy if you set it up that way ... and if you have a fast network and enough volumes then the performance is not very degraded when a gluster volume comes back, etc.

However, I don't think I would trust extremely critical things on glusterfs at this point.

I think the keyword with solutions like glusterfs, ceph, sheepdog, etc. is "elasticity". DRBD and RAID work well as long as you have a fixed size of data to deal with but once you get to a consistent data growth you need something that offers redundancy yet can be easily extended incrementally.

Glusterfs seems to aim to be a solution that works well right now because it uses a simple file replication approach whereas ceph and sheepdog seem to go deeper and provide better architectures but will take longer to mature.

Regards, Dennis

Rajagopal Swaminathan

3:50 p.m.

Greetings,

On Wed, Aug 29, 2012 at 7:39 PM, Dennis Jacobfeuerborn dennisml@conversis.de wrote:

Where does openAFS stands in all these deleberations? http://www.openafs.org/

-- Regards, Rajagopal

Rajagopal Swaminathan

3:52 p.m.

On Wed, Aug 29, 2012 at 9:20 PM, Rajagopal Swaminathan raju.rajsand@gmail.com wrote:

...

Greetings,

On Wed, Aug 29, 2012 at 7:39 PM, Dennis Jacobfeuerborn dennisml@conversis.de wrote:

Where does openAFS stands in all these deleberations? http://www.openafs.org/

oops, missed out this: http://www.stacken.kth.se/project/arla/

-- Regards, Rajagopal

Les Mikesell

4:06 p.m.

On Wed, Aug 29, 2012 at 10:52 AM, Rajagopal Swaminathan raju.rajsand@gmail.com wrote:

...

...
Where does openAFS stands in all these deleberations? http://www.openafs.org/

oops, missed out this: http://www.stacken.kth.se/project/arla/

AFS isn't what you expect from a distributed file system. Each machine works with cached copies of whole files and when one of them writes and closes a file the others are notified to update their copy. Last write wins.

-- Les Mikesell lesmikesell@gmail.com

David C. Miller

28 Aug 28 Aug

4:44 p.m.

----- Original Message -----

...

From: "John Doe" jdmls@yahoo.com To: "Cent O Smailinglist" centos@centos.org Sent: Tuesday, August 28, 2012 3:14:29 AM Subject: [CentOS] Is glusterfs ready?

Hey,

since RH took control of glusterfs, I've been looking to convert our old independent RAID storage servers to several non RAID glustered ones.

The thing is that I, here and there, heard a few frightening stories from some users (even with latest release). Any one has experienced with it long enough to think one can blindly trust it or if it is almost there but not yet ready?

Thx, JD

I'm using gluster 3.3.0-1 on two KVM host nodes. I have a 1TB logical volume used as a brick on each node to create a replicated volume. I store my VM's on this volume and have each node mount the gluster volume via localhost using the native FUSE gluster driver. I get about 75-105MB/s over 1Gb Ethernet. Been running this since 3.3 came out. I did quite a bit of failure testing before going live. So far it is working well. I'm only using it as a glorified network RAID1 to make live migration of my VM's fast.

David.

Bob Hepple

5 Sep 5 Sep

5:14 a.m.

David C. Miller <millerdc@...> writes:

...

----- Original Message -----

...
From: "John Doe" <jdmls@...> To: "Cent O Smailinglist" <centos@...> Sent: Tuesday, August 28, 2012 3:14:29 AM Subject: [CentOS] Is glusterfs ready?

Hey,

since RH took control of glusterfs, I've been looking to convert our old independent RAID storage servers to several non RAID glustered ones.

The thing is that I, here and there, heard a few frightening stories from some users (even with latest release). Any one has experienced with it long enough to think one can blindly trust it or if it is almost there but not yet ready?

Heya,

Well I guess I'm one of the frightening stories, or at least a previous employer was. They had a mere 0.1 petabyte store over 6 bricks yet they had incredible performance and reliability difficulties. I'm talking about a mission critical system being unavailable for weeks at a time. At least it wasn't customer facing (there was another set of servers for that).

The system was down more than it was up. Reading was generally OK (but very slow) but multiple threads writing caused mayhem - I'm talking lost files and file system accesses going into the multiple minutes.

In the end I implemented a 1-Tb store to be fuse-unioned over the top of the thing to take the impact of multiple threads writing to it. A single thread (overnight) brought the underlying glusterfs up to date.

That got us more or less running but the darned thing spent most of its time re-indexing and balancing rather than serving files.

To be fair, some of the problems were undoubtedly of their own making as 2 nodes were centos and 4 were fedora-12 - apparently the engineer couldn't find the installation CD for the 2 new nodes and 'made do' with what he had! I recall that a difference in the system 'sort' command gave all sorts of grief until it was discovered, never mind different versions of the gluster drivers.

I'd endorse Johnny's comments about it not handling large numbers of small files well (ie <~ 10 Mb). I believe it was designed for large multi-media files such as clinical X-Rays. ie a small number of large files.

Another factor is that the available space is the physical space divided by 4 due to the replication across the nodes on top of the nodes being RAID'd themselves.

Lesse now - that was all of 6 months ago - unlike most of my war stories, it's not ancient history!!

Dennis Jacobfeuerborn

3:07 p.m.

On 09/05/2012 07:14 AM, Bob Hepple wrote:

...

David C. Miller <millerdc@...> writes:

...
----- Original Message -----

...
From: "John Doe" <jdmls@...> To: "Cent O Smailinglist" <centos@...> Sent: Tuesday, August 28, 2012 3:14:29 AM Subject: [CentOS] Is glusterfs ready?

Hey,

since RH took control of glusterfs, I've been looking to convert our old independent RAID storage servers to several non RAID glustered ones.

The thing is that I, here and there, heard a few frightening stories from some users (even with latest release). Any one has experienced with it long enough to think one can blindly trust it or if it is almost there but not yet ready?

Heya,

Well I guess I'm one of the frightening stories, or at least a previous employer was. They had a mere 0.1 petabyte store over 6 bricks yet they had incredible performance and reliability difficulties. I'm talking about a mission critical system being unavailable for weeks at a time. At least it wasn't customer facing (there was another set of servers for that).

The system was down more than it was up. Reading was generally OK (but very slow) but multiple threads writing caused mayhem - I'm talking lost files and file system accesses going into the multiple minutes.

In the end I implemented a 1-Tb store to be fuse-unioned over the top of the thing to take the impact of multiple threads writing to it. A single thread (overnight) brought the underlying glusterfs up to date.

That got us more or less running but the darned thing spent most of its time re-indexing and balancing rather than serving files.

To be fair, some of the problems were undoubtedly of their own making as 2 nodes were centos and 4 were fedora-12 - apparently the engineer couldn't find the installation CD for the 2 new nodes and 'made do' with what he had! I recall that a difference in the system 'sort' command gave all sorts of grief until it was discovered, never mind different versions of the gluster drivers.

That is the problem with most of these stories in that the setups tend to be of the "adventurous" kind. Not only was the setup very asymmetrical but Fedora 12 was long outdated even 6 months ago. This kind of setup should be categorized as "highly experimental" and not something you actually use in production.

...

I'd endorse Johnny's comments about it not handling large numbers of small files well (ie <~ 10 Mb). I believe it was designed for large multi-media files such as clinical X-Rays. ie a small number of large files.

That's a problem with all distributed filesystems. For a few large files the additional time needed for round-trips is usually dwarfed by the actual I/O requests themselves so you don't notice it (much). With a ton of small files you incur lots of metadata fetching round-trips for every few kbyte read/written which slows things down by a great deal. So basically if you want top performance for lot of small files don't use distributed filesystems.

...

Another factor is that the available space is the physical space divided by 4 due to the replication across the nodes on top of the nodes being RAID'd themselves.

That really depends on your setup. I'm not sure what you mean by the nodes being raided themselves. If you run a four node cluster and keep two copies of each file you would probably create two pairs of nodes where one node is replicated to the other and then create a stripe over these two pairs which should actually improve performance. This would mean your available space would be cut in half and not be divided by 4.

Regards, Dennis

John Doe

4 p.m.

From: Dennis Jacobfeuerborn dennisml@conversis.de

...

On 09/05/2012 07:14 AM, Bob Hepple wrote:

...
Another factor is that the available space is the physical space divided by 4 due to the replication across the nodes on top of the nodes being RAID'd themselves.

That really depends on your setup. I'm not sure what you mean by the nodes being raided themselves.

I think he meant gluster "RAID1" plus hardware RAID (10 I guess from the x2, instead of standalone disks).

Jose P. Espinal

11 Sep 11 Sep

3:45 p.m.

On Wed, Sep 5, 2012 at 12:00 PM, John Doe jdmls@yahoo.com wrote:

...

From: Dennis Jacobfeuerborn dennisml@conversis.de

...
On 09/05/2012 07:14 AM, Bob Hepple wrote:

...
Another factor is that the available space is the physical space divided by 4 due to the replication across the nodes on top of the nodes being RAID'd themselves.

That really depends on your setup. I'm not sure what you mean by the nodes being raided themselves.

I think he meant gluster "RAID1" plus hardware RAID (10 I guess from the x2, instead of standalone disks).

JD _______________________________________________ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos

Hello, this comment was posted on a site I administer, where I chronologically publish an archive of some CentOS (and some other distros) lists:

===== [comment] ===== A new comment on the post "Is Glusterfs Ready?"

Author : Jeff Darcy (IP: 173.48.139.36 , pool-173-48-139-36.bstnma.fios.verizon.net) E-mail : jeff@pl.atyp.us URL : http://pl.atyp.us Whois : http://whois.arin.net/rest/ip/173.48.139.36 Comment:

Hi. I'm one of the GlusterFS developers, and I'll try to offer a slightly different perspective.

First, sure GlusterFS has bugs. Some of them even make me cringe. If we really wanted to get into a discussion of the things about GlusterFS that suck, I'd probably be able to come up with more things than anybody, but one of the lessons I learned early in my career is that seeing all of the bugs for a piece of software leads to a skewed perspective. Some people have had problems with GlusterFS but some people have been very happy with it, and I guarantee that every alternative has its own horror stories. GlusterFS and XtreemFS were the only two distributed filesystems that passed some *very simple* tests I ran last year. Ceph crashed. MooseFS hung (and also doesn't honor O_SYNC). OrangeFS corrupted data. HDFS cheats by buffering writes locally, and doesn't even try to implement half of the required behaviors for a general-purpose filesystem. I can go through any of those codebases and find awful bug after horrible bug after data-destroying bug . . . and yet each of them has their fans too, because most users could never possibly hit the edge conditions where those bugs exist. The lesson is that anecdotes do not equal data. Don't listen to vendor hype, and don't listen to anti-vendor bashing either. Find out what the *typical* experience across a large number of users is, and how well the software works in your own testing.

Second, just as every piece of software has bugs, every truly distributed filesystem (i.e. not NFS) struggles with lots of small files. There has been some progress in this area with projects like Pomegranate and GIGA+, we have some ideas for how to approach it in GlusterFS (see my talk at SDC next week), but overall I think it's important to realize that such a workload is likely to be problematic for *any* offering in the category. You'll have to do a lot of tuning, maybe implement some special workarounds yourself, but if you want to combine this I/O profile with the benefits of scalable storage it can all be worth it.

Lastly, if anybody is paying a 4x disk-space penalty (at one site) I'd say they're overdoing things. Once you have replication between servers, RAID-1 on each server is overkill. I'd say even RAID-6 is overkill. How many simultaneous disk failures do you need to survive? If the answer is two, as it usually seems to be, then GlusterFS replication on top of RAID-5 is a fine solution and requires a maximum of 3x (more typically just a bit more than 2x). In the future we're looking at various forms of compression and deduplication and erasure codes that will all bring the multiple down even further.

So I can't say whether it's ready or whether you can trust it. I'm not objective enough for my opinion on that to count for much. What I'm saying is that distributed filesystems are complex pieces of sofware, none of the alternatives are where any of us working on them would like to be, and the only way any of these projects get better is if users let us know of problems they encounter. Blog posts or comments describing specific issues, from people whose names appear nowhere on any email or bug report the developers could have seen, don't help to advance the state of the art.

===== [/comment] =====

Regards,

-- J. Pavel Espinal Skype: p.espinal http://ww.pavelespinal.com http://www.slackware-es.com

Les Mikesell

4:58 p.m.

On Tue, Sep 11, 2012 at 10:45 AM, Jose P. Espinal jose@pavelespinal.com wrote:

...

Blog posts or comments describing specific issues, from people whose names appear nowhere on any email or bug report the developers could have seen, don't help to advance the state of the art.

Just speaking for myself here, I'm less interested in 'advancing' the state of the art' (which usually means running something painfully broken) than in finding something that already works... You didn't paint a very rosy picture there. Would it be better to just forget filesystem semantics and use one of the distributed nosql databases (riak, mongo, cassandra, etc.).

-- Les Mikesell lesmikesell@gmail.com

John Doe

12 Sep 12 Sep

12:30 p.m.

From: Jose P. Espinal jose@pavelespinal.com

...

First, sure GlusterFS has bugs. Some of them even make me cringe. If we really wanted to get into a discussion of the things about GlusterFS that suck, I'd probably be able to come up with more things than anybody, but one of the lessons I learned early in my career is that seeing all of the bugs for a piece of software leads to a skewed perspective. Some people have had problems with GlusterFS but some people have been very happy with it, and I guarantee that every alternative has its own horror stories. ... So I can't say whether it's ready or whether you can trust it. I'm not objective enough for my opinion on that to count for much. What I'm saying is that distributed filesystems are complex pieces of sofware, none of the alternatives are where any of us working on them would like to be, and the only way any of these projects get better is if users let us know of problems they encounter.

By ready I just meant "safe enough to transfer all our production storage on it and be 99.99% sure that it won't vanish one night".... Again, the same level of trust that one can have with RAID storage. It can still fail, but it is nowadays quite rare (luckily never happened to me).

I understand that developers need testers and feedback, and I am sure you are doing an excellent job, but we will start with a small test cluster and follow the project progress.

Thx for your input, JD

4718

Age (days ago)

4733

Last active (days ago)

discuss@lists.centos.org

19 comments

10 participants

tags (0)

participants (10)

Bob Hepple
David C. Miller
Dennis Jacobfeuerborn
isdtor
John Doe
John R Pierce
Johnny Hughes
Jose P. Espinal
Les Mikesell
Rajagopal Swaminathan