Large file system idea

List overview All Threads
Download

newer

older

Sorry

Disabling IRQ 16

Steve Thompson

17 May 2014 17 May '14

2:30 p.m.

This idea is intruiging...

Suppose one has a set of file servers called A, B, C, D, and so forth, all running CentOS 6.5 64-bit, all being interconnected with 10GbE. These file servers can be divided into identical pairs, so A is the same configuration (diks, processors, etc) as B, C the same as D, and so forth (because this is what I have; there are ten servers in all). Each file server has four Xeon 3GHz processors and 16GB memory. File server A acts as an iscsi target for logical volumes A1, A2,...An, and file server B acts as an iscsi target for logical volumes B1, B2,...Bn, where each LVM volume is 10 TB in size (a RAID-5 set of six 2TB NL-SAS disks). There are no file systems directly built on any of the LVM volumes. Each member of a server pair (A,B) are in different cabinets (albeit in the same machine room) and are on different power circuits, and have UPS protection.

A server system called S (which has six processors and 48 GB memory, and is not one of the file servers), acts as iscsi initiator for all targets. On S, A1 and B1 are combined into the software RAID-1 volume /dev/md101. Similarly, A2 and B2 are combined into /dev/md102, and so forth for as many target pairs as one has. The initial sync of /dev/md101 takes about 6 hours, with the sync speed being around 400 MB/sec for a 10TB volume. I realize that only half of the 10-gig bandwidth is available while writing, since the data is being written twice.

All of the /dev/md10X volumes are LVM PV's and are members of the same volume group, and there is one logical volume that occupies the entire volume group. An XFS file system (-i size=512, inode64) is built on top of this logical volume, and S NFS-exports that to the world (an HPC cluster of about 200 systems). In my case, the size of the resulting file system will ultimately be around 80 TB. The I/O performance of the xfs file system is most excellent, and exceeds by a large amount the performance of the equivalent file systems built with such packages as MooseFS and GlusterFS: I get about 350 MB/sec write speed through the file system, and up to 800 MB/sec read.

I have built something like this, and by performing tests such as sending a SIGKILL to one of the tgtd's, I have been unable to kill access to the file system. Obviously one has to manually intervene on the return of the tgtd in order to fail/hot-remove/hot-add the relevent target(s) to the md device. Presumably this will be made easier by using persistent device names for the targets on S.

One could probably expand this to supplement the server S with a second server T to allow the possibility of failover of the service should S croak. I haven't tackled that part yet.

So, what failure scenarios can take out the entire file system, assuming that both members of a pair (A,B) or (C,D) don't go down at the same time? There's no doubt that I haven't thought of something.

Steve

-- ---------------------------------------------------------------------------- Steve Thompson E-mail: smt AT vgersoft DOT com Voyager Software LLC Web: http://www DOT vgersoft DOT com 39 Smugglers Path VSW Support: support AT vgersoft DOT com Ithaca, NY 14850 "186,282 miles per second: it's not just a good idea, it's the law" ----------------------------------------------------------------------------

Show replies by date

SilverTip257

17 May 17 May

3:19 p.m.

On Sat, May 17, 2014 at 10:30 AM, Steve Thompson smt@vgersoft.com wrote:

...

This idea is intruiging...

Suppose one has a set of file servers called A, B, C, D, and so forth, all running CentOS 6.5 64-bit, all being interconnected with 10GbE. These file servers can be divided into identical pairs, so A is the same configuration (diks, processors, etc) as B, C the same as D, and so forth (because this is what I have; there are ten servers in all). Each file server has four Xeon 3GHz processors and 16GB memory. File server A acts as an iscsi target for logical volumes A1, A2,...An, and file server B acts as an iscsi target for logical volumes B1, B2,...Bn, where each LVM volume is 10 TB in size (a RAID-5 set of six 2TB NL-SAS disks). There are no file systems directly built on any of the LVM volumes. Each member of a server pair (A,B) are in different cabinets (albeit in the same machine room) and are on different power circuits, and have UPS protection.

A server system called S (which has six processors and 48 GB memory, and is not one of the file servers), acts as iscsi initiator for all targets. On S, A1 and B1 are combined into the software RAID-1 volume /dev/md101.

Sounds like you might be reinventing the wheel. DRBD [0] does what it sounds like you're trying to accomplish [1].

Especially since you have two nodes A+B or C+D that are RAIDed over iSCSI.

It's rather painless to set up two-nodes with DRBD. But once you want to sync three [2] or more nodes with each other, the number of resources (DRBD block devices) becomes exponentially larger. Linbit, the developers behind DRBD, call it resource stacking.

[0] http://www.drbd.org/ [1] http://www.drbd.org/users-guide-emb/ch-configure.html [2] http://www.drbd.org/users-guide-emb/s-three-nodes.html

...

Similarly, A2 and B2 are combined into /dev/md102, and so forth for as many target pairs as one has. The initial sync of /dev/md101 takes about 6 hours, with the sync speed being around 400 MB/sec for a 10TB volume. I realize that only half of the 10-gig bandwidth is available while writing, since the data is being written twice.

All of the /dev/md10X volumes are LVM PV's and are members of the same volume group, and there is one logical volume that occupies the entire volume group. An XFS file system (-i size=512, inode64) is built on top of this logical volume, and S NFS-exports that to the world (an HPC cluster of about 200 systems). In my case, the size of the resulting file system will ultimately be around 80 TB. The I/O performance of the xfs file system is most excellent, and exceeds by a large amount the performance of the equivalent file systems built with such packages as MooseFS and GlusterFS: I get about 350 MB/sec write speed through the file system, and up to 800 MB/sec read.

I have built something like this, and by performing tests such as sending a SIGKILL to one of the tgtd's, I have been unable to kill access to the file system. Obviously one has to manually intervene on the return of the tgtd in order to fail/hot-remove/hot-add the relevent target(s) to the md device. Presumably this will be made easier by using persistent device names for the targets on S.

One could probably expand this to supplement the server S with a second server T to allow the possibility of failover of the service should S croak. I haven't tackled that part yet.

So, what failure scenarios can take out the entire file system, assuming that both members of a pair (A,B) or (C,D) don't go down at the same time? There's no doubt that I haven't thought of something.

Steve

Steve Thompson E-mail: smt AT vgersoft DOT com Voyager Software LLC Web: http://www DOT vgersoft DOT com 39 Smugglers Path VSW Support: support AT vgersoft DOT com Ithaca, NY 14850 "186,282 miles per second: it's not just a good idea, it's the law"

CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos

-- ---~~.~~--- Mike // SilverTip257 //

Steve Thompson

5 p.m.

On Sat, 17 May 2014, SilverTip257 wrote:

...

Sounds like you might be reinventing the wheel.

I think not; see below.

...

DRBD [0] does what it sounds like you're trying to accomplish [1]. Especially since you have two nodes A+B or C+D that are RAIDed over iSCSI. It's rather painless to set up two-nodes with DRBD.

I am familiar with DRBD, having used it for a number of years. However, I don't think this does what I am describing. With a conventional two-node DRBD setup, the drbd block device appears on both storage nodes, one of which is primary. In this case, writes to the block device are done from the client to the primary, and the storage I/O is done locally on the primary and is forwarded across the network by the primary to the secondary.

What I am describing in my experiment is a setup in which the block device (/dev/mdXXX) appears on neither of the storage nodes, but on a third node. Writes to the block device are done from the client to the third node and are forwarded over the network to both storage servers. The whole setup can be done with only packages from the base repo.

I don't see how this can be accomplished with DRBD, unless the DRBD two-node setup then iscsi-exports the block device to the third node. With provision for failover, this is surely a great deal more complex than the setup that I have described.

If DRBD had the ability for the drbd block device to appear on a third node (one that *does not have any storage*), then it would perhaps be different.

Steve

Eero Volotinen

6:25 p.m.

How about glusterfs? 17.5.2014 20.01 kirjoitti "Steve Thompson" smt@vgersoft.com:

...

On Sat, 17 May 2014, SilverTip257 wrote:

...
Sounds like you might be reinventing the wheel.

I think not; see below.

...
DRBD [0] does what it sounds like you're trying to accomplish [1]. Especially since you have two nodes A+B or C+D that are RAIDed over

iSCSI.

...
It's rather painless to set up two-nodes with DRBD.

I am familiar with DRBD, having used it for a number of years. However, I don't think this does what I am describing. With a conventional two-node DRBD setup, the drbd block device appears on both storage nodes, one of which is primary. In this case, writes to the block device are done from the client to the primary, and the storage I/O is done locally on the primary and is forwarded across the network by the primary to the secondary.

What I am describing in my experiment is a setup in which the block device (/dev/mdXXX) appears on neither of the storage nodes, but on a third node. Writes to the block device are done from the client to the third node and are forwarded over the network to both storage servers. The whole setup can be done with only packages from the base repo.

I don't see how this can be accomplished with DRBD, unless the DRBD two-node setup then iscsi-exports the block device to the third node. With provision for failover, this is surely a great deal more complex than the setup that I have described.

If DRBD had the ability for the drbd block device to appear on a third node (one that *does not have any storage*), then it would perhaps be different.

Steve _______________________________________________ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos

Steve Thompson

6:49 p.m.

On Sat, 17 May 2014, Eero Volotinen wrote:

...

How about glusterfs?

I have tried glusterfs; the large file performance is reasonable, but the small file performance is too low to be useable.

Steve

SilverTip257

9:10 p.m.

On Sat, May 17, 2014 at 1:00 PM, Steve Thompson smt@vgersoft.com wrote:

...

On Sat, 17 May 2014, SilverTip257 wrote:

...
Sounds like you might be reinventing the wheel.

I think not; see below.

...

...
DRBD [0] does what it sounds like you're trying to accomplish [1]. Especially since you have two nodes A+B or C+D that are RAIDed over

iSCSI.

...
It's rather painless to set up two-nodes with DRBD.

I am familiar with DRBD, having used it for a number of years. However, I don't think this does what I am describing. With a conventional two-node DRBD setup, the drbd block device appears on both storage nodes, one of which is primary. In this case, writes to the block device are done from the client to the primary, and the storage I/O is done locally on the primary and is forwarded across the network by the primary to the secondary.

...

What I am describing in my experiment is a setup in which the block device (/dev/mdXXX) appears on neither of the storage nodes, but on a third node. Writes to the block device are done from the client to the third node and are forwarded over the network to both storage servers. The whole setup can be done with only packages from the base repo.

Right, DRBD is no longer available from the CentOS Extras repo (like it was in EL5).

...

I don't see how this can be accomplished with DRBD, unless the DRBD two-node setup then iscsi-exports the block device to the third node. With provision for failover, this is surely a great deal more complex than the setup that I have described.

If DRBD had the ability for the drbd block device to appear on a third node (one that *does not have any storage*), then it would perhaps be different.

Ah, good point.

-- ---~~.~~--- Mike // SilverTip257 //

Dennis Jacobfeuerborn

11:02 p.m.

On 17.05.2014 19:00, Steve Thompson wrote:

...

On Sat, 17 May 2014, SilverTip257 wrote:

...
Sounds like you might be reinventing the wheel.

I think not; see below.

...
DRBD [0] does what it sounds like you're trying to accomplish [1]. Especially since you have two nodes A+B or C+D that are RAIDed over iSCSI. It's rather painless to set up two-nodes with DRBD.

I am familiar with DRBD, having used it for a number of years. However, I don't think this does what I am describing. With a conventional two-node DRBD setup, the drbd block device appears on both storage nodes, one of which is primary. In this case, writes to the block device are done from the client to the primary, and the storage I/O is done locally on the primary and is forwarded across the network by the primary to the secondary.

What I am describing in my experiment is a setup in which the block device (/dev/mdXXX) appears on neither of the storage nodes, but on a third node. Writes to the block device are done from the client to the third node and are forwarded over the network to both storage servers. The whole setup can be done with only packages from the base repo.

I don't see how this can be accomplished with DRBD, unless the DRBD two-node setup then iscsi-exports the block device to the third node. With provision for failover, this is surely a great deal more complex than the setup that I have described.

If DRBD had the ability for the drbd block device to appear on a third node (one that *does not have any storage*), then it would perhaps be different.

Why specifically do you care about that? Both with your solution and the DRBD one the clients only see a NFS endpoint so what does it matter that this endpoint is placed on one of the storage systems? Also while with you solution streaming performance may be ok latency is going to be fairly terrible due to the round-trips and synchronicity required so this may be a nice setup for e.g. a backup storage system but not really suited as a more general purpose solution.

Regards, Dennis

Steve Thompson

11:14 p.m.

On Sun, 18 May 2014, Dennis Jacobfeuerborn wrote:

...

Why specifically do you care about that? Both with your solution and the DRBD one the clients only see a NFS endpoint so what does it matter that this endpoint is placed on one of the storage systems?

The whole point of the exercise is to end up with multiple block devices on a single system so that I can combine them into one VG using LVM, and then build a single file system that covers the lot. On a budget, of course.

...

Also while with you solution streaming performance may be ok latency is going to be fairly terrible due to the round-trips and synchronicity required so this may be a nice setup for e.g. a backup storage system but not really suited as a more general purpose solution.

Yes, I hear what you are saying. However, I have investigated MooseFS and GlusterFS using the same resources, and my experimental iscsi-based setup gives a file system that is *much* faster than either in practical use, latency notwithstanding.

Steve

Andrew Holway

11:51 p.m.

Have you looked at parallel filesystems such as Lustre and fhgfs?

On 18 May 2014 01:14, Steve Thompson smt@vgersoft.com wrote:

...

On Sun, 18 May 2014, Dennis Jacobfeuerborn wrote:

...
Why specifically do you care about that? Both with your solution and the DRBD one the clients only see a NFS endpoint so what does it matter that this endpoint is placed on one of the storage systems?

The whole point of the exercise is to end up with multiple block devices on a single system so that I can combine them into one VG using LVM, and then build a single file system that covers the lot. On a budget, of course.

...
Also while with you solution streaming performance may be ok latency is going to be fairly terrible due to the round-trips and synchronicity required so this may be a nice setup for e.g. a backup storage system but not really suited as a more general purpose solution.

Yes, I hear what you are saying. However, I have investigated MooseFS and GlusterFS using the same resources, and my experimental iscsi-based setup gives a file system that is *much* faster than either in practical use, latency notwithstanding.

Steve _______________________________________________ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos

Steve Thompson

18 May 18 May

3:47 p.m.

On Sun, 18 May 2014, Andrew Holway wrote:

...

Have you looked at parallel filesystems such as Lustre and fhgfs?

I have not looked at Lustre, as I have heard many negative things about it (including Oracle ownership). The only business using Lustre where I know the admins has had a lot of trouble with it. No redundancy.

Fhgfs looks interesting, and I am planning on looking at it, but have not yet done so.

MooseFS and GlusterFS have both been evaluated, and were too slow. In the case of GlusterFS, waaaay too slow.

Steve

Ted Miller

19 May 19 May

1:35 a.m.

On 05/18/2014 11:47 AM, Steve Thompson wrote:

...

MooseFS and GlusterFS have both been evaluated, and were too slow. In the case of GlusterFS, waaaay too slow.

How recently have you looked at Gluster? It has seen some significant progress, though small files are still its weakest area. I believe that some use-cases have found that NFS access is faster for small files.

Ted Miller Elkhart, IN

Steve Thompson

11:34 a.m.

On Sun, 18 May 2014, Ted Miller wrote:

...

How recently have you looked at Gluster? It has seen some significant progress, though small files are still its weakest area. I believe that some use-cases have found that NFS access is faster for small files.

I last looked at Gluster about two months ago, using version 3.4.2.

Steve

Richer, Mark (CIV)

1:19 p.m.

We were using glusterfs for shared home directories and it was really slow. We're using an NFS shared and it's working much faster.

Mark

...

On May 18, 2014, at 21:35, "Ted Miller" tedlists@sbcglobal.net wrote:

...
On 05/18/2014 11:47 AM, Steve Thompson wrote: MooseFS and GlusterFS have both been evaluated, and were too slow. In the case of GlusterFS, waaaay too slow.

How recently have you looked at Gluster? It has seen some significant progress, though small files are still its weakest area. I believe that some use-cases have found that NFS access is faster for small files.

Ted Miller Elkhart, IN

CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos

Les Mikesell

2:51 a.m.

On Sun, May 18, 2014 at 10:47 AM, Steve Thompson smt@vgersoft.com wrote:

...

On Sun, 18 May 2014, Andrew Holway wrote:

MooseFS and GlusterFS have both been evaluated, and were too slow. In the case of GlusterFS, waaaay too slow.

Do you really need filesystem semantics or would ceph's object store work?

-- Les Mikesell lesmikesell@gmail.com

Steve Thompson

11:35 a.m.

On Sun, 18 May 2014, Les Mikesell wrote:

...

Do you really need filesystem semantics or would ceph's object store work?

Yes, I really need file system semantics; I am storing home directories.

Steve

Les Mikesell

3:11 p.m.

On Mon, May 19, 2014 at 6:35 AM, Steve Thompson smt@vgersoft.com wrote:

...

On Sun, 18 May 2014, Les Mikesell wrote:

...
Do you really need filesystem semantics or would ceph's object store work?

Yes, I really need file system semantics; I am storing home directories.

In that case, wouldn't it be simpler to have several separate DRBD pairs with the directory from the appropriate server automounted at login instead of consolidating them to the point where you have scaling issues?

And have you tried ceph's filesystem layer?

-- Les Mikesell lesmikesell@gmail.com

Andrew Holway

8:52 a.m.

...

I have not looked at Lustre, as I have heard many negative things about it (including Oracle ownership). The only business using Lustre where I know the admins has had a lot of trouble with it. No redundancy.

I know some Lustre admins that indeed have the far away stare similar to people that have survived natural disasters. It can be somewhat unstable and difficult to manage when you try and roll it yourself but, if you get the professionals and have it properly supported you can have a good time.

Lustre is not owned by Oracle, its free and opensource software Licensed under GPL v2. It does have redundancy but this is handled on the hardware level with Active / Active object storage servers and meta data servers.

Primarily supported by Intel. Well, they have the most developers and sell the most support contracts. It is a very interesting replacement for Hadoop HDFS.

...

Fhgfs looks interesting, and I am planning on looking at it, but have not yet done so.

The Fraunhofer Parallel Cluster File System (FhGFS) has just been spun out of the German Institute from which is was born and has been renamed BeeGFS. (the germans never had a knack for snappy names :).

It is a very strong contender for these kinds of workloads and is probably just about to be fully opensourced.

In general Parallel filesystems such as Lustre are quite hard to get right and most people fail to grasp the complexity and the skill required in implementing them. People have a go, fsck it up (heh) and then blame the software when it doesn't work properly. If you really have a business requirement for insane metadata performance over single, multi petabyte namespace you should be sure to tread lightly and carry a good support contract.

...

MooseFS and GlusterFS have both been evaluated, and were too slow. In the case of GlusterFS, waaaay too slow.

I believe Gluster to be a rapidly dying project however I am willing to be set straight on this point. It seems that anyone looking at Gluster will also be looking a Ceph and this is an obviously better system.

4202

Age (days ago)

4204

Last active (days ago)

discuss@lists.centos.org

16 comments

8 participants

tags (0)

participants (8)

Andrew Holway
Dennis Jacobfeuerborn
Eero Volotinen
Les Mikesell
Richer, Mark (CIV)
SilverTip257
Steve Thompson
Ted Miller