On Sat, May 17, 2014 at 10:30 AM, Steve Thompson smt@vgersoft.com wrote:
This idea is intruiging...
Suppose one has a set of file servers called A, B, C, D, and so forth, all running CentOS 6.5 64-bit, all being interconnected with 10GbE. These file servers can be divided into identical pairs, so A is the same configuration (diks, processors, etc) as B, C the same as D, and so forth (because this is what I have; there are ten servers in all). Each file server has four Xeon 3GHz processors and 16GB memory. File server A acts as an iscsi target for logical volumes A1, A2,...An, and file server B acts as an iscsi target for logical volumes B1, B2,...Bn, where each LVM volume is 10 TB in size (a RAID-5 set of six 2TB NL-SAS disks). There are no file systems directly built on any of the LVM volumes. Each member of a server pair (A,B) are in different cabinets (albeit in the same machine room) and are on different power circuits, and have UPS protection.
A server system called S (which has six processors and 48 GB memory, and is not one of the file servers), acts as iscsi initiator for all targets. On S, A1 and B1 are combined into the software RAID-1 volume /dev/md101.
Sounds like you might be reinventing the wheel. DRBD [0] does what it sounds like you're trying to accomplish [1].
Especially since you have two nodes A+B or C+D that are RAIDed over iSCSI.
It's rather painless to set up two-nodes with DRBD. But once you want to sync three [2] or more nodes with each other, the number of resources (DRBD block devices) becomes exponentially larger. Linbit, the developers behind DRBD, call it resource stacking.
[0] http://www.drbd.org/ [1] http://www.drbd.org/users-guide-emb/ch-configure.html [2] http://www.drbd.org/users-guide-emb/s-three-nodes.html
Similarly, A2 and B2 are combined into /dev/md102, and so forth for as many target pairs as one has. The initial sync of /dev/md101 takes about 6 hours, with the sync speed being around 400 MB/sec for a 10TB volume. I realize that only half of the 10-gig bandwidth is available while writing, since the data is being written twice.
All of the /dev/md10X volumes are LVM PV's and are members of the same volume group, and there is one logical volume that occupies the entire volume group. An XFS file system (-i size=512, inode64) is built on top of this logical volume, and S NFS-exports that to the world (an HPC cluster of about 200 systems). In my case, the size of the resulting file system will ultimately be around 80 TB. The I/O performance of the xfs file system is most excellent, and exceeds by a large amount the performance of the equivalent file systems built with such packages as MooseFS and GlusterFS: I get about 350 MB/sec write speed through the file system, and up to 800 MB/sec read.
I have built something like this, and by performing tests such as sending a SIGKILL to one of the tgtd's, I have been unable to kill access to the file system. Obviously one has to manually intervene on the return of the tgtd in order to fail/hot-remove/hot-add the relevent target(s) to the md device. Presumably this will be made easier by using persistent device names for the targets on S.
One could probably expand this to supplement the server S with a second server T to allow the possibility of failover of the service should S croak. I haven't tackled that part yet.
So, what failure scenarios can take out the entire file system, assuming that both members of a pair (A,B) or (C,D) don't go down at the same time? There's no doubt that I haven't thought of something.
Steve
Steve Thompson E-mail: smt AT vgersoft DOT com Voyager Software LLC Web: http://www DOT vgersoft DOT com 39 Smugglers Path VSW Support: support AT vgersoft DOT com Ithaca, NY 14850 "186,282 miles per second: it's not just a good idea, it's the law"
CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos