[CentOS] Large file system idea

Sat May 17 14:30:21 UTC 2014
Steve Thompson <smt at vgersoft.com>

This idea is intruiging...

Suppose one has a set of file servers called A, B, C, D, and so forth, all 
running CentOS 6.5 64-bit, all being interconnected with 10GbE. These file 
servers can be divided into identical pairs, so A is the same 
configuration (diks, processors, etc) as B, C the same as D, and so forth 
(because this is what I have; there are ten servers in all). Each file 
server has four Xeon 3GHz processors and 16GB memory. File server A acts 
as an iscsi target for logical volumes A1, A2,...An, and file server B 
acts as an iscsi target for logical volumes B1, B2,...Bn, where each LVM 
volume is 10 TB in size (a RAID-5 set of six 2TB NL-SAS disks). There are 
no file systems directly built on any of the LVM volumes. Each member of a 
server pair (A,B) are in different cabinets (albeit in the same machine 
room) and are on different power circuits, and have UPS protection.

A server system called S (which has six processors and 48 GB memory, and 
is not one of the file servers), acts as iscsi initiator for all targets. 
On S, A1 and B1 are combined into the software RAID-1 volume /dev/md101. 
Similarly, A2 and B2 are combined into /dev/md102, and so forth for as 
many target pairs as one has. The initial sync of /dev/md101 takes about 6 
hours, with the sync speed being around 400 MB/sec for a 10TB volume. I 
realize that only half of the 10-gig bandwidth is available while writing, 
since the data is being written twice.

All of the /dev/md10X volumes are LVM PV's and are members of the same 
volume group, and there is one logical volume that occupies the entire 
volume group. An XFS file system (-i size=512, inode64) is built on top of 
this logical volume, and S NFS-exports that to the world (an HPC cluster 
of about 200 systems). In my case, the size of the resulting file system 
will ultimately be around 80 TB. The I/O performance of the xfs file 
system is most excellent, and exceeds by a large amount the performance of 
the equivalent file systems built with such packages as MooseFS and 
GlusterFS: I get about 350 MB/sec write speed through the file system, and 
up to 800 MB/sec read.

I have built something like this, and by performing tests such as sending 
a SIGKILL to one of the tgtd's, I have been unable to kill access to the 
file system. Obviously one has to manually intervene on the return of the 
tgtd in order to fail/hot-remove/hot-add the relevent target(s) to the md 
device. Presumably this will be made easier by using persistent device 
names for the targets on S.

One could probably expand this to supplement the server S with a second 
server T to allow the possibility of failover of the service should S 
croak. I haven't tackled that part yet.

So, what failure scenarios can take out the entire file system, assuming 
that both members of a pair (A,B) or (C,D) don't go down at the same time? 
There's no doubt that I haven't thought of something.

Steve
-- 
----------------------------------------------------------------------------
Steve Thompson                 E-mail:      smt AT vgersoft DOT com
Voyager Software LLC           Web:         http://www DOT vgersoft DOT com
39 Smugglers Path              VSW Support: support AT vgersoft DOT com
Ithaca, NY 14850
   "186,282 miles per second: it's not just a good idea, it's the law"
----------------------------------------------------------------------------