[CentOS] XFS on a 25 TB device

Wed Sep 29 22:37:19 UTC 2010

On Sep 29, 2010, at 2:53 PM, Lamar Owen <lowen at pari.edu> wrote:

> On Wednesday, September 29, 2010 01:25:11 pm Peter Kjellstrom wrote:
>> You are a bit mistaken. The raid controller does not "copy data around as it 
>> sees fit". It stores data on each disk in chunk-size'ed pieces. It then 
>> stripes this across all drives giving you a stripe-size'ed piece of chunk 
>> size times the number of data drives.
> 
> [Snip math]
> 
>> Then again, for other workloads the effect could be insignificant. YMMV.
> 
> For a simple RAID controller I can see some benefit.  
> 
> However, in my case the 'RAID controller' is on SAN, consisting of three EMC Clariion arrays: a CX3-10c, a CX3-80, and a CX700.  The EMC Navisphere/Unisphere tools allow LUN migration across RAID groups; I could very well take a LUN from a RAID1/0 with 16 drives to a RAID5 with 9 drives to a RAID6 with 10 drives to a RAID6 with 16 drives and have different stripe sizes.  Further, since this is all being accessed through VMware ESX, I'm limited to 2TB LUNs anyway, even using raw device mappings, which I do, but for a different reason; LVM to the rescue to get this:
> [root at backup-rdc ~]# df -h
> Filesystem            Size  Used Avail Use% Mounted on
> /dev/mapper/VolGroup00-LogVol00
>                       37G   18G   18G  50% /
> /dev/sda1              99M   26M   69M  28% /boot
> /dev/mapper/dasch--backup-volume1
>                       21T   19T  2.6T  88% /opt/backups
> tmpfs                1006M     0 1006M   0% /dev/shm
> /dev/mapper/dasch--rdc-cx3--80
>                       23T   19T  4.2T  82% /opt/dasch-rdc
> [root at backup-rdc ~]# 
> 
> Yeah, the output of pvscan is pretty long (it has been longer, and seeing things like /dev/sdak1 is strange....).
> 
> Using XFS at the moment.  The two volume groups are on two different arrays; one is on the CX700 and the other on the CX3-80, and they're physically separated at two locations on-campus, with single-mode 4Gb/s FC ISL's between switches.  They're soon to be connected to different VMware ESX hosts; the dual fibre-channel connect was so the initial sync time would be reasonable.  
> 
> I looked through all the performance optimization howtos for XFS that I could find, but then realized how futile that would be with these 'RAID controllers' and their massive caches (our CX3-80 SP's have 8GB of RAM each; the shared write cache and the variable-sized read cache, which I have set up for a rather large size on our CX3-80: 3GB on each SP for read, and 2GB for write; the CX700 has 4GB (actually 3968MB) split 1GB read 2GB write); the benchmarks that I did (that I can't release due to both EMC and VMware's EULAs' prohibitions) showed that the performance differences with alignment versus without were insignificant with these 'RAID controllers'.
> 
> But for something inside the server, like a 3ware 9500 or similar, it might be worthwhile to align to stripe size, since that is a fixed constant for the logical drives that controller exports.
> 
> And Peter is very right: YMMV depending upon workload.  Our load for this system is, as can be inferred from the name of the machine, backups of a raw data set that are processed once and then archived.  I/O's per second isn't even on the radar for this workload; throughput, on the other hand, is.  And man these Clariions are fast.

For sequential IO you won't notice any impact from misalignment, but for random IO it could be a 25-33% loss.

I'm sure EMC has white papers posted on aligning volumes for Exchange/SQL as well as VMware.

The 8GB cache only goes so far... Get enough server connections or a couple of sequential IO hogs like yours and cache effect disappears quickly.

Often the misalignment starts at the initiator and travels to the target, initiator needs to read two blocks because it is off by one sector, but then the target needs to read two chunks because one of those blocks crosses a chunk, and so on.

-Ross