[CentOS] home directory server performance issues

On 12/12/2012 11:52 AM, Matt Garman wrote:
> Now that it appears the hardware + software configuration can handle
> the load.  So I still have the same question: how I can accurately
> *quantify* the kind of IO load these servers have?  I.e., how to
> measure IOPS?

IOPS are given in the output of iostat, which you're logging.  iostat 
will report to you the number of read and write operations sent to the 
device per second in the "r/s" and "w/s" columns.  You also want to pay 
attention to the "rrqm/s" and "wrqm/s".  Those two columns indicate the 
number of read/write operations that were queued.  If that number rises, 
it means that your disks aren't keeping up with the demands of 
applications.  Finally, the %util is critical to understanding those 
numbers.  %util indicates the amount of cpu time during which I/O 
requests were issued to the device.  As %util approaches 100%, the r/s 
and w/s columns indicate the maximum performance of your disks, and 
indicate that disks are becoming a bottleneck to application performance.

> I agree with all that.  Problem is, there is a higher risk of storage
> failure with RAID-10 compared to RAID-6.  We do have good, reliable
> *data* backups, but no real hardware backup.  Our current service
> contract on the hardware is next business day.  That's too much down
> time to tolerate with this particular system.
...
> How do most people handle this kind of scenario, i.e. can't afford to
> have a hardware failure for any significant length of time?  Have a
> whole redundant system in place?  I would have to "sell" the idea to
> management, and for that, I'd need to precisely quantify our situation
> (i.e. my initial question).

You need an SLA in order to decide what array type is acceptable. 
Nothing is 100% reliable, so you need to decide how frequent a failure 
is acceptable in order to evaluate your options.  Once you establish an 
SLA, you need to gather data on the MTBF of all of the components in 
your array in order to determine the probability of concurrent failures 
which will disrupt service, or the failure of non-redundant components 
which also will disrupt service.  If you don't have an SLA, the 
observation that RAID10 is less resilient than RAID6 is no more useful 
than the observation that RAID10 performance per $ is vastly better, and 
somewhat less useful when you're asking others about improving performance.

The question isn't one of whether or not you can afford a hardware 
failure, the question is what would such a failure cost.  Once you know 
how much an outage costs, how much your available options cost, and how 
frequently your available options will fail, you can make informed 
decisions about how much to spend on preventing a problem.  If you don't 
establish those costs and probabilities, you're really just guessing 
blindly.

>> data=journal actually offers better performance than the default in some
>> workloads, but not all.  You should try the default and see which is
>> better.  With a hardware RAID controller that has battery backed write
>> cache, data=journal should not perform any better than the default, but
>> probably not any worse.
>
> Right, that was mentioned in another response.  Unfortunately, I don't
> have the ability to test this.  My only system is the real production
> system.  I can't afford the interruption to the users while I fully
> unmount and mount the partition (can't change data= type with
> remount).

You were able to interrupt service to do a full system upgrade, so a 
re-mount seems trivial by comparison.  You don't even need to stop 
applications on the NFS clients.  If you stop NFS service on the server 
long enough to unmount/mount the FS, or just reboot during non-office 
hours, the clients will simply block for the duration of that 
maintenance and then continue.

> In general, it seems like a lot of IO tuning is "change parameter,
> then test".  But (1) what test?

Relative RAID performance is highly dependent on both your RAID 
controller and on your workload, which is why it's so hard to find data 
on the best available configuration.  There just isn't an answer that's 
suitable for everyone.  One RAID controller can be twice as fast as 
another.  In some workloads, RAID10 will be a small improvement over 
RAID6.  In others, it can be easily twice as fast as an array of similar 
size.  If your controller is good, you may be able to get better 
performance from a RAID6 array if you increase the number of member 
disks.  In some controllers, performance degrades as member disks increase.

You need to continue to record data from iostat.  If you were changing 
the array configuration, you'd look at changes in r/s and w/s relative 
to %util.  If you're only able to change the FS parameters, you're 
probably looking for a change that reduces your average %util.  Whereas 
changing the array might allow you more IOPS, changing the filesystem 
parameters will usually just smooth out the %util over time so that IOPS 
are less clustered.

That's basically how data=journal operates in some workloads.  The 
journal should be one big block of contiguous block of sectors.  When 
the kernel flushes memory buffers to disk, it's able to perform the 
writes in sequence, which means that the disk heads don't need to seek 
much, and that transfer of buffers to disk is much faster than it would 
be if they were simply written to more or less random sectors across the 
disk.  The kernel can, then, use idle time to read those sectors back in 
to memory, and then write them back to their final destination sectors. 
  As you can see, this doubles the total number of writes, and so it 
greatly increases the number of IOPS on the storage array.  It also only 
works if your IO is relatively clustered, with idle time in between.  If 
your IO is already a steady saturation, it will make overall performance 
much worse.  Finally, data=journal requires that your journal is large 
enough to store all of the write buffers that will accumulate between 
idle periods that are long enough to allow that journal to be emptied. 
If your journal fills up, performance will suddenly and dramatically 
drop while the journal is emptied.  It's up to you to determine whether 
your IO is clustered in that manner, how much data is being written in 
those peaks, and how large your journal needs to be to support that.

If your RAID controller has a battery backed cache, it operates 
according to the same rules, and in basically the same way as 
data=journal.  All of your writes go to one place (the cache) and are 
written to the disk array during idle periods, and performance will tank 
if the cache fills.  Thus, if you have a battery backed cache, using 
data=journal should never improve performance.

If you're stuck with RAID6 (and, I guess, even if you're not), one of 
the options that you have is to add another fast disk specifically for 
the journal.  External journals and logs are recommended by virtually 
every database and application vendor I know of, but one of the least 
deployed options that I see.  Using an external journal on a very fast 
disk or array and data=journal means that your write path is separate 
from your read path, and journal flushes only really depend on the read 
load to idle.  If you can add a single fast disk (or RAID1 array) and 
move your ext4 journal there, you will dramatically improve your array 
performance in virtually all workloads where there are mixed reads and 
writes.  I like to use fast SSDs for this purpose.  You don't need them 
to be very large.  An ultra-fast 8GB SSD (or RAID1 pair) is more than 
enough for the journal.