[CentOS] home directory server performance issues
Gordon Messmer
yinyang at eburg.com
Fri Dec 14 19:13:04 UTC 2012
On 12/12/2012 11:52 AM, Matt Garman wrote:
> Now that it appears the hardware + software configuration can handle
> the load. So I still have the same question: how I can accurately
> *quantify* the kind of IO load these servers have? I.e., how to
> measure IOPS?
IOPS are given in the output of iostat, which you're logging. iostat
will report to you the number of read and write operations sent to the
device per second in the "r/s" and "w/s" columns. You also want to pay
attention to the "rrqm/s" and "wrqm/s". Those two columns indicate the
number of read/write operations that were queued. If that number rises,
it means that your disks aren't keeping up with the demands of
applications. Finally, the %util is critical to understanding those
numbers. %util indicates the amount of cpu time during which I/O
requests were issued to the device. As %util approaches 100%, the r/s
and w/s columns indicate the maximum performance of your disks, and
indicate that disks are becoming a bottleneck to application performance.
> I agree with all that. Problem is, there is a higher risk of storage
> failure with RAID-10 compared to RAID-6. We do have good, reliable
> *data* backups, but no real hardware backup. Our current service
> contract on the hardware is next business day. That's too much down
> time to tolerate with this particular system.
...
> How do most people handle this kind of scenario, i.e. can't afford to
> have a hardware failure for any significant length of time? Have a
> whole redundant system in place? I would have to "sell" the idea to
> management, and for that, I'd need to precisely quantify our situation
> (i.e. my initial question).
You need an SLA in order to decide what array type is acceptable.
Nothing is 100% reliable, so you need to decide how frequent a failure
is acceptable in order to evaluate your options. Once you establish an
SLA, you need to gather data on the MTBF of all of the components in
your array in order to determine the probability of concurrent failures
which will disrupt service, or the failure of non-redundant components
which also will disrupt service. If you don't have an SLA, the
observation that RAID10 is less resilient than RAID6 is no more useful
than the observation that RAID10 performance per $ is vastly better, and
somewhat less useful when you're asking others about improving performance.
The question isn't one of whether or not you can afford a hardware
failure, the question is what would such a failure cost. Once you know
how much an outage costs, how much your available options cost, and how
frequently your available options will fail, you can make informed
decisions about how much to spend on preventing a problem. If you don't
establish those costs and probabilities, you're really just guessing
blindly.
>> data=journal actually offers better performance than the default in some
>> workloads, but not all. You should try the default and see which is
>> better. With a hardware RAID controller that has battery backed write
>> cache, data=journal should not perform any better than the default, but
>> probably not any worse.
>
> Right, that was mentioned in another response. Unfortunately, I don't
> have the ability to test this. My only system is the real production
> system. I can't afford the interruption to the users while I fully
> unmount and mount the partition (can't change data= type with
> remount).
You were able to interrupt service to do a full system upgrade, so a
re-mount seems trivial by comparison. You don't even need to stop
applications on the NFS clients. If you stop NFS service on the server
long enough to unmount/mount the FS, or just reboot during non-office
hours, the clients will simply block for the duration of that
maintenance and then continue.
> In general, it seems like a lot of IO tuning is "change parameter,
> then test". But (1) what test?
Relative RAID performance is highly dependent on both your RAID
controller and on your workload, which is why it's so hard to find data
on the best available configuration. There just isn't an answer that's
suitable for everyone. One RAID controller can be twice as fast as
another. In some workloads, RAID10 will be a small improvement over
RAID6. In others, it can be easily twice as fast as an array of similar
size. If your controller is good, you may be able to get better
performance from a RAID6 array if you increase the number of member
disks. In some controllers, performance degrades as member disks increase.
You need to continue to record data from iostat. If you were changing
the array configuration, you'd look at changes in r/s and w/s relative
to %util. If you're only able to change the FS parameters, you're
probably looking for a change that reduces your average %util. Whereas
changing the array might allow you more IOPS, changing the filesystem
parameters will usually just smooth out the %util over time so that IOPS
are less clustered.
That's basically how data=journal operates in some workloads. The
journal should be one big block of contiguous block of sectors. When
the kernel flushes memory buffers to disk, it's able to perform the
writes in sequence, which means that the disk heads don't need to seek
much, and that transfer of buffers to disk is much faster than it would
be if they were simply written to more or less random sectors across the
disk. The kernel can, then, use idle time to read those sectors back in
to memory, and then write them back to their final destination sectors.
As you can see, this doubles the total number of writes, and so it
greatly increases the number of IOPS on the storage array. It also only
works if your IO is relatively clustered, with idle time in between. If
your IO is already a steady saturation, it will make overall performance
much worse. Finally, data=journal requires that your journal is large
enough to store all of the write buffers that will accumulate between
idle periods that are long enough to allow that journal to be emptied.
If your journal fills up, performance will suddenly and dramatically
drop while the journal is emptied. It's up to you to determine whether
your IO is clustered in that manner, how much data is being written in
those peaks, and how large your journal needs to be to support that.
If your RAID controller has a battery backed cache, it operates
according to the same rules, and in basically the same way as
data=journal. All of your writes go to one place (the cache) and are
written to the disk array during idle periods, and performance will tank
if the cache fills. Thus, if you have a battery backed cache, using
data=journal should never improve performance.
If you're stuck with RAID6 (and, I guess, even if you're not), one of
the options that you have is to add another fast disk specifically for
the journal. External journals and logs are recommended by virtually
every database and application vendor I know of, but one of the least
deployed options that I see. Using an external journal on a very fast
disk or array and data=journal means that your write path is separate
from your read path, and journal flushes only really depend on the read
load to idle. If you can add a single fast disk (or RAID1 array) and
move your ext4 journal there, you will dramatically improve your array
performance in virtually all workloads where there are mixed reads and
writes. I like to use fast SSDs for this purpose. You don't need them
to be very large. An ultra-fast 8GB SSD (or RAID1 pair) is more than
enough for the journal.
More information about the CentOS
mailing list