On 12/12/2012 11:52 AM, Matt Garman wrote: > Now that it appears the hardware + software configuration can handle > the load. So I still have the same question: how I can accurately > *quantify* the kind of IO load these servers have? I.e., how to > measure IOPS? IOPS are given in the output of iostat, which you're logging. iostat will report to you the number of read and write operations sent to the device per second in the "r/s" and "w/s" columns. You also want to pay attention to the "rrqm/s" and "wrqm/s". Those two columns indicate the number of read/write operations that were queued. If that number rises, it means that your disks aren't keeping up with the demands of applications. Finally, the %util is critical to understanding those numbers. %util indicates the amount of cpu time during which I/O requests were issued to the device. As %util approaches 100%, the r/s and w/s columns indicate the maximum performance of your disks, and indicate that disks are becoming a bottleneck to application performance. > I agree with all that. Problem is, there is a higher risk of storage > failure with RAID-10 compared to RAID-6. We do have good, reliable > *data* backups, but no real hardware backup. Our current service > contract on the hardware is next business day. That's too much down > time to tolerate with this particular system. ... > How do most people handle this kind of scenario, i.e. can't afford to > have a hardware failure for any significant length of time? Have a > whole redundant system in place? I would have to "sell" the idea to > management, and for that, I'd need to precisely quantify our situation > (i.e. my initial question). You need an SLA in order to decide what array type is acceptable. Nothing is 100% reliable, so you need to decide how frequent a failure is acceptable in order to evaluate your options. Once you establish an SLA, you need to gather data on the MTBF of all of the components in your array in order to determine the probability of concurrent failures which will disrupt service, or the failure of non-redundant components which also will disrupt service. If you don't have an SLA, the observation that RAID10 is less resilient than RAID6 is no more useful than the observation that RAID10 performance per $ is vastly better, and somewhat less useful when you're asking others about improving performance. The question isn't one of whether or not you can afford a hardware failure, the question is what would such a failure cost. Once you know how much an outage costs, how much your available options cost, and how frequently your available options will fail, you can make informed decisions about how much to spend on preventing a problem. If you don't establish those costs and probabilities, you're really just guessing blindly. >> data=journal actually offers better performance than the default in some >> workloads, but not all. You should try the default and see which is >> better. With a hardware RAID controller that has battery backed write >> cache, data=journal should not perform any better than the default, but >> probably not any worse. > > Right, that was mentioned in another response. Unfortunately, I don't > have the ability to test this. My only system is the real production > system. I can't afford the interruption to the users while I fully > unmount and mount the partition (can't change data= type with > remount). You were able to interrupt service to do a full system upgrade, so a re-mount seems trivial by comparison. You don't even need to stop applications on the NFS clients. If you stop NFS service on the server long enough to unmount/mount the FS, or just reboot during non-office hours, the clients will simply block for the duration of that maintenance and then continue. > In general, it seems like a lot of IO tuning is "change parameter, > then test". But (1) what test? Relative RAID performance is highly dependent on both your RAID controller and on your workload, which is why it's so hard to find data on the best available configuration. There just isn't an answer that's suitable for everyone. One RAID controller can be twice as fast as another. In some workloads, RAID10 will be a small improvement over RAID6. In others, it can be easily twice as fast as an array of similar size. If your controller is good, you may be able to get better performance from a RAID6 array if you increase the number of member disks. In some controllers, performance degrades as member disks increase. You need to continue to record data from iostat. If you were changing the array configuration, you'd look at changes in r/s and w/s relative to %util. If you're only able to change the FS parameters, you're probably looking for a change that reduces your average %util. Whereas changing the array might allow you more IOPS, changing the filesystem parameters will usually just smooth out the %util over time so that IOPS are less clustered. That's basically how data=journal operates in some workloads. The journal should be one big block of contiguous block of sectors. When the kernel flushes memory buffers to disk, it's able to perform the writes in sequence, which means that the disk heads don't need to seek much, and that transfer of buffers to disk is much faster than it would be if they were simply written to more or less random sectors across the disk. The kernel can, then, use idle time to read those sectors back in to memory, and then write them back to their final destination sectors. As you can see, this doubles the total number of writes, and so it greatly increases the number of IOPS on the storage array. It also only works if your IO is relatively clustered, with idle time in between. If your IO is already a steady saturation, it will make overall performance much worse. Finally, data=journal requires that your journal is large enough to store all of the write buffers that will accumulate between idle periods that are long enough to allow that journal to be emptied. If your journal fills up, performance will suddenly and dramatically drop while the journal is emptied. It's up to you to determine whether your IO is clustered in that manner, how much data is being written in those peaks, and how large your journal needs to be to support that. If your RAID controller has a battery backed cache, it operates according to the same rules, and in basically the same way as data=journal. All of your writes go to one place (the cache) and are written to the disk array during idle periods, and performance will tank if the cache fills. Thus, if you have a battery backed cache, using data=journal should never improve performance. If you're stuck with RAID6 (and, I guess, even if you're not), one of the options that you have is to add another fast disk specifically for the journal. External journals and logs are recommended by virtually every database and application vendor I know of, but one of the least deployed options that I see. Using an external journal on a very fast disk or array and data=journal means that your write path is separate from your read path, and journal flushes only really depend on the read load to idle. If you can add a single fast disk (or RAID1 array) and move your ext4 journal there, you will dramatically improve your array performance in virtually all workloads where there are mixed reads and writes. I like to use fast SSDs for this purpose. You don't need them to be very large. An ultra-fast 8GB SSD (or RAID1 pair) is more than enough for the journal.