I’m looking for advice and considerations on how to optimally setup and deploy an NFS-based home directory server. In particular: (1) how to determine hardware requirements, and (2) how to best setup and configure the server. We actually have a system in place, but the performance is pretty bad---the users often experience a fair amount of lag (1--5 seconds) when doing anything on their home directories, including an “ls” or writing a small text file.
So now I’m trying to back-up and determine, is it simply a configuration issue, or is the hardware inadequate?
Our scenario: we have about 25 users, mostly software developers and analysts. The users login to one or more of about 40 development servers. All users’ home directories live on a single server (no login except root); that server does an NFSv4 export which is mounted by all dev servers. The home directory server hardware is a Dell R510 with dual E5620 CPUs and 8 GB RAM. There are eight 15k 2.5” 600 GB drives (Seagate ST3600057SS) configured in hardware RAID-6 with a single hot spare. RAID controller is a Dell PERC H700 w/512MB cache (Linux sees this as a LSI MegaSAS 9260). OS is CentOS 5.6, home directory partition is ext3, with options “rw,data=journal,usrquota”.
I have the HW RAID configured to present two virtual disks to the OS: /dev/sda for the OS (boot, root and swap partitions), and /dev/sdb for the home directories. I’m fairly certain I did not align the partitions optimally:
[root@lnxutil1 ~]# parted -s /dev/sda unit s print
Model: DELL PERC H700 (scsi) Disk /dev/sda: 134217599s Sector size (logical/physical): 512B/512B Partition Table: msdos
Number Start End Size Type File system Flags 1 63s 465884s 465822s primary ext2 boot 2 465885s 134207009s 133741125s primary lvm
[root@lnxutil1 ~]# parted -s /dev/sdb unit s print
Model: DELL PERC H700 (scsi) Disk /dev/sdb: 5720768639s Sector size (logical/physical): 512B/512B Partition Table: gpt
Number Start End Size File system Name Flags 1 34s 5720768606s 5720768573s lvm
Can anyone confirm that the partitions are not aligned correctly, as I suspect? If this is true, is there any way to *quantify* the effects of partition mis-alignment on performance? In other words, what kind of improvement could I expect if I rebuilt this server with the partitions aligned optimally?
In general, what is the best way to determine the source of our performance issues? Right now, I’m running “iostat -dkxt 30” re-directed to a file. I intend to let this run for a day or so, and write a script to produce some statistics.
Here is one iteration from the iostat process:
Time: 09:37:28 AM Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util sda 0.00 44.09 0.03 107.76 0.13 607.40 11.27 0.89 8.27 7.27 78.35 sda1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sda2 0.00 44.09 0.03 107.76 0.13 607.40 11.27 0.89 8.27 7.27 78.35 sdb 0.00 2616.53 0.67 157.88 2.80 11098.83 140.04 8.57 54.08 4.21 66.68 sdb1 0.00 2616.53 0.67 157.88 2.80 11098.83 140.04 8.57 54.08 4.21 66.68 dm-0 0.00 0.00 0.03 151.82 0.13 607.26 8.00 1.25 8.23 5.16 78.35 dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-2 0.00 0.00 0.67 2774.84 2.80 11099.37 8.00 474.30 170.89 0.24 66.84 dm-3 0.00 0.00 0.67 2774.84 2.80 11099.37 8.00 474.30 170.89 0.24 66.84
What I observe, is that whenever sdb (home directory partition) becomes loaded, sda (OS) often does as well. Why is this? I would expect sda to generally be idle, or have minimal utilization. According to both “free” and “vmstat”, this server is not swapping at all.
At one point, our problems were due to a random user writing a huge file to their home directory. We built a second server specifically for people to use for writing large temporary files. Furthermore, for all the dev servers, I used the following tc commands to rate limit how quickly any one server can write to the home directory server (8 Mbps or 1 MB/s):
ETH_IFACE=$( route -n | grep "^0.0.0.0" | awk '{ print $8 }' ) IFACE_RATE=1000mbit LIMIT_RATE=8mbit TARGET_IP=1.2.3.4 # home directory server IP tc qdisc add dev $ETH_IFACE root handle 1: htb default 1 tc class add dev $ETH_IFACE parent 1: classid 1:1 htb rate $IFACE_RATE ceil $IFACE_RATE tc class add dev $ETH_IFACE parent 1: classid 1:2 htb rate $LIMIT_RATE ceil $LIMIT_RATE tc filter add dev $ETH_IFACE parent 1: protocol ip prio 16 u32 match ip dst $TARGET_IP flowid 1:2
The other interesting thing is that the second server I mentioned—the one specifically designed for users to “torture”—shows very low IO utilization, practically never going above 10%. That server is fairly different though: dual E5-2340 CPUs (more cores, but lower clock), 32 GB RAM. Disk subsystem is Dell PERC 710 (LSI MegaRAID SAS 2208), and drives are 7200 RPM 1GB (SEAGATE ST1000NM0001) in RAID-6. The OS is CentOS 6.3, NFS partition is ext4 with options “rw,relatime,barrier=1,data=ordered,usrquota”.
Ultimately, I plan to rebuild the home directory server with CentOS 6 (instead of 5), and align the partitions properly. But as of now, I don’t have a rational reason for doing that other than the other server with this config doesn’t have performance problems. I’d like to be able to say specifically (i.e. quantify) exactly where the problems are and how they will be addressed by the upgrade/config change.
I’ll add that we want to use the “sec=krb5p” (i.e. encrypt everything) mount option for the home directories. We tried that with the home directory server, and it became virtually unusable. But we use that option on the other server, with no issue. For now, as a stop-gap, we are just using the “sec=krb5” mount option (i.e., no encryption). The server is still laggy, but at least usable.
Here is the output of “nfsstat –v” on the home directory server: [root@lnxutil1 ~]# nfsstat -v Server packet stats: packets udp tcp tcpconn 12560989 0 12544002 17146
Server rpc stats: calls badcalls badclnt badauth xdrcall 12516995 922 0 922 0
Server reply cache: hits misses nocache 0 0 12512316
Server file handle cache: lookup anon ncachedir ncachedir stale 0 0 0 0 160
Server nfs v4: null compound 86 0% 12516096 99%
Server nfs v4 operations: op0-unused op1-unused op2-future access close commit 0 0% 0 0% 0 0% 449630 1% 1131528 2% 191998 0% create delegpurge delegreturn getattr getfh link 2053 0% 0 0% 62931 0% 11210081 29% 1638995 4% 275 0% lock lockt locku lookup lookup_root nverify 196 0% 0 0% 196 0% 557606 1% 0 0% 0 0% open openattr open_conf open_dgrd putfh putpubfh 1274780 3% 0 0% 72561 0% 618 0% 12357089 32% 0 0% putrootfh read readdir readlink remove rename 160 0% 1548999 4% 44760 0% 625 0% 140946 0% 4229 0% renew restorefh savefh secinfo setattr setcltid 134103 0% 1157086 3% 1281276 3% 0 0% 133212 0% 143 0% setcltidconf verify write rellockowner 113 0% 0 0% 4896102 12% 196 0%
Let me know if I can provide any more useful information. Thanks in advance for any pointers!
Matt Garman wrote:
I’m looking for advice and considerations on how to optimally setup and deploy an NFS-based home directory server. In particular: (1) how to determine hardware requirements, and (2) how to best setup and configure the server. We actually have a system in place, but the performance is pretty bad---the users often experience a fair amount of lag (1--5 seconds) when doing anything on their home directories, including an “ls” or writing a small text file.
So now I’m trying to back-up and determine, is it simply a configuration issue, or is the hardware inadequate?
<snip> Without poring over your info, let me give you something that bit us here: our home directory servers are all 5.x (in this case, 5.8). Here's the reason: when we tried 6.x, if you were in an NFS-mounted directory, working from the same, or another NFS-mounted directory, it was *slow*. Unzipping a file that was about 120M or so took 6.5-7 *minutes*, as opposed to 1 min. After extensive testing (the numbers are still on our whiteboard here, from when I did it many months ago), it didn't seem to matter what the workstation was running, but it did matter what the NFS server was. You *can* solve it by changing from sync to async... if you're not worried about possible data loss or corruption. We do have to worry, since in some cases, our researchers might be dumping many gigs of data into their home directories from a job that's been running for days, and no one wants to rerun that.
mark
On Mon, Dec 10, 2012 at 6:37 PM, Matt Garman matthew.garman@gmail.com wrote:
I’m looking for advice and considerations on how to optimally setup and deploy an NFS-based home directory server. In particular: (1) how to determine hardware requirements, and (2) how to best setup and configure the server. We actually have a system in place, but the performance is pretty bad---the users often experience a fair amount of lag (1--5 seconds) when doing anything on their home directories, including an “ls” or writing a small text file.
I know this is the centos forum, however, if you are still in a testing fase, then I can recommend you try solaris derivatives like nexenta or omnios. The NFS server performance in linux is simple not the same as on those using the same hardware. You get too true acls (no posix, but nfsv4 acls, comparable to those in ntfs), deduplication, compression, and snapshots (ZFS!).
Nexenta is free as in beer up to 18TB and has a great web interface, omnios is just free but you need to know how to use solaris.
If you stay with the linux nfs servers, look into the io scheduler setting of the disks. I managed to double the performance of a proliant raid controller (don't remember which model, sorry) by changing the standard cfq to noop. Shortly after that I came across nexenta and moved all our NFS loads there. Later we got a netapp cluster, but the nexenta filers are still kicking around.
On Mon, Dec 10, 2012 at 11:37:50AM -0600, Matt Garman wrote:
OS is CentOS 5.6, home directory partition is ext3, with options “rw,data=journal,usrquota”.
Is the data=journal option really wanted here? Did you try with the other journalling modes available? I also think you are missing the noatime option here.
The wiki has some information about raid math and ext3 journalling modes: http://wiki.centos.org/HowTos/Disk_Optimization
At one point, our problems were due to a random user writing a huge file to their home directory.
This is the case in data=journal mode; the server has to write the data twice on disk.
On Tue, Dec 11, 2012 at 1:58 AM, Nicolas KOWALSKI nicolas.kowalski@gmail.com wrote:
On Mon, Dec 10, 2012 at 11:37:50AM -0600, Matt Garman wrote:
OS is CentOS 5.6, home directory partition is ext3, with options “rw,data=journal,usrquota”.
Is the data=journal option really wanted here? Did you try with the other journalling modes available? I also think you are missing the noatime option here.
Short answer: I don't know. Intuitively, it seems like it's not the right thing. However, there are a number of articles out there[1], that say in data=journal may improve performance dramatically, in cases where there is a both a lot of reading and writing. That's what a home directory server is to me: a lot of reading and writing. However, I haven't seen any tool or mechanism for precisely quantifying when data=journal will improve performance; everyone just says "change it and test". Unfortunately, in my situation, I didn't have the luxury of testing, because things were unusable "now".
[1] for example: http://www.ibm.com/developerworks/linux/library/l-fs8/index.html
From: Matt Garman matthew.garman@gmail.com
I’m fairly certain I did not align the partitions optimally: Number Start End Size Type File system Flags 1 63s 465884s 465822s primary ext2 boot 2 465885s 134207009s 133741125s primary lvm Number Start End Size File system Name Flags 1 34s 5720768606s 5720768573s lvm Can anyone confirm that the partitions are not aligned correctly, as I suspect? If this is true, is there any way to *quantify* the effects of partition mis-alignment on performance? In other words, what kind of improvement could I expect if I rebuilt this server with the partitions aligned optimally?
They indeed do not look like aligned... First, I am no expert but: At one point , the minimum to do was to at least start on 64 instead of 63. Now, if you add RAID stripes, 4k disks... it is more complicated. https://access.redhat.com/knowledge/docs/en-US/Red_Hat_Enterprise_Linux/6/ht...
You can see the effects on non alignment by looking at such images: http://www.ateamsystems.com/blog/FreeBSD-Partition-Alignment-RAID-SSD-4k-Dri...
Formatting also takes alignment parameters. By example, stride and stripe-width for ext...
JD
On Mon, Dec 10, 2012 at 9:37 AM, Matt Garman matthew.garman@gmail.comwrote:
I’m looking for advice and considerations on how to optimally setup and deploy an NFS-based home directory server. In particular: (1) how to determine hardware requirements, and (2) how to best setup and configure the server. We actually have a system in place, but the performance is pretty bad---the users often experience a fair amount of lag (1--5 seconds) when doing anything on their home directories, including an “ls” or writing a small text file.
Just going to throw this out there. What is RPCNFSDCOUNT in /etc/sysconfig/nfs?
On Tue, 11 Dec 2012, Dan Young wrote:
Just going to throw this out there. What is RPCNFSDCOUNT in /etc/sysconfig/nfs?
This is in fact a very interesting question. The default value of RPCNFSDCOUNT (8) is in my opinion way too low for many kinds of NFS servers. My own setup has 7 NFS servers ranging from small ones (7 TB disk served) to larger ones (25 TB served), and there are about 1000 client cores making use of this. After spending some time looking at NFS performance problems, I discovered that the number of nfsd's had to be much higher to prevent stalls. On the largest servers I now use 256-320 nfsd's, and 64 nfsd's on the very smallest ones. Along with suitable adjustment of vm.dirty_ratio and vm.dirty_background_ratio, this makes a huge difference.
Steve
On Tue, Dec 11, 2012 at 4:01 PM, Steve Thompson smt@vgersoft.com wrote:
This is in fact a very interesting question. The default value of RPCNFSDCOUNT (8) is in my opinion way too low for many kinds of NFS servers. My own setup has 7 NFS servers ranging from small ones (7 TB disk served) to larger ones (25 TB served), and there are about 1000 client cores making use of this. After spending some time looking at NFS performance problems, I discovered that the number of nfsd's had to be much higher to prevent stalls. On the largest servers I now use 256-320 nfsd's, and 64 nfsd's on the very smallest ones. Along with suitable adjustment of vm.dirty_ratio and vm.dirty_background_ratio, this makes a huge difference.
Could you perhaps elaborate a bit on your scenario? In particular, how much memory and CPU cores do the servers have with the really high NFSD counts? Is there a rule of thumb for nfsd counts relative to the system specs? Or, like so many IO tuning situations, just a matter of "test and see"?
On Wed, 12 Dec 2012, Matt Garman wrote:
Could you perhaps elaborate a bit on your scenario? In particular, how much memory and CPU cores do the servers have with the really high NFSD counts? Is there a rule of thumb for nfsd counts relative to the system specs? Or, like so many IO tuning situations, just a matter of "test and see"?
My NFS servers that run 256 nfsd's have four cores (Xeon, 3.16 GHz) and 16 GB memory, with three incoming network segments on which the clients live (each of which is a dual bonded GbE link). I don't know of any rule of thumb; indeed I am using 256 nfsd's at the moment because that is the nature of the current workload. It might be different in a few month's time, especially as we add more clients. Indeed I started with 64 nfsd's and kept adding more until the NFS stalls essentially stopped.
Steve
On 2012-12-11, Dan Young danielmyoung@gmail.com wrote:
On Mon, Dec 10, 2012 at 9:37 AM, Matt Garman matthew.garman@gmail.comwrote:
I?m looking for advice and considerations on how to optimally setup and deploy an NFS-based home directory server. In particular: (1) how to determine hardware requirements, and (2) how to best setup and configure the server. We actually have a system in place, but the performance is pretty bad---the users often experience a fair amount of lag (1--5 seconds) when doing anything on their home directories, including an ?ls? or writing a small text file.
Just going to throw this out there. What is RPCNFSDCOUNT in /etc/sysconfig/nfs?
I was also bit by this issue after a recent migration. The default in CentOS 6 is 8, which was too small even for my group, which has only 10 or so NFS clients, and only a handful active at any one time.
It is easy to change the number of nfsd kernel threads on the fly: just do
rpc.nfsd NN
where NN is the number of threads you want. The kernel will adjust the number of running threads on the fly. If that solves your performance issue, then you can adjust RPCNFSDCOUNT accordingly.
--keith
On Tue, Dec 11, 2012 at 2:24 PM, Dan Young danielmyoung@gmail.com wrote:
Just going to throw this out there. What is RPCNFSDCOUNT in /etc/sysconfig/nfs?
It was 64 (upped from the default of... 8 I think).
On 12/10/2012 09:37 AM, Matt Garman wrote:
In particular: (1) how to determine hardware requirements
That may be difficult at this point, because you really want to start by measuring the number of IOPS. That's difficult to do if your applications demand more than your hardware currently provices.
-the users often experience a fair amount of lag (1--5 seconds) when doing anything on their home directories, including an “ls” or writing a small text file.
This might not be the result of your NFS server performance. You might actually be seeing bad performance in your directory service. What are you using for that service? LDAP? NIS? Are you running nscd or sssd on the clients?
There are eight 15k 2.5” 600 GB drives (Seagate ST3600057SS) configured in hardware RAID-6 with a single hot spare. RAID controller is a Dell PERC H700 w/512MB cache (Linux sees this as a LSI MegaSAS 9260).
RAID 6 is good for $/GB, but bad for performance. If you find that your performance is bad, RAID10 will offer you a lot more IOPS.
Mixing 15k drives with RAID-6 is probably unusual. Typically 15k drives are used when the system needs maximum IOPS, and RAID-6 is used when storage capacity is more important than performance.
It's also unusual to see a RAID-6 array with a hot spare. You already have two disks of parity. At this point, your available storage capacity is only 600GB greater than a RAID-10 configuration, but your performance is MUCH worse.
OS is CentOS 5.6, home directory partition is ext3, with options “rw,data=journal,usrquota”.
data=journal actually offers better performance than the default in some workloads, but not all. You should try the default and see which is better. With a hardware RAID controller that has battery backed write cache, data=journal should not perform any better than the default, but probably not any worse.
I have the HW RAID configured to present two virtual disks to the OS: /dev/sda for the OS (boot, root and swap partitions), and /dev/sdb for the home directories. I’m fairly certain I did not align the partitions optimally:
If your drives are really 4k sectors, rather than the reported 512B, then they're not optimal and writes will suffer. The best policy is to start your first partition at 1M offset. parted should be aligning things well if it's updated, but if your partition sizes (in sectors) are divisible by 8, you should be in good shape.
Here is one iteration from the iostat process:
Time: 09:37:28 AM Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util sda 0.00 44.09 0.03 107.76 0.13 607.40 11.27 0.89 8.27 7.27 78.35 sdb 0.00 2616.53 0.67 157.88 2.80 11098.83 140.04 8.57 54.08 4.21 66.68
If that's normal, you need a faster array configuration. That iteration caught both disks with a very high % of maximum utilization. Consider using RAID-10.
What I observe, is that whenever sdb (home directory partition) becomes loaded, sda (OS) often does as well. Why is this?
Regardless of what you export to the OS, if the RAID controller really only has one big RAID-6 array, you'd expect saturation of either OS disk to affect both.
On Wed, Dec 12, 2012 at 12:29 AM, Gordon Messmer yinyang@eburg.com wrote:
That may be difficult at this point, because you really want to start by measuring the number of IOPS. That's difficult to do if your applications demand more than your hardware currently provices.
Since my original posting, we temporarily moved the data from the centos 5 server to the centos 6 server. We rebuilt the original (slow) server with centos 6, then migrated the data back. So far (fingers crossed) so good. I'm running a constant "iostat -kx 30", and logging it to a file. Disk utilization is virtually always under 50%. Random spikes in the 90% range, but they are few and far between.
Now that it appears the hardware + software configuration can handle the load. So I still have the same question: how I can accurately *quantify* the kind of IO load these servers have? I.e., how to measure IOPS?
This might not be the result of your NFS server performance. You might actually be seeing bad performance in your directory service. What are you using for that service? LDAP? NIS? Are you running nscd or sssd on the clients?
Not using a directory service (manually sync'ed passwd files, and kerberos for authentication). Not running nscd or sssd.
RAID 6 is good for $/GB, but bad for performance. If you find that your performance is bad, RAID10 will offer you a lot more IOPS.
Mixing 15k drives with RAID-6 is probably unusual. Typically 15k drives are used when the system needs maximum IOPS, and RAID-6 is used when storage capacity is more important than performance.
It's also unusual to see a RAID-6 array with a hot spare. You already have two disks of parity. At this point, your available storage capacity is only 600GB greater than a RAID-10 configuration, but your performance is MUCH worse.
I agree with all that. Problem is, there is a higher risk of storage failure with RAID-10 compared to RAID-6. We do have good, reliable *data* backups, but no real hardware backup. Our current service contract on the hardware is next business day. That's too much down time to tolerate with this particular system.
As I typed that, I realized we technically do have a hardware backup---the other server I mentioned. But even the time to restore from backup would make a lot of people extremely unhappy.
How do most people handle this kind of scenario, i.e. can't afford to have a hardware failure for any significant length of time? Have a whole redundant system in place? I would have to "sell" the idea to management, and for that, I'd need to precisely quantify our situation (i.e. my initial question).
OS is CentOS 5.6, home directory partition is ext3, with options “rw,data=journal,usrquota”.
data=journal actually offers better performance than the default in some workloads, but not all. You should try the default and see which is better. With a hardware RAID controller that has battery backed write cache, data=journal should not perform any better than the default, but probably not any worse.
Right, that was mentioned in another response. Unfortunately, I don't have the ability to test this. My only system is the real production system. I can't afford the interruption to the users while I fully unmount and mount the partition (can't change data= type with remount).
In general, it seems like a lot of IO tuning is "change parameter, then test". But (1) what test? It's hard to simulate a very random/unpredictable workload like user home directories, and (2) what to test on when one only has the single production system? I wish there were more "analytic" tools where you could simply measure a number of attributes, and from there, derive the ideal settings and configuration parameters.
If your drives are really 4k sectors, rather than the reported 512B, then they're not optimal and writes will suffer. The best policy is to start your first partition at 1M offset. parted should be aligning things well if it's updated, but if your partition sizes (in sectors) are divisible by 8, you should be in good shape.
It appears that centos 6 does the 1M offset by default. Centos 5 definitely doesn't do that.
Anyway... as I suggested above, the problem appears to be resolved... But the "fix" was kind of a shotgun approach, i.e. I changed too many things at once to know exactly what specific item fixed the problem. I'm sure this will inevitably come up again at some point, so I'd still like to learn/understand more to better handle the situation next time.
Thanks!
On Wed, Dec 12, 2012 at 1:52 PM, Matt Garman matthew.garman@gmail.com wrote:
I agree with all that. Problem is, there is a higher risk of storage failure with RAID-10 compared to RAID-6.
Does someone have the real odds here? I think the big risks are always that you have unnoticed bad sectors on the remaining mirror/parity drive when you lose a disk or that you keep running long enough to develop them before replacing it.
We do have good, reliable *data* backups, but no real hardware backup. Our current service contract on the hardware is next business day. That's too much down time to tolerate with this particular system.
As I typed that, I realized we technically do have a hardware backup---the other server I mentioned. But even the time to restore from backup would make a lot of people extremely unhappy.
How do most people handle this kind of scenario, i.e. can't afford to have a hardware failure for any significant length of time? Have a whole redundant system in place? I would have to "sell" the idea to management, and for that, I'd need to precisely quantify our situation (i.e. my initial question).
The simple-minded approach is to have a spare chassis and some spare drives to match your critical boxes. The most likely thing to go are the drives so all you have to do is rebuild the raid. In less likely event of a chassis failure, you can swap the drives into a spare a lot faster than copying the data. You only need a few spares to cover the likely failures across many production boxes but storage servers might be a special case with a different chassis type. You are still going to have some downtime with this approach, though - and it works best where you have operations staff on site to do the swaps. Also, you need to test it to be sure you understand what you have to change to make the system come up with new NIC's, etc.
On 12/12/2012 12:16 PM, Les Mikesell wrote:
On Wed, Dec 12, 2012 at 1:52 PM, Matt Garmanmatthew.garman@gmail.com wrote:
I agree with all that. Problem is, there is a higher risk of storage failure with RAID-10 compared to RAID-6.
Does someone have the real odds here? I think the big risks are always that you have unnoticed bad sectors on the remaining mirror/parity drive when you lose a disk or that you keep running long enough to develop them before replacing it.
a decent raid system does periodic 'scrubs' where in the background (when otherwise idle), it reads all the disks and verifies the raid. any marginal sectors should get detected and remapped at this point.
On Wed, Dec 12, 2012 at 2:24 PM, John R Pierce pierce@hogranch.com wrote:
I agree with all that. Problem is, there is a higher risk of storage failure with RAID-10 compared to RAID-6.
Does someone have the real odds here? I think the big risks are always that you have unnoticed bad sectors on the remaining mirror/parity drive when you lose a disk or that you keep running long enough to develop them before replacing it.
a decent raid system does periodic 'scrubs' where in the background (when otherwise idle), it reads all the disks and verifies the raid. any marginal sectors should get detected and remapped at this point.
Yes, but if you are doing it in software you need to make sure that the functionality is enabled and you are getting notifications from smartmon or something about the disk health - and in hardware you need some sort of controller-specific monitor running to track the drive health.
Matt Garman wrote:
On Wed, Dec 12, 2012 at 12:29 AM, Gordon Messmer yinyang@eburg.com wrote:
<snip>
As I typed that, I realized we technically do have a hardware backup---the other server I mentioned. But even the time to restore from backup would make a lot of people extremely unhappy.
How do most people handle this kind of scenario, i.e. can't afford to have a hardware failure for any significant length of time? Have a whole redundant system in place? I would have to "sell" the idea to management, and for that, I'd need to precisely quantify our situation (i.e. my initial question).
<snip> About selling it: ask them to consider what happens if one goes down... and "next day" service means someone shows up the next day (if you convince the OEM that you need on-site support). That does *not* guarantee that the server will be back up that next day (there was a Dell box that we replaced the m/b three, the second one was dead, and they finally just replaced the box because no one could figure out what was going wrong, but that was two weeks or so).
*Then*, once it's up, you get to restore everything to production. Try a tabletop exercise, as we have to do once a year, on what do we do with two or three scenarios, and guesstimate time for each. That might scare management into buying more hardware.
As long as they don't freeze your salary or lay someone off to pay for it....
mark
On 12/12/2012 11:52 AM, Matt Garman wrote:
Now that it appears the hardware + software configuration can handle the load. So I still have the same question: how I can accurately *quantify* the kind of IO load these servers have? I.e., how to measure IOPS?
IOPS are given in the output of iostat, which you're logging. iostat will report to you the number of read and write operations sent to the device per second in the "r/s" and "w/s" columns. You also want to pay attention to the "rrqm/s" and "wrqm/s". Those two columns indicate the number of read/write operations that were queued. If that number rises, it means that your disks aren't keeping up with the demands of applications. Finally, the %util is critical to understanding those numbers. %util indicates the amount of cpu time during which I/O requests were issued to the device. As %util approaches 100%, the r/s and w/s columns indicate the maximum performance of your disks, and indicate that disks are becoming a bottleneck to application performance.
I agree with all that. Problem is, there is a higher risk of storage failure with RAID-10 compared to RAID-6. We do have good, reliable *data* backups, but no real hardware backup. Our current service contract on the hardware is next business day. That's too much down time to tolerate with this particular system.
...
How do most people handle this kind of scenario, i.e. can't afford to have a hardware failure for any significant length of time? Have a whole redundant system in place? I would have to "sell" the idea to management, and for that, I'd need to precisely quantify our situation (i.e. my initial question).
You need an SLA in order to decide what array type is acceptable. Nothing is 100% reliable, so you need to decide how frequent a failure is acceptable in order to evaluate your options. Once you establish an SLA, you need to gather data on the MTBF of all of the components in your array in order to determine the probability of concurrent failures which will disrupt service, or the failure of non-redundant components which also will disrupt service. If you don't have an SLA, the observation that RAID10 is less resilient than RAID6 is no more useful than the observation that RAID10 performance per $ is vastly better, and somewhat less useful when you're asking others about improving performance.
The question isn't one of whether or not you can afford a hardware failure, the question is what would such a failure cost. Once you know how much an outage costs, how much your available options cost, and how frequently your available options will fail, you can make informed decisions about how much to spend on preventing a problem. If you don't establish those costs and probabilities, you're really just guessing blindly.
data=journal actually offers better performance than the default in some workloads, but not all. You should try the default and see which is better. With a hardware RAID controller that has battery backed write cache, data=journal should not perform any better than the default, but probably not any worse.
Right, that was mentioned in another response. Unfortunately, I don't have the ability to test this. My only system is the real production system. I can't afford the interruption to the users while I fully unmount and mount the partition (can't change data= type with remount).
You were able to interrupt service to do a full system upgrade, so a re-mount seems trivial by comparison. You don't even need to stop applications on the NFS clients. If you stop NFS service on the server long enough to unmount/mount the FS, or just reboot during non-office hours, the clients will simply block for the duration of that maintenance and then continue.
In general, it seems like a lot of IO tuning is "change parameter, then test". But (1) what test?
Relative RAID performance is highly dependent on both your RAID controller and on your workload, which is why it's so hard to find data on the best available configuration. There just isn't an answer that's suitable for everyone. One RAID controller can be twice as fast as another. In some workloads, RAID10 will be a small improvement over RAID6. In others, it can be easily twice as fast as an array of similar size. If your controller is good, you may be able to get better performance from a RAID6 array if you increase the number of member disks. In some controllers, performance degrades as member disks increase.
You need to continue to record data from iostat. If you were changing the array configuration, you'd look at changes in r/s and w/s relative to %util. If you're only able to change the FS parameters, you're probably looking for a change that reduces your average %util. Whereas changing the array might allow you more IOPS, changing the filesystem parameters will usually just smooth out the %util over time so that IOPS are less clustered.
That's basically how data=journal operates in some workloads. The journal should be one big block of contiguous block of sectors. When the kernel flushes memory buffers to disk, it's able to perform the writes in sequence, which means that the disk heads don't need to seek much, and that transfer of buffers to disk is much faster than it would be if they were simply written to more or less random sectors across the disk. The kernel can, then, use idle time to read those sectors back in to memory, and then write them back to their final destination sectors. As you can see, this doubles the total number of writes, and so it greatly increases the number of IOPS on the storage array. It also only works if your IO is relatively clustered, with idle time in between. If your IO is already a steady saturation, it will make overall performance much worse. Finally, data=journal requires that your journal is large enough to store all of the write buffers that will accumulate between idle periods that are long enough to allow that journal to be emptied. If your journal fills up, performance will suddenly and dramatically drop while the journal is emptied. It's up to you to determine whether your IO is clustered in that manner, how much data is being written in those peaks, and how large your journal needs to be to support that.
If your RAID controller has a battery backed cache, it operates according to the same rules, and in basically the same way as data=journal. All of your writes go to one place (the cache) and are written to the disk array during idle periods, and performance will tank if the cache fills. Thus, if you have a battery backed cache, using data=journal should never improve performance.
If you're stuck with RAID6 (and, I guess, even if you're not), one of the options that you have is to add another fast disk specifically for the journal. External journals and logs are recommended by virtually every database and application vendor I know of, but one of the least deployed options that I see. Using an external journal on a very fast disk or array and data=journal means that your write path is separate from your read path, and journal flushes only really depend on the read load to idle. If you can add a single fast disk (or RAID1 array) and move your ext4 journal there, you will dramatically improve your array performance in virtually all workloads where there are mixed reads and writes. I like to use fast SSDs for this purpose. You don't need them to be very large. An ultra-fast 8GB SSD (or RAID1 pair) is more than enough for the journal.