 
            Dear list,
I thought I'd just share my experiences with this 3Ware card, and see if anyone might have any suggestions.
System: Supermicro H8DA8 with 2 x Opteron 250 2.4GHz and 4GB RAM installed. 9550SX-8LP hosting 4x Seagate ST3250820SV 250GB in a RAID 1 plus 2 hot spare config. The array is properly initialized, write cache is on, as is queueing (and supported by the drives). StoreSave set to Protection.
OS is CentOS 4.5 i386, minimal install, default partitioning as suggested by the installer (ext3, small /boot on /dev/sda1, remainder as / on LVM VolGroup with 2GB swap).
Firmware from 3Ware codeset 9.4.1.2 in use, firmware/driver details: //serv1> /c0 show all /c0 Driver Version = 2.26.05.007 /c0 Model = 9550SX-8LP /c0 Memory Installed = 112MB /c0 Firmware Version = FE9X 3.08.02.005 /c0 Bios Version = BE9X 3.08.00.002 /c0 Monitor Version = BL9X 3.01.00.006
I initially noticed something odd while installing 4.4, that writing the inode tables took a longer time than I expected (I thought the installer had frozen) and the system overall felt sluggish when doing its first yum update, certainly more sluggish than I'd expect with a comparatively powerful machine and hardware RAID 1.
I tried a few simple benchmarks (bonnie++, iozone, dd) and noticed up to 8 pdflush commands hanging about in uninterruptible sleep when writing to disk, along with kjournald and kswapd from time to time. Loadave during writing climbed considerably (up to >12) with 'ls' taking up to 30 seconds to give any output. I've tried CentOS 4.4, 4.5, RHEL AS 4 update 5 (just in case) and openSUSE 10.2 and they all show the same symptoms.
Googling around makes me think that this may be related to queue depth, nr_requests and possibly VM params (the latter from https://bugzilla.redhat.com/show_bug.cgi?id=121434#c275). These are the default settings:
/sys/block/sda/device/queue_depth = 254 /sys/block/sda/queue/nr_requests = 8192 /proc/sys/vm/dirty_expire_centisecs = 3000 /proc/sys/vm/dirty_ratio = 30
3Ware mentions elevator=deadline, blockdev --setra 16384 along with nr_requests=512 in their performance tuning doc - these alone seem to make no difference to the latency problem.
Setting dirty_expire_centisecs = 1000 and dirty_ratio = 5 does indeed reduce the number of processes in 'b' state as reported by vmstat 1 during an iozone benchmark (./iozone -s 20480m -r 64 -i 0 -i 1 -t 1 -b filename.xls as per 3Ware's own tuning doc) but the problem is obviously still there, just mitigated somewhat. The comparison graphs are in a PDF here: http://community.novacaster.com/attach.pl/7411/482/iozone_vm_tweaks_xls.pdf Incidentally, the vmstat 1 output was directed to an NFS-mounted disk to avoid writing it to the arry during the actual testing.
I've tried eliminating LVM from the equation, going to ext2 rather than ext3 and booting single-processor all to no useful effect. I've also tried benchmarking with different blocksizes from 512B to 1M in powers of 2 and the problem remains - many processes in uninterruptible sleep blocking other IO. I'm about to start downloading CentOS 5 to give it a go, and after that I might have to resort to seeing if WinXP has the same issue.
My only real question is "where do I go from here?" I don't have enough specific tuning knowledge to know what else to look at.
Thanks for any pointers.
Simon
 
            Simon Banton wrote:
Dear list,
I thought I'd just share my experiences with this 3Ware card, and see if anyone might have any suggestions.
System: Supermicro H8DA8 with 2 x Opteron 250 2.4GHz and 4GB RAM installed. 9550SX-8LP hosting 4x Seagate ST3250820SV 250GB in a RAID 1 plus 2 hot spare config. The array is properly initialized, write cache is on, as is queueing (and supported by the drives). StoreSave set to Protection.
Well, the first thing I noted was that the H8DA8 was not on the list of compatible motherboards on the 3ware website.
My only real question is "where do I go from here?" I don't have enough specific tuning knowledge to know what else to look at.
Perhaps update to the latest firmware for both motherboard and 3ware board. Also check that you actually plugged the thing into a PCI-X 64-bit 100/133 Mhz slot and that it is running at those speeds. Next question would be whether you are using a riser board?
 
            At 20:52 +0800 13/9/07, Feizhou wrote:
Well, the first thing I noted was that the H8DA8 was not on the list of compatible motherboards on the 3ware website.
I challenged the vendor about that quite early on and was told that they've used this combo before with no trouble, though I've yet to press them on this point - but yes, that's a possibility I've not ruled out.
The card's running the latest firmware that I know about, certainly the fw in codeset 9.4.1.2 is the version intended to go with the driver supplied in CentOS 4.5.
It's definitely in the right slot, not using a riser and it's running at the right speed:
//serv1> /c0 show all [snip] /c0 Controller Bus Type = PCIX /c0 Controller Bus Width = 64 bits /c0 Controller Bus Speed = 133 Mhz
I'll see if there's a motherboard BIOS update, thanks for the reminder on that.
If there is anyone on the list using a 9550SX with an H8DA8 could they let me know whether or not they're seeing anything similar to what I've described?
Thanks S.
 
            Simon Banton wrote:
At 20:52 +0800 13/9/07, Feizhou wrote:
Well, the first thing I noted was that the H8DA8 was not on the list of compatible motherboards on the 3ware website.
I challenged the vendor about that quite early on and was told that they've used this combo before with no trouble, though I've yet to press them on this point - but yes, that's a possibility I've not ruled out.
The card's running the latest firmware that I know about, certainly the fw in codeset 9.4.1.2 is the version intended to go with the driver supplied in CentOS 4.5.
It's definitely in the right slot, not using a riser and it's running at the right speed:
//serv1> /c0 show all [snip] /c0 Controller Bus Type = PCIX /c0 Controller Bus Width = 64 bits /c0 Controller Bus Speed = 133 Mhz
Hmm, how are you creating your ext3 filesystem(s) that you test on? Try creating it with a large journal (maybe 256MB) and run it in full journal mode.
 
            Simon Banton wrote:
Hmm, how are you creating your ext3 filesystem(s) that you test on? Try creating it with a large journal (maybe 256MB) and run it in full journal mode.
The filesystem was created during the initial CentOS installation, and I've tried it with ext2 which made no difference.
The journal size was probably 32MB then. A 128MB or larger journal in full journal mode on a 3ware card with a BBU write cache should make a big difference because fsync calls will now return OK as soon as it hits the write cache....oh....do you have a BBU for your write cache on your 3ware board?
 
            Simon Banton wrote:
At 17:34 +0800 14/9/07, Feizhou wrote:
.oh....do you have a BBU for your write cache on your 3ware board?
Not installed, but the machine's on a UPS.
Ugh. The 3ware code will not give OK then until the stuff has hit disk.
I see where you're going with larger journal idea and I'll give that a go.
Well, I do not think it will help much with a larger journal...you want RAM speed, not single 250GB SATA disk speed.
 
            Feizhou wrote:
Simon Banton wrote:
At 17:34 +0800 14/9/07, Feizhou wrote:
.oh....do you have a BBU for your write cache on your 3ware board?
Not installed, but the machine's on a UPS.
Ugh. The 3ware code will not give OK then until the stuff has hit disk.
I see where you're going with larger journal idea and I'll
give that a go.
Well, I do not think it will help much with a larger journal...you want RAM speed, not single 250GB SATA disk speed.
Yes, a write-back cache with a BBU will definitely help, also your config,
4x Seagate ST3250820SV 250GB in a RAID 1 plus 2 hot spare config
is kinda wasteful, why not create a 4 disk RAID10 and get a 5th drive for a hot-spare.
Also think about getting 2 internal SATA drives for the OS and keep the RAID10 as purely for data, that should make things humm nicely and to be able to upgrade your data storage without messing with your OS/application installation. It wouldn't cost a lot either, 2 SATA drives + 1 SAS drive.
-Ross
______________________________________________________________________ This e-mail, and any attachments thereto, is intended only for use by the addressee(s) named herein and may contain legally privileged and/or confidential information. If you are not the intended recipient of this e-mail, you are hereby notified that any dissemination, distribution or copying of this e-mail, and any attachments thereto, is strictly prohibited. If you have received this e-mail in error, please immediately notify the sender and permanently delete the original and any copy or printout thereof.
 
            At 11:16 -0400 14/9/07, Ross S. W. Walker wrote:
Yes, a write-back cache with a BBU will definitely help, also your config,
The write-cache is enabled, but what I've not known up to now is that the absence of a BBU will impact IO performance in this way - which seems to be what you and Feizhou are saying. Is there any way to tell the card to forget about not having a BBU and behave as if it did?
The main problem here is the latency when under IO load not the throughput (or lack of). I don't care if it can't achieve 300MB/s sustained write speeds, only that it shouldn't bring the machine to its knees in the process of getting 35MB/s.
4x Seagate ST3250820SV 250GB in a RAID 1 plus 2 hot spare config
is kinda wasteful, why not create a 4 disk RAID10 and get a 5th drive for a hot-spare.
Logistics meant that it was more important to be able to cope with a disk failure without needing to visit the hosting centre immediately afterwards (which we'd have to do if there was only one hot spare).
Also think about getting 2 internal SATA drives for the OS and keep the RAID10 as purely for data, that should make things humm nicely and to be able to upgrade your data storage without messing with your OS/application installation. It wouldn't cost a lot either, 2 SATA drives + 1 SAS drive.
The server box is a Supermicro AS2020A - there is no onboard SATA nor any space for internal disks - there are 6 bays on a hot swap backplane and they're all cabled to the 3ware controller.
I've unpacked and fired up one of the other identical machines and moved the drives from the original to this one and booted straight off them.
The only difference between hardware is that the firmware on the 3Ware card in this one has not been updated (it's 3.04.00.005 from codeset 9.3.0 as opposed to 3.08.02.005 from 9.4.1.2).
# /opt/iozone/bin/iozone -s 20480m -r 64 -i 0 -i 1 -t 1
Original box: Initial write 34208.20703 Rewrite 38133.20313 Read 79596.36719 Re-read 79669.22656
Newly unpacked box: Initial write 50230.10547 Rewrite 46108.17969 Read 78739.14844 Re-read 79325.11719
... but the new one still shows the same IO blocking/responsiveness issue.
S.
 
            Simon Banton wrote:
At 11:16 -0400 14/9/07, Ross S. W. Walker wrote:
Yes, a write-back cache with a BBU will definitely help, also your config,
The write-cache is enabled, but what I've not known up to now is that the absence of a BBU will impact IO performance in this way - which seems to be what you and Feizhou are saying. Is there any way to tell the card to forget about not having a BBU and behave as if it did?
Short of modifying the code...I do not know of any.
The main problem here is the latency when under IO load not the throughput (or lack of). I don't care if it can't achieve 300MB/s sustained write speeds, only that it shouldn't bring the machine to its knees in the process of getting 35MB/s.
SATA 7200RPM disks are not exactly fantastic at random i/o. The cache only boosts writes in a significant way and less so for reads. Maybe you could consider RAID10.
That way, them 7200RPM disks are a match if not better than hardware raid with BBU cache for a mirror of scsi drives.
 
            At 08:18 +0800 15/9/07, Feizhou wrote:
Is there any way to tell the card to forget about not having a BBU and behave as if it did?
Short of modifying the code...I do not know of any.
Well, I've now got BBUs on order for the three identical machines to see if that does anything to improve matters - I'll report back when I've fitted them. A glance through the 2.26.05.007 driver code shows no references to the BBU, so the different code paths (with BBU and without) must be in the firmware itself.
S.
 
            Simon Banton wrote:
At 08:18 +0800 15/9/07, Feizhou wrote:
Is there any way to tell the card to forget about not having a BBU and behave as if it did?
Short of modifying the code...I do not know of any.
Well, I've now got BBUs on order for the three identical machines to see if that does anything to improve matters - I'll report back when I've fitted them. A glance through the 2.26.05.007 driver code shows no references to the BBU, so the different code paths (with BBU and without) must be in the firmware itself.
If your card is on a PCI riser try running it plugged directly in the slot (if you can) and see if that helps.
-Ross
______________________________________________________________________ This e-mail, and any attachments thereto, is intended only for use by the addressee(s) named herein and may contain legally privileged and/or confidential information. If you are not the intended recipient of this e-mail, you are hereby notified that any dissemination, distribution or copying of this e-mail, and any attachments thereto, is strictly prohibited. If you have received this e-mail in error, please immediately notify the sender and permanently delete the original and any copy or printout thereof.
 
            Is there any way to tell the card to forget about not having a BBU and behave as if it did?
Short of modifying the code...I do not know of any.
Well, I've now got BBUs on order for the three identical machines to see if that does anything to improve matters - I'll report back when I've fitted them. A glance through the 2.26.05.007 driver code shows no references to the BBU, so the different code paths (with BBU and without) must be in the firmware itself.
If your card is on a PCI riser try running it plugged directly in the slot (if you can) and see if that helps.
He said his card is directly plugged in.
 
            Feizhou wrote:
Is there any way to tell the card to forget about not
having a BBU
and behave as if it did?
Short of modifying the code...I do not know of any.
Well, I've now got BBUs on order for the three identical
machines to
see if that does anything to improve matters - I'll report
back when
I've fitted them. A glance through the 2.26.05.007 driver
code shows
no references to the BBU, so the different code paths
(with BBU and
without) must be in the firmware itself.
If your card is on a PCI riser try running it plugged
directly in the
slot (if you can) and see if that helps.
He said his card is directly plugged in.
Doh, problem with the long threads, one forgets everything that was mentioned earlier unless they re-read the whole thread again.
-Ross
______________________________________________________________________ This e-mail, and any attachments thereto, is intended only for use by the addressee(s) named herein and may contain legally privileged and/or confidential information. If you are not the intended recipient of this e-mail, you are hereby notified that any dissemination, distribution or copying of this e-mail, and any attachments thereto, is strictly prohibited. If you have received this e-mail in error, please immediately notify the sender and permanently delete the original and any copy or printout thereof.
 
            At 23:07 +0800 14/9/07, Feizhou wrote:
Well, I do not think it will help much with a larger journal...you want RAM speed, not single 250GB SATA disk speed.
Right now, I'd be happy with being able to configure the 3Ware care as a plain old SATA II passthru interface and do software RAID1 with mdadm - but no, Export JBOD doesn't seem possible any more with the 9550 (unless the units have previously been JBODs on earlier cards), you've got to use their 'Single Disk' config which exhibits exactly the same problems.
S.
 
            At 17:34 +0800 14/9/07, Feizhou wrote:
.oh....do you have a BBU for your write cache on your 3ware board?
Not installed, but the machine's on a UPS.
Ugh. The 3ware code will not give OK then until the stuff has hit disk.
Having now installed BBUs, it's made no difference to the underlying responsiveness problem I'm afraid.
With ports 2 and 3 now configured as RAID 0, with ext3 filesystem and mounted on /mnt/raidtest, running this bonnie++ command:
bonnie++ -m RA-256_NR-8192 -n 0 -u 0 -r 4096 -s 20480 -f -b -d /mnt/raidtest
(RA- and NR- relate to kernel params for readahead and nr_requests respectively - the values above are Centos post-installation defaults)
...causes load to climb:
16:36:12 up 13 min, 2 users, load average: 8.77, 4.78, 1.98
... and uninterruptible processes:
ps ax | grep D PID TTY STAT TIME COMMAND 59 ? D 0:03 [kswapd0] 2159 ? D 0:01 [kjournald] 2923 ? Ds 0:00 syslogd -m 0 4155 ? D 0:00 [pdflush] 4175 ? D 0:00 [pdflush] 4192 ? D 0:00 [pdflush] 4193 ? D 0:00 [pdflush] 4197 ? D 0:00 [pdflush] 4199 ? D 0:00 [pdflush] 4201 pts/1 R+ 0:00 grep D
... plus an Out of Memory kill of sshd. Second time around (logged in on the console rather than over ssh), it's just the same except it's hald that happens to get clobbered instead.
Now that the presence or otherwise of a BBU has been ruled out along with OS, 3ware recommended kernel param tweaks, RAID level, LVM, slot speed, different but identical-spec hardware (both machine and card), what's left to try?
I see there's a new firmware version out today (3ware codeset 9.4.1.3 - driver's still at 2.26.05.007 but the fw's updated to from 3.08.02.005 to 3.08.02.007), so I guess I'll update it and push the whole thing back up the hill for another go.
If there's anyone out there with a 9550SX and a two-disk RAID 1 or RAID 0 config on CentOS 4.5 who can give the above bonnie++ benchmark a go (params adjusted for their own installed RAM - I'm benchmarking using 5x my installed amount) and let me know if they also have the same responsiveness problem or not, I'd seriously appreciate it.
S.
 
            Simon Banton wrote:
At 17:34 +0800 14/9/07, Feizhou wrote:
.oh....do you have a BBU for your write cache on your 3ware board?
Not installed, but the machine's on a UPS.
Ugh. The 3ware code will not give OK then until the stuff has hit disk.
Having now installed BBUs, it's made no difference to the underlying responsiveness problem I'm afraid.
So a 3ware card will give OK once the stuff is in the cache and you have selected write-cache enable even if there is no BBU? My apologies. My previous experience has been with the 75xx and 85xx series which do not have ram caches.
With ports 2 and 3 now configured as RAID 0, with ext3 filesystem and mounted on /mnt/raidtest, running this bonnie++ command:
bonnie++ -m RA-256_NR-8192 -n 0 -u 0 -r 4096 -s 20480 -f -b -d /mnt/raidtest
(RA- and NR- relate to kernel params for readahead and nr_requests respectively - the values above are Centos post-installation defaults)
...causes load to climb:
16:36:12 up 13 min, 2 users, load average: 8.77, 4.78, 1.98
... and uninterruptible processes:
ps ax | grep D PID TTY STAT TIME COMMAND 59 ? D 0:03 [kswapd0] 2159 ? D 0:01 [kjournald] 2923 ? Ds 0:00 syslogd -m 0 4155 ? D 0:00 [pdflush] 4175 ? D 0:00 [pdflush] 4192 ? D 0:00 [pdflush] 4193 ? D 0:00 [pdflush] 4197 ? D 0:00 [pdflush] 4199 ? D 0:00 [pdflush] 4201 pts/1 R+ 0:00 grep D
... plus an Out of Memory kill of sshd. Second time around (logged in on the console rather than over ssh), it's just the same except it's hald that happens to get clobbered instead.
Are you saying that running in RAID0 mode with this card and motherboard combination, you get a memory leak? Who is the culprit?
Now that the presence or otherwise of a BBU has been ruled out along with OS, 3ware recommended kernel param tweaks, RAID level, LVM, slot speed, different but identical-spec hardware (both machine and card), what's left to try?
Bug report...
I see there's a new firmware version out today (3ware codeset 9.4.1.3 - driver's still at 2.26.05.007 but the fw's updated to from 3.08.02.005 to 3.08.02.007), so I guess I'll update it and push the whole thing back up the hill for another go.
I hope that fixes things for you.
 
            At 07:46 +0800 24/9/07, Feizhou wrote:
... plus an Out of Memory kill of sshd. Second time around (logged in on the console rather than over ssh), it's just the same except it's hald that happens to get clobbered instead.
Are you saying that running in RAID0 mode with this card and motherboard combination, you get a memory leak? Who is the culprit?
I don't know if it's caused by a memory leak or something else, I'm just describing what happens. I would be tempted to suspect the RAM itself if another identical machine didn't have exactly the same issue.
what's left to try?
Bug report...
I've reported the issue to 3ware but they've not responded. I replicated the problem with RHEL AS 4 update 5 and contacted RedHat but they told me evaluation subscriptions aren't supported.
I see there's a new firmware version out today (3ware codeset 9.4.1.3... I guess I'll update it and push the whole thing back up the hill for another go.
I hope that fixes things for you.
Maybe I'm thinking about this all wrong - maybe this responsiveness issue won't even arise during normal operation, perhaps it's just a symptom of intensive benchmarking when all the resources of the machine are devoted to throwing data at the card/disks as fast as possible. I'm now way out of my depth, frankly.
I'm going to try the latest firmware upgrade, followed by RHEL/CentOS 5, and finally see if I can replicate with a different card (Areca or LSI, perhaps).
Thanks for all the feedback, at least I feel as if I've tried every conceivable obvious thing.
S.
 
            Simon Banton wrote:
At 07:46 +0800 24/9/07, Feizhou wrote:
... plus an Out of Memory kill of sshd. Second time around (logged in on the console rather than over ssh), it's just the same except it's hald that happens to get clobbered instead.
Are you saying that running in RAID0 mode with this card and motherboard combination, you get a memory leak? Who is the culprit?
I don't know if it's caused by a memory leak or something else, I'm just describing what happens. I would be tempted to suspect the RAM itself if another identical machine didn't have exactly the same issue.
what's left to try?
Bug report...
I've reported the issue to 3ware but they've not responded. I replicated the problem with RHEL AS 4 update 5 and contacted RedHat but they told me evaluation subscriptions aren't supported.
I see there's a new firmware version out today (3ware codeset 9.4.1.3... I guess I'll update it and push the whole thing back up the hill for another go.
I hope that fixes things for you.
Maybe I'm thinking about this all wrong - maybe this responsiveness issue won't even arise during normal operation, perhaps it's just a symptom of intensive benchmarking when all the resources of the machine are devoted to throwing data at the card/disks as fast as possible. I'm now way out of my depth, frankly.
I'm going to try the latest firmware upgrade, followed by RHEL/CentOS 5, and finally see if I can replicate with a different card (Areca or LSI, perhaps).
Thanks for all the feedback, at least I feel as if I've tried every conceivable obvious thing.
In the end it just may be that the card cannot perform under the load you need it to.
How about trying your benchmarks with the 'disktest' utility from the LTP (Linux Test Project), the utility can perform benchmarks on raw block devices and it gives very accurate benchmarks, it is also a lot easier to setup and use.
-Ross
______________________________________________________________________ This e-mail, and any attachments thereto, is intended only for use by the addressee(s) named herein and may contain legally privileged and/or confidential information. If you are not the intended recipient of this e-mail, you are hereby notified that any dissemination, distribution or copying of this e-mail, and any attachments thereto, is strictly prohibited. If you have received this e-mail in error, please immediately notify the sender and permanently delete the original and any copy or printout thereof.
 
            At 10:04 -0400 24/9/07, Ross S. W. Walker wrote:
How about trying your benchmarks with the 'disktest' utility from the LTP (Linux Test Project),
Now fetched and installed - I'd be grateful for a suggestion as to an appropriate disktest command line for a 4GB RAM twin CPU box with 250GB RAID 1 array, because I think you had your tongue in your cheek when you said:
it is also a lot easier to setup and use.
S.
 
            Simon Banton wrote:
At 10:04 -0400 24/9/07, Ross S. W. Walker wrote:
How about trying your benchmarks with the 'disktest' utility from the LTP (Linux Test Project),
Now fetched and installed - I'd be grateful for a suggestion as to an appropriate disktest command line for a 4GB RAM twin CPU box with 250GB RAID 1 array, because I think you had your tongue in your cheek when you said:
it is also a lot easier to setup and use.
Ok, so maybe it's easy for me... I forget that nothing is easy the first time using it.
Ok, so here is the command I would use:
Sequential reads: disktest -B 4k -h 1 -I BD -K 4 -p l -P T -T 300 -r /dev/sdX
Sequential writes: disktest -B 4k -h 1 -I BD -K 4 -p l -P T -T 300 -w /dev/sdX
Random writes: disktest -B 4k -h 1 -I BD -K 4 -p r -P T -T 300 -r /dev/sdX
Random writes: disktest -B 4k -h 1 -I BD -K 4 -p r -P T -T 300 -w /dev/sdX
Description of the options used: -B 4k = 4k block ios -h 1 = 1 second heartbeat -I BD = block device, direct io -K 4 = 4 threads, or 4 outstanding/overlapping ios, typical pattern (use -K 1 for the raw performance of single drive, aka dd type output) -p <l|r> = io type, l=linear, r=random -P T = output metrics type "Throughput" -T 300 = duration of test 300 seconds -r = read -w = write
These tests will run across the whole disk/partition and the write tests WILL BE DESTRUCTIVE so be warned!
-Ross
______________________________________________________________________ This e-mail, and any attachments thereto, is intended only for use by the addressee(s) named herein and may contain legally privileged and/or confidential information. If you are not the intended recipient of this e-mail, you are hereby notified that any dissemination, distribution or copying of this e-mail, and any attachments thereto, is strictly prohibited. If you have received this e-mail in error, please immediately notify the sender and permanently delete the original and any copy or printout thereof.
 
            At 13:35 -0400 24/9/07, Ross S. W. Walker wrote:
Ok, so here is the command I would use:
Thanks - here are the results (tried CentOS 4.5 and RHEL5, with tests on sdb when configured as both RAID 0 and as RAID 1):
Sequential reads: disktest -B 4k -h 1 -I BD -K 4 -p l -P T -T 300 -r /dev/sdX
CentOS 4.5, RAID 0: | 2007/09/25-14:26:58 | STAT | 13944 | v1.2.8 | /dev/sdb | Total read throughput: 50249728.0B/s (47.92MB/s), IOPS 12268.0/s. | 2007/09/25-14:26:58 | END | 13944 | v1.2.8 | /dev/sdb | Test Done (Passed)
CentOS 4.5, RAID 1: | 2007/09/25-14:20:06 | STAT | 13807 | v1.2.8 | /dev/sdb | Total read throughput: 44994150.4B/s (42.91MB/s), IOPS 10984.9/s. | 2007/09/25-14:20:06 | END | 13807 | v1.2.8 | /dev/sdb | Test Done (Passed)
RHEL5, RAID 0: | 2007/09/25-11:07:46 | STAT | 2835 | v1.2.8 | /dev/sdb | Total read throughput: 2405171.2B/s (2.29MB/s), IOPS 587.2/s. | 2007/09/25-11:07:46 | END | 2835 | v1.2.8 | /dev/sdb | Test Done (Passed)
RHEL5, RAID 1: | 2007/09/25-11:35:53 | STAT | 3022 | v1.2.8 | /dev/sdb | Total read throughput: 2461696.0B/s (2.35MB/s), IOPS 601.0/s. | 2007/09/25-11:35:53 | END | 3022 | v1.2.8 | /dev/sdb | Test Done (Passed)
Sequential writes: disktest -B 4k -h 1 -I BD -K 4 -p l -P T -T 300 -w /dev/sdX
CentOS 4.5, RAID 0: | 2007/09/25-14:28:19 | STAT | 13951 | v1.2.8 | /dev/sdb | Total write throughput: 66150946.1B/s (63.09MB/s), IOPS 16150.1/s. | 2007/09/25-14:28:19 | END | 13951 | v1.2.8 | /dev/sdb | Test Done (Passed)
CentOS 4.5, RAID 1: | 2007/09/25-14:21:52 | STAT | 13815 | v1.2.8 | /dev/sdb | Total write throughput: 53170039.5B/s (50.71MB/s), IOPS 12981.0/s. | 2007/09/25-14:21:52 | END | 13815 | v1.2.8 | /dev/sdb | Test Done (Passed)
RHEL5, RAID 0: | 2007/09/25-11:13:44 | STAT | 2850 | v1.2.8 | /dev/sdb | Total write throughput: 66031616.0B/s (62.97MB/s), IOPS 16121.0/s. | 2007/09/25-11:13:44 | END | 2850 | v1.2.8 | /dev/sdb | Test Done (Passed)
RHEL5, RAID 1: | 2007/09/25-11:36:36 | STAT | 3031 | v1.2.8 | /dev/sdb | Total write throughput: 56870229.3B/s (54.24MB/s), IOPS 13884.3/s. | 2007/09/25-11:36:36 | END | 3031 | v1.2.8 | /dev/sdb | Test Done (Passed)
Random reads: disktest -B 4k -h 1 -I BD -K 4 -p r -P T -T 300 -r /dev/sdX
CentOS 4.5, RAID 0: | 2007/09/25-14:28:59 | STAT | 13958 | v1.2.8 | /dev/sdb | Total read throughput: 504217.6B/s (0.48MB/s), IOPS 123.1/s. | 2007/09/25-14:28:59 | END | 13958 | v1.2.8 | /dev/sdb | Test Done (Passed)
CentOS 4.5, RAID 1: | 2007/09/25-14:23:14 | STAT | 13822 | v1.2.8 | /dev/sdb | Total read throughput: 549570.2B/s (0.52MB/s), IOPS 134.2/s. | 2007/09/25-14:23:14 | END | 13822 | v1.2.8 | /dev/sdb | Test Done (Passed)
RHEL5, RAID 0: | 2007/09/25-11:16:21 | STAT | 2875 | v1.2.8 | /dev/sdb | Total read throughput: 273612.8B/s (0.26MB/s), IOPS 66.8/s. | 2007/09/25-11:16:21 | END | 2875 | v1.2.8 | /dev/sdb | Test Done (Passed)
RHEL5, RAID 1: | 2007/09/25-11:39:20 | STAT | 3042 | v1.2.8 | /dev/sdb | Total read throughput: 546816.0B/s (0.52MB/s), IOPS 133.5/s. | 2007/09/25-11:39:20 | END | 3042 | v1.2.8 | /dev/sdb | Test Done (Passed)
Random writes: disktest -B 4k -h 1 -I BD -K 4 -p r -P T -T 300 -w /dev/sdX
CentOS 4.5, RAID 0: | 2007/09/25-14:29:34 | STAT | 13965 | v1.2.8 | /dev/sdb | Total write throughput: 1379532.8B/s (1.32MB/s), IOPS 336.8/s. | 2007/09/25-14:29:34 | END | 13965 | v1.2.8 | /dev/sdb | Test Done (Passed)
CentOS 4.5, RAID 1: | 2007/09/25-14:24:15 | STAT | 13829 | v1.2.8 | /dev/sdb | Total write throughput: 782199.5B/s (0.75MB/s), IOPS 191.0/s. | 2007/09/25-14:24:15 | END | 13829 | v1.2.8 | /dev/sdb | Test Done (Passed)
RHEL5, RAID 0: | 2007/09/25-11:19:21 | STAT | 2894 | v1.2.8 | /dev/sdb | Total write throughput: 1377894.4B/s (1.31MB/s), IOPS 336.4/s. | 2007/09/25-11:19:21 | END | 2894 | v1.2.8 | /dev/sdb | Test Done (Passed)
RHEL5 RAID 1: | 2007/09/25-11:40:08 | STAT | 3049 | v1.2.8 | /dev/sdb | Total write throughput: 798310.4B/s (0.76MB/s), IOPS 194.9/s. | 2007/09/25-11:40:08 | END | 3049 | v1.2.8 | /dev/sdb | Test Done (Passed)
I'm not sure what to make of it, mind you.
Cheers S.
 
            Simon Banton wrote:
At 13:35 -0400 24/9/07, Ross S. W. Walker wrote:
Ok, so here is the command I would use:
Thanks - here are the results (tried CentOS 4.5 and RHEL5, with tests on sdb when configured as both RAID 0 and as RAID 1):
Sequential reads: disktest -B 4k -h 1 -I BD -K 4 -p l -P T -T 300 -r /dev/sdX
CentOS 4.5, RAID 0: | 2007/09/25-14:26:58 | STAT | 13944 | v1.2.8 | /dev/sdb | Total read throughput: 50249728.0B/s (47.92MB/s), IOPS 12268.0/s. | 2007/09/25-14:26:58 | END | 13944 | v1.2.8 | /dev/sdb | Test Done (Passed)
Ok, this is a 2 disk RAID0? If so then this is ok, not the fastest config (60MB/s for fast drives) but mid-level SATA performance.
CentOS 4.5, RAID 1: | 2007/09/25-14:20:06 | STAT | 13807 | v1.2.8 | /dev/sdb | Total read throughput: 44994150.4B/s (42.91MB/s), IOPS 10984.9/s. | 2007/09/25-14:20:06 | END | 13807 | v1.2.8 | /dev/sdb | Test Done (Passed)
Statisically equivalent to RAID0, which is a good sign, as it means the 3ware is doing striped reads off a RAID1.
RHEL5, RAID 0: | 2007/09/25-11:07:46 | STAT | 2835 | v1.2.8 | /dev/sdb | Total read throughput: 2405171.2B/s (2.29MB/s), IOPS 587.2/s. | 2007/09/25-11:07:46 | END | 2835 | v1.2.8 | /dev/sdb | Test Done (Passed)
Ok there is a problem here with the driver on RHEL5, are you running the latest version off of 3ware's site?
Can you send the output of a modinfo <driver name>?
RHEL5, RAID 1: | 2007/09/25-11:35:53 | STAT | 3022 | v1.2.8 | /dev/sdb | Total read throughput: 2461696.0B/s (2.35MB/s), IOPS 601.0/s. | 2007/09/25-11:35:53 | END | 3022 | v1.2.8 | /dev/sdb | Test Done (Passed)
Same bad result here too... at least it's consistently bad, definitely points to a bad driver.
Sequential writes: disktest -B 4k -h 1 -I BD -K 4 -p l -P T -T 300 -w /dev/sdX
CentOS 4.5, RAID 0: | 2007/09/25-14:28:19 | STAT | 13951 | v1.2.8 | /dev/sdb | Total write throughput: 66150946.1B/s (63.09MB/s), IOPS 16150.1/s. | 2007/09/25-14:28:19 | END | 13951 | v1.2.8 | /dev/sdb | Test Done (Passed)
Good write performance here, the BBU cache is definitely helping.
CentOS 4.5, RAID 1: | 2007/09/25-14:21:52 | STAT | 13815 | v1.2.8 | /dev/sdb | Total write throughput: 53170039.5B/s (50.71MB/s), IOPS 12981.0/s. | 2007/09/25-14:21:52 | END | 13815 | v1.2.8 | /dev/sdb | Test Done (Passed)
Also good write performance here with the BBU cache, RAID1 is going to be slower by nature as it writes twice for each write, but the BBU cache is minimizing the hurt.
RHEL5, RAID 0: | 2007/09/25-11:13:44 | STAT | 2850 | v1.2.8 | /dev/sdb | Total write throughput: 66031616.0B/s (62.97MB/s), IOPS 16121.0/s. | 2007/09/25-11:13:44 | END | 2850 | v1.2.8 | /dev/sdb | Test Done (Passed)
Write performance on RHEL 5 doesn't seem to be affected here, maybe it's only read performance, maybe the BBU cache is hiding the problem.
RHEL5, RAID 1: | 2007/09/25-11:36:36 | STAT | 3031 | v1.2.8 | /dev/sdb | Total write throughput: 56870229.3B/s (54.24MB/s), IOPS 13884.3/s. | 2007/09/25-11:36:36 | END | 3031 | v1.2.8 | /dev/sdb | Test Done (Passed)
Same thing here, good write performance.
Random reads: disktest -B 4k -h 1 -I BD -K 4 -p r -P T -T 300 -r /dev/sdX
CentOS 4.5, RAID 0: | 2007/09/25-14:28:59 | STAT | 13958 | v1.2.8 | /dev/sdb | Total read throughput: 504217.6B/s (0.48MB/s), IOPS 123.1/s. | 2007/09/25-14:28:59 | END | 13958 | v1.2.8 | /dev/sdb | Test Done (Passed)
And here is where the difference between a 15K drive and a 7200 RPM drive appears, though with RAID0 one would expect to see around 1MB, what chunk size does it use 64K?
CentOS 4.5, RAID 1: | 2007/09/25-14:23:14 | STAT | 13822 | v1.2.8 | /dev/sdb | Total read throughput: 549570.2B/s (0.52MB/s), IOPS 134.2/s. | 2007/09/25-14:23:14 | END | 13822 | v1.2.8 | /dev/sdb | Test Done (Passed)
This is the correct performance of a RAID1 for 7200 RPM drives.
RHEL5, RAID 0: | 2007/09/25-11:16:21 | STAT | 2875 | v1.2.8 | /dev/sdb | Total read throughput: 273612.8B/s (0.26MB/s), IOPS 66.8/s. | 2007/09/25-11:16:21 | END | 2875 | v1.2.8 | /dev/sdb | Test Done (Passed)
This also shows a serious performance degradation here, the numbers should be similar to RHEL 4.5 numbers.
RHEL5, RAID 1: | 2007/09/25-11:39:20 | STAT | 3042 | v1.2.8 | /dev/sdb | Total read throughput: 546816.0B/s (0.52MB/s), IOPS 133.5/s. | 2007/09/25-11:39:20 | END | 3042 | v1.2.8 | /dev/sdb | Test Done (Passed)
This is an oddity, and is inconsistent. I would have expected this number to be low too, but it is showing normal throughput for this configuration. I wouldn't put any faith in this and if you ran it 3 times in a row it will probably post slow numbers 2 out of the 3 times.
Random writes: disktest -B 4k -h 1 -I BD -K 4 -p r -P T -T 300 -w /dev/sdX
CentOS 4.5, RAID 0: | 2007/09/25-14:29:34 | STAT | 13965 | v1.2.8 | /dev/sdb | Total write throughput: 1379532.8B/s (1.32MB/s), IOPS 336.8/s. | 2007/09/25-14:29:34 | END | 13965 | v1.2.8 | /dev/sdb | Test Done (Passed)
Ok, well the write-back cache is helping here as it should, without it these numbers would be around 320KB/s
CentOS 4.5, RAID 1: | 2007/09/25-14:24:15 | STAT | 13829 | v1.2.8 | /dev/sdb | Total write throughput: 782199.5B/s (0.75MB/s), IOPS 191.0/s. | 2007/09/25-14:24:15 | END | 13829 | v1.2.8 | /dev/sdb | Test Done (Passed)
Also good write-cache performance for this config, it would probably be around 160KB/s without it.
RHEL5, RAID 0: | 2007/09/25-11:19:21 | STAT | 2894 | v1.2.8 | /dev/sdb | Total write throughput: 1377894.4B/s (1.31MB/s), IOPS 336.4/s. | 2007/09/25-11:19:21 | END | 2894 | v1.2.8 | /dev/sdb | Test Done (Passed)
Statistically the same as the RHEL 4.5 numbers.
RHEL5 RAID 1: | 2007/09/25-11:40:08 | STAT | 3049 | v1.2.8 | /dev/sdb | Total write throughput: 798310.4B/s (0.76MB/s), IOPS 194.9/s. | 2007/09/25-11:40:08 | END | 3049 | v1.2.8 | /dev/sdb | Test Done (Passed)
Same here.
I'm not sure what to make of it, mind you.
Well bottom line, there is something very wrong with the 3ware drivers on the RHEL 5 implementation.
I recommend making sure they are the latest drivers from the manufacturer and if they are, base your implementation on RHEL 4.5.
Post the modinfo <driver name> to the list just in case somebody else knows of any issues with the version you are running.
-Ross
______________________________________________________________________ This e-mail, and any attachments thereto, is intended only for use by the addressee(s) named herein and may contain legally privileged and/or confidential information. If you are not the intended recipient of this e-mail, you are hereby notified that any dissemination, distribution or copying of this e-mail, and any attachments thereto, is strictly prohibited. If you have received this e-mail in error, please immediately notify the sender and permanently delete the original and any copy or printout thereof.
 
            At 10:36 -0400 25/9/07, Ross S. W. Walker wrote:
Post the modinfo <driver name> to the list just in case somebody else knows of any issues with the version you are running.
This is from RHEL5 - it's the driver that comes built-in:
[root@serv1 ~]# modinfo 3w-9xxx filename: /lib/modules/2.6.18-8.el5/kernel/drivers/scsi/3w-9xxx.ko version: 2.26.02.007 license: GPL description: 3ware 9000 Storage Controller Linux Driver author: AMCC srcversion: 029473DD729D96D687985E4 alias: pci:v000013C1d00001003sv*sd*bc*sc*i* alias: pci:v000013C1d00001002sv*sd*bc*sc*i* depends: scsi_mod vermagic: 2.6.18-8.el5 SMP mod_unload 686 REGPARM 4KSTACKS gcc-4.1
The overall responsiveness of the system under benchmarking when running RHEL5 is somewhat better than that when running CentOS 4.5.
Cheers S.
 
            Simon Banton wrote:
At 10:36 -0400 25/9/07, Ross S. W. Walker wrote:
Post the modinfo <driver name> to the list just in case somebody else knows of any issues with the version you are running.
This is from RHEL5 - it's the driver that comes built-in:
[root@serv1 ~]# modinfo 3w-9xxx filename: /lib/modules/2.6.18-8.el5/kernel/drivers/scsi/3w-9xxx.ko version: 2.26.02.007 license: GPL description: 3ware 9000 Storage Controller Linux Driver author: AMCC srcversion: 029473DD729D96D687985E4 alias: pci:v000013C1d00001003sv*sd*bc*sc*i* alias: pci:v000013C1d00001002sv*sd*bc*sc*i* depends: scsi_mod vermagic: 2.6.18-8.el5 SMP mod_unload 686 REGPARM 4KSTACKS gcc-4.1
The overall responsiveness of the system under benchmarking when running RHEL5 is somewhat better than that when running CentOS 4.5.
Off of 3ware's support site I was able to download and compile the latest stable release which has this modinfo:
[root@mfg-nyc-iscsi1 driver]# modinfo 3w-9xxx.ko filename: 3w-9xxx.ko version: 2.26.06.002-2.6.18 license: GPL description: 3ware 9000 Storage Controller Linux Driver author: AMCC srcversion: 7F428E7BA74EAFF0FF137E2 alias: pci:v000013C1d00001004sv*sd*bc*sc*i* alias: pci:v000013C1d00001003sv*sd*bc*sc*i* alias: pci:v000013C1d00001002sv*sd*bc*sc*i* depends: scsi_mod vermagic: 2.6.20-1.2320.fc5smp SMP mod_unload 686 4KSTACKS
I compiled it on a fc5 box, but that shouldn't matter.
As far as the "responsiveness" is concerned, the 2.6.18 kernel made some substantial improvements to the "cfq" io scheduler that helped with interactive user experience. If you are using the box as a server though I would try the "deadline" or "noop" scheduler with this card as you may see substantial performance improvements with these schedulers and this card.
Try setting the io scheduler to "deadline" and re-run the benchmarks to see what I mean.
-Ross
______________________________________________________________________ This e-mail, and any attachments thereto, is intended only for use by the addressee(s) named herein and may contain legally privileged and/or confidential information. If you are not the intended recipient of this e-mail, you are hereby notified that any dissemination, distribution or copying of this e-mail, and any attachments thereto, is strictly prohibited. If you have received this e-mail in error, please immediately notify the sender and permanently delete the original and any copy or printout thereof.
 
            At 13:26 -0400 25/9/07, Ross S. W. Walker wrote:
Off of 3ware's support site I was able to download and compile the latest stable release which has this modinfo:
[root@mfg-nyc-iscsi1 driver]# modinfo 3w-9xxx.ko filename: 3w-9xxx.ko version: 2.26.06.002-2.6.18
OK, driver source from the 9.4.1.3 codeset (3w-9xxx-2.6.18kernel_9.4.1.3.tgz) now built and installed for RHEL5, new initrd created and machine re-tested.
[root@serv1 ~]# modinfo 3w-9xxx filename: /lib/modules/2.6.18-8.el5/kernel/drivers/scsi/3w-9xxx.ko version: 2.26.06.002-2.6.18 license: GPL description: 3ware 9000 Storage Controller Linux Driver author: AMCC srcversion: 7F428E7BA74EAFF0FF137E2 alias: pci:v000013C1d00001004sv*sd*bc*sc*i* alias: pci:v000013C1d00001003sv*sd*bc*sc*i* alias: pci:v000013C1d00001002sv*sd*bc*sc*i* depends: scsi_mod vermagic: 2.6.18-8.el5 SMP mod_unload 686 REGPARM 4KSTACKS gcc-4.1
tw_cli output just to be sure:
//serv1> /c0 show all /c0 Driver Version = 2.26.06.002-2.6.18 /c0 Model = 9550SX-8LP /c0 Memory Installed = 112MB /c0 Firmware Version = FE9X 3.08.02.007 /c0 Bios Version = BE9X 3.08.00.002 /c0 Monitor Version = BL9X 3.01.00.006
Well bottom line, there is something very wrong with the 3ware drivers on the RHEL 5 implementation.
There still is then, because the figures for the LTP disktest are almost identical post-update.
Sequential reads:
RHEL5, RAID 0: | 2007/09/26-09:11:27 | START | 2962 | v1.2.8 | /dev/sdb | Start args: -B 4k -h 1 -I BD -K 4 -p l -P T -T 30 -r (-N 976519167) (-c) (-p u) | 2007/09/26-09:11:57 | STAT | 2962 | v1.2.8 | /dev/sdb | Total read throughput: 2430429.9B/s (2.32MB/s), IOPS 593.4/s.
RHEL5, RAID 1: | 2007/09/26-09:59:41 | START | 3210 | v1.2.8 | /dev/sdb | Start args: -B 4k -h 1 -I BD -K 4 -p l -P T -T 30 -r (-N 488259583) (-c) (-p u) | 2007/09/26-10:00:11 | STAT | 3210 | v1.2.8 | /dev/sdb | Total read throughput: 2566280.5B/s (2.45MB/s), IOPS 626.5/s.
Sequential writes:
RHEL5, RAID 0: | 2007/09/26-09:11:57 | START | 2971 | v1.2.8 | /dev/sdb | Start args: -B 4k -h 1 -I BD -K 4 -p l -P T -T 30 -w (-N 976519167) (-c) (-p u) | 2007/09/26-09:12:27 | STAT | 2971 | v1.2.8 | /dev/sdb | Total write throughput: 66337450.7B/s (63.26MB/s), IOPS 16195.7/s.
RHEL5, RAID 1: | 2007/09/26-10:00:11 | START | 3217 | v1.2.8 | /dev/sdb | Start args: -B 4k -h 1 -I BD -K 4 -p l -P T -T 30 -w (-N 488259583) (-c) (-p u) | 2007/09/26-10:00:41 | STAT | 3217 | v1.2.8 | /dev/sdb | Total write throughput: 54108160.0B/s (51.60MB/s), IOPS 13210.0/s.
Random reads:
RHEL5, RAID 0: | 2007/09/26-09:12:28 | START | 2978 | v1.2.8 | /dev/sdb | Start args: -B 4k -h 1 -I BD -K 4 -p r -P T -T 30 -r (-N 976519167) (-c) (-D 100:0) | 2007/09/26-09:12:57 | STAT | 2978 | v1.2.8 | /dev/sdb | Total read throughput: 269206.1B/s (0.26MB/s), IOPS 65.7/s.
RHEL5, RAID 1: | 2007/09/26-10:00:41 | START | 3231 | v1.2.8 | /dev/sdb | Start args: -B 4k -h 1 -I BD -K 4 -p r -P T -T 30 -r (-N 488259583) (-c) (-D 100:0) | 2007/09/26-10:01:11 | STAT | 3231 | v1.2.8 | /dev/sdb | Total read throughput: 262144.0B/s (0.25MB/s), IOPS 64.0/s.
Random writes:
RHEL5, RAID 0: | 2007/09/26-09:12:57 | START | 2987 | v1.2.8 | /dev/sdb | Start args: -B 4k -h 1 -I BD -K 4 -p r -P T -T 30 -w (-N 976519167) (-c) (-D 0:100) | 2007/09/26-09:13:34 | STAT | 2987 | v1.2.8 | /dev/sdb | Total write throughput: 1378440.5B/s (1.31MB/s), IOPS 336.5/s.
RHEL5, RAID 1: | 2007/09/26-10:01:12 | START | 11539 | v1.2.8 | /dev/sdb | Start args: -B 4k -h 1 -I BD -K 4 -p r -P T -T 30 -w (-N 488259583) (-c) (-D 0:100) | 2007/09/26-10:01:41 | STAT | 11539 | v1.2.8 | /dev/sdb | Total write throughput: 638976.0B/s (0.61MB/s), IOPS 156.0/s.
I re-ran the tests again, just to be sure (same order as above - SeqR, SeqW, RandomR, RandomW):
RAID 0: SR| 2007/09/26-10:16:53 | STAT | 4602 | v1.2.8 | /dev/sdb | Total read throughput: 2456328.8B/s (2.34MB/s), IOPS 599.7/s. SW| 2007/09/26-10:17:23 | STAT | 4611 | v1.2.8 | /dev/sdb | Total write throughput: 66434662.4B/s (63.36MB/s), IOPS 16219.4/s. RR| 2007/09/26-10:17:53 | STAT | 4618 | v1.2.8 | /dev/sdb | Total read throughput: 273612.8B/s (0.26MB/s), IOPS 66.8/s. RW| 2007/09/26-10:18:31 | STAT | 4626 | v1.2.8 | /dev/sdb | Total write throughput: 1424701.8B/s (1.36MB/s), IOPS 347.8/s.
RAID 1: SR| 2007/09/26-10:12:49 | STAT | 4509 | v1.2.8 | /dev/sdb | Total read throughput: 2479718.4B/s (2.36MB/s), IOPS 605.4/s. SW| 2007/09/26-10:13:19 | STAT | 4516 | v1.2.8 | /dev/sdb | Total write throughput: 53864721.1B/s (51.37MB/s), IOPS 13150.6/s. RR| 2007/09/26-10:13:49 | STAT | 4525 | v1.2.8 | /dev/sdb | Total read throughput: 268151.5B/s (0.26MB/s), IOPS 65.5/s. RW| 2007/09/26-10:14:19 | STAT | 4532 | v1.2.8 | /dev/sdb | Total write throughput: 549287.7B/s (0.52MB/s), IOPS 134.1/s.
Baffled, I am.
S.
 
            Simon Banton wrote:
At 13:26 -0400 25/9/07, Ross S. W. Walker wrote:
Off of 3ware's support site I was able to download and compile the latest stable release which has this modinfo:
[root@mfg-nyc-iscsi1 driver]# modinfo 3w-9xxx.ko filename: 3w-9xxx.ko version: 2.26.06.002-2.6.18
OK, driver source from the 9.4.1.3 codeset (3w-9xxx-2.6.18kernel_9.4.1.3.tgz) now built and installed for RHEL5, new initrd created and machine re-tested.
[root@serv1 ~]# modinfo 3w-9xxx filename: /lib/modules/2.6.18-8.el5/kernel/drivers/scsi/3w-9xxx.ko version: 2.26.06.002-2.6.18 license: GPL description: 3ware 9000 Storage Controller Linux Driver author: AMCC srcversion: 7F428E7BA74EAFF0FF137E2 alias: pci:v000013C1d00001004sv*sd*bc*sc*i* alias: pci:v000013C1d00001003sv*sd*bc*sc*i* alias: pci:v000013C1d00001002sv*sd*bc*sc*i* depends: scsi_mod vermagic: 2.6.18-8.el5 SMP mod_unload 686 REGPARM 4KSTACKS gcc-4.1
tw_cli output just to be sure:
//serv1> /c0 show all /c0 Driver Version = 2.26.06.002-2.6.18 /c0 Model = 9550SX-8LP /c0 Memory Installed = 112MB /c0 Firmware Version = FE9X 3.08.02.007 /c0 Bios Version = BE9X 3.08.00.002 /c0 Monitor Version = BL9X 3.01.00.006
Well bottom line, there is something very wrong with the 3ware drivers on the RHEL 5 implementation.
There still is then, because the figures for the LTP disktest are almost identical post-update.
Sequential reads:
RHEL5, RAID 0: | 2007/09/26-09:11:27 | START | 2962 | v1.2.8 | /dev/sdb | Start args: -B 4k -h 1 -I BD -K 4 -p l -P T -T 30 -r (-N 976519167) (-c) (-p u) | 2007/09/26-09:11:57 | STAT | 2962 | v1.2.8 | /dev/sdb | Total read throughput: 2430429.9B/s (2.32MB/s), IOPS 593.4/s.
RHEL5, RAID 1: | 2007/09/26-09:59:41 | START | 3210 | v1.2.8 | /dev/sdb | Start args: -B 4k -h 1 -I BD -K 4 -p l -P T -T 30 -r (-N 488259583) (-c) (-p u) | 2007/09/26-10:00:11 | STAT | 3210 | v1.2.8 | /dev/sdb | Total read throughput: 2566280.5B/s (2.45MB/s), IOPS 626.5/s.
Sequential writes:
RHEL5, RAID 0: | 2007/09/26-09:11:57 | START | 2971 | v1.2.8 | /dev/sdb | Start args: -B 4k -h 1 -I BD -K 4 -p l -P T -T 30 -w (-N 976519167) (-c) (-p u) | 2007/09/26-09:12:27 | STAT | 2971 | v1.2.8 | /dev/sdb | Total write throughput: 66337450.7B/s (63.26MB/s), IOPS 16195.7/s.
RHEL5, RAID 1: | 2007/09/26-10:00:11 | START | 3217 | v1.2.8 | /dev/sdb | Start args: -B 4k -h 1 -I BD -K 4 -p l -P T -T 30 -w (-N 488259583) (-c) (-p u) | 2007/09/26-10:00:41 | STAT | 3217 | v1.2.8 | /dev/sdb | Total write throughput: 54108160.0B/s (51.60MB/s), IOPS 13210.0/s.
Random reads:
RHEL5, RAID 0: | 2007/09/26-09:12:28 | START | 2978 | v1.2.8 | /dev/sdb | Start args: -B 4k -h 1 -I BD -K 4 -p r -P T -T 30 -r (-N 976519167) (-c) (-D 100:0) | 2007/09/26-09:12:57 | STAT | 2978 | v1.2.8 | /dev/sdb | Total read throughput: 269206.1B/s (0.26MB/s), IOPS 65.7/s.
RHEL5, RAID 1: | 2007/09/26-10:00:41 | START | 3231 | v1.2.8 | /dev/sdb | Start args: -B 4k -h 1 -I BD -K 4 -p r -P T -T 30 -r (-N 488259583) (-c) (-D 100:0) | 2007/09/26-10:01:11 | STAT | 3231 | v1.2.8 | /dev/sdb | Total read throughput: 262144.0B/s (0.25MB/s), IOPS 64.0/s.
Random writes:
RHEL5, RAID 0: | 2007/09/26-09:12:57 | START | 2987 | v1.2.8 | /dev/sdb | Start args: -B 4k -h 1 -I BD -K 4 -p r -P T -T 30 -w (-N 976519167) (-c) (-D 0:100) | 2007/09/26-09:13:34 | STAT | 2987 | v1.2.8 | /dev/sdb | Total write throughput: 1378440.5B/s (1.31MB/s), IOPS 336.5/s.
RHEL5, RAID 1: | 2007/09/26-10:01:12 | START | 11539 | v1.2.8 | /dev/sdb | Start args: -B 4k -h 1 -I BD -K 4 -p r -P T -T 30 -w (-N 488259583) (-c) (-D 0:100) | 2007/09/26-10:01:41 | STAT | 11539 | v1.2.8 | /dev/sdb | Total write throughput: 638976.0B/s (0.61MB/s), IOPS 156.0/s.
I re-ran the tests again, just to be sure (same order as above - SeqR, SeqW, RandomR, RandomW):
RAID 0: SR| 2007/09/26-10:16:53 | STAT | 4602 | v1.2.8 | /dev/sdb | Total read throughput: 2456328.8B/s (2.34MB/s), IOPS 599.7/s. SW| 2007/09/26-10:17:23 | STAT | 4611 | v1.2.8 | /dev/sdb | Total write throughput: 66434662.4B/s (63.36MB/s), IOPS 16219.4/s. RR| 2007/09/26-10:17:53 | STAT | 4618 | v1.2.8 | /dev/sdb | Total read throughput: 273612.8B/s (0.26MB/s), IOPS 66.8/s. RW| 2007/09/26-10:18:31 | STAT | 4626 | v1.2.8 | /dev/sdb | Total write throughput: 1424701.8B/s (1.36MB/s), IOPS 347.8/s.
RAID 1: SR| 2007/09/26-10:12:49 | STAT | 4509 | v1.2.8 | /dev/sdb | Total read throughput: 2479718.4B/s (2.36MB/s), IOPS 605.4/s. SW| 2007/09/26-10:13:19 | STAT | 4516 | v1.2.8 | /dev/sdb | Total write throughput: 53864721.1B/s (51.37MB/s), IOPS 13150.6/s. RR| 2007/09/26-10:13:49 | STAT | 4525 | v1.2.8 | /dev/sdb | Total read throughput: 268151.5B/s (0.26MB/s), IOPS 65.5/s. RW| 2007/09/26-10:14:19 | STAT | 4532 | v1.2.8 | /dev/sdb | Total write throughput: 549287.7B/s (0.52MB/s), IOPS 134.1/s.
Baffled, I am.
Could you try the benchmarks with the 'deadline' scheduler?
echo deadline >/sys/block/sdb/queue/scheduler
In my tests here I found the 'cfq' scheduler does not let overlapping io shine, it actually acts as a governor.
-Ross
______________________________________________________________________ This e-mail, and any attachments thereto, is intended only for use by the addressee(s) named herein and may contain legally privileged and/or confidential information. If you are not the intended recipient of this e-mail, you are hereby notified that any dissemination, distribution or copying of this e-mail, and any attachments thereto, is strictly prohibited. If you have received this e-mail in error, please immediately notify the sender and permanently delete the original and any copy or printout thereof.
 
            At 09:14 -0400 26/9/07, Ross S. W. Walker wrote:
Could you try the benchmarks with the 'deadline' scheduler?
OK, these are all with RHEL5, driver 2.26.06.002-2.6.18, RAID 1:
elevator=deadline: Sequential reads: | 2007/09/26-16:19:30 | START | 3065 | v1.2.8 | /dev/sdb | Start args: -B 4k -h 1 -I BD -K 4 -p l -P T -T 30 -r (-N 488259583) (-c) (-p u) | 2007/09/26-16:20:00 | STAT | 3065 | v1.2.8 | /dev/sdb | Total read throughput: 45353642.7B/s (43.25MB/s), IOPS 11072.7/s. Sequential writes: | 2007/09/26-16:20:00 | START | 3082 | v1.2.8 | /dev/sdb | Start args: -B 4k -h 1 -I BD -K 4 -p l -P T -T 30 -w (-N 488259583) (-c) (-p u) | 2007/09/26-16:20:30 | STAT | 3082 | v1.2.8 | /dev/sdb | Total write throughput: 53781186.2B/s (51.29MB/s), IOPS 13130.2/s. Random reads: | 2007/09/26-16:20:30 | START | 3091 | v1.2.8 | /dev/sdb | Start args: -B 4k -h 1 -I BD -K 4 -p r -P T -T 30 -r (-N 488259583) (-c) (-D 100:0) | 2007/09/26-16:21:00 | STAT | 3091 | v1.2.8 | /dev/sdb | Total read throughput: 545587.2B/s (0.52MB/s), IOPS 133.2/s. Random writes: | 2007/09/26-16:21:00 | START | 3098 | v1.2.8 | /dev/sdb | Start args: -B 4k -h 1 -I BD -K 4 -p r -P T -T 30 -w (-N 488259583) (-c) (-D 0:100) | 2007/09/26-16:21:44 | STAT | 3098 | v1.2.8 | /dev/sdb | Total write throughput: 795852.8B/s (0.76MB/s), IOPS 194.3/s.
Here are the others for comparison.
elevator=noop: Sequential reads: | 2007/09/26-16:24:02 | START | 3167 | v1.2.8 | /dev/sdb | Start args: -B 4k -h 1 -I BD -K 4 -p l -P T -T 30 -r (-N 488259583) (-c) (-p u) | 2007/09/26-16:24:32 | STAT | 3167 | v1.2.8 | /dev/sdb | Total read throughput: 45467374.9B/s (43.36MB/s), IOPS 11100.4/s. Sequential writes: | 2007/09/26-16:24:32 | START | 3176 | v1.2.8 | /dev/sdb | Start args: -B 4k -h 1 -I BD -K 4 -p l -P T -T 30 -w (-N 488259583) (-c) (-p u) | 2007/09/26-16:25:02 | STAT | 3176 | v1.2.8 | /dev/sdb | Total write throughput: 53825672.5B/s (51.33MB/s), IOPS 13141.0/s. Random reads: | 2007/09/26-16:25:03 | START | 3193 | v1.2.8 | /dev/sdb | Start args: -B 4k -h 1 -I BD -K 4 -p r -P T -T 30 -r (-N 488259583) (-c) (-D 100:0) | 2007/09/26-16:25:32 | STAT | 3193 | v1.2.8 | /dev/sdb | Total read throughput: 540954.5B/s (0.52MB/s), IOPS 132.1/s. Random writes: | 2007/09/26-16:25:32 | START | 3202 | v1.2.8 | /dev/sdb | Start args: -B 4k -h 1 -I BD -K 4 -p r -P T -T 30 -w (-N 488259583) (-c) (-D 0:100) | 2007/09/26-16:26:16 | STAT | 3202 | v1.2.8 | /dev/sdb | Total write throughput: 795989.3B/s (0.76MB/s), IOPS 194.3/s.
elevator=anticipatory: Sequential reads: | 2007/09/26-16:37:04 | START | 3277 | v1.2.8 | /dev/sdb | Start args: -B 4k -h 1 -I BD -K 4 -p l -P T -T 30 -r (-N 488259583) (-c) (-p u) | 2007/09/26-16:37:34 | STAT | 3277 | v1.2.8 | /dev/sdb | Total read throughput: 45414126.9B/s (43.31MB/s), IOPS 11087.4/s. Sequential writes: | 2007/09/26-16:37:35 | START | 3284 | v1.2.8 | /dev/sdb | Start args: -B 4k -h 1 -I BD -K 4 -p l -P T -T 30 -w (-N 488259583) (-c) (-p u) | 2007/09/26-16:38:04 | STAT | 3284 | v1.2.8 | /dev/sdb | Total write throughput: 53895168.0B/s (51.40MB/s), IOPS 13158.0/s. Random reads: | 2007/09/26-16:38:04 | START | 3293 | v1.2.8 | /dev/sdb | Start args: -B 4k -h 1 -I BD -K 4 -p r -P T -T 30 -r (-N 488259583) (-c) (-D 100:0) | 2007/09/26-16:38:34 | STAT | 3293 | v1.2.8 | /dev/sdb | Total read throughput: 467080.5B/s (0.45MB/s), IOPS 114.0/s. Random writes: | 2007/09/26-16:38:34 | START | 3300 | v1.2.8 | /dev/sdb | Start args: -B 4k -h 1 -I BD -K 4 -p r -P T -T 30 -w (-N 488259583) (-c) (-D 0:100) | 2007/09/26-16:39:18 | STAT | 3300 | v1.2.8 | /dev/sdb | Total write throughput: 793122.1B/s (0.76MB/s), IOPS 193.6/s.
elevator=cfq (just to re-check): Sequential reads: | 2007/09/26-16:42:18 | START | 3353 | v1.2.8 | /dev/sdb | Start args: -B 4k -h 1 -I BD -K 4 -p l -P T -T 30 -r (-N 488259583) (-c) (-p u) | 2007/09/26-16:42:48 | STAT | 3353 | v1.2.8 | /dev/sdb | Total read throughput: 2463470.9B/s (2.35MB/s), IOPS 601.4/s. Sequential writes: | 2007/09/26-16:42:48 | START | 3360 | v1.2.8 | /dev/sdb | Start args: -B 4k -h 1 -I BD -K 4 -p l -P T -T 30 -w (-N 488259583) (-c) (-p u) | 2007/09/26-16:43:18 | STAT | 3360 | v1.2.8 | /dev/sdb | Total write throughput: 54572782.9B/s (52.04MB/s), IOPS 13323.4/s. Random reads: | 2007/09/26-16:43:19 | START | 3369 | v1.2.8 | /dev/sdb | Start args: -B 4k -h 1 -I BD -K 4 -p r -P T -T 30 -r (-N 488259583) (-c) (-D 100:0) | 2007/09/26-16:43:48 | STAT | 3369 | v1.2.8 | /dev/sdb | Total read throughput: 267652.4B/s (0.26MB/s), IOPS 65.3/s. Random writes: | 2007/09/26-16:43:48 | START | 3376 | v1.2.8 | /dev/sdb | Start args: -B 4k -h 1 -I BD -K 4 -p r -P T -T 30 -w (-N 488259583) (-c) (-D 0:100) | 2007/09/26-16:44:31 | STAT | 3376 | v1.2.8 | /dev/sdb | Total write throughput: 793122.1B/s (0.76MB/s), IOPS 193.6/s.
Certainly cfq is severely cramping the reads, it appears.
S.
 
            Simon Banton wrote:
At 09:14 -0400 26/9/07, Ross S. W. Walker wrote:
Could you try the benchmarks with the 'deadline' scheduler?
OK, these are all with RHEL5, driver 2.26.06.002-2.6.18, RAID 1:
elevator=deadline: Sequential reads: | 2007/09/26-16:19:30 | START | 3065 | v1.2.8 | /dev/sdb | Start args: -B 4k -h 1 -I BD -K 4 -p l -P T -T 30 -r (-N 488259583) (-c) (-p u) | 2007/09/26-16:20:00 | STAT | 3065 | v1.2.8 | /dev/sdb | Total read throughput: 45353642.7B/s (43.25MB/s), IOPS 11072.7/s.
That's a lot better, where it should be for those drives.
Sequential writes: | 2007/09/26-16:20:00 | START | 3082 | v1.2.8 | /dev/sdb | Start args: -B 4k -h 1 -I BD -K 4 -p l -P T -T 30 -w (-N 488259583) (-c) (-p u) | 2007/09/26-16:20:30 | STAT | 3082 | v1.2.8 | /dev/sdb | Total write throughput: 53781186.2B/s (51.29MB/s), IOPS 13130.2/s.
Yup, with the write-back you'll see better write throughput then read at this block size.
Random reads: | 2007/09/26-16:20:30 | START | 3091 | v1.2.8 | /dev/sdb | Start args: -B 4k -h 1 -I BD -K 4 -p r -P T -T 30 -r (-N 488259583) (-c) (-D 100:0) | 2007/09/26-16:21:00 | STAT | 3091 | v1.2.8 | /dev/sdb | Total read throughput: 545587.2B/s (0.52MB/s), IOPS 133.2/s.
Same, random io would really be affected here.
Random writes: | 2007/09/26-16:21:00 | START | 3098 | v1.2.8 | /dev/sdb | Start args: -B 4k -h 1 -I BD -K 4 -p r -P T -T 30 -w (-N 488259583) (-c) (-D 0:100) | 2007/09/26-16:21:44 | STAT | 3098 | v1.2.8 | /dev/sdb | Total write throughput: 795852.8B/s (0.76MB/s), IOPS 194.3/s.
Same here.
Here are the others for comparison.
elevator=noop: Sequential reads: | 2007/09/26-16:24:02 | START | 3167 | v1.2.8 | /dev/sdb | Start args: -B 4k -h 1 -I BD -K 4 -p l -P T -T 30 -r (-N 488259583) (-c) (-p u) | 2007/09/26-16:24:32 | STAT | 3167 | v1.2.8 | /dev/sdb | Total read throughput: 45467374.9B/s (43.36MB/s), IOPS 11100.4/s.
About the same as deadline, but you'll probably be better off with deadline as deadline will attempt to merge requests from separate sources to the same volume while noop will just send it as it gets it.
Sequential writes: | 2007/09/26-16:24:32 | START | 3176 | v1.2.8 | /dev/sdb | Start args: -B 4k -h 1 -I BD -K 4 -p l -P T -T 30 -w (-N 488259583) (-c) (-p u) | 2007/09/26-16:25:02 | STAT | 3176 | v1.2.8 | /dev/sdb | Total write throughput: 53825672.5B/s (51.33MB/s), IOPS 13141.0/s.
Same for the others.
Random reads: | 2007/09/26-16:25:03 | START | 3193 | v1.2.8 | /dev/sdb | Start args: -B 4k -h 1 -I BD -K 4 -p r -P T -T 30 -r (-N 488259583) (-c) (-D 100:0) | 2007/09/26-16:25:32 | STAT | 3193 | v1.2.8 | /dev/sdb | Total read throughput: 540954.5B/s (0.52MB/s), IOPS 132.1/s. Random writes: | 2007/09/26-16:25:32 | START | 3202 | v1.2.8 | /dev/sdb | Start args: -B 4k -h 1 -I BD -K 4 -p r -P T -T 30 -w (-N 488259583) (-c) (-D 0:100) | 2007/09/26-16:26:16 | STAT | 3202 | v1.2.8 | /dev/sdb | Total write throughput: 795989.3B/s (0.76MB/s), IOPS 194.3/s.
elevator=anticipatory: Sequential reads: | 2007/09/26-16:37:04 | START | 3277 | v1.2.8 | /dev/sdb | Start args: -B 4k -h 1 -I BD -K 4 -p l -P T -T 30 -r (-N 488259583) (-c) (-p u) | 2007/09/26-16:37:34 | STAT | 3277 | v1.2.8 | /dev/sdb | Total read throughput: 45414126.9B/s (43.31MB/s), IOPS 11087.4/s.
While anticipatory appears to be an adequate choice here it will cause performance issues from multiple writers as it keeps trying to anticipate those reads. For a server deadline is still the best.
Sequential writes: | 2007/09/26-16:37:35 | START | 3284 | v1.2.8 | /dev/sdb | Start args: -B 4k -h 1 -I BD -K 4 -p l -P T -T 30 -w (-N 488259583) (-c) (-p u) | 2007/09/26-16:38:04 | STAT | 3284 | v1.2.8 | /dev/sdb | Total write throughput: 53895168.0B/s (51.40MB/s), IOPS 13158.0/s. Random reads: | 2007/09/26-16:38:04 | START | 3293 | v1.2.8 | /dev/sdb | Start args: -B 4k -h 1 -I BD -K 4 -p r -P T -T 30 -r (-N 488259583) (-c) (-D 100:0) | 2007/09/26-16:38:34 | STAT | 3293 | v1.2.8 | /dev/sdb | Total read throughput: 467080.5B/s (0.45MB/s), IOPS 114.0/s. Random writes: | 2007/09/26-16:38:34 | START | 3300 | v1.2.8 | /dev/sdb | Start args: -B 4k -h 1 -I BD -K 4 -p r -P T -T 30 -w (-N 488259583) (-c) (-D 0:100) | 2007/09/26-16:39:18 | STAT | 3300 | v1.2.8 | /dev/sdb | Total write throughput: 793122.1B/s (0.76MB/s), IOPS 193.6/s.
elevator=cfq (just to re-check): Sequential reads: | 2007/09/26-16:42:18 | START | 3353 | v1.2.8 | /dev/sdb | Start args: -B 4k -h 1 -I BD -K 4 -p l -P T -T 30 -r (-N 488259583) (-c) (-p u) | 2007/09/26-16:42:48 | STAT | 3353 | v1.2.8 | /dev/sdb | Total read throughput: 2463470.9B/s (2.35MB/s), IOPS 601.4/s.
CFQ is intended for single disk workstations and it's io limits are based on that, so it actually acts as an io govenor on RAID setups.
Only use 'cfq' on single disk workstations.
Use 'deadline' on RAID setups and servers.
Sequential writes: | 2007/09/26-16:42:48 | START | 3360 | v1.2.8 | /dev/sdb | Start args: -B 4k -h 1 -I BD -K 4 -p l -P T -T 30 -w (-N 488259583) (-c) (-p u) | 2007/09/26-16:43:18 | STAT | 3360 | v1.2.8 | /dev/sdb | Total write throughput: 54572782.9B/s (52.04MB/s), IOPS 13323.4/s. Random reads: | 2007/09/26-16:43:19 | START | 3369 | v1.2.8 | /dev/sdb | Start args: -B 4k -h 1 -I BD -K 4 -p r -P T -T 30 -r (-N 488259583) (-c) (-D 100:0) | 2007/09/26-16:43:48 | STAT | 3369 | v1.2.8 | /dev/sdb | Total read throughput: 267652.4B/s (0.26MB/s), IOPS 65.3/s. Random writes: | 2007/09/26-16:43:48 | START | 3376 | v1.2.8 | /dev/sdb | Start args: -B 4k -h 1 -I BD -K 4 -p r -P T -T 30 -w (-N 488259583) (-c) (-D 0:100) | 2007/09/26-16:44:31 | STAT | 3376 | v1.2.8 | /dev/sdb | Total write throughput: 793122.1B/s (0.76MB/s), IOPS 193.6/s.
Certainly cfq is severely cramping the reads, it appears.
Yes, as I mentioned above it allocatess IO per-executing thread based on the typical single disk io pattern and therefore limits the bandwidth going to disk per thread as a fraction of a single disk's performance.
-Ross
______________________________________________________________________ This e-mail, and any attachments thereto, is intended only for use by the addressee(s) named herein and may contain legally privileged and/or confidential information. If you are not the intended recipient of this e-mail, you are hereby notified that any dissemination, distribution or copying of this e-mail, and any attachments thereto, is strictly prohibited. If you have received this e-mail in error, please immediately notify the sender and permanently delete the original and any copy or printout thereof.
 
            At 12:01 -0400 26/9/07, Ross S. W. Walker wrote:
CFQ is intended for single disk workstations and it's io limits are based on that, so it actually acts as an io govenor on RAID setups.
Only use 'cfq' on single disk workstations.
Use 'deadline' on RAID setups and servers.
Many thanks Ross, that's one variable tied down at least :-)
S.
 
            Simon Banton wrote:
At 07:46 +0800 24/9/07, Feizhou wrote:
... plus an Out of Memory kill of sshd. Second time around (logged in on the console rather than over ssh), it's just the same except it's hald that happens to get clobbered instead.
Are you saying that running in RAID0 mode with this card and motherboard combination, you get a memory leak? Who is the culprit?
I don't know if it's caused by a memory leak or something else, I'm just describing what happens. I would be tempted to suspect the RAM itself if another identical machine didn't have exactly the same issue.
This is worth checking and reporting to 3ware.
what's left to try?
Bug report...
I've reported the issue to 3ware but they've not responded. I replicated the problem with RHEL AS 4 update 5 and contacted RedHat but they told me evaluation subscriptions aren't supported.
If you have something that is reproducible like the above, they will definitely be interested. Nothing like causing instability to get them on the case.
 
            hello,
i saw this thread a bit late, but I had /am having the exact same issues on a dual-2-core-cpu opteron box with a 9550SX. (Centos 5 x86_64) What I did to work around them was basically switching to XFS for everything except / (3ware say their cards are fast, but only on XFS) AND using very low nr_requests for every blockdev on the 3ware card. (like 32 or 64). That will limit the iowait times for the cpus and make the 3ware-drives respond faster (see yourself with iostat -x -m 1 while benchmarking). If you can, you could also try _not_ putting the system disks on the 3ware card, because additionally the 3ware driver/card gives writes priority. People suggested the unresponsive system behaviour is because the cpu hanging in iowait for writing and then reading the system binaries won't happen till the writes are done, so the binaries should be on another io path.
All this seem to be symptoms of a very complex issue consisting of kernel bugs/bad drivers/... and they seem to be worst on a AMD/3ware Combination. here is another link: http://bugzilla.kernel.org/show_bug.cgi?id=7372
regards, matthias
Simon Banton schrieb:
Dear list,
I thought I'd just share my experiences with this 3Ware card, and see if anyone might have any suggestions.
System: Supermicro H8DA8 with 2 x Opteron 250 2.4GHz and 4GB RAM installed. 9550SX-8LP hosting 4x Seagate ST3250820SV 250GB in a RAID 1 plus 2 hot spare config. The array is properly initialized, write cache is on, as is queueing (and supported by the drives). StoreSave set to Protection.
OS is CentOS 4.5 i386, minimal install, default partitioning as suggested by the installer (ext3, small /boot on /dev/sda1, remainder as / on LVM VolGroup with 2GB swap).
Firmware from 3Ware codeset 9.4.1.2 in use, firmware/driver details: //serv1> /c0 show all /c0 Driver Version = 2.26.05.007 /c0 Model = 9550SX-8LP /c0 Memory Installed = 112MB /c0 Firmware Version = FE9X 3.08.02.005 /c0 Bios Version = BE9X 3.08.00.002 /c0 Monitor Version = BL9X 3.01.00.006
I initially noticed something odd while installing 4.4, that writing the inode tables took a longer time than I expected (I thought the installer had frozen) and the system overall felt sluggish when doing its first yum update, certainly more sluggish than I'd expect with a comparatively powerful machine and hardware RAID 1.
I tried a few simple benchmarks (bonnie++, iozone, dd) and noticed up to 8 pdflush commands hanging about in uninterruptible sleep when writing to disk, along with kjournald and kswapd from time to time. Loadave during writing climbed considerably (up to >12) with 'ls' taking up to 30 seconds to give any output. I've tried CentOS 4.4, 4.5, RHEL AS 4 update 5 (just in case) and openSUSE 10.2 and they all show the same symptoms.
Googling around makes me think that this may be related to queue depth, nr_requests and possibly VM params (the latter from https://bugzilla.redhat.com/show_bug.cgi?id=121434#c275). These are the default settings:
/sys/block/sda/device/queue_depth = 254 /sys/block/sda/queue/nr_requests = 8192 /proc/sys/vm/dirty_expire_centisecs = 3000 /proc/sys/vm/dirty_ratio = 30
3Ware mentions elevator=deadline, blockdev --setra 16384 along with nr_requests=512 in their performance tuning doc - these alone seem to make no difference to the latency problem.
Setting dirty_expire_centisecs = 1000 and dirty_ratio = 5 does indeed reduce the number of processes in 'b' state as reported by vmstat 1 during an iozone benchmark (./iozone -s 20480m -r 64 -i 0 -i 1 -t 1 -b filename.xls as per 3Ware's own tuning doc) but the problem is obviously still there, just mitigated somewhat. The comparison graphs are in a PDF here: http://community.novacaster.com/attach.pl/7411/482/iozone_vm_tweaks_xls.pdf Incidentally, the vmstat 1 output was directed to an NFS-mounted disk to avoid writing it to the arry during the actual testing.
I've tried eliminating LVM from the equation, going to ext2 rather than ext3 and booting single-processor all to no useful effect. I've also tried benchmarking with different blocksizes from 512B to 1M in powers of 2 and the problem remains - many processes in uninterruptible sleep blocking other IO. I'm about to start downloading CentOS 5 to give it a go, and after that I might have to resort to seeing if WinXP has the same issue.
My only real question is "where do I go from here?" I don't have enough specific tuning knowledge to know what else to look at.
Thanks for any pointers.
Simon _______________________________________________ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
 
            At 12:30 +0200 2/10/07, matthias platzer wrote:
What I did to work around them was basically switching to XFS for everything except / (3ware say their cards are fast, but only on XFS) AND using very low nr_requests for every blockdev on the 3ware card.
Hi Matthias,
Thanks for this. In my CentOS 5 tests the nr_requests turned out by default to be 128, rather than the 8192 of CentOS 4.5. I'll have a go at reducing it still further.
If you can, you could also try _not_ putting the system disks on the 3ware card, because additionally the 3ware driver/card gives writes priority.
I've noticed that kicking off a simulataneous pair of dd reads and writes from/to the RAID 1 array indicates that very clearly - only with cfq as the elevator did reads get any kind of look-in. Sadly, I'm not able to separate the system disks off as there's no on-board SATA on the mboard nor any room for inboard disks, the original intention was to provide the resilience of hardware RAID 1 for the entire machine.
People suggested the unresponsive system behaviour is because the cpu hanging in iowait for writing and then reading the system binaries won't happen till the writes are done, so the binaries should be on another io path.
Yup, that certainly seems to be what's happening. Wish I had another io path...
All this seem to be symptoms of a very complex issue consisting of kernel bugs/bad drivers/... and they seem to be worst on a AMD/3ware Combination. here is another link: http://bugzilla.kernel.org/show_bug.cgi?id=7372
Ouch - thanks for that link :-( Looks like I'm screwed big time.
S.
 
            Simon Banton wrote:
At 12:30 +0200 2/10/07, matthias platzer wrote:
What I did to work around them was basically switching to XFS for everything except / (3ware say their cards are fast, but only on XFS) AND using very low nr_requests for every blockdev on the 3ware card.
Hi Matthias,
Thanks for this. In my CentOS 5 tests the nr_requests turned out by default to be 128, rather than the 8192 of CentOS 4.5. I'll have a go at reducing it still further.
Yes, the nr_requests should be a realistic reflection of what the card itself can handle. If too high you will see io_waits stack up high.
64 or 128 are good numbers, rarely have I seen a card that can handle a depth larger then 128 (some older scsi cards did 256 I think).
If you can, you could also try _not_ putting the system disks on the 3ware card, because additionally the 3ware driver/card gives writes priority.
I've noticed that kicking off a simulataneous pair of dd reads and writes from/to the RAID 1 array indicates that very clearly - only with cfq as the elevator did reads get any kind of look-in. Sadly, I'm not able to separate the system disks off as there's no on-board SATA on the mboard nor any room for inboard disks, the original intention was to provide the resilience of hardware RAID 1 for the entire machine.
CFQ will give reads a first to the line priority, but this can cause all sorts of negative side effects for a RAID setup, workloads can be such that a read operation is dependant on a write succeeding first, but both were issued in an io overlapping scenario, you can see the problem. If reads are getting starved with your workload you can try 'anticipatory', but if I remember you have BBU write-back cache enabled and this should really limit the impact.
You will always see an impact though, that is just the nature of it.
Writes will beat reads, random will beat sequential, it's the rock, paper, scissors game that all storage systems must play.
People suggested the unresponsive system behaviour is because the cpu hanging in iowait for writing and then reading the system binaries won't happen till the writes are done, so the binaries should be on another io path.
Yup, that certainly seems to be what's happening. Wish I had another io path...
You can have another io path, just add more disks to the 3ware, create another RAID array and locate your application data there.
All this seem to be symptoms of a very complex issue consisting of kernel bugs/bad drivers/... and they seem to be worst on a AMD/3ware Combination. here is another link: http://bugzilla.kernel.org/show_bug.cgi?id=7372
Ouch - thanks for that link :-( Looks like I'm screwed big time.
There is always a way out of any mess (without scraping the whole project).
-Ross
______________________________________________________________________ This e-mail, and any attachments thereto, is intended only for use by the addressee(s) named herein and may contain legally privileged and/or confidential information. If you are not the intended recipient of this e-mail, you are hereby notified that any dissemination, distribution or copying of this e-mail, and any attachments thereto, is strictly prohibited. If you have received this e-mail in error, please immediately notify the sender and permanently delete the original and any copy or printout thereof.
 
            On Tue, Oct 02, 2007 at 09:39:09AM -0400, Ross S. W. Walker wrote:
Simon Banton wrote:
At 12:30 +0200 2/10/07, matthias platzer wrote:
What I did to work around them was basically switching to XFS for everything except / (3ware say their cards are fast, but only on XFS) AND using very low nr_requests for every blockdev on the 3ware card.
Hi Matthias,
Thanks for this. In my CentOS 5 tests the nr_requests turned out by default to be 128, rather than the 8192 of CentOS 4.5. I'll have a go at reducing it still further.
Yes, the nr_requests should be a realistic reflection of what the card itself can handle. If too high you will see io_waits stack up high.
64 or 128 are good numbers, rarely have I seen a card that can handle a depth larger then 128 (some older scsi cards did 256 I think).
Hmm.. let's say you have a linux software md-raid array made of sata drives.. what kind of nr_request values you should use for that for optimal performance?
Thanks!
-- Pasi
 
            On Tue, Oct 02, 2007 at 08:57:28PM +0300, Pasi Kärkkäinen wrote:
On Tue, Oct 02, 2007 at 09:39:09AM -0400, Ross S. W. Walker wrote:
Simon Banton wrote:
At 12:30 +0200 2/10/07, matthias platzer wrote:
What I did to work around them was basically switching to XFS for everything except / (3ware say their cards are fast, but only on XFS) AND using very low nr_requests for every blockdev on the 3ware card.
Hi Matthias,
Thanks for this. In my CentOS 5 tests the nr_requests turned out by default to be 128, rather than the 8192 of CentOS 4.5. I'll have a go at reducing it still further.
Yes, the nr_requests should be a realistic reflection of what the card itself can handle. If too high you will see io_waits stack up high.
64 or 128 are good numbers, rarely have I seen a card that can handle a depth larger then 128 (some older scsi cards did 256 I think).
Hmm.. let's say you have a linux software md-raid array made of sata drives.. what kind of nr_request values you should use for that for optimal performance?
Or let's put it this way:
You have a md-raid array on dom0. What kind of nr_requests values should you use for normal 7200 rpm sata-ncq disks on intel ich8 (ncq) controller?
And then this md-array is seen as xvdb by domU.. what kind of nr_requests values should you use in domU?
io-scheduler/elevator should be deadline in domU I assume.. how about in dom0? deadline there too?
Thanks!
-- Pasi
 
            Pasi Kärkkäinen wrote:
On Tue, Oct 02, 2007 at 08:57:28PM +0300, Pasi Kärkkäinen wrote:
On Tue, Oct 02, 2007 at 09:39:09AM -0400, Ross S. W. Walker wrote:
Simon Banton wrote:
At 12:30 +0200 2/10/07, matthias platzer wrote:
What I did to work around them was basically switching
to XFS for
everything except / (3ware say their cards are fast,
but only on
XFS) AND using very low nr_requests for every blockdev
on the 3ware
card.
Hi Matthias,
Thanks for this. In my CentOS 5 tests the nr_requests
turned out by
default to be 128, rather than the 8192 of CentOS 4.5.
I'll have a go
at reducing it still further.
Yes, the nr_requests should be a realistic reflection of what the card itself can handle. If too high you will see io_waits stack up high.
64 or 128 are good numbers, rarely have I seen a card
that can handle
a depth larger then 128 (some older scsi cards did 256 I think).
Hmm.. let's say you have a linux software md-raid array made of sata drives.. what kind of nr_request values you should use for
that for optimal
performance?
Or let's put it this way:
You have a md-raid array on dom0. What kind of nr_requests values should you use for normal 7200 rpm sata-ncq disks on intel ich8 (ncq) controller?
And then this md-array is seen as xvdb by domU.. what kind of nr_requests values should you use in domU?
io-scheduler/elevator should be deadline in domU I assume.. how about in dom0? deadline there too?
Arrr, where thou go thar be monsters...
You got me Pasi, with Xen as the workload it adds a whole new dimension.
Unless you have hardware RAID, stick to the default setting and when you see a bottleneck double check your hardware drivers and RAID config first and only twiddle the queue settings if everything else has been twiddled first.
-Ross
______________________________________________________________________ This e-mail, and any attachments thereto, is intended only for use by the addressee(s) named herein and may contain legally privileged and/or confidential information. If you are not the intended recipient of this e-mail, you are hereby notified that any dissemination, distribution or copying of this e-mail, and any attachments thereto, is strictly prohibited. If you have received this e-mail in error, please immediately notify the sender and permanently delete the original and any copy or printout thereof.
 
            On Tue, Oct 02, 2007 at 03:56:17PM -0400, Ross S. W. Walker wrote:
Pasi Kärkkäinen wrote:
On Tue, Oct 02, 2007 at 08:57:28PM +0300, Pasi Kärkkäinen wrote:
On Tue, Oct 02, 2007 at 09:39:09AM -0400, Ross S. W. Walker wrote:
Simon Banton wrote:
At 12:30 +0200 2/10/07, matthias platzer wrote:
What I did to work around them was basically switching
to XFS for
everything except / (3ware say their cards are fast,
but only on
XFS) AND using very low nr_requests for every blockdev
on the 3ware
card.
Hi Matthias,
Thanks for this. In my CentOS 5 tests the nr_requests
turned out by
default to be 128, rather than the 8192 of CentOS 4.5.
I'll have a go
at reducing it still further.
Yes, the nr_requests should be a realistic reflection of what the card itself can handle. If too high you will see io_waits stack up high.
64 or 128 are good numbers, rarely have I seen a card
that can handle
a depth larger then 128 (some older scsi cards did 256 I think).
Hmm.. let's say you have a linux software md-raid array made of sata drives.. what kind of nr_request values you should use for
that for optimal
performance?
Or let's put it this way:
You have a md-raid array on dom0. What kind of nr_requests values should you use for normal 7200 rpm sata-ncq disks on intel ich8 (ncq) controller?
And then this md-array is seen as xvdb by domU.. what kind of nr_requests values should you use in domU?
io-scheduler/elevator should be deadline in domU I assume.. how about in dom0? deadline there too?
Arrr, where thou go thar be monsters...
You got me Pasi, with Xen as the workload it adds a whole new dimension.
Unless you have hardware RAID, stick to the default setting and when you see a bottleneck double check your hardware drivers and RAID config first and only twiddle the queue settings if everything else has been twiddled first.
OK.
I'm seeing quite high io-wait times on domU, but hardly any io-wait in dom0.. so I was wondering if I have too high nr_requests on domU. I think it is 256 atm.
Maybe I'll have to do some benchmarking.
What's the best multi-threaded / multi-process io-benchmark utility that works with filesystems instead of raw devices? and can read/write multiple files at once..
-- Pasi
 
            What's the best multi-threaded / multi-process io-benchmark utility that works with filesystems instead of raw devices? and can read/write multiple files at once..
http://untroubled.org/benchmarking/2004-04/
No raw numbers but...
 
            Pasi Kärkkäinen wrote:
On Tue, Oct 02, 2007 at 09:39:09AM -0400, Ross S. W. Walker wrote:
Simon Banton wrote:
At 12:30 +0200 2/10/07, matthias platzer wrote:
What I did to work around them was basically switching
to XFS for
everything except / (3ware say their cards are fast, but only on XFS) AND using very low nr_requests for every blockdev
on the 3ware
card.
Hi Matthias,
Thanks for this. In my CentOS 5 tests the nr_requests
turned out by
default to be 128, rather than the 8192 of CentOS 4.5.
I'll have a go
at reducing it still further.
Yes, the nr_requests should be a realistic reflection of what the card itself can handle. If too high you will see io_waits stack up high.
64 or 128 are good numbers, rarely have I seen a card that
can handle
a depth larger then 128 (some older scsi cards did 256 I think).
Hmm.. let's say you have a linux software md-raid array made of sata drives.. what kind of nr_request values you should use for that for optimal performance?
Thanks!
Pasi,
Good to hear from you again.
I haven't done much testing with software RAID, but after googling around I have found that there truly is no 1 nr_requests setting that fits all pictures.
The nr_requests is the maximum number of requests that can be queued before the queue is unplugged and the perfect # of requests queued is a reflection of the workload and the hardware together. Also most system func unplug after each request, so it isn't such a big issue unless the system is under high load.
If the default 128 isn't working I would explore hardware or RAID configuration problems first before trying to tweak this setting.
The old nr_requests = 8192 was definitely too high.
-Ross
______________________________________________________________________ This e-mail, and any attachments thereto, is intended only for use by the addressee(s) named herein and may contain legally privileged and/or confidential information. If you are not the intended recipient of this e-mail, you are hereby notified that any dissemination, distribution or copying of this e-mail, and any attachments thereto, is strictly prohibited. If you have received this e-mail in error, please immediately notify the sender and permanently delete the original and any copy or printout thereof.
 
            On Tue, Oct 02, 2007 at 03:51:50PM -0400, Ross S. W. Walker wrote:
Pasi Kärkkäinen wrote:
On Tue, Oct 02, 2007 at 09:39:09AM -0400, Ross S. W. Walker wrote:
Simon Banton wrote:
At 12:30 +0200 2/10/07, matthias platzer wrote:
What I did to work around them was basically switching
to XFS for
everything except / (3ware say their cards are fast, but only on XFS) AND using very low nr_requests for every blockdev
on the 3ware
card.
Hi Matthias,
Thanks for this. In my CentOS 5 tests the nr_requests
turned out by
default to be 128, rather than the 8192 of CentOS 4.5.
I'll have a go
at reducing it still further.
Yes, the nr_requests should be a realistic reflection of what the card itself can handle. If too high you will see io_waits stack up high.
64 or 128 are good numbers, rarely have I seen a card that
can handle
a depth larger then 128 (some older scsi cards did 256 I think).
Hmm.. let's say you have a linux software md-raid array made of sata drives.. what kind of nr_request values you should use for that for optimal performance?
Thanks!
Pasi,
Good to hear from you again.
I haven't done much testing with software RAID, but after googling around I have found that there truly is no 1 nr_requests setting that fits all pictures.
The nr_requests is the maximum number of requests that can be queued before the queue is unplugged and the perfect # of requests queued is a reflection of the workload and the hardware together. Also most system func unplug after each request, so it isn't such a big issue unless the system is under high load.
If the default 128 isn't working I would explore hardware or RAID configuration problems first before trying to tweak this setting.
The old nr_requests = 8192 was definitely too high.
Thanks for the reply Ross :) You're quite active on many lists it seems..
I also understood there's no single correct value for nr_requests..
I wonder if there are any "guidelines" how to find the best value.. I guess monitoring io-wait and throughput should give the best value for nr_requests?
-- Pasi
 
            Dear all,
According to change-log of plus kernel, JFS, NTFS, ReiserFS are enabled in the plus kernel. Could you tell me why XFS is not? I know kmod-xfs packages are released in the plus repository. What is an advantage of the kmod against build-in?
Best regards, Yuji
 
            On Monday 08 October 2007, Yuji Tsuchimoto wrote:
Dear all,
According to change-log of plus kernel, JFS, NTFS, ReiserFS are enabled in the plus kernel. Could you tell me why XFS is not?
First, don't reply to random posts it screws up the threading.
XFS is enabled by installing the kmod-xfs pkg (as you noted below). The xfs code available in the kernel source is outdated and as such not suitable for use.
/Peter
I know kmod-xfs packages are released in the plus repository. What is an advantage of the kmod against build-in?
Best regards, Yuji
 
            matthias platzer wrote:
hello,
i saw this thread a bit late, but I had /am having the exact same issues on a dual-2-core-cpu opteron box with a 9550SX. (Centos 5 x86_64) What I did to work around them was basically switching to XFS for everything except / (3ware say their cards are fast, but only on XFS) AND using very low nr_requests for every blockdev on the 3ware card. (like 32 or 64). That will limit the iowait times for the cpus and make the 3ware-drives respond faster (see yourself with iostat -x -m 1 while benchmarking). If you can, you could also try _not_ putting the system disks on the 3ware card, because additionally the 3ware driver/card gives writes priority. People suggested the unresponsive system behaviour is because the cpu hanging in iowait for writing and then reading the system binaries won't happen till the writes are done, so the binaries should be on another io path.
All this seem to be symptoms of a very complex issue consisting of kernel bugs/bad drivers/... and they seem to be worst on a AMD/3ware Combination. here is another link: http://bugzilla.kernel.org/show_bug.cgi?id=7372
Actually the real-real fix was to use the 'deadline' or 'noop' scheduler with this card as the default 'cfq' scheduler was designed to work with a single drive and not a multiple drive RAID, so it acts as a govenor on the amount of IO that a single process can send to the device and when you do multiple overlapping IOs performance decreases instead of increases.
Personnally I always use 'deadline' as my scheduler of choice.
Of course if your drivers are broken that will always negatively impact performance.
-Ross
______________________________________________________________________ This e-mail, and any attachments thereto, is intended only for use by the addressee(s) named herein and may contain legally privileged and/or confidential information. If you are not the intended recipient of this e-mail, you are hereby notified that any dissemination, distribution or copying of this e-mail, and any attachments thereto, is strictly prohibited. If you have received this e-mail in error, please immediately notify the sender and permanently delete the original and any copy or printout thereof.
 
            At 09:24 -0400 2/10/07, Ross S. W. Walker wrote:
Actually the real-real fix was to use the 'deadline' or 'noop' scheduler with this card as the default 'cfq' scheduler was designed to work with a single drive and not a multiple drive RAID, so it acts as a govenor on the amount of IO that a single process can send to the device and when you do multiple overlapping IOs performance decreases instead of increases.
Ah - that wasn't actually a complete fix Ross, but it did give a noticeable improvement in certain situations. I'm still chasing a real real 'general purpose' fix.
S.
 
            Simon Banton wrote:
At 09:24 -0400 2/10/07, Ross S. W. Walker wrote:
Actually the real-real fix was to use the 'deadline' or
'noop' scheduler
with this card as the default 'cfq' scheduler was designed
to work with
a single drive and not a multiple drive RAID, so it acts as
a govenor on
the amount of IO that a single process can send to the
device and when
you do multiple overlapping IOs performance decreases instead of increases.
Ah - that wasn't actually a complete fix Ross, but it did give a noticeable improvement in certain situations. I'm still chasing a real real 'general purpose' fix.
I was unaware of that. I thought changing schedulers did it.
What is the recurring performance problem you are seeing?
-Ross
______________________________________________________________________ This e-mail, and any attachments thereto, is intended only for use by the addressee(s) named herein and may contain legally privileged and/or confidential information. If you are not the intended recipient of this e-mail, you are hereby notified that any dissemination, distribution or copying of this e-mail, and any attachments thereto, is strictly prohibited. If you have received this e-mail in error, please immediately notify the sender and permanently delete the original and any copy or printout thereof.
 
            What is the recurring performance problem you are seeing?
Pretty much exactly the symptoms described in http://bugzilla.kernel.org/show_bug.cgi?id=7372 relating to read starvation under heavy write IO causing sluggish system response.
I recently graphed the blocks in/blocks out from vmstat 1 for the same test using each of the four IO schedulers (see the PDF attached to the article below):
http://community.novacaster.com/showarticle.pl?id=7492
The test was:
dd if=/dev/sda of=/dev/null bs=1M count=4096 &; sleep 5; dd if=/dev/zero of=./4G bs=1M count=4096 &
Despite appearances, interactive responsiveness subjectively felt better using deadline than cfq - but this is obviously an atypical workload and so now I'm focusing on finishing building the machine completely so I can try profiling the more typical patterns of activity that it'll experience when in use.
I find myself wondering whether the fact that the array looks like a single SCSI disk to the OS means that cfq is able to perform better in terms of interleaving reads and writes to the card but that some side effect of its work is causing the responsiveness issue at the same time. Pure speculation on my part - this is way outside my experience.
I'm also looking into trying an Areca card instead (avoiding LSI because they're cited as having the same issue in the bugzilla mentioned above).
S.
 
            Simon Banton wrote:
What is the recurring performance problem you are seeing?
Pretty much exactly the symptoms described in http://bugzilla.kernel.org/show_bug.cgi?id=7372 relating to read starvation under heavy write IO causing sluggish system response.
I recently graphed the blocks in/blocks out from vmstat 1 for the same test using each of the four IO schedulers (see the PDF attached to the article below):
http://community.novacaster.com/showarticle.pl?id=7492
The test was:
dd if=/dev/sda of=/dev/null bs=1M count=4096 &; sleep 5; dd if=/dev/zero of=./4G bs=1M count=4096 &
Despite appearances, interactive responsiveness subjectively felt better using deadline than cfq - but this is obviously an atypical workload and so now I'm focusing on finishing building the machine completely so I can try profiling the more typical patterns of activity that it'll experience when in use.
I find myself wondering whether the fact that the array looks like a single SCSI disk to the OS means that cfq is able to perform better in terms of interleaving reads and writes to the card but that some side effect of its work is causing the responsiveness issue at the same time. Pure speculation on my part - this is way outside my experience.
I'm also looking into trying an Areca card instead (avoiding LSI because they're cited as having the same issue in the bugzilla mentioned above).
If the performance issue is identical to the kernel bug mentioned in the posting then the only real fix that was mentioned was to switch to 32bit from 64bit or to down-rev your kernel, which on CentOS means to go down to 4.5 from 5.0.
I'm trying to get confirmation that the culprit has been isolated, but I have a suspicion that it lies in process scheduling on x86_64 and not in the io scheduler.
And while, yes the hardware RAID appears as a single disk to the io scheduler the CFQ makes certain assumptions on a disk's performance characteristics that are single-disk minded.
The CFQ is meant to favor reads over writes which is more important for a single-user workstation then a multi-user server which should handle these fairly while preventing total starvation of either, which is what the deadline was designed to do.
So for a server I would use 'deadline' and a workstation I would use 'cfq'.
I myself am thinking of down-reving to CentOS 4.5 to avoid the x86_64 scheduling issue, but I keep holding out that the issue will be uncovered upstream in time for 5.1...
-Ross
______________________________________________________________________ This e-mail, and any attachments thereto, is intended only for use by the addressee(s) named herein and may contain legally privileged and/or confidential information. If you are not the intended recipient of this e-mail, you are hereby notified that any dissemination, distribution or copying of this e-mail, and any attachments thereto, is strictly prohibited. If you have received this e-mail in error, please immediately notify the sender and permanently delete the original and any copy or printout thereof.
 
            At 12:41 -0400 2/10/07, Ross S. W. Walker wrote:
If the performance issue is identical to the kernel bug mentioned in the posting then the only real fix that was mentioned was to switch to 32bit from 64bit or to down-rev your kernel, which on CentOS means to go down to 4.5 from 5.0.
The irony is that I'm already running 32bit[*], and that the responsiveness problem's worse on 4.5.
S.
* we specifically went for the Opteron 250 so we could stay at 32-bit because some software components we need to use may not yet be 64bit clean. The intention was to migrate later to 64bit on the same hardware, once those wrinkles had been ironed out.
 
            Simon Banton wrote:
At 12:41 -0400 2/10/07, Ross S. W. Walker wrote:
If the performance issue is identical to the kernel bug mentioned in the posting then the only real fix that was mentioned was to switch to 32bit from 64bit or to down-rev your kernel, which on CentOS means to go down to 4.5 from 5.0.
The irony is that I'm already running 32bit[*], and that the responsiveness problem's worse on 4.5.
S.
- we specifically went for the Opteron 250 so we could stay at 32-bit
because some software components we need to use may not yet be 64bit clean. The intention was to migrate later to 64bit on the same hardware, once those wrinkles had been ironed out.
Then I don't think your problem is related.
Have you tried calculating the performance of your current drives on paper to see if it matches your "reality"? It may just be that your disks suck...
What is the server going to be doing? What is the workload of your application? It may be that it will work fine for what you need it to do?
-Ross
______________________________________________________________________ This e-mail, and any attachments thereto, is intended only for use by the addressee(s) named herein and may contain legally privileged and/or confidential information. If you are not the intended recipient of this e-mail, you are hereby notified that any dissemination, distribution or copying of this e-mail, and any attachments thereto, is strictly prohibited. If you have received this e-mail in error, please immediately notify the sender and permanently delete the original and any copy or printout thereof.
 
            At 13:03 -0400 2/10/07, Ross S. W. Walker wrote:
Have you tried calculating the performance of your current drives on paper to see if it matches your "reality"? It may just be that your disks suck...
They're performing to spec for 7200rpm SATA II drives - your help in determining which was the appropriate elevator to use showed that.
What is the server going to be doing? What is the workload of your application?
Originally, it was going to be hosting a number of VMWare installations each containing a separate self contained LAMP website (for ease of subsequent migration), but that's gone by the board in favour of dispensing with the VMWare aspect. Now the websites will be NameVhosts under a single Apache directly on the native OS.
The app on each website is MySQL-backed and Perl CGI intensive. DB intended to be on a separate (identical) server. All running swimmingly at present on 4 year old single 1.6GHz P4s with single IDE disks, 512MB RAM and RH7.3 - except at peak times when they're a bit CPU bound. Loadave rarely above 1 or 2 most of the time.
Which is why I'm now focused on getting the non-VMWare approach up and running so I can profile it, instead of getting hung up on benchmarking the empty hardware. I'd never have started if I'd not noticed a terrific slowdown halfway through creating the filesystem when doing an initial CentOS 4.3 install many many weeks ago.
It may be that it will work fine for what you need it to do?
Yeah - but it's the edge cases that bite you. Can't be doing with a production server where it's possible to accidentally step on an indeterminate trigger that sends responsiveness into a nosedive.
S.
 
            Simon Banton wrote:
At 13:03 -0400 2/10/07, Ross S. W. Walker wrote:
Have you tried calculating the performance of your current drives on paper to see if it matches your "reality"? It may just be that your disks suck...
They're performing to spec for 7200rpm SATA II drives - your help in determining which was the appropriate elevator to use showed that.
What is the server going to be doing? What is the workload of your application?
Originally, it was going to be hosting a number of VMWare installations each containing a separate self contained LAMP website (for ease of subsequent migration), but that's gone by the board in favour of dispensing with the VMWare aspect. Now the websites will be NameVhosts under a single Apache directly on the native OS.
Yeah, I wouldn't do VMware guests it'll just get way too complex way too quickly as it will turn into more of a grid computing project then a web server project.
The app on each website is MySQL-backed and Perl CGI intensive. DB intended to be on a separate (identical) server. All running swimmingly at present on 4 year old single 1.6GHz P4s with single IDE disks, 512MB RAM and RH7.3 - except at peak times when they're a bit CPU bound. Loadave rarely above 1 or 2 most of the time.
Sounds like the issue is more of a CPU issue then a disk issue, so just upgrading the hardware and OS should make a big difference in itself, but I would profile the SQL queries to make sure they are not trying to bite off more then they need to.
Which is why I'm now focused on getting the non-VMWare approach up and running so I can profile it, instead of getting hung up on benchmarking the empty hardware. I'd never have started if I'd not noticed a terrific slowdown halfway through creating the filesystem when doing an initial CentOS 4.3 install many many weeks ago.
Well when you created the file system the write cache wasn't installed yet right?
And it may be that when you were creating the file system it was right after you created the RAID1 array and the controller may have been still sync'ing up the disks, which will slow things down tremendously.
If you had sync'ing disks and no write-back cache, then you will see terrible writes (and slower reads) until the disks were sync'd up.
It may be that it will work fine for what you need it to do?
Yeah - but it's the edge cases that bite you. Can't be doing with a production server where it's possible to accidentally step on an indeterminate trigger that sends responsiveness into a nosedive.
I agree that it is the edge cases that can come back and bite you just be sure you don't over-scope those edge cases for situations that will never arise.
-Ross
______________________________________________________________________ This e-mail, and any attachments thereto, is intended only for use by the addressee(s) named herein and may contain legally privileged and/or confidential information. If you are not the intended recipient of this e-mail, you are hereby notified that any dissemination, distribution or copying of this e-mail, and any attachments thereto, is strictly prohibited. If you have received this e-mail in error, please immediately notify the sender and permanently delete the original and any copy or printout thereof.
 
            At 13:49 -0400 2/10/07, Ross S. W. Walker wrote:
Sounds like the issue is more of a CPU issue then a disk issue, so just upgrading the hardware and OS should make a big difference in itself,
Yeah, that was the plan :-) Basically, we worked out what we needed to do (alleviate peak load CPU bottleneck by upgrading hardware), sought what we imagined would be suitable (dual faster CPU, hardware RAID 1, lots of RAM), and then ran into a brick wall with disk performance while testing - something that's never been an issue to date on the existing webservers which have a single IDE disk each.
but I would profile the SQL queries to make sure they are not trying to bite off more then they need to.
Fair point - we've done a lot of database tuning in the 5 years this app's been under development, so that's pretty well covered. With the existing hardware, (the back-end dbserver's a 1GB 1.6GHz P4 with mdadm RAID 1) the dbserver load barely reaches 1 even under peak traffic - we're not SQL- or IO-bound, we're CPU-bound on the front end.
Well when you created the file system the write cache wasn't installed yet right?
True, but there have been many wipes and installs since the BBUs have been available and the same long pauses when the inodes are created (much more noticeable with CentOS 4.5 than 5, but then the default nr_requests is 128 in 5 rather than 8192 in 4.5) that initially drew my attention are still apparent.
And it may be that when you were creating the file system it was right after you created the RAID1 array and the controller may have been still sync'ing up the disks, which will slow things down tremendously.
I noted that from the documentation at the outset and did an initial verify of the RAID 1 through the 3ware BIOS before doing the original install. A previous life as a technical author makes me a bit of a RTFM freak :-)
I agree that it is the edge cases that can come back and bite you just be sure you don't over-scope those edge cases for situations that will never arise.
That's why I'm now building the machine as if there wasn't an issue, so I can hammer it with apachebench and see if I'm tilting at windmills or not.
S.
 
            Simon Banton wrote:
What is the recurring performance problem you are seeing?
Pretty much exactly the symptoms described in http://bugzilla.kernel.org/show_bug.cgi?id=7372 relating to read starvation under heavy write IO causing sluggish system response.
I recently graphed the blocks in/blocks out from vmstat 1 for the same test using each of the four IO schedulers (see the PDF attached to the article below):
http://community.novacaster.com/showarticle.pl?id=7492
The test was:
dd if=/dev/sda of=/dev/null bs=1M count=4096 &; sleep 5; dd if=/dev/zero of=./4G bs=1M count=4096 &
Hmmmm, with that workload I think your going to see performance issues no matter what, as it is using really big request sizes and it it reads into /dev/sda for 5 seconds, then at some offset starts writing a large file and both are sequential, so it is going to turn the io into 1MB random reads and writes which on SATA disks is really going to suck badly (actually it'll suck on any disk). Each request is atomic so it will not start servicing another io request until the current 1MB io request is complete, which is a long time in computer terms.
Try running the same benchmark but use bs=4k and count=1048576
This will use 4k request size, avg VFS io size, and do it up to 4GB.
IO will still end up random but the inter-request latency should be smaller which should provide for a better result.
While these tests are running can you run any processes on another session? How about file system use while running?
<snip>
______________________________________________________________________ This e-mail, and any attachments thereto, is intended only for use by the addressee(s) named herein and may contain legally privileged and/or confidential information. If you are not the intended recipient of this e-mail, you are hereby notified that any dissemination, distribution or copying of this e-mail, and any attachments thereto, is strictly prohibited. If you have received this e-mail in error, please immediately notify the sender and permanently delete the original and any copy or printout thereof.
 
            At 12:59 -0400 2/10/07, Ross S. W. Walker wrote:
Try running the same benchmark but use bs=4k and count=1048576
Just finished doing that now - comparison graphs are here:
http://community.novacaster.com/showarticle.pl?id=7492
While these tests are running can you run any processes on another session?
Yes, but responsiveness is sluggish eg (taken during 4k 'deadline' scheduler test)
# time ls /usr/lib
real 0m15.959s user 0m0.011s sys 0m0.016s
S.
 
            Simon Banton wrote:
At 12:59 -0400 2/10/07, Ross S. W. Walker wrote:
Try running the same benchmark but use bs=4k and count=1048576
Just finished doing that now - comparison graphs are here:
http://community.novacaster.com/showarticle.pl?id=7492
While these tests are running can you run any processes on another session?
Yes, but responsiveness is sluggish eg (taken during 4k 'deadline' scheduler test)
# time ls /usr/lib
real 0m15.959s user 0m0.011s sys 0m0.016s
Well it looks like the CFQ actually was able to get some reads in while the background write was going on so it actually looks better in this workload scenario.
I am going to retract my suggestion that 'deadline' be a general purpose scheduler for servers based on this. Instead I would make the following alternate suggestions.
The issue I had with CFQ is it cannot handle overlapping IO well from multiple threads of the same process, so if your application does that (MySQL?) then it is probably NOT the right scheduler for you and you might be best to consider 'deadline' or 'noop' and put your data on a separate disk/array that doesn't compete with any other process/application.
If an application spawns multiple threads or processes to handle completely separate data workloads (ie not overlapping io from same workload) then you are best using the default 'cfq' I believe.
I would try some web benchmark app next with both cfq and then deadline to see which works better in a web-server environment. For web serving that is read only I suspect that either 'cfq' or 'deadline' will work well, but would like to know the results of your web benchmarks.
In the end, since all your "content" will be in mysql and therefore all file system operations will be "read", the whole issue of being able to read while writing a large file isn't very relevant, so I would probably disregard it as an edge case that doesn't fit your workload.
-Ross
______________________________________________________________________ This e-mail, and any attachments thereto, is intended only for use by the addressee(s) named herein and may contain legally privileged and/or confidential information. If you are not the intended recipient of this e-mail, you are hereby notified that any dissemination, distribution or copying of this e-mail, and any attachments thereto, is strictly prohibited. If you have received this e-mail in error, please immediately notify the sender and permanently delete the original and any copy or printout thereof.






