Hi All,
I have a client trying to us a Promise Tech iSCSI array to share 2.8TB via SAMBA. I have CentOS 4.2 with all updates installed on an IBM server. The installation and setup was pretty straightforward. The Promise box is using Gigabit Ethernet, and is the only device on that net (I think they are using a cross-over cable - I didn't set up the hardware). We're experiencing periodic "stoppages" with some of the Samba users on this server. Their Windows clients, which have drives mapped to the IBM server (which has the iSCSI partitions "mounted"), periodically "pause" for about 30-60 seconds. The machines are NOT locked up, as we can take screenshots, move the mouse, etc, but disk IO seems "stuck". When it happens, about anywhere from 3-12 people are affected (but not the other ~80 users).
There are no network errors on either the iSCSI interfaces, or the switches and/or network interfaces. The kernel is not swapping (though the symptoms SEEM a lot like a process getting swapped to disk). The cpu usage is not spiking in correlation to the events, as far as we can tell. It DOES appear that the % of time the CPU is servicing IO requests is saturated, from iostat. I will paste the iostat info from one of the events below.
Has anyone else seem such behavion, and/or do you have suggestions for troubleshooting or otherwise correcting it? Thanks.
-Scott
------------------------------------------------------------------------ This was sent from one of the techs working the problem:
I think I located the problem. Collecting iostat data during the last lockup yielded the following information.
Time: 03:20:38 PM
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util sda 0.00 0.00 3.09 0.00 24.74 0.00 12.37 0.00 8.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 85.57 0.00 684.54 0.00 342.27 0.00 8.00 1.03 12.06 12.04 102.99
Time: 03:20:39 PM
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util sda 0.00 13.27 0.00 10.20 0.00 187.76 0.00 93.88 18.40 0.00 0.10 0.10 0.10
sdb 0.00 0.00 82.65 0.00 661.22 0.00 330.61 0.00 8.00 1.02 12.23 12.33 101.94
This clearly shows that the percent of CPU time used for I/O requests has reached the device saturation point which is 100%. The utilization was at or above 100% the entire time that the freeze occured. I am researching the issue now to see if this is something we can resolve with kernel tweaks or otherwise. Any input regarding the issue is appreciated, thanks.