Hi All,
I have a client trying to us a Promise Tech iSCSI array to share 2.8TB
via SAMBA. I have CentOS 4.2 with all updates installed on an IBM
server. The installation and setup was pretty straightforward. The
Promise box is using Gigabit Ethernet, and is the only device on that
net (I think they are using a cross-over cable - I didn't set up the
hardware). We're experiencing periodic "stoppages" with some of the
Samba users on this server. Their Windows clients, which have drives
mapped to the IBM server (which has the iSCSI partitions "mounted"),
periodically "pause" for about 30-60 seconds. The machines are NOT
locked up, as we can take screenshots, move the mouse, etc, but disk IO
seems "stuck". When it happens, about anywhere from 3-12 people are
affected (but not the other ~80 users).
There are no network errors on either the iSCSI interfaces, or the
switches and/or network interfaces. The kernel is not swapping (though
the symptoms SEEM a lot like a process getting swapped to disk).
The cpu usage is not spiking in correlation to the events, as far
as we can tell. It DOES appear that the % of time the CPU is servicing
IO requests is saturated, from iostat. I will paste the iostat
info from one of the events below.
Has anyone else seem such behavion, and/or do you have suggestions for
troubleshooting or otherwise correcting it? Thanks.
-Scott
------------------------------------------------------------------------
This was sent from one of the techs working the problem:
I think I located the problem. Collecting iostat data during the last
lockup yielded the following information.
Time: 03:20:38 PM
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s rkB/s wkB/s
avgrq-sz avgqu-sz await svctm %util
sda 0.00 0.00 3.09 0.00 24.74 0.00 12.37 0.00
8.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 85.57 0.00 684.54 0.00 342.27 0.00
8.00 1.03 12.06 12.04 102.99
Time: 03:20:39 PM
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s rkB/s wkB/s
avgrq-sz avgqu-sz await svctm %util
sda 0.00 13.27 0.00 10.20 0.00 187.76 0.00 93.88
18.40 0.00 0.10 0.10 0.10
sdb 0.00 0.00 82.65 0.00 661.22 0.00 330.61 0.00
8.00 1.02 12.23 12.33 101.94
This clearly shows that the percent of CPU time used for I/O requests
has reached the device saturation point which is 100%. The utilization
was at or above 100% the entire time that the freeze occured. I am
researching the issue now to see if this is something we can resolve
with kernel tweaks or otherwise. Any input regarding the issue is
appreciated, thanks.