[CentOS] SUMMARY : Repair Filesystem prompt , after inode has illegal blocks ; qla2xxx message on reboot

Tue Sep 15 13:25:48 UTC 2009
Ross Walker <rswwalker at gmail.com>

On Sep 14, 2009, at 10:25 PM, "McCulloch, Alan" <alan.mcculloch at agresearch.co.nz 
 > wrote:

> hi All,
>
> thanks for the responses.
>
> After being dropped into the
>
> # Filesystem repair
>
> prompt,
>
> (  on account of “inode 27344909 has illegal blocks” )
>
> following warm reboot (via “reboot”) after finding (SAN )  
> filesystem in read-only
> mode yesterday morning (possibly because of HBA fault on SAN) , I ran
>
> fsck –r /data
>
> (Linux version 2.6.18-92.1.18.el5 , Red Hat 4.1.2-42 , ext3  
> filesystem)
>
> This took a couple of hours or so , prompting me for various changes
> all of which I accepted. This appeared to complete OK, but then the
> system would not boot, with the following error from the qla2xxx  
> driver.
>
> .
> .
> qla2xxx 0000:05:0d.0: Mailbox command timeout occurred. Scheduling  
> ISP abort.
> qla2xxx 0000:05:0d.0: Mailbox command timeout occurred. Scheduling  
> ISP abort.
> .
> etc
>
> However after powering down the system and cold-booting, the system  
> was able
> to boot up and mount the repaired filesystem without any obvious  
> damage, but with
> abnormal not to mention scary looking boot messages  and ongoing  
> warnings from
> multipath.
>
> This morning (as I sort of expected) the filesystem had dropped back  
> down to read-only mode, but meanwhile
> the source of our woes was identified, a fibre port on the SAN  
> controller which was degraded but not
> completely failed,  so that there had been no clean failover to the  
> twin controller, and therefore a degraded
> virtual device was presented to the O/S, with consequence for the  
> filesystem.
>
> After that port and controller was quarantined, this time around I  
> did a cold power-off reboot
> of the server , and this time there was a more normal looking boot  
> and the filesystem
> came up normally without any repair being requested.
>
> (My hypothesis is that in this situation – i.e. ext3 filesystem has  
> put itself in read-only mode –
> a warm boot , via reboot, does not cleanly remount the filesystem  
> and apply the journal
> quite like a cold power-off reboot does. I think it is likely that  
> the lengthy
> session of me answering “yes” to fsck’s interactive repair, the  
> first time around, simply applied all of the
> fixes that would automatically have been done from the journal , had  
> I cold-rebooted in the first place.
> However that is only a hunch. But I will be making sure to do cold  
> power-off reboots in general, in
> future.)
>
> Another lesson is that a sophisticated system of twin SAN  
> controllers with failover does not protect
> against a situation where a device is degrading  rather than failing  
> completely.
>
> Thanks again for the responses and sorry if my questions were a bit  
> basic but I have
> been dropped  in a little out of my depth with this system.

I always prefer round-robin mpath versus fail-over if possible as a  
degraded or failed path simply is not used, then there is the twice  
the bandwidth factor when both paths are working which is nice.

-Ross

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.centos.org/pipermail/centos/attachments/20090915/9eba313b/attachment-0004.html>