On Sep 14, 2009, at 10:25 PM, "McCulloch, Alan" <alan.mcculloch at agresearch.co.nz > wrote: > hi All, > > thanks for the responses. > > After being dropped into the > > # Filesystem repair > > prompt, > > ( on account of “inode 27344909 has illegal blocks” ) > > following warm reboot (via “reboot”) after finding (SAN ) > filesystem in read-only > mode yesterday morning (possibly because of HBA fault on SAN) , I ran > > fsck –r /data > > (Linux version 2.6.18-92.1.18.el5 , Red Hat 4.1.2-42 , ext3 > filesystem) > > This took a couple of hours or so , prompting me for various changes > all of which I accepted. This appeared to complete OK, but then the > system would not boot, with the following error from the qla2xxx > driver. > > . > . > qla2xxx 0000:05:0d.0: Mailbox command timeout occurred. Scheduling > ISP abort. > qla2xxx 0000:05:0d.0: Mailbox command timeout occurred. Scheduling > ISP abort. > . > etc > > However after powering down the system and cold-booting, the system > was able > to boot up and mount the repaired filesystem without any obvious > damage, but with > abnormal not to mention scary looking boot messages and ongoing > warnings from > multipath. > > This morning (as I sort of expected) the filesystem had dropped back > down to read-only mode, but meanwhile > the source of our woes was identified, a fibre port on the SAN > controller which was degraded but not > completely failed, so that there had been no clean failover to the > twin controller, and therefore a degraded > virtual device was presented to the O/S, with consequence for the > filesystem. > > After that port and controller was quarantined, this time around I > did a cold power-off reboot > of the server , and this time there was a more normal looking boot > and the filesystem > came up normally without any repair being requested. > > (My hypothesis is that in this situation – i.e. ext3 filesystem has > put itself in read-only mode – > a warm boot , via reboot, does not cleanly remount the filesystem > and apply the journal > quite like a cold power-off reboot does. I think it is likely that > the lengthy > session of me answering “yes” to fsck’s interactive repair, the > first time around, simply applied all of the > fixes that would automatically have been done from the journal , had > I cold-rebooted in the first place. > However that is only a hunch. But I will be making sure to do cold > power-off reboots in general, in > future.) > > Another lesson is that a sophisticated system of twin SAN > controllers with failover does not protect > against a situation where a device is degrading rather than failing > completely. > > Thanks again for the responses and sorry if my questions were a bit > basic but I have > been dropped in a little out of my depth with this system. I always prefer round-robin mpath versus fail-over if possible as a degraded or failed path simply is not used, then there is the twice the bandwidth factor when both paths are working which is nice. -Ross -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://lists.centos.org/pipermail/centos/attachments/20090915/9eba313b/attachment-0005.html>