replying at the end
On Apr 4, 2012, at 5:26 AM, Crunch wrote:
On 04/03/2012 05:58 PM, Tony Schreiner wrote:
Two weeks ago I (clean-)installed CentOS 6.2 on a server which had been running 5.7.
There is a 16 disk = ~11 TB data volume running on an Areca ARC-1280 raid card with LVM + xfs filesystem on it. The included arcmsr driver module is loaded.
At first it seemed ok, but with in a few hours I started getting I/O error message on directory listings, and then a bit later when I did a vgdisplay command there was garbage in that.
The file system data are being corrupted. This can only happen either through human intervention or hardware failure; assuming that the original installation was okay. This is a safe assumption to make considering you've reinstalled and it now seems to be okay.
I then ran the volume check on the RAID card bios, it flagged 3 errors. When I restarted the system, things were ok, but then the problem reappeared. I ran another volume check and no errors were flagged (I should note, the check takes about 9 hours). but upon restarting, the file system was ok, but then went bad again.
Presumably the card bios runs checks only on the firmware and/or the hardware; say disks and the card itself. The reported errors therefore point to those components.
Another symptom was that the cli64 raid management utility, which I got from the Areca site would just hang.
I would guess the utility is a piece of client code that queries the firmware. Assuming nothing is wrong with the client code, this implies some form of defect occurring in the firmware. Could be unresponsive hardware or corrupt firmware code.
After a couple of days of this, I decided I could not afford to have this system unavailable, and I reinstalled CentOS 5.8. Everything has been fine since.
The firmware and file system may well have corrected the errors on your first pass. But then for the corruption to happen again without any detected errors sounds inconsistent. There's something missing here. Maybe the card corrected the errors itself the second time leaving corruption behind.
CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
i'm not sure what you're saying can be entirely true. What I failed to mention in the original post, is that I did not recreate the problematic data volume during either install; it was preserved both for the upgrade and the downgrade. It doesn't appear that there is any filesystem corruption independent of the raid software, xfs_check doesn't discover any.
I'm willing to believe that the raid firmware is problematic, but it seems to be an issue with version 6 but not version 5.
I'm in the process of reporting to the BugTracker. As KB mentioned, there is an existing id 5517.
Tony Schreiner