[CentOS] CentOS 6.2 + areca raid + xfs problems

Wed Apr 4 15:26:40 UTC 2012
Tony Schreiner <anthony.schreiner at bc.edu>

replying at the end

On Apr 4, 2012, at 5:26 AM, Crunch wrote:

> On 04/03/2012 05:58 PM, Tony Schreiner wrote:
>> Two weeks ago I (clean-)installed CentOS 6.2 on a server which had been running 5.7.
>> 
>> There is a 16 disk = ~11 TB data volume running on an Areca ARC-1280 raid card with LVM + xfs filesystem on it. The included arcmsr driver module is loaded.
>> 
>> At first it seemed ok, but with in a few hours I started getting I/O error message on directory listings, and then a bit later when I did a vgdisplay command there was garbage in that.
> 
> The file system data are being corrupted. This can only happen either 
> through human intervention or hardware failure; assuming that the 
> original installation was okay. This is a safe assumption to make 
> considering you've reinstalled and it now seems to be okay.
> 
>> 
>> I then ran the volume check on the RAID card bios, it flagged 3 errors. When I restarted the system, things were ok, but then the problem reappeared.
>> I ran another volume check and no errors were flagged (I should note, the check takes about 9 hours). but upon restarting, the file system was ok, but then went bad again.
> 
> Presumably the card bios runs checks only on the firmware and/or the 
> hardware; say disks and the card itself. The reported errors therefore 
> point to those components.
> 
>> 
>> Another symptom was that the cli64 raid management utility, which I got from the Areca site would just hang.
> I would guess the utility is a piece of client code that queries the 
> firmware. Assuming nothing is wrong with the client code, this implies 
> some form of defect occurring in the firmware. Could be unresponsive 
> hardware or corrupt firmware code.
> 
>> 
>> After a couple of days of this, I decided I could not afford to have this system unavailable, and I reinstalled CentOS 5.8. Everything has been fine since.
> The firmware and file system may well have corrected the errors on your 
> first pass. But then for the corruption to happen again without any 
> detected errors sounds inconsistent. There's something missing here. 
> Maybe the card corrected the errors itself the second time leaving 
> corruption behind.
> 
> _______________________________________________
> CentOS mailing list
> CentOS at centos.org
> http://lists.centos.org/mailman/listinfo/centos

i'm not sure what you're saying can be entirely true. What I failed to mention in the original post, is that I did not recreate the problematic data volume during either install; it was preserved both for the upgrade and the downgrade. It doesn't appear that there is any filesystem corruption independent of the raid software, xfs_check doesn't discover any.

I'm willing to believe that the raid firmware is problematic, but it seems to be an issue with version 6 but not version 5.

I'm in the process of reporting to the BugTracker. As KB mentioned, there is an existing id 5517.

Tony Schreiner