[CentOS] Areca RAID controller on latest CentOS 7 (1708 i.e. RHEL 7.4) kernel 3.10.0-693.2.2.el7.x86_64

Mon Oct 23 12:13:18 UTC 2017
Noam Bernstein <noam.bernstein at nrl.navy.mil>

> On Oct 22, 2017, at 4:35 PM, Joseph L. Casale <jcasale at activenetwerx.com> wrote:
> -----Original Message-----
> From: CentOS [mailto:centos-bounces at centos.org] On Behalf Of Noam
> Bernstein
> Sent: Sunday, October 22, 2017 8:54 AM
> To: CentOS mailing list <centos at centos.org>
> Subject: [CentOS] Areca RAID controller on latest CentOS 7 (1708 i.e. RHEL
> 7.4) kernel 3.10.0-693.2.2.el7.x86_64
>> Is anyone running any Areca RAID controllers with the latest CentOS 7 kernel,
>> 3.10.0-693.2.2.el7.x86_64?  We recently updated (from 3.10.0-
>> 514.26.2.el7.x86_64), and we’ve started having lots of problems.  To add to
>> the confusion, there’s also a hardware problem (either with the controller or
>> the backplane most likely) that we’re in the process of analyzing.  Regardless,
>> we have an ARC1883i, and with the older kernel the system is stable, but
>> with the new kernel it locks up within 1-12 hours of boot, with errors in
>> /var/log/messages that start with things like
>> kernel: arcmsr0: abort device command of scsi id = 0 lun = 0
>> (that is indeed the RAID scsi device) and within a few minutes of those also
>> things like
>> Oct 19 23:06:57 radon kernel: INFO: task xfsaild/dm-9:913 blocked for more
>> than 120 seconds.
> You mention you have hardware problems, what are they?

They’re weird is what they are.  There’s one slot that’s apparently bad.  It was first showing a failed disk (in the web interface, e.g.), but the disk is apparently fine (as checked by putting in other known good disks into that slot, and putting that disk into other slots or into a different machine), and is currently listed as a hot spare, so it’s not actually being accessed.  Now that slot has apparently spontaneously fixed itself, in so far as it is showing as a working disk. However, the lights that flash as it scans through the slots on boot clearly behave differently for that slot than all the others  (~1 s red flash in the second scan, instead of more like 0.25 s) , so I don’t believe that it’s really fixed.  But so far as a I can tell when that slot is empty the array behaves normally, except for these errors with the new kernel only.

> A write is blocked
> for longer than they host is willing to wait. There are a few sysctl parameters
> that affect this but I'd be more willing to suggest its related to your hardware
> problems.

As I said, these errors only show up with the latest kernel, so while I agree in principle that it makes sense for it to be related to the hardware problem, it has to be interacting with the kernel somehow as well.


Noam Bernstein, Ph.D.
Center for Materials Physics and Technology
U.S. Naval Research Laboratory
T +1 202 404 8628  F +1 202 404 7546
https://www.nrl.navy.mil <https://www.nrl.navy.mil/>