On Oct 22, 2017, at 4:35 PM, Joseph L. Casale jcasale@activenetwerx.com wrote:
-----Original Message----- From: CentOS [mailto:centos-bounces@centos.org] On Behalf Of Noam Bernstein Sent: Sunday, October 22, 2017 8:54 AM To: CentOS mailing list centos@centos.org Subject: [CentOS] Areca RAID controller on latest CentOS 7 (1708 i.e. RHEL 7.4) kernel 3.10.0-693.2.2.el7.x86_64
Is anyone running any Areca RAID controllers with the latest CentOS 7 kernel, 3.10.0-693.2.2.el7.x86_64? We recently updated (from 3.10.0- 514.26.2.el7.x86_64), and we’ve started having lots of problems. To add to the confusion, there’s also a hardware problem (either with the controller or the backplane most likely) that we’re in the process of analyzing. Regardless, we have an ARC1883i, and with the older kernel the system is stable, but with the new kernel it locks up within 1-12 hours of boot, with errors in /var/log/messages that start with things like kernel: arcmsr0: abort device command of scsi id = 0 lun = 0 (that is indeed the RAID scsi device) and within a few minutes of those also things like Oct 19 23:06:57 radon kernel: INFO: task xfsaild/dm-9:913 blocked for more than 120 seconds.
You mention you have hardware problems, what are they?
They’re weird is what they are. There’s one slot that’s apparently bad. It was first showing a failed disk (in the web interface, e.g.), but the disk is apparently fine (as checked by putting in other known good disks into that slot, and putting that disk into other slots or into a different machine), and is currently listed as a hot spare, so it’s not actually being accessed. Now that slot has apparently spontaneously fixed itself, in so far as it is showing as a working disk. However, the lights that flash as it scans through the slots on boot clearly behave differently for that slot than all the others (~1 s red flash in the second scan, instead of more like 0.25 s) , so I don’t believe that it’s really fixed. But so far as a I can tell when that slot is empty the array behaves normally, except for these errors with the new kernel only.
A write is blocked for longer than they host is willing to wait. There are a few sysctl parameters that affect this but I'd be more willing to suggest its related to your hardware problems.
As I said, these errors only show up with the latest kernel, so while I agree in principle that it makes sense for it to be related to the hardware problem, it has to be interacting with the kernel somehow as well.
Noam
____________ || |U.S. NAVAL| |_RESEARCH_| LABORATORY Noam Bernstein, Ph.D. Center for Materials Physics and Technology U.S. Naval Research Laboratory T +1 202 404 8628 F +1 202 404 7546 https://www.nrl.navy.mil https://www.nrl.navy.mil/