I have a CentOS 7.9 system with a software raid 6 root partition. Today something very strange occurred. At 6:45AM the system crashed. I rebooted and when the system came up I had multiple emails indicating that 3 out of 6 drives had failed on the root partition. Strangely I was able to boot into the system and everything was working correctly despite
cat /proc/mdstat
also indicating 3 out of 6 drives had failed. Since the system was up and running despite the fact more than 2 drives had failed in the root raid array I decided to reboot the system. Actually I shut it down, waited for the drives to spin down and then restarted. This time when it came back the 3 missing drives were back in the array and a cat /proc/mdstat indicated all 6 drives were again in the raid 6 array. So a few questions:
1.) If 3 our of 6 drives of a raid 6 array supposedly fail, how does the array still function? 2.) Why would a shutdown/restart sequence supposedly fix the array? 3.) My gut suggests that the raid array was never degraded and that my system (i.e. cat /proc/mdstat) was lying to me. Any Opinions?
Has anybody else ever seen such strange behavior?
On Wed, 16 Dec 2020 13:57:13 -0700 Paul R. Ganci via CentOS wrote:
My gut suggests that the raid array was never degraded and that my system (i.e. cat /proc/mdstat) was lying to me. Any Opinions?
I wonder if it's a ram failure in either the main computer or the drive controller. An intermittent ram failure (or cold solder joint or something equally hard to track down) could cause all manner of un-repeatable weirdness.
I had an issue similar to this years ago where I helped out a former employer on a Dell Poweredge System with a RAID 5 array (Windows). The system Refused to Boot, but there were lights on the front of the backplane were the drives slid in, indicating drive fault (amber) or drive ok (green). One of the tests I did was I re-arranged the drives where they were inserted into the backplane. When I did that the same lights (slots) went amber after re-arranging the drives.
The problem wasn't the drives at all, the problem was the controller card going bad. The IT guy that was there full time ended up shipping the drives off to a recovery service depo, and they recovered the data there, no problem.
When I worked for Sage we had SCSI RAID Controller cards that had similar functions, where the RAID card config was backed up in the drives, and the Drive configuration was stored in the RAID controller, so they backed up the config of each other.
In the event of a failure of the controller card, the same model card could be put back into the system, and the config data pulled off the recovery location in the drives, then the system was back up and going again.
Perhaps that is what's happening to your system.
I would take several full bare metal backups right now (and test restore the data onto a new system) there may be looming hardware failure around the corner.
Chris
On 12/16/2020 3:10 PM, Frank Cox wrote:
On Wed, 16 Dec 2020 13:57:13 -0700 Paul R. Ganci via CentOS wrote:
My gut suggests that the raid array was never degraded and that my system (i.e. cat /proc/mdstat) was lying to me. Any Opinions?
I wonder if it's a ram failure in either the main computer or the drive controller. An intermittent ram failure (or cold solder joint or something equally hard to track down) could cause all manner of un-repeatable weirdness.