Christopher Chan wrote:
Funny you should mention software RAID1... I've seen two instances of that
getting silently out-of-sync and royally screwing things up beyond all repair.
Maybe this thread has gone on long enough now?
Not yet :)
Please tell more about your hardware and software. What distro? What kernel? What disk controller? What disks?
I'm interested in this because I have never seen Linux software MD RAID1 failures like this, but some people keep telling they happen frequently..
It could be like Les said - bad RAM. I certainly have not encountered this sort of error on a md raid1 array.
I'm just wondering why I'm not seeing these failures, or if I've just been lucky so far..
Yeah, lucky you've not got bad RAM that passed POSTing and at the same time did not bring your system down on you right from the start or rendered it unstable.
On the machine where I had the problem I had to run memtest86 more than a day to finally catch it. Then after replacing the RAM and fsck'ing the volume, I still had mysterious problems about once a month until I realized that the disks are accessed alternately and the fsck pass didn't catch everything. I forget the commands to compare and fix the mirroring, but they worked - and I think the centos 5.4 update does that periodically as a cron job now. The other worry is that when one drive dies, you might have unreadable spots in normally unused areas of the mirror since this will keep a rebuild from working - but the cron job should detect those too if you notice the results.