[CentOS] Looking for a life-save LVM Guru

Sun Mar 1 00:14:45 UTC 2015
Chris Murphy <lists at colorremedies.com>

On Sat, Feb 28, 2015 at 4:29 PM, Valeri Galtsev
<galtsev at kicp.uchicago.edu> wrote:

> You are implying that firmware of hardware RAID cards is somehow buggier
> than software of software RAID plus Linux kernel (sorry if I
> misinterpreted your point).

"Drives, and hardware RAID cards are subject to firmware bugs, just as
we have software bugs in the kernel." makes no assessment of how
common such bugs are relative to each other.

> I disagree: embedded system of RAID card and
> RAID function they have to fulfill are much simpler than everything
> involved into software RAID. Therefore, with the same effort invested,
> firmware of (good) hardware is less buggy.

There's no evidence provided for this. All I've stated is bugs happen
in both software and the firmware on hardware RAID cards.
http://www.cs.toronto.edu/~bianca/papers/fast08.pdf

And further there's a widespread misperception that RAID56 (whether
software or hardware) is capable of detecting and correcting
corruption.


> And again, Linux kernel can be
> panicked more likely than trivial embedded system of hardware RAID
> card/box. At least my experience over decade and a half confirms that.

I'd say this is not a scientific sample and therefore unproven. I can
provide my own non-scientific sample: an XServe running OS X with
software raid1 which has never, in 8 years, kernel panicked. Its
longest uptime was over 500 days, and was only rebooted due to a
system upgrade that required it. There's nothing special about the
XServe that makes this magic, it's just good hardware with ECC memory,
enterprise SAS drives, and a capable though limited kernel. So there's
no good reason to expect kernel panics. Having them means something is
wrong.

> I have my raids verified once a week. If you don't
> verify them for a year, what happens then: you don't discover individual
> drive degradation until it is too late and larger number than the level of
> redundancy are kicked out because of fatal failures.

This is a common problem on software and hardware RAID alike, the lack
of scrubbing. Also recognize that software raid tends to bring along
cheaper drives that aren't well suited for RAID use, whereas people
spending money on hardware raid tend to invest in appropriate drives.
That prevents problems due to proper SCT ERC settings on the drive.

>Anyway, these
> horror stories were purely poor sysadmin's job IMHO.

I agree. This is common in any case.


-- 
Chris Murphy