[CentOS] New controller card issues

Thu May 28 18:13:15 UTC 2015
Valeri Galtsev <galtsev at kicp.uchicago.edu>

On Thu, May 28, 2015 11:43 am, Gordon Messmer wrote:
> On 05/28/2015 09:12 AM, Kirk Bocek wrote:
>> I suggest everyone stay away from 3Ware from now on.
>
> My experience has been that 3ware cards are less reliable that software
> RAID for a long, long time.  It took me a while to convince my previous
> employer to stop using them.  Inability to migrate disk sets across
> controller families, data corruption, and boot failures due to bad
> battery daughter cards eventually proved my point.  I have never seen a
> failure due to software RAID, but I've seen quite a lot of 3ware
> failures in the last 15 years.

I strongly disagree with that.

I have large number of 3ware based RAIDs in my server room. During last 13
years or so I have never had any failures or data losses of these RAIDs -
not due to hardware (knocking on wood, I guess I should start calling
myself lucky).

Occasionally some researchers (who come mostly from places where they
self-manage machines) bring stories of disaster/data losses. Each of them
when I looked deep into details turn out to be purely due to not well
configured hardware RAID in the first place. So, I end up telling them:
before telling others that 3ware RAID cards are bad and let you down,
check that what you set up does not contain obvious blunders. Let me list
the major ones I have heard of:

1. Bad choice of drives for RAID. Any "green" (spin-down to conserve
energy) drives are not suitable for RAID. Even drives that are not
"spin-down" but poor quality, when they work in parallel (say 16 in single
RAID unit) have much larger chance of more than one failing
simultaneously. If you went as far as buying hardware RAID card, spend
some 10% more on good drives (and buy them from good source), do not
follow "price grabber".

2. Bad configuration of RAID itself. You need to run "verification" of the
RAID every so often. My RAIDs are verified once a week. This will allow at
least to force drives to scan the whole surface often, and discover and
re-allocate bad blocks. If you don't do if for over a year you will have
fair chance RAID failure due to several drives failing (because of
accumulated never discovered bad blocks) accessing particular stripe...
then you loose your RAID with its data. This is purely configuration
mistake.

3. Smaller, yet still blunders: having card without memory battery backup
and running RAID with the cache (in which case RAID device is much
faster): if in this configuration you yank the power, you loose content of
cache, and RAID quite likely will be screwed up big time.

Of course, there are some restrictions, in particular, not always you can
attach drives to different card model and have RAID keep functioning.
3ware cards usually discover that, and they export RAID read-only, so you
can copy data elsewhere, then re-create RAID so it is compatible with
internals of this new card.

I do not want to start "software" vs "hardware" RAID wars here, but I
really have to mention this: Software RAID function is implemented in the
kernel. That means you have to have your system running for software RAID
to fulfill its function. If you panic the kernel, software RAID is stopped
in the middle of what it was doing and haven't done yet.

Hardware RAID, to the contrary, does not need the system running. It is
implemented in its embedded system, and all it need is a power. Embedded
system is rather rudimentary and it runs only single rudimentary function:
chops data flow and calculates RAID stripes. I've never heard this
embedded system ever got panicked (which is result of its simplicity
mainly).


So, even though I'm strongly in favor of hardware RAID, I still consider
one's choice just a matter of taste. And I would be much happier if
software RAID people will have same attitude as well ;-)


Just my $0.02

Valeri

++++++++++++++++++++++++++++++++++++++++
Valeri Galtsev
Sr System Administrator
Department of Astronomy and Astrophysics
Kavli Institute for Cosmological Physics
University of Chicago
Phone: 773-702-4247
++++++++++++++++++++++++++++++++++++++++