[CentOS] oops, or how to bring a datacenter router down with one setting

Fri Feb 10 21:33:04 UTC 2012
Devin Reade <gdr at gno.org>

--On Friday, February 10, 2012 01:49:05 PM -0600 Les Mikesell
<lesmikesell at gmail.com> wrote:

> I suppose it is possible for a NIC to fail, but I can't recall actually
> ever seeing it.  I've seen lots of complicated failover schemes introduce
> new problems and their own failure modes [...]

+1.

Redundancy is cool.  Redundancy, when needed and properly implemented,
can work and can save your bacon.  However, it is expensive, time
consuming, and significantly increases both the complexity of a
system and the skill needed to analyze problems (or for that matter
predict them and plan for mitigation strategies).  It also needs
to be exercised on a regular basis or, when you need it, you'll 
find that someone has made a bad configuration change that prohibits
failover.

I, also, have not seen a properly tested NIC fail in quite a few years.
(I'm discounting bad NIC models that don't pass evaluation.) Of course,
just because I've not seen it doesn't mean it can't happen, but I also
don't usually worry about having a redundant SERIAL back-channel for
cluster hearbeat operations, which used to be considered as the only
reasonable way to do things.

I do have clusters where bonding is in use but those have helped not so
much in avoiding NIC failures as they do in allowing the machines
to continue operating as the network team brings down part of the
redundant switch network for maintenance (or to replace a failed switch,
or when some fool decides that they can unplug a network cable 
briefly so that they can move other cables around).

Devin