[CentOS] HA Storage Cookbook?
lists at bitratchet.com
Mon Nov 10 16:37:41 UTC 2008
Les Mikesell wrote:
> But, I think the OP's real problem is that everything is tied to one
> single large drive (i.e. the software mirroring is mostly irrelevant as
I think that Les makes a good point, and I'd like to push the point even
more generally: providing network file storage, via SAN or NFS is that
when you have a single service instance, you need procedures and/or
layers of caching to deal with outages.
I've been using a DRBD cluster joined by a bonded GigE switch and it
replicates quite quickly. My issues have been related to Heartbeat and
monitoring. We've learned it's very important to practice and tune the
fail-over process and detect on file system performance rather than
merely pinging. Also, it's necessary to monitor application performance
to see if your storage nodes are suffering load issues. I've seen a
two-core nfs server perform reliably under load 6-7 but it starts to get
unhappy at any higher load.
Ironically, we've had absolutely no hard drive errors yet. Hardware
things that come to mind are: mother boards: I've had more mother board
and ram failures than drive failures with the systems we've had. Raid
cards: we've had to swap out 2 3Ware raid controllers also.
Network failures will get you down if you're looking for uptime as well:
we recently had a nic in one of our storage nodes get into a state where
it was spouting 60Mbit of bad packets and created quite a layer-2
networking issue for two cabinets of web servers and two ldap servers.
When the ldap servers couldn't respond, the access to the storage nodes
got even worse. It was a black day.
The next thing in our setup has to do with reliance of NFS. NFS may not
the best choice to put behind web-servers, but it was quickest. We're
adjusting our application to caching the data found on NFS nodes on
local file-systems so that we can handle an NFS outage.
My take is: if you're a competent Linux admin, DRBD will cost you less
with by using appropriate servers be more maintainable than an
appliance. The challenge of course is working out how to reduce response
time when any hardware goes sour.
More information about the CentOS