[CentOS] HA Storage Cookbook?

Mon Nov 10 16:37:41 UTC 2008
Jed Reynolds <lists at bitratchet.com>

Les Mikesell wrote:
> But, I think the OP's real problem is that everything is tied to one 
> single large drive (i.e. the software mirroring is mostly irrelevant as
...

I think that Les makes a good point, and I'd like to push the point even 
more generally: providing network file storage, via SAN or NFS is that 
when you have a single service instance, you need procedures and/or 
layers of caching to deal with outages.

I've been using a DRBD cluster joined by a bonded GigE switch and it 
replicates quite quickly. My issues have been related to Heartbeat and 
monitoring. We've learned it's very important to practice and tune the 
fail-over process and detect on file system performance rather than 
merely pinging. Also, it's necessary to monitor application performance 
to see if your storage nodes are suffering load issues. I've seen a 
two-core nfs server perform reliably under load 6-7 but it starts to get 
unhappy at any higher load.

Ironically, we've had absolutely no hard drive errors yet. Hardware 
things that come to mind are: mother boards: I've had more mother board 
and ram failures than drive failures with the systems we've had. Raid 
cards: we've had to swap out 2 3Ware raid controllers also.

Network failures will get you down if you're looking for uptime as well: 
we recently had a nic in one of our storage nodes get into a state where 
it was spouting 60Mbit of bad packets and created quite a layer-2 
networking issue for two cabinets of web servers and two ldap servers. 
When the ldap servers couldn't respond, the access to the storage nodes 
got even worse. It was a black day.

The next thing in our setup has to do with reliance of NFS. NFS may not 
the best choice to put behind web-servers, but it was quickest. We're 
adjusting our application to caching the data found on NFS nodes on 
local file-systems so that we can handle an NFS outage.

My take is: if you're a competent Linux admin, DRBD will cost you less 
with by using appropriate servers be more maintainable than an 
appliance. The challenge of course is working out how to reduce response 
time when any hardware goes sour.


Good luck

Jed