Les Mikesell wrote: > But, I think the OP's real problem is that everything is tied to one > single large drive (i.e. the software mirroring is mostly irrelevant as ... I think that Les makes a good point, and I'd like to push the point even more generally: providing network file storage, via SAN or NFS is that when you have a single service instance, you need procedures and/or layers of caching to deal with outages. I've been using a DRBD cluster joined by a bonded GigE switch and it replicates quite quickly. My issues have been related to Heartbeat and monitoring. We've learned it's very important to practice and tune the fail-over process and detect on file system performance rather than merely pinging. Also, it's necessary to monitor application performance to see if your storage nodes are suffering load issues. I've seen a two-core nfs server perform reliably under load 6-7 but it starts to get unhappy at any higher load. Ironically, we've had absolutely no hard drive errors yet. Hardware things that come to mind are: mother boards: I've had more mother board and ram failures than drive failures with the systems we've had. Raid cards: we've had to swap out 2 3Ware raid controllers also. Network failures will get you down if you're looking for uptime as well: we recently had a nic in one of our storage nodes get into a state where it was spouting 60Mbit of bad packets and created quite a layer-2 networking issue for two cabinets of web servers and two ldap servers. When the ldap servers couldn't respond, the access to the storage nodes got even worse. It was a black day. The next thing in our setup has to do with reliance of NFS. NFS may not the best choice to put behind web-servers, but it was quickest. We're adjusting our application to caching the data found on NFS nodes on local file-systems so that we can handle an NFS outage. My take is: if you're a competent Linux admin, DRBD will cost you less with by using appropriate servers be more maintainable than an appliance. The challenge of course is working out how to reduce response time when any hardware goes sour. Good luck Jed