On 8/6/10, Les Mikesell <lesmikesell at gmail.com> wrote: > If you are going to do that, why not also rely on the database engine's > replication which is aware of the transactions? Databases rely on > filesystem write ordering and fsync() actually working - things that > aren't always reliable locally, much less when clustered. Mostly because I don't only need to set this up for databases only. I can't just say "ok, the dbms can ensure transactional integrity as well as provide remote replication" and ignore the other uses the system has to support. Also the secondary consideration that I need to be able to add more storage nodes easily so it seems to make more sense to use a single technology that can support both requirements. Of course in the end, budget/tech constraints might mean that I have to cut back somewhere eventually but it doesn't hurt to plan for things and then know what I'm cutting out. > But there are lots of ways things can go wrong, and clustering just adds > to them. What happens when your replica host dies? Or the network to > it, or the disk where you expect the copy to land? And if you don't > wait for a sync to disk, what happens if these things break after the > remote accepted the buffer copy. All the nodes will have RAID 1 setup, I also plan on using at least 2 switches to provide network redundancy. In general, for the planned setup with minimal replicate delay, the only real disaster is if all 4 drives die at the same time. Otherwise I believe only a small window exist where very specific sequence of failures would cause problem and even so only likely for one or two transactions due to the time window. However, using a slower replicate method like zfs send/receive which is a command line thing, the time window enlarges significantly which even if causes reparable damage would take far more time to fix simply due to the fact much more transactions could be lost. > The DB will offer a more optimized alternative. A VM image won't. I'm not quite sure what's the connection here. The database runs within the VM and is stored in the virtual disk. I'm not using VM to substitute for a database replication but to segregrate functionality. In a way, it would also allow me to pursue different redundancy arrangements if the original configuration is not ideal for one of the functions. >But can you afford to wait for transactional guarantees on all that data > that mostly doesn't matter? Possibly, but of course depends on result of actual testing once a final configuration is decided. Data integrity, redundancy and availability (during working hours anyway) are more important than absolute performance since server load are not usually that high. By the time the customer's load can place significant demands on the hardware, they should also have the budget for more orthodox/proven/expensive solutions :D > So how long do you wait if it is the replica that breaks? And how do > you recover/sync later? I'm not sure what "wait" are you referring to. Is that the wait before the chosen option decides to flag the node as down or the wait before replacing the replica machine or the wait until the system is fully redundant again with a sync'd replica? As for the actual recovery/sync, if a drive fails in the storage node, it would be straightforward case of replacing the drive and rebuilding the node's raid array wouldn't it? If the storage node fails, such as a mainboard problem, I'll replace/repair the node and put it back online, leaving gluster to self heal/resync. Gluster keeps versioning data so it would only sync changed files so that should be pretty fast. I could also stop both the servers at night, externally clone the drives, edit the necessary conf files on the new replica and so avoid mdraid trying to resync everything. >> Sorry for the confusion, I don't mean no slow down or expect the >> underlying fs to be responsible for transactional replication. That's >> the job of the DBMS, I just need the fs replication not to fail in >> such a way that it could cause transactional integrity issue as noted >> in my reply above. > > That's a lot to ask. I'd like to be convinced it is possible. It's not possible if I'm not wrong, we can always think of a situation or sequence of events that would break things. I'm just trying to pick one that would minimize the time that window of opportunity would exist hence zfs's send/receive as replication would not be a good option for live replication.