On 8/6/10, Les Mikesell <lesmikesell at gmail.com> wrote: > But even if you have live replicated data you might want historical > snapshots and/or backup copies to protect against software/operator > failure modes that might lose all of the replicated copies at once. That we already do, daily backups of database, configurations and where applicable website data. Kept for two months before dropping to fortnightly archives which are then offloaded and kept for years. > What you want is difficult to accomplish even in a local file system. I > think it would be unreasonably expensive (both in speed and cost) to put > your entire data store on something that provides both replication and > transactional guarantees. I'd like to be convinced otherwise, > though... Is it a requirement that you can recover your transactional > state after a complete power loss or is it enough to have reached the > buffers of a replica system? For the local side, I can rely on ACID compliant database engines such as InnoDB on MySQL to maintain transactional integrity. What I don't want is if the transaction is committed on the primary disk, an output sent to the user for something supposedly unique such as a serial number. Then before the replication service (in this case, the delayed replicate of zfs send/receive) kicks in, the primary server dies. For DRBD and gluster, if I'm not mistaken, unless I deliberate set otherwise, a write must have at least reached the replica buffers before it's considered as committed. So this scenario is unlikely to arise thus I don't see this as a problem with using them as machine replication service as compared to the unknown delay of using zfs send/receive replicate. While I'm using DB as an example, the same issue applies to the VM disk image. The upper layer cannot be told a write is done until it's been at least sent out to the replica system. The way I see it under DRBD or gluster replicate, only if the replica dies after receiving the write, followed by the primary dying after receiving the ack AND reporting the result to the user AND both drives in its mirror dying. Then would I have a consistency issue. I know it's not possible to guarantee 100% but I can live with this kind of probability as compared to a several seconds delay where several transactions/changes could have taken place before a replica receives an update. >> In most cases, I'll expect the drives would fail first than the >> server. So with the propose configuration, I have for each set of >> data, a pair of server and 2 pairs of mirror drives. If server goes >> down, Gluster handles self healing and if I'm not wrong, it's smart >> about it so won't be duplicating every single inode. On the drive >> side, even if one server is heavily impacted by the resync process, >> the system as a whole likely won't notice it as much since the other >> server is still at full speed. > > I don't see how you can have transactional replication if the servers > don't have to stay in sync, or how you can avoid being slowed down by > the head motion of a good drive being replicated to a new mirror. > There's just some physics involved that don't make sense. Sorry for the confusion, I don't mean no slow down or expect the underlying fs to be responsible for transactional replication. That's the job of the DBMS, I just need the fs replication not to fail in such a way that it could cause transactional integrity issue as noted in my reply above. Also I expect the impact of a rebuild to be lesser as gluster can be configured (temporarily or permanently) to prefer a particular volume(node) to be read from so the responsiveness should still be good (just that the theoretical bandwidth is halved) and reducing the head motion on the rebuilding node as less reads are demanded from it. > As far as I know, linux md devices have to rebuild completely. A raid1 Darn, I was hoping there was the equivalent of "assemble but do not rebuild" option which I had on fakeraid controllers several years back. But I suppose if we clone the drive externally and throw it back into service, it still does help with reducing the degradation window since it is an identical copy even if md doesn't know it.