--On Thursday, July 03, 2014 04:47:30 PM -0400 Stephen Harris <lists at spuddy.org> wrote: > On Thu, Jul 03, 2014 at 12:48:34PM -0700, Lists wrote: >> Whatever we do, we need the ability to create a point-in-time history. >> We commonly use our archival dumps for audit, testing, and debugging >> purposes. I don't think PG + WAL provides this type of capability. So at >> the moment we're down to: > > You can recover WAL files up until the point in time specified in the > restore file > > See, for example > > http://opensourcedbms.com/dbms/how-to-do-point-in-time-recovery-with-post > gresql-9-2-pitr-3/ I have to back up Stephen on this one: 1. The most efficient way to get minimal diffs is generally to get the program that understands the semantics of the data to make the diffs. In the DB world this is typically some type of baseline + log shipping. It comes in various flavours and names, but the concept is the same across the various enterprise-grade databases. Just as algorithmic changes to an application to increase performance are always going to be much better than trying to tune OS-level parameters, doing "dedup" at the application level (where the capability exists) is always going to be more efficient than trying to do it at the OS level. 2. Recreating a point-in-time image for audits, testing, etc, then becomes the process of exercising your recovery/DR procedures (which is a very good side effect). Want to do an audit? Recover the db by starting with the baseline and rolling the log forward to the desired point. 3. Although rolling the log forward can take time, you can find a suitable tradeoff between recover time and disk space by periodically taking a new baseline (weekly? monthly? depends on your write load) Then anything older than that baseline is only of interest for audit data/retention purposes, and no longer factors into the recovery/DR/test scenarios. 4. Using baseline + log shipping generally results in smaller storage requirements for offline / offsite backups. (Don't forget that you're not exercising your DR procedure unless you sometimes recover from your offsite backups, so maybe it would be good to have a policy that all audits are performed based on recovery from offsite media, only.) 5. With the above mechanisms in place, there's basically zero need for block- or file-based deduplication, so you can save yourself from that level of complexity / resource usage. You may decide that filesystem-level snapshots of the filesystem holding the log files still plays a part in your backup strategy, but that's separate from the dedup issue. Echoing one of John's comments, I would be very surprised if doing dedup on database-type data would realize any significant benefits for common configurations/loads. Devin