[CentOS] block level changes at the file system level?

Fri Jul 4 20:41:05 UTC 2014
Devin Reade <gdr at gno.org>

--On Thursday, July 03, 2014 04:47:30 PM -0400 Stephen Harris 
<lists at spuddy.org> wrote:

> On Thu, Jul 03, 2014 at 12:48:34PM -0700, Lists wrote:
>> Whatever we do, we need the ability to create a point-in-time history.
>> We commonly use our archival dumps for audit, testing, and debugging
>> purposes. I don't think PG + WAL provides this type of capability. So at
>> the moment we're down to:
>
> You can recover WAL files up until the point in time specified in the
> restore file
>
> See, for example
>
> http://opensourcedbms.com/dbms/how-to-do-point-in-time-recovery-with-post
> gresql-9-2-pitr-3/

I have to back up Stephen on this one:

1. The most efficient way to get minimal diffs is generally to get
   the program that understands the semantics of the data to make
   the diffs.  In the DB world this is typically some type of
   baseline + log shipping.  It comes in various flavours and names,
   but the concept is the same across the various enterprise-grade
   databases.

   Just as algorithmic changes to an application to increase performance
   are always going to be much better than trying to tune OS-level
   parameters, doing "dedup" at the application level (where the capability
   exists) is always going to be more efficient than trying to do it
   at the OS level.

2. Recreating a point-in-time image for audits, testing, etc, then
   becomes the process of exercising your recovery/DR procedures (which
   is a very good side effect).  Want to do an audit?  Recover the
   db by starting with the baseline and rolling the log forward to
   the desired point.

3. Although rolling the log forward can take time, you can find a
   suitable tradeoff between recover time and disk space by periodically
   taking a new baseline (weekly?  monthly? depends on your write load)
   Then anything older than that baseline is only of interest for
   audit data/retention purposes, and no longer factors into the
   recovery/DR/test scenarios.

4. Using baseline + log shipping generally results in smaller storage
   requirements for offline / offsite backups.  (Don't forget that
   you're not exercising your DR procedure unless you sometimes recover
   from your offsite backups, so maybe it would be good to have a policy
   that all audits are performed based on recovery from offsite media,
   only.)

5. With the above mechanisms in place, there's basically zero need for
   block- or file-based deduplication, so you can save yourself from
   that level of complexity / resource usage.  You may decide that
   filesystem-level snapshots of the filesystem holding the log files
   still plays a part in your backup strategy, but that's separate from
   the dedup issue.

Echoing one of John's comments, I would be very surprised if doing
dedup on database-type data would realize any significant benefits
for common configurations/loads.

Devin