[CentOS] Deduplicated archives via hardlinks [Was: XFS or EXT3 ?]

Fri Dec 3 22:07:06 UTC 2010
Les Mikesell <lesmikesell at gmail.com>

On 12/3/2010 3:14 PM, Adam Tauno Williams wrote:
>
> I know nothing about backuppc;  I don't use it.  But we use rsync with
> the same concept for a deduplicated archive.

Backuppc is a couple of perl scripts, one of which happens to 
re-implement rsync in a way that lets it use stock rsync on the remote 
while transparently accessing a compressed copy on the server side.  It 
can also use tar or samba to copy files in, then does the same 
compression/dedup operation.

>> (for deduplication) with versioning, I'd have to assume the archive
>> volume gets really messy after awhile, and further, something like that
>> is pretty darn hard to make a replica of it.
>
> I don't see why;  only the archive is deduplicated in this manner, and
> it certainly isn't "messy".  One simply makes a backup [for us that
> means to tape - a disk is not a backup] of the most current snapshot.

I does get messy because backuppc archives typically have millions of 
hardlinked files.  It doesn't just hardlink between subsequent runs of 
the same machine, it hardlinks all files with identical content from the 
same machine or other, using a pool directory of hashed filenames as a 
common link to match them up quickly.

> The script just looks like -
>
> export ROOT="/srv/cifs/Arabis-Red"
> export STAMP=`date +%Y%m%d%H`
> export LASTSTAMP=`cat $ROOT/LAST.STAMP`
> mkdir $ROOT/$STAMP
> mkdir $ROOT/$STAMP/home
>
> nice rsync --verbose --archive --delete --acls \
>        --link-dest $ROOT/$LASTSTAMP/home/ \
>        --numeric-ids \
>        -e ssh \
>          archivist at arabis-red:/home/ \
>            $ROOT/$STAMP/home/ \
>            2>&1>  $ROOT/$STAMP/home.log
>
> echo $STAMP>  $ROOT/LAST.STAMP

But that won't match up multiple copies of the same file in different 
locations or help with many machines with mostly-duplicate content. The 
backuppc scheme works pretty well in normal usage, but most 
file-oriented approaches to copy the whole backuppc archive have scaling 
problems because they have to track all the inodes and names to match up 
the hard links.

-- 
    Les Mikesell
     lesmikesell at gmail.com