Les Mikesell wrote:
On Mon, 2006-11-06 at 18:42 +0000, Peter Crighton wrote:
You wrote "Hardlinks are key to this backup strategy. Using cp -al creates hardlinks to files, and this simple command is what does all the heavy lifting for daily and weekly backups. Wikipedia has a very good explanation on how hardlinks work. In a nutshell, when there's a hardlink pointing to a file from the hourly directory, to a file in the current directory, and that current file gets deleted, all the links that point to that now deleted current file gets the file data 'pushed' back towards all the links. I'll have to think how to explain this better."
Do you mean that the hourly files are written when created, the hardlink for the daily doesn't actually copy the file (simply makes a link), but if the file is set to be deleted from it's location (because it's gone from the server) then it is actually moved so that it still exists in the daily backup but is removed from the hourly? --
Think of all directory entries as links. The real entries that map disk space to files are inodes and links are names pointing to the inodes. There can be any number - including 0 - of links to an inode. The space is not released for re-use until the link count goes to 0 and no process has the file open. So hardlinks are just multiple names pointing to the same data, and the data doesn't go away until the last name is removed.
You did much better explaining what's going on with hardlinks than I did. I'm going to have to rewrite that part of the blog a few times before it reads better. I can picture it all in my head, but describing how it works is another.
Note that this only works as a backup if the original filename is removed.
If it
is overwritten or truncated instead, all links now point to the changed
version.
This is true if you're doing it with only filesystem tools, but this system is using rsync. What's happening is the cp -al occurs first making hardlinks that point to an hourly directory into the current directory, then rsync is run to update current. Because rsync will create a new temp file when any file changes, the original is deleted with it's data 'pushed' to any hardlinks pointing at the original file. Rsync then renames the temp file the original file name that has changed, therefore assuring that any hardlinks will always have the previous copy of any changed files. With rsync running in --delete mode, any files from the source server that get deleted, will get deleted out of current in the backup server, causing this cascade of hardlinks to get updated with the deleted files data. That's how this system can create incremental backups of only changed data, but with hardlinks, it looks like full backups are made each and every time. Really saves disk space, that's for sure!
Hope this clears things up...
Mark
On Mon, 2006-11-06 at 14:54 -0800, Mark Schoonover wrote:
Note that this only works as a backup if the original filename is removed.
If it
is overwritten or truncated instead, all links now point to the changed
version.
This is true if you're doing it with only filesystem tools, but this system is using rsync. What's happening is the cp -al occurs first making hardlinks that point to an hourly directory into the current directory, then rsync is run to update current. Because rsync will create a new temp file when any file changes, the original is deleted with it's data 'pushed' to any hardlinks pointing at the original file. Rsync then renames the temp file the original file name that has changed, therefore assuring that any hardlinks will always have the previous copy of any changed files. With rsync running in --delete mode, any files from the source server that get deleted, will get deleted out of current in the backup server, causing this cascade of hardlinks to get updated with the deleted files data. That's how this system can create incremental backups of only changed data, but with hardlinks, it looks like full backups are made each and every time. Really saves disk space, that's for sure!
Hope this clears things up...
Backuppc is even more extreme in the space savings. It first compresses the files, then detects duplicates using an efficient hashing scheme and links all duplicates to one pooled copy whether they came from the same source or not. It includes a custom rsync on the server side that understands the compressed storage format but works with stock versions on the remote side so you don't need any special client software. And it has a nice web interface for browsing the backup archive and doing restores.