Ruslan Sivak wrote:
This is a bit different then what I was proposing. I know that backupPC already does this on a file level, but I want a filesystem that does it at a block level. File level only helps if you're backing up multiple systems and they all have the same exact files. Block level would help a lot more I think. You'd be able to do a full backup every night and have it only take up around the same space as a differential backup. Things like virtual machine disk images which a lot of times are clones of each other, could take up only a small additinal amount of space for each clone, proportional to the changes that are made to that disk image.
Well then I would look at backup software that does block-level de-duplication. Even if the file system did do this, as the backup software read the files it would re-create the duplicate unless the backup software was intimately married to the file system, which makes things a little too proprietary.
You will find that de-duplication can happen on many different levels here. I was proposing the near-line data at rest, while the far-line or archival data at rest would be a different scenario. Near-line needs to be more performance conscious then far-line.
Nobody really answered this, so I'll ask again. Is there a windows version of Fuse? How does one test a fuse filesystem while developing it? Would be nice to just be able to run something from eclipse, once you've made your changes and have a drive mounted and ready to test. Being able to debug a filesystem while it's running would be great too. Anyone here with experience building Fuse filesystems?
While FUSE is a distinctly Linux development, Windows has had installable file system filters for a long time. These work a lot like stackable storage drivers in Linux and is the basis of a lot of storage tools on Windows including anti-virus software (as well as rootkits).
Windows does have a de-duplication service that works on the file level much like what I proposed called the Single Instance Storage Groveler (I like to call it the single instance storage mangler :-), and high-end backup software companies have block level de-duplication options for their software, proprietary storage appliance companies also have block level de-duplication for their near and far line storage (big $$$).
Ross S. W. Walker wrote:
These are all good and valid issues.
Thinking about it some more I might just implement it as a system service that scans given disk volumes in the background, keeps a hidden directory where it stores it's state information and
hardlinks
named after the md5 hash of the files on the volume. If a
collission
occurs with an existing md5 hash then the new file is unlinked and re-linked to the md5 hash file, if an md5 hash file exists with no secondary links then it is removed. Maybe monitor the
journal or use
inotify to just get new files and once a week do a full volume scan.
This way the file system performs as well as it normally
does and as
things go forward duplicate files are eliminated
(combined). Of course
the problem arises of what to do when 1 duplicate is
modified, but the
other should remain the same...
Of course what you said about revisions that differ just a little won't take advantage of this, but it's file level so it only works with whole files, still better then nothing.
-Ross
-----Original Message----- From: centos-bounces@centos.org centos-bounces@centos.org To: CentOS mailing list centos@centos.org Sent: Thu Dec 06 08:10:38 2007 Subject: Re: [CentOS] Filesystem that doesn't store duplicate data
On Thursday 06 December 2007, Ross S. W. Walker wrote:
How about a FUSE file system (userland, ie NTFS 3G) that layers on top of any file system that supports hard links
That would be easy but I can see a few issues with that approach:
- On file level rather than block level you're going to be
much more
inefficient. I for one have gigabytes of revisions of files
that have
changed a little between each file.
- You have to write all datablocks to disk and then erase
them again
if you find a match. That will slow you down and create some weird
behavior. I.e.
you know the FS shouldn't store duplicate data, yet you
can't use cp
to copy a 10G file if only 9G are free. If you copy a 8G file, you
see the usage
increase till only 1G is free, then when your app closes
the file, you are
going to go back to 9G free...
- Rather than continuously looking for matches on block level, you
have to search for matches on files that can be any size. That is
fine if you
have a 100K file - but if you have a 100M or larger file, the checksum calculations will take you forever. This means rather than adding a
specific, small
penalty to every write call, you add a unknown penalty,
proportional
to file size when closing the file. Also, the fact that most C coders don't check the return code of close doesn't make me happy there...
Peter.
______________________________________________________________________ This e-mail, and any attachments thereto, is intended only for use by the addressee(s) named herein and may contain legally privileged and/or confidential information. If you are not the intended recipient of this e-mail, you are hereby notified that any dissemination, distribution or copying of this e-mail, and any attachments thereto, is strictly prohibited. If you have received this e-mail in error, please immediately notify the sender and permanently delete the original and any copy or printout thereof.