[CentOS] Filesystem that doesn't store duplicate data

Thu Dec 6 15:48:27 UTC 2007

This is a bit different then what I was proposing.  I know that backupPC 
already does this on a file level, but I want a filesystem that does it 
at a block level.  File level only helps if you're backing up multiple 
systems and they all have the same exact files.  Block level would help 
a lot more I think.  You'd be able to do a full backup every night and 
have it only take up around the same space as a differential backup.  
Things like virtual machine disk images which a lot of times are clones 
of each other, could take up only a small additinal amount of space for 
each clone, proportional to the changes that are made to that disk image. 

Nobody really answered this, so I'll ask again.  Is there a windows 
version of Fuse?  How does one test a fuse filesystem while developing 
it?  Would be nice to just be able to run something from eclipse, once 
you've made your changes and have a drive mounted and ready to test.  
Being able to debug a filesystem while it's running would be great too.  
Anyone here with experience building Fuse filesystems?

Russ

Ross S. W. Walker wrote:
>
> These are all good and valid issues.
>
> Thinking about it some more I might just implement it as a system 
> service that scans given disk volumes in the background, keeps a 
> hidden directory where it stores it's state information and hardlinks 
> named after the md5 hash of the files on the volume. If a collission 
> occurs with an existing md5 hash then the new file is unlinked and 
> re-linked to the md5 hash file, if an md5 hash file exists with no 
> secondary links then it is removed. Maybe monitor the journal or use 
> inotify to just get new files and once a week do a full volume scan.
>
> This way the file system performs as well as it normally does and as 
> things go forward duplicate files are eliminated (combined). Of course 
> the problem arises of what to do when 1 duplicate is modified, but the 
> other should remain the same...
>
> Of course what you said about revisions that differ just a little 
> won't take advantage of this, but it's file level so it only works 
> with whole files, still better then nothing.
>
> -Ross
>
>
> -----Original Message-----
> From: centos-bounces at centos.org <centos-bounces at centos.org>
> To: CentOS mailing list <centos at centos.org>
> Sent: Thu Dec 06 08:10:38 2007
> Subject: Re: [CentOS] Filesystem that doesn't store duplicate data
>
> On Thursday 06 December 2007, Ross S. W. Walker wrote:
> > How about a FUSE file system (userland, ie NTFS 3G) that layers
> > on top of any file system that supports hard links
>
> That would be easy but I can see a few issues with that approach:
>
> 1) On file level rather than block level you're going to be much more
> inefficient. I for one have gigabytes of revisions of files that have 
> changed
> a little between each file.
>
> 2) You have to write all datablocks to disk and then erase them again 
> if you
> find a match. That will slow you down and create some weird behavior. I.e.
> you know the FS shouldn't store duplicate data, yet you can't use cp 
> to copy
> a 10G file if only 9G are free. If you copy a 8G file, you see the usage
> increase till only 1G is free, then when your app closes the file, you are
> going to go back to 9G free...
>
> 3) Rather than continuously looking for matches on block level, you 
> have to
> search for matches on files that can be any size. That is fine if you 
> have a
> 100K file - but if you have a 100M or larger file, the checksum 
> calculations
> will take you forever. This means rather than adding a specific, small
> penalty to every write call, you add a unknown penalty, proportional 
> to file
> size when closing the file. Also, the fact that most C coders don't 
> check the
> return code of close doesn't make me happy there...
>
> Peter.
> _______________________________________________
> CentOS mailing list
> CentOS at centos.org
> http://lists.centos.org/mailman/listinfo/centos
>
> ------------------------------------------------------------------------
> This e-mail, and any attachments thereto, is intended only for use by 
> the addressee(s) named herein and may contain legally privileged 
> and/or confidential information. If you are not the intended recipient 
> of this e-mail, you are hereby notified that any dissemination, 
> distribution or copying of this e-mail, and any attachments thereto, 
> is strictly prohibited. If you have received this e-mail in error, 
> please immediately notify the sender and permanently delete the 
> original and any copy or printout thereof.
> ------------------------------------------------------------------------
>
> _______________________________________________
> CentOS mailing list
> CentOS at centos.org
> http://lists.centos.org/mailman/listinfo/centos
>