[CentOS] Filesystem that doesn't store duplicate data
Ruslan Sivak
rsivak at istandfor.com
Thu Dec 6 15:48:27 UTC 2007
This is a bit different then what I was proposing. I know that backupPC
already does this on a file level, but I want a filesystem that does it
at a block level. File level only helps if you're backing up multiple
systems and they all have the same exact files. Block level would help
a lot more I think. You'd be able to do a full backup every night and
have it only take up around the same space as a differential backup.
Things like virtual machine disk images which a lot of times are clones
of each other, could take up only a small additinal amount of space for
each clone, proportional to the changes that are made to that disk image.
Nobody really answered this, so I'll ask again. Is there a windows
version of Fuse? How does one test a fuse filesystem while developing
it? Would be nice to just be able to run something from eclipse, once
you've made your changes and have a drive mounted and ready to test.
Being able to debug a filesystem while it's running would be great too.
Anyone here with experience building Fuse filesystems?
Russ
Ross S. W. Walker wrote:
>
> These are all good and valid issues.
>
> Thinking about it some more I might just implement it as a system
> service that scans given disk volumes in the background, keeps a
> hidden directory where it stores it's state information and hardlinks
> named after the md5 hash of the files on the volume. If a collission
> occurs with an existing md5 hash then the new file is unlinked and
> re-linked to the md5 hash file, if an md5 hash file exists with no
> secondary links then it is removed. Maybe monitor the journal or use
> inotify to just get new files and once a week do a full volume scan.
>
> This way the file system performs as well as it normally does and as
> things go forward duplicate files are eliminated (combined). Of course
> the problem arises of what to do when 1 duplicate is modified, but the
> other should remain the same...
>
> Of course what you said about revisions that differ just a little
> won't take advantage of this, but it's file level so it only works
> with whole files, still better then nothing.
>
> -Ross
>
>
> -----Original Message-----
> From: centos-bounces at centos.org <centos-bounces at centos.org>
> To: CentOS mailing list <centos at centos.org>
> Sent: Thu Dec 06 08:10:38 2007
> Subject: Re: [CentOS] Filesystem that doesn't store duplicate data
>
> On Thursday 06 December 2007, Ross S. W. Walker wrote:
> > How about a FUSE file system (userland, ie NTFS 3G) that layers
> > on top of any file system that supports hard links
>
> That would be easy but I can see a few issues with that approach:
>
> 1) On file level rather than block level you're going to be much more
> inefficient. I for one have gigabytes of revisions of files that have
> changed
> a little between each file.
>
> 2) You have to write all datablocks to disk and then erase them again
> if you
> find a match. That will slow you down and create some weird behavior. I.e.
> you know the FS shouldn't store duplicate data, yet you can't use cp
> to copy
> a 10G file if only 9G are free. If you copy a 8G file, you see the usage
> increase till only 1G is free, then when your app closes the file, you are
> going to go back to 9G free...
>
> 3) Rather than continuously looking for matches on block level, you
> have to
> search for matches on files that can be any size. That is fine if you
> have a
> 100K file - but if you have a 100M or larger file, the checksum
> calculations
> will take you forever. This means rather than adding a specific, small
> penalty to every write call, you add a unknown penalty, proportional
> to file
> size when closing the file. Also, the fact that most C coders don't
> check the
> return code of close doesn't make me happy there...
>
> Peter.
> _______________________________________________
> CentOS mailing list
> CentOS at centos.org
> http://lists.centos.org/mailman/listinfo/centos
>
> ------------------------------------------------------------------------
> This e-mail, and any attachments thereto, is intended only for use by
> the addressee(s) named herein and may contain legally privileged
> and/or confidential information. If you are not the intended recipient
> of this e-mail, you are hereby notified that any dissemination,
> distribution or copying of this e-mail, and any attachments thereto,
> is strictly prohibited. If you have received this e-mail in error,
> please immediately notify the sender and permanently delete the
> original and any copy or printout thereof.
> ------------------------------------------------------------------------
>
> _______________________________________________
> CentOS mailing list
> CentOS at centos.org
> http://lists.centos.org/mailman/listinfo/centos
>
More information about the CentOS
mailing list