<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
<HTML>
<HEAD>
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=utf-8">
<META NAME="Generator" CONTENT="MS Exchange Server version 6.5.7652.24">
<TITLE>Re: [CentOS] Filesystem that doesn't store duplicate data</TITLE>
</HEAD>
<BODY>
<!-- Converted from text/plain format -->
<BR>
<P><FONT SIZE=2>These are all good and valid issues.<BR>
<BR>
Thinking about it some more I might just implement it as a system service that scans given disk volumes in the background, keeps a hidden directory where it stores it's state information and hardlinks named after the md5 hash of the files on the volume. If a collission occurs with an existing md5 hash then the new file is unlinked and re-linked to the md5 hash file, if an md5 hash file exists with no secondary links then it is removed. Maybe monitor the journal or use inotify to just get new files and once a week do a full volume scan.<BR>
<BR>
This way the file system performs as well as it normally does and as things go forward duplicate files are eliminated (combined). Of course the problem arises of what to do when 1 duplicate is modified, but the other should remain the same...<BR>
<BR>
Of course what you said about revisions that differ just a little won't take advantage of this, but it's file level so it only works with whole files, still better then nothing.<BR>
<BR>
-Ross<BR>
<BR>
<BR>
-----Original Message-----<BR>
From: centos-bounces@centos.org <centos-bounces@centos.org><BR>
To: CentOS mailing list <centos@centos.org><BR>
Sent: Thu Dec 06 08:10:38 2007<BR>
Subject: Re: [CentOS] Filesystem that doesn't store duplicate data<BR>
<BR>
On Thursday 06 December 2007, Ross S. W. Walker wrote:<BR>
> How about a FUSE file system (userland, ie NTFS 3G) that layers<BR>
> on top of any file system that supports hard links<BR>
<BR>
That would be easy but I can see a few issues with that approach:<BR>
<BR>
1) On file level rather than block level you're going to be much more<BR>
inefficient. I for one have gigabytes of revisions of files that have changed<BR>
a little between each file.<BR>
<BR>
2) You have to write all datablocks to disk and then erase them again if you<BR>
find a match. That will slow you down and create some weird behavior. I.e.<BR>
you know the FS shouldn't store duplicate data, yet you can't use cp to copy<BR>
a 10G file if only 9G are free. If you copy a 8G file, you see the usage<BR>
increase till only 1G is free, then when your app closes the file, you are<BR>
going to go back to 9G free...<BR>
<BR>
3) Rather than continuously looking for matches on block level, you have to<BR>
search for matches on files that can be any size. That is fine if you have a<BR>
100K file - but if you have a 100M or larger file, the checksum calculations<BR>
will take you forever. This means rather than adding a specific, small<BR>
penalty to every write call, you add a unknown penalty, proportional to file<BR>
size when closing the file. Also, the fact that most C coders don't check the<BR>
return code of close doesn't make me happy there...<BR>
<BR>
Peter.<BR>
_______________________________________________<BR>
CentOS mailing list<BR>
CentOS@centos.org<BR>
<A HREF="http://lists.centos.org/mailman/listinfo/centos">http://lists.centos.org/mailman/listinfo/centos</A><BR>
</FONT>
</P>
<P></P>
<HR WIDTH="100%">
This e-mail, and any attachments thereto, is intended only for use by
the addressee(s) named herein and may contain legally privileged
and/or confidential information. If you are not the intended recipient
of this e-mail, you are hereby notified that any dissemination,
distribution or copying of this e-mail, and any attachments thereto,
is strictly prohibited. If you have received this e-mail in error,
please immediately notify the sender and permanently delete the
original and any copy or printout thereof.
</BODY>
</HTML>