[CentOS] Filesystem that doesn't store duplicate data

Peter Arremann loony at loonybin.org
Thu Dec 6 13:10:38 UTC 2007


On Thursday 06 December 2007, Ross S. W. Walker wrote:
> How about a FUSE file system (userland, ie NTFS 3G) that layers
> on top of any file system that supports hard links

That would be easy but I can see a few issues with that approach: 

1) On file level rather than block level you're going to be much more 
inefficient. I for one have gigabytes of revisions of files that have changed 
a little between each file. 

2) You have to write all datablocks to disk and then erase them again if you 
find a match. That will slow you down and create some weird behavior. I.e. 
you know the FS shouldn't store duplicate data, yet you can't use cp to copy 
a 10G file if only 9G are free. If you copy a 8G file, you see the usage 
increase till only 1G is free, then when your app closes the file, you are 
going to go back to 9G free... 

3) Rather than continuously looking for matches on block level, you have to 
search for matches on files that can be any size. That is fine if you have a 
100K file - but if you have a 100M or larger file, the checksum calculations 
will take you forever. This means rather than adding a specific, small 
penalty to every write call, you add a unknown penalty, proportional to file 
size when closing the file. Also, the fact that most C coders don't check the 
return code of close doesn't make me happy there... 

Peter.



More information about the CentOS mailing list