[CentOS] Filesystem that doesn't store duplicate data

Thu Dec 6 05:00:18 UTC 2007
Ruslan Sivak <rsivak at istandfor.com>

redhat at mckerrs.net wrote:
>
> ----- Original Message -----
> From: rsivak at istandfor.com
> To: "CentOS Mailing list" <centos at centos.org>
> Sent: Thursday, December 6, 2007 11:18:16 AM (GMT+1000) Australia/Brisbane
> Subject: [CentOS] Filesystem that doesn't store duplicate data
>
> Is there such a filesystem available?  It seems like it wouldn't be 
> too hard to implement...  Basically do things on a block by block 
> basis.  Store md5 of a block in the table, and when writing a new 
> block, check if the md5 already exists and then point the new block to 
> the old block.  Since md5 is not guaranteed unique, might need to do a 
> diff between the 2 blocks and if the blocks are indeed different, 
> handle it somehow.  
>
> When modifying an existing block that has multiple pointers, copy the 
> block and modify the new block.  
>
> I know I'm oversimplifying things a lot, but something like this could 
> work, no?  Would be a great filesystem to store backups on, or things 
> like vmware volumes...
>
> Russ
> Sent from my Verizon Wireless BlackBerry
> _______________________________________________
> CentOS mailing list
> CentOS at centos.org
> http://lists.centos.org/mailman/listinfo/centos
>
> -- 
> This message has been scanned for viruses and
> dangerous content by MailScanner, and is
> believed to be clean.
>
>
> You are describing what I understand to be 'Data De-duplication". It 
> is all the rage for backups as it has the potential to decrease backup 
> times and volumes by significant amounts. I went to a presentation by 
> Avamar (a partner of EMC ?) regarding this technology and it seemed 
> really nice for your typical windows file server. I suppose it 
> effectively turns your data into 'single-instance' which is no bad 
> thing. I suppose it could be useful for large database backups as well.
>
> You'd think that using this technology on a live filesystem could 
> incur a significant performance penalty due to all those calculations 
> (fuse module anyone ?). Imagine a hardware optimized data 
> de-duplication disk controller, similar to raid XOR optimized cpus. 
> Now that would be cool. All it would need to store was meta-data when 
> it had already seen the exact same block. I think fundamentally it is 
> similar in result to on the fly disk compression.
>
> Let us know when you have a beta to test !
>
> 8^)
>
I'm not sure if this would be possible to make available on a disk 
controller, as I don't think a controller can store the amount of data 
necessary to store the hashes.  I am thinking of maybe making it as a 
fuse module.  I'm most familiar with Java, and there are fuse bindings 
for java.  I would love to make at least a proof of concept FS that does 
this.  Does fuse exist for windows?  How does one test a fuse module 
while developing it?

Russ