Is there such a filesystem available? It seems like it wouldn't be too hard to implement... Basically do things on a block by block basis. Store md5 of a block in the table, and when writing a new block, check if the md5 already exists and then point the new block to the old block. Since md5 is not guaranteed unique, might need to do a diff between the 2 blocks and if the blocks are indeed different, handle it somehow.
When modifying an existing block that has multiple pointers, copy the block and modify the new block.
I know I'm oversimplifying things a lot, but something like this could work, no? Would be a great filesystem to store backups on, or things like vmware volumes...
Russ Sent from my Verizon Wireless BlackBerry
----- Original Message ----- From: rsivak@istandfor.com To: "CentOS Mailing list" centos@centos.org Sent: Thursday, December 6, 2007 11:18:16 AM (GMT+1000) Australia/Brisbane Subject: [CentOS] Filesystem that doesn't store duplicate data
Is there such a filesystem available? It seems like it wouldn't be too hard to implement... Basically do things on a block by block basis. Store md5 of a block in the table, and when writing a new block, check if the md5 already exists and then point the new block to the old block. Since md5 is not guaranteed unique, might need to do a diff between the 2 blocks and if the blocks are indeed different, handle it somehow.
When modifying an existing block that has multiple pointers, copy the block and modify the new block.
I know I'm oversimplifying things a lot, but something like this could work, no? Would be a great filesystem to store backups on, or things like vmware volumes...
Russ Sent from my Verizon Wireless BlackBerry _______________________________________________ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
On Wednesday 05 December 2007, redhat@mckerrs.net wrote:
You'd think that using this technology on a live filesystem could incur a significant performance penalty due to all those calculations (fuse module anyone ?). Imagine a hardware optimized data de-duplication disk controller, similar to raid XOR optimized cpus. Now that would be cool. All it would need to store was meta-data when it had already seen the exact same block. I think fundamentally it is similar in result to on the fly disk compression.
Actually, the impact - if the filesystem is designed correctly - shouldn't be that horrible. After all, Sun has managed to integrate checksums into ZFS and still get great performance. In addition, ZFS doesn't directly overwrite data but uses a new datablock each time...
What you would have to do then is keep a lookup table with the checksums to find possible matches quickly. Then when you find one, do another compare to be 100% sure you didn't have a collision on your checksums. If that works, then you can reference that datablock.
It is still a lot of work, but as sun showed, on the fly compares and checksums are doable without too much of a hit.
Peter.
Peter Arremann wrote:
On Wednesday 05 December 2007, redhat@mckerrs.net wrote:
You'd think that using this technology on a live filesystem could incur a significant performance penalty due to all those calculations (fuse module anyone ?). Imagine a hardware optimized data de-duplication disk controller, similar to raid XOR optimized cpus. Now that would be cool. All it would need to store was meta-data when it had already seen the exact same block. I think fundamentally it is similar in result to on the fly disk compression.
Actually, the impact - if the filesystem is designed correctly - shouldn't be that horrible. After all, Sun has managed to integrate checksums into ZFS and still get great performance. In addition, ZFS doesn't directly overwrite data but uses a new datablock each time...
What you would have to do then is keep a lookup table with the checksums to find possible matches quickly. Then when you find one, do another compare to be 100% sure you didn't have a collision on your checksums. If that works, then you can reference that datablock.
It is still a lot of work, but as sun showed, on the fly compares and checksums are doable without too much of a hit.
Peter.
I'm not very knowledgeable on how filesystems work. Is there a primer I can brush up on somewhere? I'm thinking about implementing a proof of concept using Java and Fuse.
Russ
Ruslan Sivak wrote:
Peter Arremann wrote:
On Wednesday 05 December 2007, redhat@mckerrs.net wrote:
You'd think that using this technology on a live
filesystem could incur a
significant performance penalty due to all those
calculations (fuse module
anyone ?). Imagine a hardware optimized data de-duplication disk controller, similar to raid XOR optimized cpus. Now that
would be cool. All
it would need to store was meta-data when it had already
seen the exact
same block. I think fundamentally it is similar in result
to on the fly
disk compression.
Actually, the impact - if the filesystem is designed
correctly - shouldn't be
that horrible. After all, Sun has managed to integrate
checksums into ZFS and
still get great performance. In addition, ZFS doesn't
directly overwrite data
but uses a new datablock each time...
What you would have to do then is keep a lookup table with
the checksums to
find possible matches quickly. Then when you find one, do
another compare to
be 100% sure you didn't have a collision on your checksums.
If that works,
then you can reference that datablock.
It is still a lot of work, but as sun showed, on the fly
compares and
checksums are doable without too much of a hit.
Peter.
I'm not very knowledgeable on how filesystems work. Is there a primer I can brush up on somewhere? I'm thinking about implementing a proof of concept using Java and Fuse.
How about a FUSE file system (userland, ie NTFS 3G) that layers on top of any file system that supports hard links, intercepts the FS API and stores all files in a hidden directory and names them after their MD5 hash and hard links to the file name in the user directory stucture. When the # of links drops to 1 then the hash is removed, when new files are copied in if the hash collides with an existing one the data is discarded and only a hard link is made.
Of course it will be a little more involved then this, but the idea is to keep it really simple so it's less likely to break.
-Ross
______________________________________________________________________ This e-mail, and any attachments thereto, is intended only for use by the addressee(s) named herein and may contain legally privileged and/or confidential information. If you are not the intended recipient of this e-mail, you are hereby notified that any dissemination, distribution or copying of this e-mail, and any attachments thereto, is strictly prohibited. If you have received this e-mail in error, please immediately notify the sender and permanently delete the original and any copy or printout thereof.
Ross S. W. Walker wrote:
How about a FUSE file system (userland, ie NTFS 3G) that layers on top of any file system that supports hard links, intercepts the FS API and stores all files in a hidden directory and names them after their MD5 hash and hard links to the file name in the user directory stucture. When the # of links drops to 1 then the hash is removed, when new files are copied in if the hash collides with an existing one the data is discarded and only a hard link is made.
Of course it will be a little more involved then this, but the idea is to keep it really simple so it's less likely to break.
yeah, be REAL fun when an app random updates one of said files.
John R Pierce wrote:
Ross S. W. Walker wrote:
How about a FUSE file system (userland, ie NTFS 3G) that layers on top of any file system that supports hard links, intercepts the FS API and stores all files in a hidden directory and names them after their MD5 hash and hard links to the file name in the user directory stucture. When the # of links drops to 1 then the hash is removed, when new files are copied in if the hash collides with an existing one the data is discarded and only a hard link is made.
Of course it will be a little more involved then this, but the idea is to keep it really simple so it's less likely to break.
yeah, be REAL fun when an app random updates one of said files.
Yes, but as the writer of the FUSE file system you can make it inaccessible to apps.
-Ross
______________________________________________________________________ This e-mail, and any attachments thereto, is intended only for use by the addressee(s) named herein and may contain legally privileged and/or confidential information. If you are not the intended recipient of this e-mail, you are hereby notified that any dissemination, distribution or copying of this e-mail, and any attachments thereto, is strictly prohibited. If you have received this e-mail in error, please immediately notify the sender and permanently delete the original and any copy or printout thereof.
John R Pierce wrote:
Ross S. W. Walker wrote:
How about a FUSE file system (userland, ie NTFS 3G) that layers on top of any file system that supports hard links, intercepts the FS API and stores all files in a hidden directory and names them after their MD5 hash and hard links to the file name in the user directory stucture. When the # of links drops to 1 then the hash is removed, when new files are copied in if the hash collides with an existing one the data is discarded and only a hard link is made.
Of course it will be a little more involved then this, but the idea is to keep it really simple so it's less likely to break.
yeah, be REAL fun when an app random updates one of said files.
Backuppc stores its backup archive this way - all files are compressed and all duplicate content is hard-linked to a pooled copy (and it knows how to run a remote rsync against this storage to only transfer changes). You could probably write a FUSE filesystem that would allow direct read-only access - although the web interface isn't bad.
On Thursday 06 December 2007, Ross S. W. Walker wrote:
How about a FUSE file system (userland, ie NTFS 3G) that layers on top of any file system that supports hard links
That would be easy but I can see a few issues with that approach:
1) On file level rather than block level you're going to be much more inefficient. I for one have gigabytes of revisions of files that have changed a little between each file.
2) You have to write all datablocks to disk and then erase them again if you find a match. That will slow you down and create some weird behavior. I.e. you know the FS shouldn't store duplicate data, yet you can't use cp to copy a 10G file if only 9G are free. If you copy a 8G file, you see the usage increase till only 1G is free, then when your app closes the file, you are going to go back to 9G free...
3) Rather than continuously looking for matches on block level, you have to search for matches on files that can be any size. That is fine if you have a 100K file - but if you have a 100M or larger file, the checksum calculations will take you forever. This means rather than adding a specific, small penalty to every write call, you add a unknown penalty, proportional to file size when closing the file. Also, the fact that most C coders don't check the return code of close doesn't make me happy there...
Peter.
Peter Arremann wrote:
How about a FUSE file system (userland, ie NTFS 3G) that layers on top of any file system that supports hard links
That would be easy but I can see a few issues with that approach:
- On file level rather than block level you're going to be much more
inefficient. I for one have gigabytes of revisions of files that have changed a little between each file.
That is a problem for the way backuppc stores things - but at least it can compress the files.
- You have to write all datablocks to disk and then erase them again if you
find a match. That will slow you down and create some weird behavior. I.e. you know the FS shouldn't store duplicate data, yet you can't use cp to copy a 10G file if only 9G are free. If you copy a 8G file, you see the usage increase till only 1G is free, then when your app closes the file, you are going to go back to 9G free...
Only using it for backup storage is a special case where this is not so bad. Backuppc also has a way to rsync against the stored copy so matching files (or parts) may not need to be transfered at all.
- Rather than continuously looking for matches on block level, you have to
search for matches on files that can be any size. That is fine if you have a 100K file - but if you have a 100M or larger file, the checksum calculations will take you forever.
The backuppc scheme is to use a hash of some amount of the uncompressed file as a pooled filename for the link to quickly weed out most possibilities and permit the compression level to be changed. The full check then only has to be done on collisions.
This means rather than adding a specific, small penalty to every write call, you add a unknown penalty, proportional to file size when closing the file. Also, the fact that most C coders don't check the return code of close doesn't make me happy there...
In backuppc, the writer understands the scheme - and the linking is somewhat decoupled from the tranfers. But, even in a normal filesystem writes are buffered and if you don't fsync there is a lot that can go wrong after a close() reports success.
On Thursday 06 December 2007, Ruslan Sivak wrote:
I'm not very knowledgeable on how filesystems work. Is there a primer I can brush up on somewhere? I'm thinking about implementing a proof of concept using Java and Fuse.
Russ
This is really a google question - most stuff is available online and documented pretty well...
This is one that has a very good structure to it but may be a little hard to read:
http://hssl.cs.jhu.edu/papers/peterson-ext3cow03.pdf
Sun has some nice, entry level stuff on zfs
http://www.sun.com/software/media/real/zfs_learningcenter/high_bandwidth.htm...
If you want to go deeper into zfs, Jeff Bonwick's blog has a lot of different articles about it
http://blogs.sun.com/bonwick/category/ZFS
Peter.
redhat@mckerrs.net wrote:
----- Original Message ----- From: rsivak@istandfor.com To: "CentOS Mailing list" centos@centos.org Sent: Thursday, December 6, 2007 11:18:16 AM (GMT+1000) Australia/Brisbane Subject: [CentOS] Filesystem that doesn't store duplicate data
Is there such a filesystem available? It seems like it wouldn't be too hard to implement... Basically do things on a block by block basis. Store md5 of a block in the table, and when writing a new block, check if the md5 already exists and then point the new block to the old block. Since md5 is not guaranteed unique, might need to do a diff between the 2 blocks and if the blocks are indeed different, handle it somehow.
When modifying an existing block that has multiple pointers, copy the block and modify the new block.
I know I'm oversimplifying things a lot, but something like this could work, no? Would be a great filesystem to store backups on, or things like vmware volumes...
Russ Sent from my Verizon Wireless BlackBerry _______________________________________________ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
-- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean.
You are describing what I understand to be 'Data De-duplication". It is all the rage for backups as it has the potential to decrease backup times and volumes by significant amounts. I went to a presentation by Avamar (a partner of EMC ?) regarding this technology and it seemed really nice for your typical windows file server. I suppose it effectively turns your data into 'single-instance' which is no bad thing. I suppose it could be useful for large database backups as well.
You'd think that using this technology on a live filesystem could incur a significant performance penalty due to all those calculations (fuse module anyone ?). Imagine a hardware optimized data de-duplication disk controller, similar to raid XOR optimized cpus. Now that would be cool. All it would need to store was meta-data when it had already seen the exact same block. I think fundamentally it is similar in result to on the fly disk compression.
Let us know when you have a beta to test !
8^)
I'm not sure if this would be possible to make available on a disk controller, as I don't think a controller can store the amount of data necessary to store the hashes. I am thinking of maybe making it as a fuse module. I'm most familiar with Java, and there are fuse bindings for java. I would love to make at least a proof of concept FS that does this. Does fuse exist for windows? How does one test a fuse module while developing it?
Russ