[CentOS] Question about optimal filesystem with many small files.

Sat Jul 11 07:26:56 UTC 2009
Alexander Georgiev <alexander.georgiev at gmail.com>

2009/7/11 oooooooooooo ooooooooooooo <hhh735 at hotmail.com>:
>> You mentioned that the data can be retrieved from somewhere else. Is
>> some part of this filename a unique key?
> The real key is up to 1023 chracters long and it's unique, but I have to trim to 256 charactes, by this way is not unique unless I add the hash.

The fact that this 1023 file name is unique is very nice. And no
trimming is needed!
I think you have 2 issues to deal with:

1)  you have files with unique file names unfortunatelly with lenth <=
1023 characters.
Regarding filenames and paths in linux and ext3 you have:

  file name length limit = 254 bytes
  path length limit = 4096

If you try to store such a file directly, you will break the file name
limit. But if you decompose the name into N chunks each of 250
characters, you will be able to preserve the file as a sequence of

     N - 1 nested folders plus a file with a name equal to the Nth
chunk residing into the N-1th folder.

Via this decomposition you will translate the unique 1023 character
'file name' into a unique 1023 character 'file path' with length lower
than the path length limit

 2) You suffer performance degradation when number of files in a
folder goes beyond 1000.

Filipe Brandenburger has suggested a slick scheme to overcome this
problem, that will work perfectly without a database:

============quote start
$ echo -n example.txt | md5sum
e76faa0543e007be095bb52982802abe  -

Then say you take the first 4 digits of it to build the hash: e/7/6/f

Then you store file example.txt at: e/7/6/f/example.txt
============quote end

of course, "example.txt" might be a long filename: "exaaaaa ..... 1000
chars here .....txt" so after the "hash tree" e/7/6/f you will store
the file path structure described in 1).

As was suggested by Les Mikesell, squid and other products have
already implemented similar strategies, and you might be able to use
either the algorithm or directly the code that implements it. I would
spend some time investigating squid's code. I think squid has to deal
with exactly same problem - cache the contents of resources whose urls
might be > 254 characters.

If you use this approach - no need for a database to store hashes!

I did some tests on a Centos 3 system with the following script:

=====================script start
#! /bin/bash

for a in a b c d e f g j; do
        for i in `seq 1 250`; do
        mkdir $f
        cd $f
pwd > some_file.txt
=====================script end

which creates a nested directory structure with and a file in it.
Total file path length is > 8 * 250. I had no problems accessing this
file by its full path:

$ find ./ -name some\* -exec cat {} \; | wc -c