[CentOS] Question about optimal filesystem with many small files.

Wed Jul 8 22:09:28 UTC 2009
Filipe Brandenburger <filbranden at gmail.com>

Hi,

On Wed, Jul 8, 2009 at 17:59, oooooooooooo
ooooooooooooo<hhh735 at hotmail.com> wrote:
> My original idea was storing the file with a hash of it name, and then store a  hash->real filename in mysql. By this way I have direct access to the file and I can make a directory hierachy with the first characters of teh hash /c/0/2/a, so i would have 16*4 =65536 leaves in the directoy tree, and the files would be identically distributed, with around 200 files per dir (waht should not give any perfomance issues). But the requiremenst are to use the real file name for the directory tree, what gives the issue.

You can hash it and still keep the original filename, and you don't
even need a MySQL database to do lookups.

For instance, let's take "example.txt" as the file name.

Then let's hash it, say using MD5 (just for the sake of example, a
simpler hash could give you good enough results and be quicker to
calculate):
$ echo -n example.txt | md5sum
e76faa0543e007be095bb52982802abe  -

Then say you take the first 4 digits of it to build the hash: e/7/6/f

Then you store file example.txt at: e/7/6/f/example.txt

The file still has its original name (example.txt), and if you want to
find it, you can just calculate the hash for the name again, in which
case you will find the e/7/6/f, and prepend that to the original name.

I would also suggest that you keep less directories levels with more
branches on them, the optimal performance will be achieved by getting
a balance of them. For example, in this case (4 hex digits) you would
have 4 levels with 16 entries each. If you group the hex digits two by
two, you would have (up to) 256 entries on each level, but only two
levels of subdirectories. For instance: example.txt ->
e7/6f/example.txt. That might (or might not) give you a better
performance. A benchmark should tell you which one is better, but in
any case, both of these setups will be many times faster than the one
where you have 400,000 files in a single directory.

Would that help solve your issue?

HTH,
Filipe