[CentOS] lots of small files in a folder on Linux centos

Sun Jul 24 14:13:30 UTC 2011
R P Herrold <herrold at owlriver.com>

On Sun, 24 Jul 2011, yonatan pingle wrote:

> the coder is not tech savvy as one might expect, so it's 
> really hard for me to explain the issue of having lots of 
> files in one folder to the site owner or to the coder.

I do not expect coders to remain 'not tech savvy'

If the coder is not willing to learn and to test, you are 
already doomed, and should walk away from the project

To show the problem, take a pile of pennies, and ask the coder 
to find one with a given year.  The coder will have to do a 
linear search, to even know if the target exists.  Then show a 
egg carton with another pile of pennies sorted and labelled by 
year in each section, and aask them to repeat the task -- in 
the latter case, it is a 'single seek' to solve the problem

Obviously, the target year may not even be present.  With a 
single pile (directory) the linear search is still required, 
but with 'binning' by years, that is obvious by inspection as 
well


One approach to lots of files in a single directory (which can 
cause problems in getting timely access to a specific file) is 
to build a permuted directory tree from the file names to 
spread the load around.  If the files are of a form where they 
have 'closely identical' names [pix00001.jpg, pix00002.jpg, 
etc], first build a 'hashed' version of the file name with 
md5sum, or such, to level the hash leading characters

[herrold at localhost ~]$ ./hashdemo.sh
pix00001.jpg    fd8f49c6487588989cd764eb493251ec
pix00002.jpg    12955d9587d99becf3b2ede46305624c
pix00003.jpg    bfdc8f593676e4f1e878bb6959f14ce2
[herrold at localhost ~]$ cat hashdemo.sh
#!/bin/sh
#
CANDIDATES="pix00001.jpg pix00002.jpg pix00003.jpg"
for i in `echo "${CANDIDATES}"`; do
         HASH=`echo "$i" | md5sum - | awk {'print $1'}`
         echo "$i        ${HASH}"
done
[herrold at localhost ~]$

then, we look to the leading letter of the hask, to design our 
egg carton bins.  We place pix00001.jpg in directory: ./f/ and 
pix00002.jpg in directory ./1/ and pix00003.jpg in directory 
./b/ and so forth -- if the directories get too full again, 
you might go to using the first two letters of the hash to 
perform the 'binning' process

The md5sum function is readily available in php, as are 
directory creation and so forth, so positioning the files, and 
computing the indexes are straightforward there

This is all pretty basic stuff, covered in Knuth in TAOCP long 
ago

-- Russ herrold