[CentOS] lots of small files in a folder on Linux centos
R P Herrold
herrold at owlriver.com
Sun Jul 24 14:13:30 UTC 2011
On Sun, 24 Jul 2011, yonatan pingle wrote:
> the coder is not tech savvy as one might expect, so it's
> really hard for me to explain the issue of having lots of
> files in one folder to the site owner or to the coder.
I do not expect coders to remain 'not tech savvy'
If the coder is not willing to learn and to test, you are
already doomed, and should walk away from the project
To show the problem, take a pile of pennies, and ask the coder
to find one with a given year. The coder will have to do a
linear search, to even know if the target exists. Then show a
egg carton with another pile of pennies sorted and labelled by
year in each section, and aask them to repeat the task -- in
the latter case, it is a 'single seek' to solve the problem
Obviously, the target year may not even be present. With a
single pile (directory) the linear search is still required,
but with 'binning' by years, that is obvious by inspection as
well
One approach to lots of files in a single directory (which can
cause problems in getting timely access to a specific file) is
to build a permuted directory tree from the file names to
spread the load around. If the files are of a form where they
have 'closely identical' names [pix00001.jpg, pix00002.jpg,
etc], first build a 'hashed' version of the file name with
md5sum, or such, to level the hash leading characters
[herrold at localhost ~]$ ./hashdemo.sh
pix00001.jpg fd8f49c6487588989cd764eb493251ec
pix00002.jpg 12955d9587d99becf3b2ede46305624c
pix00003.jpg bfdc8f593676e4f1e878bb6959f14ce2
[herrold at localhost ~]$ cat hashdemo.sh
#!/bin/sh
#
CANDIDATES="pix00001.jpg pix00002.jpg pix00003.jpg"
for i in `echo "${CANDIDATES}"`; do
HASH=`echo "$i" | md5sum - | awk {'print $1'}`
echo "$i ${HASH}"
done
[herrold at localhost ~]$
then, we look to the leading letter of the hask, to design our
egg carton bins. We place pix00001.jpg in directory: ./f/ and
pix00002.jpg in directory ./1/ and pix00003.jpg in directory
./b/ and so forth -- if the directories get too full again,
you might go to using the first two letters of the hash to
perform the 'binning' process
The md5sum function is readily available in php, as are
directory creation and so forth, so positioning the files, and
computing the indexes are straightforward there
This is all pretty basic stuff, covered in Knuth in TAOCP long
ago
-- Russ herrold
More information about the CentOS
mailing list