[CentOS] lots of small files in a folder on Linux centos

yonatan pingle yonatan.pingle at gmail.com
Sun Jul 24 14:44:00 UTC 2011


On Sun, Jul 24, 2011 at 5:13 PM, R P Herrold <herrold at owlriver.com> wrote:
> On Sun, 24 Jul 2011, yonatan pingle wrote:
>
>> the coder is not tech savvy as one might expect, so it's
>> really hard for me to explain the issue of having lots of
>> files in one folder to the site owner or to the coder.
>
> I do not expect coders to remain 'not tech savvy'
>
> If the coder is not willing to learn and to test, you are
> already doomed, and should walk away from the project
>
> To show the problem, take a pile of pennies, and ask the coder
> to find one with a given year.  The coder will have to do a
> linear search, to even know if the target exists.  Then show a
> egg carton with another pile of pennies sorted and labelled by
> year in each section, and aask them to repeat the task -- in
> the latter case, it is a 'single seek' to solve the problem
>
> Obviously, the target year may not even be present.  With a
> single pile (directory) the linear search is still required,
> but with 'binning' by years, that is obvious by inspection as
> well
>
>
> One approach to lots of files in a single directory (which can
> cause problems in getting timely access to a specific file) is
> to build a permuted directory tree from the file names to
> spread the load around.  If the files are of a form where they
> have 'closely identical' names [pix00001.jpg, pix00002.jpg,
> etc], first build a 'hashed' version of the file name with
> md5sum, or such, to level the hash leading characters
>
> [herrold at localhost ~]$ ./hashdemo.sh
> pix00001.jpg    fd8f49c6487588989cd764eb493251ec
> pix00002.jpg    12955d9587d99becf3b2ede46305624c
> pix00003.jpg    bfdc8f593676e4f1e878bb6959f14ce2
> [herrold at localhost ~]$ cat hashdemo.sh
> #!/bin/sh
> #
> CANDIDATES="pix00001.jpg pix00002.jpg pix00003.jpg"
> for i in `echo "${CANDIDATES}"`; do
>         HASH=`echo "$i" | md5sum - | awk {'print $1'}`
>         echo "$i        ${HASH}"
> done
> [herrold at localhost ~]$
>
> then, we look to the leading letter of the hask, to design our
> egg carton bins.  We place pix00001.jpg in directory: ./f/ and
> pix00002.jpg in directory ./1/ and pix00003.jpg in directory
> ./b/ and so forth -- if the directories get too full again,
> you might go to using the first two letters of the hash to
> perform the 'binning' process
>
> The md5sum function is readily available in php, as are
> directory creation and so forth, so positioning the files, and
> computing the indexes are straightforward there
>
> This is all pretty basic stuff, covered in Knuth in TAOCP long
> ago
>
> -- Russ herrold
> _______________________________________________
> CentOS mailing list
> CentOS at centos.org
> http://lists.centos.org/mailman/listinfo/centos
>

Thank you for the excellent analogy , i will actually use it to
explain the matter.

I do hope he understands the simple logic behind a proper directory
tree, it's clearly a design flaw, bad planning or laziness which lead
him to this state.

unfortunately, as bash is easier to read then English for you and me,
ill spare the demohash.sh code from him , and simply put it out in
words , and hope he figures out the proper way to create a tree.

I am strongly tempted to walk away on this one, normally when there no
co-operation and statements like "it's a problem with the server "
when clearly it's a code issue , it's just nerve wrecking to try and
help these guys.

as i said earlier , he was hosted directly on a virtual server with
the largest isp in my country , and they have failed to help him (
just selling him more ram and cpu, until it got to a breaking point ).
I have actually co-locate at the very same ISP and i know for a fact
they are awesome when it comes to support...

-- 
Best Regards,
Yonatan Pingle
RHCT | RHCSA | CCNA1



More information about the CentOS mailing list