lots of small files in a folder on Linux centos

List overview All Threads
Download

newer

older

Updated Kernel sensors modules

Centos 6.0 Live CD Now Available

yonatan pingle

24 Jul 2011 24 Jul '11

9:29 a.m.

Hello, I have a rather annoying issue on going with one of my centos virtual servers. the server hosts a website using apache and mysql ,there are three persons involved with keeping the site up and running. and i am his root due to the fact he does not know anything with about Linux. there is an php/sql coder , and the site owner which only knows to use the CMS and upload new articles to the website.

the coder and the site owner work together for a long time already , i am their new admin ( as the last one was a major ISP which failed to host the site properly ).

lately the server is under-preforming and load averages are high, mysql service keeps crashing and the server is hitting max memory usage ( so i added ram .. ) , after looking into the website folders, i have found one folder which from my point of view is one of the causes for the server loads.

(sorry for piping ls ).

uploads]# ls | wc -l 3123

I have talked with the site owner, which in turn showed this to the coder ,now he throws the ball back claiming: it has nothing to do with server performance. the folder is full of images, about 40K each, and i have good reason to believe this is the problem, as this is not the first time i see that a folder which includes a large amount of files causes a server to under-perform.

the coder is not tech savvy as one might expect, so it's really hard for me to explain the issue of having lots of files in one folder to the site owner or to the coder.

the hardware is a decent machine dual E5530 24RAM with six hard drives in raid. the virtual server has 2GB of ram and it's own CPU share ( 4 cores 8 threads ). the coder is arguing with facts sadly to say he has the site owner on "his side".

long story short, how should i explain in the most simple way in plain english that having that much files in a folder will cause a server to work slower?

pros vs cons of having a large amount of small files in the same folder on Linux Centos?

-- Best Regards, Yonatan Pingle RHCT | RHCSA | CCNA1

Show replies by date

Eero Volotinen

24 Jul 24 Jul

11:03 a.m.

2011/7/24 yonatan pingle yonatan.pingle@gmail.com:

...

Hello, I have a rather annoying issue on going with one of my centos virtual servers. the server hosts a website using apache and mysql ,there are three persons involved with keeping the site up and running. and i am his root due to the fact he does not know anything with about Linux. there is an php/sql coder , and the site owner which only knows to use the CMS and upload new articles to the website.

the coder and the site owner work together for a long time already , i am their new admin ( as the last one was a major ISP which failed to host the site properly ).

lately the server is under-preforming and load averages are high, mysql service keeps crashing and the server is hitting max memory usage ( so i added ram .. ) , after looking into the website folders, i have found one folder which from my point of view is one of the causes for the server loads.

(sorry for piping ls ).

uploads]# ls | wc -l 3123

I have talked with the site owner, which in turn showed this to the coder ,now he throws the ball back claiming: it has nothing to do with server performance. the folder is full of images, about 40K each, and i have good reason to believe this is the problem, as this is not the first time i see that a folder which includes a large amount of files causes a server to under-perform.

the coder is not tech savvy as one might expect, so it's really hard for me to explain the issue of having lots of files in one folder to the site owner or to the coder.

the hardware is a decent machine dual E5530 24RAM with six hard drives in raid. the virtual server has 2GB of ram and it's own CPU share ( 4 cores 8 threads ). the coder is arguing with facts sadly to say he has the site owner on "his side".

long story short, how should i explain in the most simple way in plain english that having that much files in a folder will cause a server to work slower?

pros vs cons of having a large amount of small files in the same folder on Linux Centos?

I assume that you are using ext3 or ext4 filesystems? Both ext3 and ext4 slows down, if there is too much files in same directory. XFS-fs is solution to fix this problem.

-- Eero

Alexander Dalloz

11:19 a.m.

Am 24.07.2011 13:03, schrieb Eero Volotinen:

...

2011/7/24 yonatan pingle yonatan.pingle@gmail.com:

...

...
uploads]# ls | wc -l 3123

...

I assume that you are using ext3 or ext4 filesystems? Both ext3 and ext4 slows down, if there is too much files in same directory. XFS-fs is solution to fix this problem.

...

Eero

Seriously, 3123 files in a single directory is not an issue for any of the extX filesystems. Though ext2 probably performs the worst, ext3 and particular ext4 should not have any problem with that small amount of file objects. Given that the filesystem is not already filled nearby 100%.

An issue may be, how the code deals with the directory content. Horrible code for sure can impact the speed of the website, but should not affect the system globally.

Yonatan, if you really are concerned about the uploads directory, then use vmstat, iostat or sar to check system parameters while the directory is accessed.

Your problem is something else, I am pretty sure.

Alexander

yonatan pingle

11:52 a.m.

On Sun, Jul 24, 2011 at 2:19 PM, Alexander Dalloz ad+lists@uni-x.org wrote:

...

Am 24.07.2011 13:03, schrieb Eero Volotinen:

...
2011/7/24 yonatan pingle yonatan.pingle@gmail.com:

...
...
uploads]# ls | wc -l 3123

...
I assume that you are using ext3 or ext4 filesystems? Both ext3 and ext4 slows down, if there is too much files in same directory. XFS-fs is solution to fix this problem.

...
Eero

Seriously, 3123 files in a single directory is not an issue for any of the extX filesystems. Though ext2 probably performs the worst, ext3 and particular ext4 should not have any problem with that small amount of file objects. Given that the filesystem is not already filled nearby 100%.

An issue may be, how the code deals with the directory content. Horrible code for sure can impact the speed of the website, but should not affect the system globally.

Yonatan, if you really are concerned about the uploads directory, then use vmstat, iostat or sar to check system parameters while the directory is accessed.

Your problem is something else, I am pretty sure.

Alexander

CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos

Hi, Alexander good suggestions, ill monitor I/O and mysql code, sounds like a code related issue and not a centos issue after all.

it runs on ext3 ,i could only guess how to code deals with the dir, as it seems to be the site builds the pages using php+mysql for each visitor, with about 40K unique visitors a day, that is a lot of I/O.

This looks like an issue with MySQL after all. Queries: 48.0M qps: 66 Slow: 65.0

avg-cpu: %user %nice %system %iowait %steal %idle 0.97 0.00 0.28 97.91 0.00 0.84

runq-sz plist-sz ldavg-1 ldavg-5 ldavg-15

0 102 5.30 3.13 2.06 2 120 3.14 2.77 2.22

we wait and see, tail -f log-slow-queries.log /usr/sbin/mysqld, Version: 5.0.67-community-log (MySQL Community Edition (GPL)). started with: Tcp port: 3306 Unix socket: /var/lib/mysql/mysql.sock Time Id Command Argument

thank you

-- Best Regards, Yonatan Pingle RHCT | RHCSA | CCNA1

Ljubomir Ljubojevic

12:17 p.m.

yonatan pingle wrote:

...

On Sun, Jul 24, 2011 at 2:19 PM, Alexander Dalloz ad+lists@uni-x.org wrote:

...
Am 24.07.2011 13:03, schrieb Eero Volotinen:

...
2011/7/24 yonatan pingle yonatan.pingle@gmail.com:

...
uploads]# ls | wc -l 3123

I assume that you are using ext3 or ext4 filesystems? Both ext3 and ext4 slows down, if there is too much files in same directory. XFS-fs is solution to fix this problem. Eero

Seriously, 3123 files in a single directory is not an issue for any of the extX filesystems. Though ext2 probably performs the worst, ext3 and particular ext4 should not have any problem with that small amount of file objects. Given that the filesystem is not already filled nearby 100%.

An issue may be, how the code deals with the directory content. Horrible code for sure can impact the speed of the website, but should not affect the system globally.

Yonatan, if you really are concerned about the uploads directory, then use vmstat, iostat or sar to check system parameters while the directory is accessed.

Your problem is something else, I am pretty sure.

Alexander

CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos

Hi, Alexander good suggestions, ill monitor I/O and mysql code, sounds like a code related issue and not a centos issue after all.

it runs on ext3 ,i could only guess how to code deals with the dir, as it seems to be the site builds the pages using php+mysql for each visitor, with about 40K unique visitors a day, that is a lot of I/O.

This looks like an issue with MySQL after all. Queries: 48.0M qps: 66 Slow: 65.0

avg-cpu: %user %nice %system %iowait %steal %idle 0.97 0.00 0.28 97.91 0.00 0.84

runq-sz plist-sz ldavg-1 ldavg-5 ldavg-15
     0       102      5.30      3.13      2.06
     2       120      3.14      2.77      2.22
we wait and see, tail -f log-slow-queries.log /usr/sbin/mysqld, Version: 5.0.67-community-log (MySQL Community Edition (GPL)). started with: Tcp port: 3306 Unix socket: /var/lib/mysql/mysql.sock Time Id Command Argument

thank you

Do you have cahcing turned on in CMS? That could help.

-- Ljubomir Ljubojevic (Love is in the Air) PL Computers Serbia, Europe Google is the Mother, Google is the Father, and traceroute is your trusty Spiderman... StarOS, Mikrotik and CentOS/RHEL/Linux consultant

yonatan pingle

12:37 p.m.

...

...
Do you have cahcing turned on in CMS? That could help.

--

Ljubomir Ljubojevic (Love is in the Air) PL Computers Serbia, Europe

Google is the Mother, Google is the Father, and traceroute is your trusty Spiderman... StarOS, Mikrotik and CentOS/RHEL/Linux consultant _______________________________________________ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos

there is no caching system, its a " home made" CMS.

-- Best Regards, Yonatan Pingle RHCT | RHCSA | CCNA1

Diego Sanchez

1:05 p.m.

2011/7/24 yonatan pingle yonatan.pingle@gmail.com:

...

there is no caching system, its a " home made" CMS.

You can use an accelerator too.

http://en.wikipedia.org/wiki/PHP_accelerator http://en.wikipedia.org/wiki/List_of_PHP_accelerators

Please, make a big backup before this! (I nevever had a problem, but... why tempt the devil?)

-- Diego - Yo no soy paranoico! (pero que me siguen, me siguen) http://about.me/diegors/bio

Ryan Wagoner

12:35 p.m.

On Sun, Jul 24, 2011 at 7:52 AM, yonatan pingle yonatan.pingle@gmail.com wrote:

...

Hi, Alexander good suggestions, ill monitor I/O and mysql code, sounds like a code related issue and not a centos issue after all.

it runs on ext3 ,i could only guess how to code deals with the dir, as it seems to be the site builds the pages using php+mysql for each visitor, with about 40K unique visitors a day, that is a lot of I/O.

This looks like an issue with MySQL after all. Queries: 48.0M qps: 66 Slow: 65.0

avg-cpu: %user %nice %system %iowait %steal %idle 0.97 0.00 0.28 97.91 0.00 0.84

runq-sz plist-sz ldavg-1 ldavg-5 ldavg-15

0 102 5.30 3.13 2.06 2 120 3.14 2.77 2.22

we wait and see, tail -f log-slow-queries.log /usr/sbin/mysqld, Version: 5.0.67-community-log (MySQL Community Edition (GPL)). started with: Tcp port: 3306 Unix socket: /var/lib/mysql/mysql.sock Time Id Command Argument

thank you

-- Best Regards, Yonatan Pingle RHCT | RHCSA | CCNA1

If you are using phpMyAdmin the status page will aid you in tuning mySQL. Look for values in red. The description will usually tell you what to adjust to improve performance.

Ryan

yonatan pingle

12:40 p.m.

...

...
RHCT | RHCSA | CCNA1

If you are using phpMyAdmin the status page will aid you in tuning mySQL. Look for values in red. The description will usually tell you what to adjust to improve performance.

Ryan _______________________________________________ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos

im good with mysqltuner.pl, as it seems there are slow queries on mysql and i have adjusted all values in my.cnf according to the application needs.

looks like it's all in the code and the way the CMS handles the files from that upload directory , so there is nothing wrong with the centos machine after all, it's doing it's job

ill point the coder to the status page and hope he gets a clue.

thank you everybody for the good advices, i am now sure it's not "my fault" :-)

/thread

-- Best Regards, Yonatan Pingle RHCT | RHCSA | CCNA1

Ryan Wagoner

12:43 p.m.

On Sun, Jul 24, 2011 at 8:40 AM, yonatan pingle yonatan.pingle@gmail.com wrote:

...

im good with mysqltuner.pl, as it seems there are slow queries on mysql and i have adjusted all values in my.cnf according to the application needs.

looks like it's all in the code and the way the CMS handles the files from that upload directory , so there is nothing wrong with the centos machine after all, it's doing it's job

ill point the coder to the status page and hope he gets a clue.

thank you everybody for the good advices, i am now sure it's not "my fault" :-)

/thread

-- Best Regards, Yonatan Pingle RHCT | RHCSA | CCNA1

Sounds like you need to enable logging in mySQL for slow queries. Give your developer the log and let him know to either optimize the queries or create indexes appropriately to improve the performance.

Ryan

yonatan pingle

12:53 p.m.

On Sun, Jul 24, 2011 at 3:43 PM, Ryan Wagoner rswagoner@gmail.com wrote:

...

On Sun, Jul 24, 2011 at 8:40 AM, yonatan pingle yonatan.pingle@gmail.com wrote:

...
im good with mysqltuner.pl, as it seems there are slow queries on mysql and i have adjusted all values in my.cnf according to the application needs.

looks like it's all in the code and the way the CMS handles the files from that upload directory , so there is nothing wrong with the centos machine after all, it's doing it's job

ill point the coder to the status page and hope he gets a clue.

thank you everybody for the good advices, i am now sure it's not "my fault" :-)

/thread

-- Best Regards, Yonatan Pingle RHCT | RHCSA | CCNA1

Sounds like you need to enable logging in mySQL for slow queries. Give your developer the log and let him know to either optimize the queries or create indexes appropriately to improve the performance.

Ryan _______________________________________________ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos

Yes Ryan, that exactly what i have done. he will get the log shortly and i will get some not free beer. :-)

-- Best Regards, Yonatan Pingle RHCT | RHCSA | CCNA1

John R. Dennison

1:02 p.m.

On Sun, Jul 24, 2011 at 03:53:46PM +0300, yonatan pingle wrote:

...

Yes Ryan, that exactly what i have done. he will get the log shortly and i will get some not free beer.

While I'm all for mysql optimization it's clearly evident from an earlier posting that your disks are thrashing with insanely high iowait figures; and while it's _possible_ for this to be caused by mysql you really have to go out of your way to achieve that type of behavior.

John

-- The best argument against democracy is a five minute conversation with the average voter. -- Winston Churchill

yonatan pingle

2:30 p.m.

On Sun, Jul 24, 2011 at 4:02 PM, John R. Dennison jrd@gerdesas.com wrote:

...

On Sun, Jul 24, 2011 at 03:53:46PM +0300, yonatan pingle wrote:

...
Yes Ryan, that exactly what i have done. he will get the log shortly and i will get some not free beer.

While I'm all for mysql optimization it's clearly evident from an earlier posting that your disks are thrashing with insanely high iowait figures; and while it's _possible_ for this to be caused by mysql you really have to go out of your way to achieve that type of behavior.

John

The best argument against democracy is a five minute conversation with the average voter.

-- Winston Churchill

CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos

this is exactly what i was thinking, that's an insane iowait value, taking into consideration its a VM , not the hardware machine , and the fact the he fills up all his ram along with slow queries showing in the log, it's simply bad code and wrong handling of files.

-- Best Regards, Yonatan Pingle RHCT | RHCSA | CCNA1

R P Herrold

2:13 p.m.

On Sun, 24 Jul 2011, yonatan pingle wrote:

...

the coder is not tech savvy as one might expect, so it's really hard for me to explain the issue of having lots of files in one folder to the site owner or to the coder.

I do not expect coders to remain 'not tech savvy'

If the coder is not willing to learn and to test, you are already doomed, and should walk away from the project

To show the problem, take a pile of pennies, and ask the coder to find one with a given year. The coder will have to do a linear search, to even know if the target exists. Then show a egg carton with another pile of pennies sorted and labelled by year in each section, and aask them to repeat the task -- in the latter case, it is a 'single seek' to solve the problem

Obviously, the target year may not even be present. With a single pile (directory) the linear search is still required, but with 'binning' by years, that is obvious by inspection as well

One approach to lots of files in a single directory (which can cause problems in getting timely access to a specific file) is to build a permuted directory tree from the file names to spread the load around. If the files are of a form where they have 'closely identical' names [pix00001.jpg, pix00002.jpg, etc], first build a 'hashed' version of the file name with md5sum, or such, to level the hash leading characters

[herrold@localhost ~]$ ./hashdemo.sh pix00001.jpg fd8f49c6487588989cd764eb493251ec pix00002.jpg 12955d9587d99becf3b2ede46305624c pix00003.jpg bfdc8f593676e4f1e878bb6959f14ce2 [herrold@localhost ~]$ cat hashdemo.sh #!/bin/sh # CANDIDATES="pix00001.jpg pix00002.jpg pix00003.jpg" for i in `echo "${CANDIDATES}"`; do HASH=`echo "$i" | md5sum - | awk {'print $1'}` echo "$i ${HASH}" done [herrold@localhost ~]$

then, we look to the leading letter of the hask, to design our egg carton bins. We place pix00001.jpg in directory: ./f/ and pix00002.jpg in directory ./1/ and pix00003.jpg in directory ./b/ and so forth -- if the directories get too full again, you might go to using the first two letters of the hash to perform the 'binning' process

The md5sum function is readily available in php, as are directory creation and so forth, so positioning the files, and computing the indexes are straightforward there

This is all pretty basic stuff, covered in Knuth in TAOCP long ago

-- Russ herrold

yonatan pingle

2:44 p.m.

On Sun, Jul 24, 2011 at 5:13 PM, R P Herrold herrold@owlriver.com wrote:

...

On Sun, 24 Jul 2011, yonatan pingle wrote:

...
the coder is not tech savvy as one might expect, so it's really hard for me to explain the issue of having lots of files in one folder to the site owner or to the coder.

I do not expect coders to remain 'not tech savvy'

If the coder is not willing to learn and to test, you are already doomed, and should walk away from the project

To show the problem, take a pile of pennies, and ask the coder to find one with a given year. The coder will have to do a linear search, to even know if the target exists. Then show a egg carton with another pile of pennies sorted and labelled by year in each section, and aask them to repeat the task -- in the latter case, it is a 'single seek' to solve the problem

Obviously, the target year may not even be present. With a single pile (directory) the linear search is still required, but with 'binning' by years, that is obvious by inspection as well

One approach to lots of files in a single directory (which can cause problems in getting timely access to a specific file) is to build a permuted directory tree from the file names to spread the load around. If the files are of a form where they have 'closely identical' names [pix00001.jpg, pix00002.jpg, etc], first build a 'hashed' version of the file name with md5sum, or such, to level the hash leading characters

[herrold@localhost ~]$ ./hashdemo.sh pix00001.jpg fd8f49c6487588989cd764eb493251ec pix00002.jpg 12955d9587d99becf3b2ede46305624c pix00003.jpg bfdc8f593676e4f1e878bb6959f14ce2 [herrold@localhost ~]$ cat hashdemo.sh #!/bin/sh # CANDIDATES="pix00001.jpg pix00002.jpg pix00003.jpg" for i in `echo "${CANDIDATES}"`; do HASH=`echo "$i" | md5sum - | awk {'print $1'}` echo "$i ${HASH}" done [herrold@localhost ~]$

then, we look to the leading letter of the hask, to design our egg carton bins. We place pix00001.jpg in directory: ./f/ and pix00002.jpg in directory ./1/ and pix00003.jpg in directory ./b/ and so forth -- if the directories get too full again, you might go to using the first two letters of the hash to perform the 'binning' process

The md5sum function is readily available in php, as are directory creation and so forth, so positioning the files, and computing the indexes are straightforward there

This is all pretty basic stuff, covered in Knuth in TAOCP long ago

-- Russ herrold _______________________________________________ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos

Thank you for the excellent analogy , i will actually use it to explain the matter.

I do hope he understands the simple logic behind a proper directory tree, it's clearly a design flaw, bad planning or laziness which lead him to this state.

unfortunately, as bash is easier to read then English for you and me, ill spare the demohash.sh code from him , and simply put it out in words , and hope he figures out the proper way to create a tree.

I am strongly tempted to walk away on this one, normally when there no co-operation and statements like "it's a problem with the server " when clearly it's a code issue , it's just nerve wrecking to try and help these guys.

as i said earlier , he was hosted directly on a virtual server with the largest isp in my country , and they have failed to help him ( just selling him more ram and cpu, until it got to a breaking point ). I have actually co-locate at the very same ISP and i know for a fact they are awesome when it comes to support...

-- Best Regards, Yonatan Pingle RHCT | RHCSA | CCNA1

Always Learning

7:48 p.m.

...

On Sun, Jul 24, 2011 at 5:13 PM, R P Herrold herrold@owlriver.com wrote:

...

...
then, we look to the leading letter of the hask, to design our egg carton bins. We place pix00001.jpg in directory: ./f/ and pix00002.jpg in directory ./1/ and pix00003.jpg in directory ./b/ and so forth -- if the directories get too full again, you might go to using the first two letters of the hash to perform the 'binning' process

If the pictures are named sequentially, why not store then at a 100 per directory structure something like this

/pix/0/00/pix00001.jpg

/pix/0/26/pix02614.jpg

/pix/6/72/pix67255.jpg

-- With best regards, Paul. England, EU.

Marian Marinov

8:27 p.m.

On Sunday 24 July 2011 22:48:23 Always Learning wrote:

...

...
On Sun, Jul 24, 2011 at 5:13 PM, R P Herrold herrold@owlriver.com wrote:

...
then, we look to the leading letter of the hask, to design our egg carton bins. We place pix00001.jpg in directory: ./f/ and pix00002.jpg in directory ./1/ and pix00003.jpg in directory ./b/ and so forth -- if the directories get too full again, you might go to using the first two letters of the hash to perform the 'binning' process

If the pictures are named sequentially, why not store then at a 100 per directory structure something like this

/pix/0/00/pix00001.jpg

/pix/0/26/pix02614.jpg

/pix/6/72/pix67255.jpg

As I have worked on projects where the 'coder' is not willing to do any changes, I offer you another temporary solution:

If the pictures are in /home/site/public_html/images, you simply need to create a tmpfs, copy the pictures there and then bind mount the tmpfs in that directory:

# mkdir /home/site/ram # mount -t tmpfs -o size=200M none /home/site/ram # cp -a /home/site/public_html/images/* /home/site/ram # mount --bind /home/site/ram /home/site/public_html/images

Instant performance gain, while you wait for the coder to actually fix the problem.

However you should make sure that you copy the new images from the ram to disk. Maybe with inotifywatch.

Keep in mind that this is only a temporary solution that should serve only as a proof that this is the problem and it needs to be fixed. Try to explain that this hack is not an actual solution.

-- Best regards, Marian Marinov

R P Herrold

8:33 p.m.

On Sun, 24 Jul 2011, Always Learning wrote:

...

If the pictures are named sequentially, why not store then at a 100 per directory structure something like this

/pix/0/00/pix00001.jpg

/pix/0/26/pix02614.jpg

/pix/6/72/pix67255.jpg

Go read Knuth

One does not do that because then one is counting on the end user's data to conform to, and to continue to conform to your expectations [here you have added an invisible constraint of 'pix' as the first part of the file name which you are hoping remains constant -- it will not, as survey of naming schemes used by digital camera makers will reveal]. Your explicit constraint of a monotonicly increasing image number is also not likely to be realized in a world where people will erase or for other reasons not submit all of a given photo shoot

By using a hash, we remove those constraints, and also gain the virtuous effect for free of self-organizing a relatively level dispersion of files to the destination directories

-- Russ herrold

Keith Roberts

9:08 p.m.

On Sun, 24 Jul 2011, R P Herrold wrote:

...

By using a hash, we remove those constraints, and also gain the virtuous effect for free of self-organizing a relatively level dispersion of files to the destination directories

Not followed the whole thread, but a SQL database index of the actual picture files, giving the path into the directory structure. Would that work?

Kind Regards,

Keith Roberts

----------------------------------------------------------------- Websites: http://www.karsites.net http://www.php-debuggers.net http://www.raised-from-the-dead.org.uk

All email addresses are challenge-response protected with TMDA [http://tmda.net] -----------------------------------------------------------------

R P Herrold

9:50 p.m.

On Sun, 24 Jul 2011, Keith Roberts wrote:

...

...
By using a hash, we remove those constraints, and also gain the virtuous effect for free of self-organizing a relatively level dispersion of files to the destination directories

Not followed the whole thread, but a SQL database index of the actual picture files, giving the path into the directory structure. Would that work?

Fortunately there is a full, and freely accessible of all posts to this mailing list. The link to that archive is in the header of every message through this list. As such you need not speculate

As I read the post initially, the problem was as stated in the subject line, and the database issue was not in the forefront

Per the initial problem description, the files were all splatted into a single directory. The fastest database I know of is using the filesystem as a database; The addition of the hashing is just a pointer, and so also O(1)

Adding a database engine, with the overhead that it brings, and as the thread has already pointed out, in a domU as well (not usually the best place to add the overhead of a database), simply are additonal points of mis-design

“We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%. A good programmer will not be lulled into complacency by such reasoning, he will be wise to look carefully at the critical code; but only after that code has been identified” - Donald Knuth [1]

Once the implementation is 'correct', then it is time to do A:B testing to see where the really problem lies ... which testing was at the head of my initial post on this topic

-- Russ herrold

[1] http://pplab.snu.ac.kr/courses/adv_pl05/papers/p261-knuth.pdf

A person not willing to pony up $2.73 for a used copy of 'The Art of Computer Programming: Sorting and Searching. Volume 3', which discusses the specific problem space here, may wish to read and consider his rather nice lecture published by the ACM

Always Learning

11:11 p.m.

On Sun, 2011-07-24 at 17:50 -0400, R P Herrold wrote:

...

On Sun, 24 Jul 2011, Keith Roberts wrote:

...
...
By using a hash, we remove those constraints, and also gain the virtuous effect for free of self-organizing a relatively level dispersion of files to the destination directories

Not followed the whole thread, but a SQL database index of the actual picture files, giving the path into the directory structure. Would that work?

The answer must be 'yes' to a normal problem of identifying (searching for) then retrieving data. MySQL would be a good choice.

Russ' adoration(?) of Donald KNUTH made me read the first page of

[1] http://pplab.snu.ac.kr/courses/adv_pl05/papers/p261-knuth.pdf

which includes this

"This study focuses largely on two issues: (a) improved syntax for iterations and error exits, making it possible to write a larger class of programs clearly and efficiently without go to statements; (b) a methodology of program design, beginning with readable and correct, but possibly inefficient programs that are systematically transformed if necessary into efficient and correct, but possibly less readable code."

A computer programmer can not change the syntax of the language he or her is writing-in. The syntax of any programming language is determined by the creator of that programming language.

Spaghetti-code is a trade-mark of confused programmers, usually of little ability and certainly have never spend days trying to debug someone else's programme. Spaghetti-code can always be avoided by a clear understanding of what the user wants coupled with the programmer's in depth understanding of how to implement the user's requirements in the chosen programming language whilst remembering someone else may have to maintain the programme.

Hashing file names is an interesting concept but a simple, and they are very simple to write, MySQL db application running as HTML pages, with a dash of PHP, makes the application universally accessible and easy to use. Oh, and on Centos, amazingly quick to run :-)

-- With best regards, Paul. England, EU.

Les Mikesell

25 Jul 25 Jul

4:53 a.m.

On 7/24/11 4:08 PM, Keith Roberts wrote:

...

On Sun, 24 Jul 2011, R P Herrold wrote:

...
By using a hash, we remove those constraints, and also gain the virtuous effect for free of self-organizing a relatively level dispersion of files to the destination directories

Not followed the whole thread, but a SQL database index of the actual picture files, giving the path into the directory structure. Would that work?

You introduce new issues where the name in the database can't be managed atomically with the name in the directory that way. Consider what might happen with concurrent operations trying to add different files with the same name - or perhaps an add and delete at the same times.

And it still doesn't help with the real problem unless you do something to break up the large directory. Unix-like filesystems guarantee atomic operations in filename manipulation, so every time you try to create a file, the system must check that the name does not already exist, find an empty slot for the name and insert it with the directory locked against other changes until that is complete. Filesystems that index directories can help with the lookup, with the tradeoff that additions require an index update.

-- Les Mikesell lesmikesell@gmail.com

Always Learning

24 Jul 24 Jul

9:41 p.m.

On Sun, 2011-07-24 at 16:33 -0400, R P Herrold wrote:

...

On Sun, 24 Jul 2011, Always Learning wrote:

...
If the pictures are named sequentially, why not store then at a 100 per directory structure something like this

/pix/0/00/pix00001.jpg

/pix/0/26/pix02614.jpg

/pix/6/72/pix67255.jpg

Go read Knuth

One does not do that because then one is counting on the end user's data to conform to, and to continue to conform to your expectations [here you have added an invisible constraint of 'pix' as the first part of the file name which you are hoping remains constant -- it will not, as survey of naming schemes used by digital camera makers will reveal]. Your explicit constraint of a monotonicly increasing image number is also not likely to be realized in a world where people will erase or for other reasons not submit all of a given photo shoot

I did begin with 'IF' :-)

Photo-shoot or whatever, using the 'rename' command means pictures can adopt a uniform numbering system. There is no logical or genuine practical reason to accept a disorganised mess.

I have about 21,000+ pictures - all my own work. I can find and display any of them within about 17 seconds (just timed myself) using basic operating system commands. (My database application is unfinished).

-- With best regards, Paul. England, EU.

Marc Deop

25 Jul 25 Jul

10:38 a.m.

On Sunday 24 July 2011 10:13:30 R P Herrold wrote:

...

#!/bin/sh # CANDIDATES="pix00001.jpg pix00002.jpg pix00003.jpg" for i in `echo "${CANDIDATES}"`; do HASH=`echo "$i" | md5sum - | awk {'print $1'}` echo "$i ${HASH}" done

I know it absolutelly has nothing to do with databases or files in folders but as we are talking about optimizing:

#!/bin/bash CANDIDATES=(pix00001.jpg pix00002.jpg pix00003.jpg) for i in "${CANDIDATES[@]}"; do MD5SUM=$(md5sum <(echo $i)) echo "$i ${MD5SUM% *}"; done

It's more than twice as fast than the previous sh script.

[ willing to learn mode, feel free to ignore this]

Anyway, about the the hashes and directories and so on... I assume we'd need a hash table in our application, right?

Would we proceed as follows (correct me if I'm wrong please)?

1- m5sum the file we need 2- look for the first letter of the hash 3- get into the directory 4- now we look for our file

Is this right? I understand this would improve the searching of files when there's a lot of them.

Thanks to anyone that replies me and sorry for the offtopic

Regards,

Marc Deop

R P Herrold

4:17 p.m.

On Mon, 25 Jul 2011, Marc Deop wrote:

...

It's more than twice as fast than the previous sh script.

In part this is /bin/sh v /bin/bash and using 'bashisms' matter, but yes, I did not seek to optimize a teaching throwaway

...

1- m5sum the file we need

... actually the NAME of the file, to make it explicit we are not looking at content [also a reasonable approach if one is looking to find and de-duplicate a filestore]

...

2- look for the first letter of the hash

... actually this may be more than a single letter of the hash --- with ca 3000 files, and 16 hash characters, we should end up with about 200 files per subdirectory. The filesystem should be doing some sort of index as well -- as I recall, a B-tree in the case of extN but I've not expressly looked. The php case was mentioned, however, and its directory searching is less optimal

We have a customer with a similar problem with a naiively written set of home brewed PHP code, and are helping them work through similar issues

...

3- get into the directory 4- now we look for our file

... this is probably a single operation to suck the sub-directory listing into an array in php, and use an associative match

but you are right, we are moving increasingly away from a CentOS issue to a more general coding style issue

-- Russ herrold

Rajagopal Swaminathan

1:08 a.m.

Greetings,

On Sun, Jul 24, 2011 at 2:59 PM, yonatan pingle yonatan.pingle@gmail.com wrote:

...

Hello, after looking into the website folders, i have found one folder which from my point of view is one of the causes for the server loads.

hmm... does mount <dir> -noatime -noadirtime help speed it up?

-- Regards, Rajagopal

John R. Dennison

1:30 a.m.

On Mon, Jul 25, 2011 at 06:38:33AM +0530, Rajagopal Swaminathan wrote:

...

hmm... does mount <dir> -noatime -noadirtime help speed it up?

Just an FYI:

noatime is a superset that includes noadirtime.

John

-- You can safely assume you've created God in your own image when it turns out that God hates all the same people you do. -- Anne Lamott (10 April 1954-), American author, Bird by Bird

Lamar Owen

3:37 p.m.

On Sunday, July 24, 2011 05:29:23 AM yonatan pingle wrote: ...

...

lately the server is under-preforming and load averages are high, mysql service keeps crashing and the server is hitting max memory usage ( so i added ram .. ) , after looking into the website folders, i have found one folder which from my point of view is one of the causes for the server loads.

...

uploads]# ls | wc -l 3123

...

pros vs cons of having a large amount of small files in the same folder on Linux Centos?

3,123 files is not a large number. From a CentOS 4 file server here.....

[root@pachyderm sky_data]# ls|wc -l 13526 [root@pachyderm sky_data]# cd ../motse [root@pachyderm motse]# ls |wc -l 28218 [root@pachyderm motse]#cd [root@pachyderm ~]# du -s /var/lib/pgsql 556420596 /var/lib/pgsql [root@pachyderm ~]#

(Yeah, 556GB in PostgreSQL....) Pachyderm = 'The elephant never forgets....' But I'm not looking forward to converting it to a post-C4 PostgreSQL....

Performance on this box is pretty good, all things considered.

Large log files I have found can be performance problems; check to make sure log files are being rolled properly.

There are some specific MySQL tuning documents out there; I seem to remember a posting on a local LUG list about some serious MySQL performance issues that took a long time to ferret out, but I can't seem to find it quickly.....

5217

Age (days ago)

5218

Last active (days ago)

discuss@lists.centos.org

27 comments

16 participants

tags (0)

participants (16)

Alexander Dalloz
Always Learning
Diego Sanchez
Eero Volotinen
John R. Dennison
Keith Roberts
Lamar Owen
Les Mikesell
Ljubomir Ljubojevic
Marc Deop
Marian Marinov
R P Herrold
R P Herrold
Rajagopal Swaminathan
Ryan Wagoner
yonatan pingle