Hello, I have a rather annoying issue on going with one of my centos virtual servers. the server hosts a website using apache and mysql ,there are three persons involved with keeping the site up and running. and i am his root due to the fact he does not know anything with about Linux. there is an php/sql coder , and the site owner which only knows to use the CMS and upload new articles to the website.
the coder and the site owner work together for a long time already , i am their new admin ( as the last one was a major ISP which failed to host the site properly ).
lately the server is under-preforming and load averages are high, mysql service keeps crashing and the server is hitting max memory usage ( so i added ram .. ) , after looking into the website folders, i have found one folder which from my point of view is one of the causes for the server loads.
(sorry for piping ls ).
uploads]# ls | wc -l 3123
I have talked with the site owner, which in turn showed this to the coder ,now he throws the ball back claiming: it has nothing to do with server performance. the folder is full of images, about 40K each, and i have good reason to believe this is the problem, as this is not the first time i see that a folder which includes a large amount of files causes a server to under-perform.
the coder is not tech savvy as one might expect, so it's really hard for me to explain the issue of having lots of files in one folder to the site owner or to the coder.
the hardware is a decent machine dual E5530 24RAM with six hard drives in raid. the virtual server has 2GB of ram and it's own CPU share ( 4 cores 8 threads ). the coder is arguing with facts sadly to say he has the site owner on "his side".
long story short, how should i explain in the most simple way in plain english that having that much files in a folder will cause a server to work slower?
pros vs cons of having a large amount of small files in the same folder on Linux Centos?
2011/7/24 yonatan pingle yonatan.pingle@gmail.com:
Hello, I have a rather annoying issue on going with one of my centos virtual servers. the server hosts a website using apache and mysql ,there are three persons involved with keeping the site up and running. and i am his root due to the fact he does not know anything with about Linux. there is an php/sql coder , and the site owner which only knows to use the CMS and upload new articles to the website.
the coder and the site owner work together for a long time already , i am their new admin ( as the last one was a major ISP which failed to host the site properly ).
lately the server is under-preforming and load averages are high, mysql service keeps crashing and the server is hitting max memory usage ( so i added ram .. ) , after looking into the website folders, i have found one folder which from my point of view is one of the causes for the server loads.
(sorry for piping ls ).
uploads]# ls | wc -l 3123
I have talked with the site owner, which in turn showed this to the coder ,now he throws the ball back claiming: it has nothing to do with server performance. the folder is full of images, about 40K each, and i have good reason to believe this is the problem, as this is not the first time i see that a folder which includes a large amount of files causes a server to under-perform.
the coder is not tech savvy as one might expect, so it's really hard for me to explain the issue of having lots of files in one folder to the site owner or to the coder.
the hardware is a decent machine dual E5530 24RAM with six hard drives in raid. the virtual server has 2GB of ram and it's own CPU share ( 4 cores 8 threads ). the coder is arguing with facts sadly to say he has the site owner on "his side".
long story short, how should i explain in the most simple way in plain english that having that much files in a folder will cause a server to work slower?
pros vs cons of having a large amount of small files in the same folder on Linux Centos?
I assume that you are using ext3 or ext4 filesystems? Both ext3 and ext4 slows down, if there is too much files in same directory. XFS-fs is solution to fix this problem.
-- Eero
Am 24.07.2011 13:03, schrieb Eero Volotinen:
2011/7/24 yonatan pingle yonatan.pingle@gmail.com:
uploads]# ls | wc -l 3123
I assume that you are using ext3 or ext4 filesystems? Both ext3 and ext4 slows down, if there is too much files in same directory. XFS-fs is solution to fix this problem.
Eero
Seriously, 3123 files in a single directory is not an issue for any of the extX filesystems. Though ext2 probably performs the worst, ext3 and particular ext4 should not have any problem with that small amount of file objects. Given that the filesystem is not already filled nearby 100%.
An issue may be, how the code deals with the directory content. Horrible code for sure can impact the speed of the website, but should not affect the system globally.
Yonatan, if you really are concerned about the uploads directory, then use vmstat, iostat or sar to check system parameters while the directory is accessed.
Your problem is something else, I am pretty sure.
Alexander
On Sun, Jul 24, 2011 at 2:19 PM, Alexander Dalloz ad+lists@uni-x.org wrote:
Am 24.07.2011 13:03, schrieb Eero Volotinen:
2011/7/24 yonatan pingle yonatan.pingle@gmail.com:
uploads]# ls | wc -l 3123
I assume that you are using ext3 or ext4 filesystems? Both ext3 and ext4 slows down, if there is too much files in same directory. XFS-fs is solution to fix this problem.
Eero
Seriously, 3123 files in a single directory is not an issue for any of the extX filesystems. Though ext2 probably performs the worst, ext3 and particular ext4 should not have any problem with that small amount of file objects. Given that the filesystem is not already filled nearby 100%.
An issue may be, how the code deals with the directory content. Horrible code for sure can impact the speed of the website, but should not affect the system globally.
Yonatan, if you really are concerned about the uploads directory, then use vmstat, iostat or sar to check system parameters while the directory is accessed.
Your problem is something else, I am pretty sure.
Alexander
CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
Hi, Alexander good suggestions, ill monitor I/O and mysql code, sounds like a code related issue and not a centos issue after all.
it runs on ext3 ,i could only guess how to code deals with the dir, as it seems to be the site builds the pages using php+mysql for each visitor, with about 40K unique visitors a day, that is a lot of I/O.
This looks like an issue with MySQL after all. Queries: 48.0M qps: 66 Slow: 65.0
avg-cpu: %user %nice %system %iowait %steal %idle 0.97 0.00 0.28 97.91 0.00 0.84
runq-sz plist-sz ldavg-1 ldavg-5 ldavg-15
0 102 5.30 3.13 2.06 2 120 3.14 2.77 2.22
we wait and see, tail -f log-slow-queries.log /usr/sbin/mysqld, Version: 5.0.67-community-log (MySQL Community Edition (GPL)). started with: Tcp port: 3306 Unix socket: /var/lib/mysql/mysql.sock Time Id Command Argument
thank you
yonatan pingle wrote:
On Sun, Jul 24, 2011 at 2:19 PM, Alexander Dalloz ad+lists@uni-x.org wrote:
Am 24.07.2011 13:03, schrieb Eero Volotinen:
2011/7/24 yonatan pingle yonatan.pingle@gmail.com:
uploads]# ls | wc -l 3123
I assume that you are using ext3 or ext4 filesystems? Both ext3 and ext4 slows down, if there is too much files in same directory. XFS-fs is solution to fix this problem. Eero
Seriously, 3123 files in a single directory is not an issue for any of the extX filesystems. Though ext2 probably performs the worst, ext3 and particular ext4 should not have any problem with that small amount of file objects. Given that the filesystem is not already filled nearby 100%.
An issue may be, how the code deals with the directory content. Horrible code for sure can impact the speed of the website, but should not affect the system globally.
Yonatan, if you really are concerned about the uploads directory, then use vmstat, iostat or sar to check system parameters while the directory is accessed.
Your problem is something else, I am pretty sure.
Alexander
CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
Hi, Alexander good suggestions, ill monitor I/O and mysql code, sounds like a code related issue and not a centos issue after all.
it runs on ext3 ,i could only guess how to code deals with the dir, as it seems to be the site builds the pages using php+mysql for each visitor, with about 40K unique visitors a day, that is a lot of I/O.
This looks like an issue with MySQL after all. Queries: 48.0M qps: 66 Slow: 65.0
avg-cpu: %user %nice %system %iowait %steal %idle 0.97 0.00 0.28 97.91 0.00 0.84
runq-sz plist-sz ldavg-1 ldavg-5 ldavg-15
0 102 5.30 3.13 2.06 2 120 3.14 2.77 2.22
we wait and see, tail -f log-slow-queries.log /usr/sbin/mysqld, Version: 5.0.67-community-log (MySQL Community Edition (GPL)). started with: Tcp port: 3306 Unix socket: /var/lib/mysql/mysql.sock Time Id Command Argument
thank you
Do you have cahcing turned on in CMS? That could help.
Do you have cahcing turned on in CMS? That could help.
--
Ljubomir Ljubojevic (Love is in the Air) PL Computers Serbia, Europe
Google is the Mother, Google is the Father, and traceroute is your trusty Spiderman... StarOS, Mikrotik and CentOS/RHEL/Linux consultant _______________________________________________ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
there is no caching system, its a " home made" CMS.
2011/7/24 yonatan pingle yonatan.pingle@gmail.com:
there is no caching system, its a " home made" CMS.
You can use an accelerator too.
http://en.wikipedia.org/wiki/PHP_accelerator http://en.wikipedia.org/wiki/List_of_PHP_accelerators
Please, make a big backup before this! (I nevever had a problem, but... why tempt the devil?)
On Sun, Jul 24, 2011 at 7:52 AM, yonatan pingle yonatan.pingle@gmail.com wrote:
Hi, Alexander good suggestions, ill monitor I/O and mysql code, sounds like a code related issue and not a centos issue after all.
it runs on ext3 ,i could only guess how to code deals with the dir, as it seems to be the site builds the pages using php+mysql for each visitor, with about 40K unique visitors a day, that is a lot of I/O.
This looks like an issue with MySQL after all. Queries: 48.0M qps: 66 Slow: 65.0
avg-cpu: %user %nice %system %iowait %steal %idle 0.97 0.00 0.28 97.91 0.00 0.84
runq-sz plist-sz ldavg-1 ldavg-5 ldavg-15
0 102 5.30 3.13 2.06 2 120 3.14 2.77 2.22
we wait and see, tail -f log-slow-queries.log /usr/sbin/mysqld, Version: 5.0.67-community-log (MySQL Community Edition (GPL)). started with: Tcp port: 3306 Unix socket: /var/lib/mysql/mysql.sock Time Id Command Argument
thank you
-- Best Regards, Yonatan Pingle RHCT | RHCSA | CCNA1
If you are using phpMyAdmin the status page will aid you in tuning mySQL. Look for values in red. The description will usually tell you what to adjust to improve performance.
Ryan
RHCT | RHCSA | CCNA1
If you are using phpMyAdmin the status page will aid you in tuning mySQL. Look for values in red. The description will usually tell you what to adjust to improve performance.
Ryan _______________________________________________ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
im good with mysqltuner.pl, as it seems there are slow queries on mysql and i have adjusted all values in my.cnf according to the application needs.
looks like it's all in the code and the way the CMS handles the files from that upload directory , so there is nothing wrong with the centos machine after all, it's doing it's job
ill point the coder to the status page and hope he gets a clue.
thank you everybody for the good advices, i am now sure it's not "my fault" :-)
/thread
On Sun, Jul 24, 2011 at 8:40 AM, yonatan pingle yonatan.pingle@gmail.com wrote:
im good with mysqltuner.pl, as it seems there are slow queries on mysql and i have adjusted all values in my.cnf according to the application needs.
looks like it's all in the code and the way the CMS handles the files from that upload directory , so there is nothing wrong with the centos machine after all, it's doing it's job
ill point the coder to the status page and hope he gets a clue.
thank you everybody for the good advices, i am now sure it's not "my fault" :-)
/thread
-- Best Regards, Yonatan Pingle RHCT | RHCSA | CCNA1
Sounds like you need to enable logging in mySQL for slow queries. Give your developer the log and let him know to either optimize the queries or create indexes appropriately to improve the performance.
Ryan
On Sun, Jul 24, 2011 at 3:43 PM, Ryan Wagoner rswagoner@gmail.com wrote:
On Sun, Jul 24, 2011 at 8:40 AM, yonatan pingle yonatan.pingle@gmail.com wrote:
im good with mysqltuner.pl, as it seems there are slow queries on mysql and i have adjusted all values in my.cnf according to the application needs.
looks like it's all in the code and the way the CMS handles the files from that upload directory , so there is nothing wrong with the centos machine after all, it's doing it's job
ill point the coder to the status page and hope he gets a clue.
thank you everybody for the good advices, i am now sure it's not "my fault" :-)
/thread
-- Best Regards, Yonatan Pingle RHCT | RHCSA | CCNA1
Sounds like you need to enable logging in mySQL for slow queries. Give your developer the log and let him know to either optimize the queries or create indexes appropriately to improve the performance.
Ryan _______________________________________________ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
Yes Ryan, that exactly what i have done. he will get the log shortly and i will get some not free beer. :-)
On Sun, Jul 24, 2011 at 03:53:46PM +0300, yonatan pingle wrote:
Yes Ryan, that exactly what i have done. he will get the log shortly and i will get some not free beer.
While I'm all for mysql optimization it's clearly evident from an earlier posting that your disks are thrashing with insanely high iowait figures; and while it's _possible_ for this to be caused by mysql you really have to go out of your way to achieve that type of behavior.
John
On Sun, Jul 24, 2011 at 4:02 PM, John R. Dennison jrd@gerdesas.com wrote:
On Sun, Jul 24, 2011 at 03:53:46PM +0300, yonatan pingle wrote:
Yes Ryan, that exactly what i have done. he will get the log shortly and i will get some not free beer.
While I'm all for mysql optimization it's clearly evident from an earlier posting that your disks are thrashing with insanely high iowait figures; and while it's _possible_ for this to be caused by mysql you really have to go out of your way to achieve that type of behavior.
John
The best argument against democracy is a five minute conversation with the average voter.
-- Winston Churchill
CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
this is exactly what i was thinking, that's an insane iowait value, taking into consideration its a VM , not the hardware machine , and the fact the he fills up all his ram along with slow queries showing in the log, it's simply bad code and wrong handling of files.
On Sun, 24 Jul 2011, yonatan pingle wrote:
the coder is not tech savvy as one might expect, so it's really hard for me to explain the issue of having lots of files in one folder to the site owner or to the coder.
I do not expect coders to remain 'not tech savvy'
If the coder is not willing to learn and to test, you are already doomed, and should walk away from the project
To show the problem, take a pile of pennies, and ask the coder to find one with a given year. The coder will have to do a linear search, to even know if the target exists. Then show a egg carton with another pile of pennies sorted and labelled by year in each section, and aask them to repeat the task -- in the latter case, it is a 'single seek' to solve the problem
Obviously, the target year may not even be present. With a single pile (directory) the linear search is still required, but with 'binning' by years, that is obvious by inspection as well
One approach to lots of files in a single directory (which can cause problems in getting timely access to a specific file) is to build a permuted directory tree from the file names to spread the load around. If the files are of a form where they have 'closely identical' names [pix00001.jpg, pix00002.jpg, etc], first build a 'hashed' version of the file name with md5sum, or such, to level the hash leading characters
[herrold@localhost ~]$ ./hashdemo.sh pix00001.jpg fd8f49c6487588989cd764eb493251ec pix00002.jpg 12955d9587d99becf3b2ede46305624c pix00003.jpg bfdc8f593676e4f1e878bb6959f14ce2 [herrold@localhost ~]$ cat hashdemo.sh #!/bin/sh # CANDIDATES="pix00001.jpg pix00002.jpg pix00003.jpg" for i in `echo "${CANDIDATES}"`; do HASH=`echo "$i" | md5sum - | awk {'print $1'}` echo "$i ${HASH}" done [herrold@localhost ~]$
then, we look to the leading letter of the hask, to design our egg carton bins. We place pix00001.jpg in directory: ./f/ and pix00002.jpg in directory ./1/ and pix00003.jpg in directory ./b/ and so forth -- if the directories get too full again, you might go to using the first two letters of the hash to perform the 'binning' process
The md5sum function is readily available in php, as are directory creation and so forth, so positioning the files, and computing the indexes are straightforward there
This is all pretty basic stuff, covered in Knuth in TAOCP long ago
-- Russ herrold
On Sun, Jul 24, 2011 at 5:13 PM, R P Herrold herrold@owlriver.com wrote:
On Sun, 24 Jul 2011, yonatan pingle wrote:
the coder is not tech savvy as one might expect, so it's really hard for me to explain the issue of having lots of files in one folder to the site owner or to the coder.
I do not expect coders to remain 'not tech savvy'
If the coder is not willing to learn and to test, you are already doomed, and should walk away from the project
To show the problem, take a pile of pennies, and ask the coder to find one with a given year. The coder will have to do a linear search, to even know if the target exists. Then show a egg carton with another pile of pennies sorted and labelled by year in each section, and aask them to repeat the task -- in the latter case, it is a 'single seek' to solve the problem
Obviously, the target year may not even be present. With a single pile (directory) the linear search is still required, but with 'binning' by years, that is obvious by inspection as well
One approach to lots of files in a single directory (which can cause problems in getting timely access to a specific file) is to build a permuted directory tree from the file names to spread the load around. If the files are of a form where they have 'closely identical' names [pix00001.jpg, pix00002.jpg, etc], first build a 'hashed' version of the file name with md5sum, or such, to level the hash leading characters
[herrold@localhost ~]$ ./hashdemo.sh pix00001.jpg fd8f49c6487588989cd764eb493251ec pix00002.jpg 12955d9587d99becf3b2ede46305624c pix00003.jpg bfdc8f593676e4f1e878bb6959f14ce2 [herrold@localhost ~]$ cat hashdemo.sh #!/bin/sh # CANDIDATES="pix00001.jpg pix00002.jpg pix00003.jpg" for i in `echo "${CANDIDATES}"`; do HASH=`echo "$i" | md5sum - | awk {'print $1'}` echo "$i ${HASH}" done [herrold@localhost ~]$
then, we look to the leading letter of the hask, to design our egg carton bins. We place pix00001.jpg in directory: ./f/ and pix00002.jpg in directory ./1/ and pix00003.jpg in directory ./b/ and so forth -- if the directories get too full again, you might go to using the first two letters of the hash to perform the 'binning' process
The md5sum function is readily available in php, as are directory creation and so forth, so positioning the files, and computing the indexes are straightforward there
This is all pretty basic stuff, covered in Knuth in TAOCP long ago
-- Russ herrold _______________________________________________ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
Thank you for the excellent analogy , i will actually use it to explain the matter.
I do hope he understands the simple logic behind a proper directory tree, it's clearly a design flaw, bad planning or laziness which lead him to this state.
unfortunately, as bash is easier to read then English for you and me, ill spare the demohash.sh code from him , and simply put it out in words , and hope he figures out the proper way to create a tree.
I am strongly tempted to walk away on this one, normally when there no co-operation and statements like "it's a problem with the server " when clearly it's a code issue , it's just nerve wrecking to try and help these guys.
as i said earlier , he was hosted directly on a virtual server with the largest isp in my country , and they have failed to help him ( just selling him more ram and cpu, until it got to a breaking point ). I have actually co-locate at the very same ISP and i know for a fact they are awesome when it comes to support...
On Sun, Jul 24, 2011 at 5:13 PM, R P Herrold herrold@owlriver.com wrote:
then, we look to the leading letter of the hask, to design our egg carton bins. We place pix00001.jpg in directory: ./f/ and pix00002.jpg in directory ./1/ and pix00003.jpg in directory ./b/ and so forth -- if the directories get too full again, you might go to using the first two letters of the hash to perform the 'binning' process
If the pictures are named sequentially, why not store then at a 100 per directory structure something like this
/pix/0/00/pix00001.jpg
/pix/0/26/pix02614.jpg
/pix/6/72/pix67255.jpg
On Sunday 24 July 2011 22:48:23 Always Learning wrote:
On Sun, Jul 24, 2011 at 5:13 PM, R P Herrold herrold@owlriver.com wrote:
then, we look to the leading letter of the hask, to design our egg carton bins. We place pix00001.jpg in directory: ./f/ and pix00002.jpg in directory ./1/ and pix00003.jpg in directory ./b/ and so forth -- if the directories get too full again, you might go to using the first two letters of the hash to perform the 'binning' process
If the pictures are named sequentially, why not store then at a 100 per directory structure something like this
/pix/0/00/pix00001.jpg
/pix/0/26/pix02614.jpg
/pix/6/72/pix67255.jpg
As I have worked on projects where the 'coder' is not willing to do any changes, I offer you another temporary solution:
If the pictures are in /home/site/public_html/images, you simply need to create a tmpfs, copy the pictures there and then bind mount the tmpfs in that directory:
# mkdir /home/site/ram # mount -t tmpfs -o size=200M none /home/site/ram # cp -a /home/site/public_html/images/* /home/site/ram # mount --bind /home/site/ram /home/site/public_html/images
Instant performance gain, while you wait for the coder to actually fix the problem.
However you should make sure that you copy the new images from the ram to disk. Maybe with inotifywatch.
Keep in mind that this is only a temporary solution that should serve only as a proof that this is the problem and it needs to be fixed. Try to explain that this hack is not an actual solution.
On Sun, 24 Jul 2011, Always Learning wrote:
If the pictures are named sequentially, why not store then at a 100 per directory structure something like this
/pix/0/00/pix00001.jpg
/pix/0/26/pix02614.jpg
/pix/6/72/pix67255.jpg
Go read Knuth
One does not do that because then one is counting on the end user's data to conform to, and to continue to conform to your expectations [here you have added an invisible constraint of 'pix' as the first part of the file name which you are hoping remains constant -- it will not, as survey of naming schemes used by digital camera makers will reveal]. Your explicit constraint of a monotonicly increasing image number is also not likely to be realized in a world where people will erase or for other reasons not submit all of a given photo shoot
By using a hash, we remove those constraints, and also gain the virtuous effect for free of self-organizing a relatively level dispersion of files to the destination directories
-- Russ herrold
On Sun, 24 Jul 2011, R P Herrold wrote:
By using a hash, we remove those constraints, and also gain the virtuous effect for free of self-organizing a relatively level dispersion of files to the destination directories
Not followed the whole thread, but a SQL database index of the actual picture files, giving the path into the directory structure. Would that work?
Kind Regards,
Keith Roberts
----------------------------------------------------------------- Websites: http://www.karsites.net http://www.php-debuggers.net http://www.raised-from-the-dead.org.uk
All email addresses are challenge-response protected with TMDA [http://tmda.net] -----------------------------------------------------------------
On Sun, 24 Jul 2011, Keith Roberts wrote:
By using a hash, we remove those constraints, and also gain the virtuous effect for free of self-organizing a relatively level dispersion of files to the destination directories
Not followed the whole thread, but a SQL database index of the actual picture files, giving the path into the directory structure. Would that work?
Fortunately there is a full, and freely accessible of all posts to this mailing list. The link to that archive is in the header of every message through this list. As such you need not speculate
As I read the post initially, the problem was as stated in the subject line, and the database issue was not in the forefront
Per the initial problem description, the files were all splatted into a single directory. The fastest database I know of is using the filesystem as a database; The addition of the hashing is just a pointer, and so also O(1)
Adding a database engine, with the overhead that it brings, and as the thread has already pointed out, in a domU as well (not usually the best place to add the overhead of a database), simply are additonal points of mis-design
“We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%. A good programmer will not be lulled into complacency by such reasoning, he will be wise to look carefully at the critical code; but only after that code has been identified” - Donald Knuth [1]
Once the implementation is 'correct', then it is time to do A:B testing to see where the really problem lies ... which testing was at the head of my initial post on this topic
-- Russ herrold
[1] http://pplab.snu.ac.kr/courses/adv_pl05/papers/p261-knuth.pdf
A person not willing to pony up $2.73 for a used copy of 'The Art of Computer Programming: Sorting and Searching. Volume 3', which discusses the specific problem space here, may wish to read and consider his rather nice lecture published by the ACM
On Sun, 2011-07-24 at 17:50 -0400, R P Herrold wrote:
On Sun, 24 Jul 2011, Keith Roberts wrote:
By using a hash, we remove those constraints, and also gain the virtuous effect for free of self-organizing a relatively level dispersion of files to the destination directories
Not followed the whole thread, but a SQL database index of the actual picture files, giving the path into the directory structure. Would that work?
The answer must be 'yes' to a normal problem of identifying (searching for) then retrieving data. MySQL would be a good choice.
Russ' adoration(?) of Donald KNUTH made me read the first page of
[1] http://pplab.snu.ac.kr/courses/adv_pl05/papers/p261-knuth.pdf
which includes this
"This study focuses largely on two issues: (a) improved syntax for iterations and error exits, making it possible to write a larger class of programs clearly and efficiently without go to statements; (b) a methodology of program design, beginning with readable and correct, but possibly inefficient programs that are systematically transformed if necessary into efficient and correct, but possibly less readable code."
A computer programmer can not change the syntax of the language he or her is writing-in. The syntax of any programming language is determined by the creator of that programming language.
Spaghetti-code is a trade-mark of confused programmers, usually of little ability and certainly have never spend days trying to debug someone else's programme. Spaghetti-code can always be avoided by a clear understanding of what the user wants coupled with the programmer's in depth understanding of how to implement the user's requirements in the chosen programming language whilst remembering someone else may have to maintain the programme.
Hashing file names is an interesting concept but a simple, and they are very simple to write, MySQL db application running as HTML pages, with a dash of PHP, makes the application universally accessible and easy to use. Oh, and on Centos, amazingly quick to run :-)
On 7/24/11 4:08 PM, Keith Roberts wrote:
On Sun, 24 Jul 2011, R P Herrold wrote:
By using a hash, we remove those constraints, and also gain the virtuous effect for free of self-organizing a relatively level dispersion of files to the destination directories
Not followed the whole thread, but a SQL database index of the actual picture files, giving the path into the directory structure. Would that work?
You introduce new issues where the name in the database can't be managed atomically with the name in the directory that way. Consider what might happen with concurrent operations trying to add different files with the same name - or perhaps an add and delete at the same times.
And it still doesn't help with the real problem unless you do something to break up the large directory. Unix-like filesystems guarantee atomic operations in filename manipulation, so every time you try to create a file, the system must check that the name does not already exist, find an empty slot for the name and insert it with the directory locked against other changes until that is complete. Filesystems that index directories can help with the lookup, with the tradeoff that additions require an index update.
On Sun, 2011-07-24 at 16:33 -0400, R P Herrold wrote:
On Sun, 24 Jul 2011, Always Learning wrote:
If the pictures are named sequentially, why not store then at a 100 per directory structure something like this
/pix/0/00/pix00001.jpg
/pix/0/26/pix02614.jpg
/pix/6/72/pix67255.jpg
Go read Knuth
One does not do that because then one is counting on the end user's data to conform to, and to continue to conform to your expectations [here you have added an invisible constraint of 'pix' as the first part of the file name which you are hoping remains constant -- it will not, as survey of naming schemes used by digital camera makers will reveal]. Your explicit constraint of a monotonicly increasing image number is also not likely to be realized in a world where people will erase or for other reasons not submit all of a given photo shoot
I did begin with 'IF' :-)
Photo-shoot or whatever, using the 'rename' command means pictures can adopt a uniform numbering system. There is no logical or genuine practical reason to accept a disorganised mess.
I have about 21,000+ pictures - all my own work. I can find and display any of them within about 17 seconds (just timed myself) using basic operating system commands. (My database application is unfinished).
On Sunday 24 July 2011 10:13:30 R P Herrold wrote:
#!/bin/sh # CANDIDATES="pix00001.jpg pix00002.jpg pix00003.jpg" for i in `echo "${CANDIDATES}"`; do HASH=`echo "$i" | md5sum - | awk {'print $1'}` echo "$i ${HASH}" done
I know it absolutelly has nothing to do with databases or files in folders but as we are talking about optimizing:
#!/bin/bash CANDIDATES=(pix00001.jpg pix00002.jpg pix00003.jpg) for i in "${CANDIDATES[@]}"; do MD5SUM=$(md5sum <(echo $i)) echo "$i ${MD5SUM% *}"; done
It's more than twice as fast than the previous sh script.
[ willing to learn mode, feel free to ignore this]
Anyway, about the the hashes and directories and so on... I assume we'd need a hash table in our application, right?
Would we proceed as follows (correct me if I'm wrong please)?
1- m5sum the file we need 2- look for the first letter of the hash 3- get into the directory 4- now we look for our file
Is this right? I understand this would improve the searching of files when there's a lot of them.
Thanks to anyone that replies me and sorry for the offtopic
Regards,
Marc Deop
On Mon, 25 Jul 2011, Marc Deop wrote:
It's more than twice as fast than the previous sh script.
In part this is /bin/sh v /bin/bash and using 'bashisms' matter, but yes, I did not seek to optimize a teaching throwaway
1- m5sum the file we need
... actually the NAME of the file, to make it explicit we are not looking at content [also a reasonable approach if one is looking to find and de-duplicate a filestore]
2- look for the first letter of the hash
... actually this may be more than a single letter of the hash --- with ca 3000 files, and 16 hash characters, we should end up with about 200 files per subdirectory. The filesystem should be doing some sort of index as well -- as I recall, a B-tree in the case of extN but I've not expressly looked. The php case was mentioned, however, and its directory searching is less optimal
We have a customer with a similar problem with a naiively written set of home brewed PHP code, and are helping them work through similar issues
3- get into the directory 4- now we look for our file
... this is probably a single operation to suck the sub-directory listing into an array in php, and use an associative match
but you are right, we are moving increasingly away from a CentOS issue to a more general coding style issue
-- Russ herrold
Greetings,
On Sun, Jul 24, 2011 at 2:59 PM, yonatan pingle yonatan.pingle@gmail.com wrote:
Hello, after looking into the website folders, i have found one folder which from my point of view is one of the causes for the server loads.
hmm... does mount <dir> -noatime -noadirtime help speed it up?
On Mon, Jul 25, 2011 at 06:38:33AM +0530, Rajagopal Swaminathan wrote:
hmm... does mount <dir> -noatime -noadirtime help speed it up?
Just an FYI:
noatime is a superset that includes noadirtime.
John
On Sunday, July 24, 2011 05:29:23 AM yonatan pingle wrote: ...
lately the server is under-preforming and load averages are high, mysql service keeps crashing and the server is hitting max memory usage ( so i added ram .. ) , after looking into the website folders, i have found one folder which from my point of view is one of the causes for the server loads.
...
uploads]# ls | wc -l 3123
...
pros vs cons of having a large amount of small files in the same folder on Linux Centos?
3,123 files is not a large number. From a CentOS 4 file server here.....
[root@pachyderm sky_data]# ls|wc -l 13526 [root@pachyderm sky_data]# cd ../motse [root@pachyderm motse]# ls |wc -l 28218 [root@pachyderm motse]#cd [root@pachyderm ~]# du -s /var/lib/pgsql 556420596 /var/lib/pgsql [root@pachyderm ~]#
(Yeah, 556GB in PostgreSQL....) Pachyderm = 'The elephant never forgets....' But I'm not looking forward to converting it to a post-C4 PostgreSQL....
Performance on this box is pretty good, all things considered.
Large log files I have found can be performance problems; check to make sure log files are being rolled properly.
There are some specific MySQL tuning documents out there; I seem to remember a posting on a local LUG list about some serious MySQL performance issues that took a long time to ferret out, but I can't seem to find it quickly.....