Another question some of you may help me make decisions on. I've been doing some reading that indicates that having HT turned on in this dual xeon machine might actually slow down the computing process rather than speeding it up. I rebooted this a.m., and turned HT off, just prior to my main application run. One thing that might be of note, this application is using OMP for utilizing both cpu's, and prior to turning off HT, I had been running the software using 2 cpu's of the 4 that the OS sees. I'm waiting on a model run to finish to see if there is any appreciable difference, but the one thing I do notice right off is cpu utilization is running close to 100% on both, where before, it averaged maybe 50% or thereabouts. Again, sar is showing at last count, 83.56% utilization for user, 10.27% system and only 0.02% nice. Idle was 5.74%.
I'm attempting to squeeze every last bit of processing power out of this machine, and would entertain suggestions on tuning if there happen to be any types of tuning that would help.
Sam
On 6/7/06, Sam Drinkard sam@wa4phy.net wrote:
Another question some of you may help me make decisions on. I've been doing some reading that indicates that having HT turned on in this dual xeon machine might actually slow down the computing process rather than speeding it up. I rebooted this a.m., and turned HT off, just prior to my main application run. One thing that might be of note, this application is using OMP for utilizing both cpu's, and prior to turning off HT, I had been running the software using 2 cpu's of the 4 that the OS sees. I'm waiting on a model run to finish to see if there is any appreciable difference, but the one thing I do notice right off is cpu utilization is running close to 100% on both, where before, it averaged maybe 50% or thereabouts. Again, sar is showing at last count, 83.56% utilization for user, 10.27% system and only 0.02% nice. Idle was 5.74%.
It's entirely possible that your system is lying about load when moving back and forth between HT and real SMP. Logical CPUS (HT) are nearly identical to physical CPUs as far as the operating system is concerned. Since you have half the number of CPUS with HT turned off, but you're still running the same amount of jobs, the load should be higher. Hopefully this page will explain that a little bit better. http://www.cognitive-dissonance.org/wiki/Load+Average
Additionally as far as HT performance is concerned, I've only really found two pages that help, although the IBM load is a bit older and may not be accurate anymore.
http://perfcap.blogspot.com/2005/05/performance-monitoring-with.html http://www-128.ibm.com/developerworks/linux/library/l-htl/
I'm attempting to squeeze every last bit of processing power out of this machine, and would entertain suggestions on tuning if there happen to be any types of tuning that would help.
For performance tuning, I usually start with the filesystem tweaks: mount ext3 noatime, and changing the commit time from 5 to 30
After that I move to /etc/sysctl.conf and tweak the kernel.shmmax, shmmin, shmall, and vdso values depending on the application I'm most concerned about, as well as fs.file-max.
From there I move to the I/O scheduler/elevator for the system. RH
magazine had a decent article about this. http://www.redhat.com/magazine/008jun05/features/schedulers/
Jim Perrin wrote:
On 6/7/06, Sam Drinkard sam@wa4phy.net wrote:
Another question some of you may help me make decisions on. I've been doing some reading that indicates that having HT turned on in this dual xeon machine might actually slow down the computing process rather than speeding it up. I rebooted this a.m., and turned HT off, just prior to my main application run. One thing that might be of note, this application is using OMP for utilizing both cpu's, and prior to turning off HT, I had been running the software using 2 cpu's of the 4 that the OS sees. I'm waiting on a model run to finish to see if there is any appreciable difference, but the one thing I do notice right off is cpu utilization is running close to 100% on both, where before, it averaged maybe 50% or thereabouts. Again, sar is showing at last count, 83.56% utilization for user, 10.27% system and only 0.02% nice. Idle was 5.74%.
It's entirely possible that your system is lying about load when moving back and forth between HT and real SMP. Logical CPUS (HT) are nearly identical to physical CPUs as far as the operating system is concerned. Since you have half the number of CPUS with HT turned off, but you're still running the same amount of jobs, the load should be higher. Hopefully this page will explain that a little bit better. http://www.cognitive-dissonance.org/wiki/Load+Average
Additionally as far as HT performance is concerned, I've only really found two pages that help, although the IBM load is a bit older and may not be accurate anymore.
http://perfcap.blogspot.com/2005/05/performance-monitoring-with.html http://www-128.ibm.com/developerworks/linux/library/l-htl/
I'm attempting to squeeze every last bit of processing power out of this machine, and would entertain suggestions on tuning if there happen to be any types of tuning that would help.
For performance tuning, I usually start with the filesystem tweaks: mount ext3 noatime, and changing the commit time from 5 to 30
After that I move to /etc/sysctl.conf and tweak the kernel.shmmax, shmmin, shmall, and vdso values depending on the application I'm most concerned about, as well as fs.file-max.
From there I move to the I/O scheduler/elevator for the system. RH
magazine had a decent article about this. http://www.redhat.com/magazine/008jun05/features/schedulers/
Thanks Jim. Those articles were enlightening to say the least. Time for more study on things!
Jim Perrin wrote:
I'm attempting to squeeze every last bit of processing power out of this machine, and would entertain suggestions on tuning if there happen to be any types of tuning that would help.
For performance tuning, I usually start with the filesystem tweaks: mount ext3 noatime, and changing the commit time from 5 to 30
Jim,
Would you mount all filesystems noatime, or just those that would have any effect on operation? I'm almost positive, there are things in one directory that are accessed very often, as well as written to every few seconds. After reading about noatime, I'm not so positive it would help, altho I'm willing to give it a try. Here's what I was reading from... http://www.faqs.org/docs/securing/chap6sec73.html
After that I move to /etc/sysctl.conf and tweak the kernel.shmmax, shmmin, shmall, and vdso values depending on the application I'm most concerned about, as well as fs.file-max.
Any hints on what to look for to determine the best settings?
From there I move to the I/O scheduler/elevator for the system. RH
magazine had a decent article about this. http://www.redhat.com/magazine/008jun05/features/schedulers/
According to that article, I might be better off staying with the CFQ, but since this is a number crunching thing, I really don't know what to look for, or where to look to see what really is going on behind the monitor. I'm not savvy enough to know what the numerical model really does in the background is what I'm trying to say. Any suggestions, or could I provide more info that might help you help me?
Thanks....
Sam
Would you mount all filesystems noatime, or just those that would
have any effect on operation?
No. I usually only do this on partitions I share out over smb/cifs where I don't trust windows access timestamps anyway, or on filesystems where I don't care when the last time a file was accessed was (webroots, stuff like that). Mounting with noatime simply tells the filesystem not to update data/metadata to account for file access times. This saves a few cycles here and there, and can add up depending on how often something gets hit.
After that I move to /etc/sysctl.conf and tweak the kernel.shmmax, shmmin, shmall, and vdso values depending on the application I'm most concerned about, as well as fs.file-max.
Any hints on what to look for to determine the best settings?
Not really. Most of the apps we use here (oracle and netegrity stuff) have tuning recommendations in the documentation or associated kb/forums so I benchmark, change stuff to meet their recommendations, benchmark again, and use whichever tests better. It's mostly a trial and error kinda thing.
According to that article, I might be better off staying with the CFQ, but since this is a number crunching thing, I really don't know what to look for, or where to look to see what really is going on behind the monitor. I'm not savvy enough to know what the numerical model really does in the background is what I'm trying to say. Any suggestions, or could I provide more info that might help you help me?
Yeah, much of what I do here is webhosting/filehosting type stuff, so for me disk tuning is my performance limiter. I was hoping some other individuals would offer other performance tips after noticing that my advice was very heavily disk oriented. I'd be interested in hearing what other people are doing for performance tuning, but it seems like the vast majority of people simply leave the defaults and don't mess with things.
Jim Perrin wrote:
Would you mount all filesystems noatime, or just those that would
have any effect on operation?
No. I usually only do this on partitions I share out over smb/cifs where I don't trust windows access timestamps anyway, or on filesystems where I don't care when the last time a file was accessed was (webroots, stuff like that). Mounting with noatime simply tells the filesystem not to update data/metadata to account for file access times. This saves a few cycles here and there, and can add up depending on how often something gets hit.
After that I move to /etc/sysctl.conf and tweak the kernel.shmmax, shmmin, shmall, and vdso values depending on the application I'm most concerned about, as well as fs.file-max.
Any hints on what to look for to determine the best settings?
Not really. Most of the apps we use here (oracle and netegrity stuff) have tuning recommendations in the documentation or associated kb/forums so I benchmark, change stuff to meet their recommendations, benchmark again, and use whichever tests better. It's mostly a trial and error kinda thing.
I suppose, without having extensive knowledge about how the model behaves internally, or for that matter, writing to disk, it really would be trial and error. There are tools, but I'm not experienced enough to know what they are really telling me either. I know the model does a lot of reads from data, but don't know if that data is cached on read, or if the whole file is read in at once, then the numbers are crunched and written or what.
According to that article, I might be better off staying with the CFQ, but since this is a number crunching thing, I really don't know what to look for, or where to look to see what really is going on behind the monitor. I'm not savvy enough to know what the numerical model really does in the background is what I'm trying to say. Any suggestions, or could I provide more info that might help you help me?
Yeah, much of what I do here is webhosting/filehosting type stuff, so for me disk tuning is my performance limiter. I was hoping some other individuals would offer other performance tips after noticing that my advice was very heavily disk oriented. I'd be interested in hearing what other people are doing for performance tuning, but it seems like the vast majority of people simply leave the defaults and don't mess with things.
I've posted some questions to the WRF-model.org website so maybe one of the developers or someone with extensive software background info will respond. I know a lot of the weather oriented people who run the model have little to no clue about unix or linux in general, just based on the questions I see posted.
Sam
On Thu, 2006-06-08 at 11:33 -0400, Sam Drinkard wrote:
I suppose, without having extensive knowledge about how the model behaves internally, or for that matter, writing to disk, it really would be trial and error. There are tools, but I'm not experienced enough to know what they are really telling me either. I know the model does a lot of reads from data, but don't know if that data is cached on read, or if the whole file is read in at once, then the numbers are crunched and written or what.
vmstat and iostat will a reasonable idea of what is happening swap and disk-wise if you watch them with a 1 or 5 second interval while the app runs. Note that disk writes are normally queued and especially for scsi have little CPU overhead. Reads often mean that the CPU is waiting. Depending on whether the same thing is read more than once or not you might be able to speed this up with more RAM to cache it. But, if sar is reporting near 100% CPU use for most of the run there probably isn't much you can do.
I've posted some questions to the WRF-model.org website so maybe one of the developers or someone with extensive software background info will respond. I know a lot of the weather oriented people who run the model have little to no clue about unix or linux in general, just based on the questions I see posted.
A different algorithm, or something like pre-sorting the data might make a huge difference.
On Thu, 2006-06-08 at 11:32 -0500, Les Mikesell wrote:
On Thu, 2006-06-08 at 11:33 -0400, Sam Drinkard wrote:
<snip>
A different algorithm, or something like pre-sorting the data might make a huge difference.
That can make a huge difference. My post mentioning the order of data and keys is the "live" equivalent of exactly what Les mentions. Taken a step further (if this is what you mean Les, sorry for repeating), it may be possible to sequentially pre-read and sort (or pre-read in indexed order) and feed that to the application *if* it has the ability to take data from a sequential file or stream. *BIG* gains possible that way.
Further, if you have access to the source, or have tunable parameters available (like size of each read call) and can achieve very large reads of sequential data, system overhead (context switch, ...) is greatly reduced. What does SAR show the split between user/system time being? Might be a lot to gain there too.
HTH
William L. Maltby wrote:
A different algorithm, or something like pre-sorting the data might make a huge difference.
That can make a huge difference. My post mentioning the order of data and keys is the "live" equivalent of exactly what Les mentions. Taken a step further (if this is what you mean Les, sorry for repeating), it may be possible to sequentially pre-read and sort (or pre-read in indexed order) and feed that to the application *if* it has the ability to take data from a sequential file or stream. *BIG* gains possible that way.
I don't think there is much in the way I can modify how the data is read in. AFIK, it's all based on time data.
Further, if you have access to the source, or have tunable parameters available (like size of each read call) and can achieve very large reads of sequential data, system overhead (context switch, ...) is greatly reduced. What does SAR show the split between user/system time being? Might be a lot to gain there too.
Sar reports 85% user, and about 9% system, with a current iowait of 0.27%, so there is not much wait to things.
Les Mikesell wrote:
vmstat and iostat will a reasonable idea of what is happening swap and disk-wise if you watch them with a 1 or 5 second interval while the app runs. Note that disk writes are normally queued and especially for scsi have little CPU overhead. Reads often mean that the CPU is waiting. Depending on whether the same thing is read more than once or not you might be able to speed this up with more RAM to cache it. But, if sar is reporting near 100% CPU use for most of the run there probably isn't much you can do.
According to iostat, there does not appear to be any big problems on disk access or wait. In fact, wait is 0. The largest block writes I've seen are like 6000/sec, writing 30,400 in 5 seconds. Sar is reporting at the current, zero swap, and from what I've been watching, it pretty much stays that way unitil 0420, when all the updatedb and other cron jobs kick off. The machine stays busy about 20 hours of of 24, so it's pretty easy to monitor what's going on. I just wish I knew more about the internal workings of the model and how it does things. Doubt if source code would help me much as I'm not really up to speed on fortran.
A different algorithm, or something like pre-sorting the data might make a huge difference.
I don't know if there is anything that could be done in that area. The software reads in files, I'd assume, sequentially, as that is how the initialization data is done.. all based on time.
On Thu, 2006-06-08 at 11:33 -0400, Sam Drinkard wrote:
Jim Perrin wrote:
<snip>
I suppose, without having extensive knowledge about how the model behaves internally, or for that matter, writing to disk, it really would be trial and error. There are tools, but I'm not experienced enough to know what they are really telling me either. I know the model does a lot of reads from data, but don't know if that data is cached on read, or if the whole file is read in at once, then the numbers are crunched and written or what.
<snip>
Knowledge = power, so pursue that. In the meantime, some things that *used* to be good "Rules of Thumb" (if not sitting on it) that you might be alert for as you investigate. Unfortunately, some would demand that you have a test bed to be sure you can 1) recover if necessary and 2) see if it really works without killing the end-user attitude and (potentially) your future (although anyone from the 8088 days... ;-)
1) Many DBMS claim (and rightfully so) big performance gains if they are on a "raw partition" rather than residing in a file system. If it's a whole disk, you won't even need to partition it, Linux and at least one other *IX support operations without such bothersome things.
2) If data is read predominately sequentially and you are in a FS, make a huge block size when you make the file system. This has more effect if the number crunching is input-data-intensive. Concomitantly, HDs with large cache will contribute substantially to reduced wait time. As you might surmise, all forms of cache are a total waste if the data (whether key or base) is totally wrongly sequenced.
3) Ditto if random reads, but tend to be heavily grouped in consecutive key ranges. In order for this to be effective, data should occur in most frequently accessed order and, optimally, the index for that sequence should be smack in the middle of the data blocks (i.e. first on disk is appx 50% of the data, then the index with some slack for growth and then the rest of the data). Better is index on another disk on another IDE (or whatever) channel). Can everybody say "Bus Mastering"? It's hard to keep things organized this way on a single disk, but since you're doing a batch operation, maybe you can make a copy as needed (or there is an HD backup frequently made?) and operate on that.
Anecdotally demonstrating the effectiveness of the statements about ordering matching application, in 1988(?) a n00b admin on Oracle couldn't understand why my app took 1:15 minutes to gen a screen. I *knew* it wasn't my program, I'm a performance enthusiast. We talked a bit, I told him to reload his DB in a certain sequence.
Result full screen in about 7 seconds.
To maintain that performance does require periodic reload/reorg.
4) Defrag the appropriate parts occasionally. Whether a file system or raw partition, stuff must occasionally go into "overflow areas". Unless your DBMS reorgs/defrags on-the-fly (bad for user response complaints), the easiest/fastest/cheapest way is cross-HD copies/reloads. After each cross-HD, scripts direct the apps to the right disk (on start up check a timestamp and file/part name gened by the reorg process). This only works if you have a quiescent time.
5) As Jim (IIRC) mentioned, avoiding access time updates can provide gains. If you are fortunate enough that no writes need to be done to the whole partion when you a crunching the chunks (if partition has only the DB, if not, consider making it so Mr. Spock), remount the partion read only (mount -o remount,ro <partition-or-label-or-even-mount-point>) for the duration of the run. This benefits in some VM/kernel internal ways, to a very small degree too.
There's more, but some may apply to only specific scenarios.
Sam
<snip sig stuff>
HTH
William L. Maltby wrote:
On Thu, 2006-06-08 at 11:33 -0400, Sam Drinkard wrote:
Jim Perrin wrote:
<snip>
Knowledge = power, so pursue that. In the meantime, some things that *used* to be good "Rules of Thumb" (if not sitting on it) that you might be alert for as you investigate. Unfortunately, some would demand that you have a test bed to be sure you can 1) recover if necessary and 2) see if it really works without killing the end-user attitude and (potentially) your future (although anyone from the 8088 days... ;-
Definately..
- Many DBMS claim (and rightfully so) big performance gains if they are
on a "raw partition" rather than residing in a file system. If it's a whole disk, you won't even need to partition it, Linux and at least one other *IX support operations without such bothersome things.
- If data is read predominately sequentially and you are in a FS, make
a huge block size when you make the file system. This has more effect if the number crunching is input-data-intensive. Concomitantly, HDs with large cache will contribute substantially to reduced wait time. As you might surmise, all forms of cache are a total waste if the data (whether key or base) is totally wrongly sequenced.
Block size might be an option for sure. The machine is defaulted right now, but it would not be difficult to back stuff, let the disk get scribbled to and increase block size, then restore. Unfortunately, this is nowhere like a database, at least from what I know about how it works, so doubt if DBMS type stuff would apply.
- Ditto if random reads, but tend to be heavily grouped in consecutive
key ranges. In order for this to be effective, data should occur in most frequently accessed order and, optimally, the index for that sequence should be smack in the middle of the data blocks (i.e. first on disk is appx 50% of the data, then the index with some slack for growth and then the rest of the data). Better is index on another disk on another IDE (or whatever) channel). Can everybody say "Bus Mastering"? It's hard to keep things organized this way on a single disk, but since you're doing a batch operation, maybe you can make a copy as needed (or there is an HD backup frequently made?) and operate on that.
There again, it would take more knowledge about the model than I have available to know what goes on.
Anecdotally demonstrating the effectiveness of the statements about ordering matching application, in 1988(?) a n00b admin on Oracle couldn't understand why my app took 1:15 minutes to gen a screen. I *knew* it wasn't my program, I'm a performance enthusiast. We talked a bit, I told him to reload his DB in a certain sequence.
Result full screen in about 7 seconds.
To maintain that performance does require periodic reload/reorg.
- Defrag the appropriate parts occasionally. Whether a file system or
raw partition, stuff must occasionally go into "overflow areas". Unless your DBMS reorgs/defrags on-the-fly (bad for user response complaints), the easiest/fastest/cheapest way is cross-HD copies/reloads. After each cross-HD, scripts direct the apps to the right disk (on start up check a timestamp and file/part name gened by the reorg process). This only works if you have a quiescent time.
Not a lot of free time on here. As said before, about 4 hours of idle time per 24 hours.
- As Jim (IIRC) mentioned, avoiding access time updates can provide
gains. If you are fortunate enough that no writes need to be done to the whole partion when you a crunching the chunks (if partition has only the DB, if not, consider making it so Mr. Spock), remount the partion read only (mount -o remount,ro <partition-or-label-or-even-mount-point>) for the duration of the run. This benefits in some VM/kernel internal ways, to a very small degree too.
Have implemented the noatime, and should know something more about whether it helps in about 2 hours when this run completes.
There's more, but some may apply to only specific scenarios.
Yep.. maybe some of the folks at the WRF will respond to the tuning stuff too, or give me some more in depth info about what takes place.
Sam