We have several latency-sensitive "pipeline"-style programs that have a measurable performance degredation when run on CentOS 5.x versus CentOS 4.x.
By "pipeline" program, I mean one that has multiple threads. The mutiple threads work on shared data. Between each thread, there is a queue. So thread A gets data, pushes into Qab, thread B pulls from Qab, does some processing, then pushes into Qbc, thread C pulls from Qbc, etc. The initial data is from the network (generated by a 3rd party).
We basically measure the time from when the data is received to when the last thread performs its task. In our application, we see an increase of anywhere from 20 to 50 microseconds when moving from CentOS 4 to CentOS 5.
I have used a few methods of profiling our application, and determined that the added latency on CentOS 5 comes from queue operations (in particular, popping).
However, I can improve performance on CentOS 5 (to be the same as CentOS 4) by using taskset to bind the program to a subset of the available cores.
So it appers to me, between CentOS 4 and 5, there was some change (presumably to the kernel) that caused threads to be scheduled differently (and this difference is suboptimal for our application).
While I can "solve" this problem with taskset, my preference is to not have to do this. I'm hoping there's some kind of kernel tunable (or maybe collection of tunables) whose default was changed between versions.
Anyone have any experience with this? Perhaps some more areas to investigate?
Thanks, Matt
On Fri, 20 May 2011, Matt Garman wrote:
We have several latency-sensitive "pipeline"-style programs that have a measurable performance degredation when run on CentOS 5.x versus CentOS 4.x.
By "pipeline" program, I mean one that has multiple threads. The mutiple threads work on shared data. Between each thread, there is a queue. So thread A gets data, pushes into Qab, thread B pulls from Qab, does some processing, then pushes into Qbc, thread C pulls from Qbc, etc. The initial data is from the network (generated by a 3rd party).
We basically measure the time from when the data is received to when the last thread performs its task. In our application, we see an increase of anywhere from 20 to 50 microseconds when moving from CentOS 4 to CentOS 5.
Anyone have any experience with this? Perhaps some more areas to investigate?
We do procesing similar to this with financials markets datastreams. You do not say, but I assume you are blocking on a select, rather than polling [polling is bad here]. Also you do not say if all threds are under a common process' ownership. If not, mod complexity of debugging threading, you may want to do so
I say this, because in our testing (both with all housed in a single process, and when using co-processes fed through an anaoymous pipe), we will occasionally get hit with a context or process switch, which messes up the latencies something fierce. An 'at' or 'cron' job firing off can ruin the day as well
Also, system calls are to be avoided, as the timing on when (and if, and in what order) one gets returned to, is not something controllable in userspace
Average latencies are not so meaningful here ... collecton of all dispatch and return data and explaining the outliers is probably a good place to continue with afer addresing the foregoing. graphviz, and gnuplot are lovely for doing this kind of visualization
-- Russ herrold
I would like to confirm Matt's claim. I too experienced larger latencies with Centos 5.x compared to 4.x. My application is very network sensitive and its easy to prove using lat_tcp.
Russ, I am curious about identifying the problem. What tools do you recommend to find where the latency is coming from in the application?
On Fri, May 20, 2011 at 2:46 PM, R P Herrold herrold@owlriver.com wrote:
On Fri, 20 May 2011, Matt Garman wrote:
We have several latency-sensitive "pipeline"-style programs that have a measurable performance degredation when run on CentOS 5.x versus CentOS 4.x.
By "pipeline" program, I mean one that has multiple threads. The mutiple threads work on shared data. Between each thread, there is a queue. So thread A gets data, pushes into Qab, thread B pulls from Qab, does some processing, then pushes into Qbc, thread C pulls from Qbc, etc. The initial data is from the network (generated by a 3rd party).
We basically measure the time from when the data is received to when the last thread performs its task. In our application, we see an increase of anywhere from 20 to 50 microseconds when moving from CentOS 4 to CentOS 5.
Anyone have any experience with this? Perhaps some more areas to investigate?
We do procesing similar to this with financials markets datastreams. You do not say, but I assume you are blocking on a select, rather than polling [polling is bad here]. Also you do not say if all threds are under a common process' ownership. If not, mod complexity of debugging threading, you may want to do so
I say this, because in our testing (both with all housed in a single process, and when using co-processes fed through an anaoymous pipe), we will occasionally get hit with a context or process switch, which messes up the latencies something fierce. An 'at' or 'cron' job firing off can ruin the day as well
Also, system calls are to be avoided, as the timing on when (and if, and in what order) one gets returned to, is not something controllable in userspace
Average latencies are not so meaningful here ... collecton of all dispatch and return data and explaining the outliers is probably a good place to continue with afer addresing the foregoing. graphviz, and gnuplot are lovely for doing this kind of visualization
-- Russ herrold _______________________________________________ CentOS mailing list CentOS@centos.org http://lists.centos.org/mailman/listinfo/centos
On Mon, 23 May 2011, Mag Gam wrote:
I would like to confirm Matt's claim. I too experienced larger latencies with Centos 5.x compared to 4.x. My application is very network sensitive and its easy to prove using lat_tcp.
Russ, I am curious about identifying the problem. What tools do you recommend to find where the latency is coming from in the application?
I went through the obvious candidates: system calls (loss of control of when if ever the scheduler decides to let your process run again) polling v select polling is almost always a wrong approach when latency reduction is in play (reading and understanding: man 2 select_tut is time very well spent) choice of implementation language -- the issue here being if one uses a scripting language, one cannot 'see' the time leaks
Doing metrics permits both 'hot spot' analysis, and moves the coding from 'guesstimation' to software engineering. We use graphviz, and gnuplot on the plain text 'CSV-style' timings files to 'see' outliers and hotspots
Knuth's admonition about premature optimization applies here of course
A sensible process might be: Make it work correctly, THEN make it fast
Some people add a precursor step of: make it compile but this seems to me a less efficient process than simply proceeding up with a clean design at the start, and the expedient of 'stubbing' out unimplemented portions. Then replace the stubs with 'correctly' funcitoning refactorings (... I just did this with part of my build tools, writing a meta-code outline of what I wanted, and then implementing the metacode)
The C++ code of the 'trading-shim' tool (GPLv3+) was produced in just this fashion over the last few years, and compared to all the competitors in its class, outpaces them all in terms of minimal latency .. most of that competition being Java based, or in some other scripting language. The 'shim' runs like a scalded dog ;)
-- Russ herrold
On Tue, May 24, 2011 at 02:22:12PM -0400, R P Herrold wrote:
On Mon, 23 May 2011, Mag Gam wrote:
I would like to confirm Matt's claim. I too experienced larger latencies with Centos 5.x compared to 4.x. My application is very network sensitive and its easy to prove using lat_tcp.
Russ, I am curious about identifying the problem. What tools do you recommend to find where the latency is coming from in the application?
I went through the obvious candidates: system calls (loss of control of when if ever the scheduler decides to let your process run again)
This is almost certainly what it is for us. But in this situation, these calls are limited to mutex operations and condition variable signaling.
polling v select polling is almost always a wrong approach when latency reduction is in play (reading and understanding: man 2 select_tut is time very well spent)
We are using select(). However, that is only for the networking part (basically using select() to wait on data from a socket). Here, my concern isn't with network latency---it's with "intra process" latency.
choice of implementation language -- the issue here being if one uses a scripting language, one cannot 'see' the time leaks
C/C++ here.
Doing metrics permits both 'hot spot' analysis, and moves the coding from 'guesstimation' to software engineering. We use graphviz, and gnuplot on the plain text 'CSV-style' timings files to 'see' outliers and hotspots
We're basically doing that. We pre-allocate a huge 2D array for keeping "stopwatch" points throughout the program. Each column represents a different stopwatch point, and each row represents and different iteration through these measured points. After a lot of iterations (usually at least 100k), the numbers are dumped to a file for analysis.
Basically, the standard deviation from one iteration to the next is fairly low. It's not like there are a few outliers driving the average intra-process latency up; it's just that, in general, going from point A to point B takes longer with the newer kernels.
For what it's worth, I tried a 2.6.39 mainline kernel (from elrepo), and the intra-process latencies get still worse. It appears that whatever changes are being made to the kernel, it's bad for our kind of program. I'm trying to figure out, from a conceptual level, what those changes are. I'm looking for an easier way to understand than reading the kernel source and change history. :)