 
            On Fri, 2005-07-08 at 19:33 -0400, Peter Arremann wrote:
Bruno, you can make benchmarks say whatever you want - and this one seems to be one of the worst distortions I've seen in a while. I know from personal experience that this isn't even close. dual xeon 2.0Ghz/4GB is about the same performance as a quad 450Mhz E4500/4GB in everything we're doing...
For what? That's the question.
The TPC-C benchmarks (only official, impartial benchmarks I could find) are even worse: dual 3.6Ghz scored 63464 14 way 464Mhz E4500 scored 67103
But there are over a _dozen_ TPC-C benchmarks -- some scale linearly on clusters, other prefer shared memory systems, some really taxi interconnect and do much better on NUMA+"true systems" interconnects.
The more you go the latter, the more P4 MCH "sucks." ;->
Oracle has a word doc on their website talking about performance of a E4500 vs a Dell box (http://download-west.oracle.com/owsf_2003/Oracleworld2003.doc) where a quad xeon beats a quad E4500 by about 60% - and that was a quad 700Mhz P3 based Xeon...
Which has the ALU/FPU equivalent of a 1.1-1.5GHz P4. Now the MCH of a P4 is certainly better, but still not a NUMA/"true system" interconnect. Which is why P3/P4 does _not_ scale well beyond 2 CPUs -- heck, the P4 really no better than P3, only some more throughput, but no less contention (actually more in many cases).
Which is why I still recommend dual-P3 servers today, especially refurbs, for the cost.
But here's the kicker ... "Simultaneously running 8 queries"
At that point, I'm _not_ taxing the interconnect at all. So there's no advantage to the NUMA/UPA architecture.
Go search google for Xeon and E4500 and you'll see tons more of these benchmarks - and they all tell the same story...
Of course, because they don't do a _full_ suite.
Compaq-Microsoft came out with a benchmark of the full TPC-C suite awhile back showing how Windows clusters beat a Sun shared memory system. What they didn't focus on, unless you read the entire article, is how the PC got _roasted_ -- by up to 10x -- on 4 of the 15 tests because they really stressed the interconnect, and the contention of memory access of threads.