maximum cpus/cores in CentOS 4.1

List overview All Threads
Download

newer

older

CentOS-announce Digest, Vol 7,...

Delay during boot up

Tony Schreiner

8 Sep 2005 8 Sep '05

2:49 p.m.

What is the maximum number of AMD64 cores supported by CentOS 4?

The RedHat page https://www.redhat.com/software/rhel/configuration/

suggests that RHEL 4 AS supports up to 8 AMD64/EM64T logical cpus. Is that accurate and does it apply to CentOS 4?

I see a few vendors offering boxes with 8 (dual core) CPU sockets. IWILL for one.

Thanks,

Tony Schreiner Boston College

Show replies by date

Peter Arremann

8 Sep 8 Sep

3:38 p.m.

On Thursday 08 September 2005 10:49, Tony Schreiner wrote:

...

What is the maximum number of AMD64 cores supported by CentOS 4?

The RedHat page https://www.redhat.com/software/rhel/configuration/

suggests that RHEL 4 AS supports up to 8 AMD64/EM64T logical cpus. Is that accurate and does it apply to CentOS 4?

I see a few vendors offering boxes with 8 (dual core) CPU sockets. IWILL for one.

The suggestion of 8 was made mostly because there was no larger x86-64 platform available at that time.

Also, the main reason for the limit is scalability. With more cpus comes more communications overhead, more congestion on the bus, less memory bandwidth for each cpu and so on. How much that effects you depends on your application. I remember in the good old days when smp was first added to the kernel, people said 2cpu was the max you can have... we ran a few 4way systems back then very effectivly simply because our application had only a low volume of communications.

Peter.

Bryan J. Smith

6:33 p.m.

Peter Arremann loony@loonybin.org wrote:

...

The suggestion of 8 was made mostly because there was no larger x86-64 platform available at that time.

Opteron 8xx processors are so named because 8-way is the maximum number of Opteron processors with 3 HyperTransport links so no other Opteron is more than 2 hops away. With more than 8-way, you start to run into excessive hops, and that requires further design considerations in both hardware and software.

I know many vendors are selling "scalable" 4-way Socket-940 boards these days with two 3.2-8.0GBps HyperTransport connectors for daisy chaining mainboards. But the HyperTransport eXpansion (HTX) is now the preferred way to build clusters of 4-way Socket-940 boards, and each system has its own OS. Infiniband over HTX is capable of a "real world" 1.8GBps -- over 100% faster than "real world" performance of Infiniband PCI-X 2.0 cards (typically used in Xeon/Itanium).

...

Also, the main reason for the limit is scalability. With more cpus comes more communications overhead, more congestion on the bus,

There is _no_ "bus" in Opteron. Yes, Opterons will "share" HyperTransport links when they cannot directly connect to

...

less memory bandwidth for each cpu and so on.

Okay, this is _misleading_. You're thinking Intel SMP.

Opterons _always_ have 128-bit of DDR (2 channels) per CPU. Opteron uses NUMA (and HyperTransport partial meshes for CPU-I/O). There is _no_ "less memory bandwidth for each cpu". That is a trait of Intel SMP [A]GTL+, not AMD NUMA/HyperTransport.

Yes, if the Opteron has to access memory over on another CPU, then that is a performance issue. If the other CPU is on another mainboard, then yes, contention can happen there.

...

I remember in the good old days when smp was first added, to the kernel, people said 2cpu was the max you can have...

I remember non-Linux/non-PC where MP, not SMP, was used.

...

From true crossbar switches (not "bus hubs") to the partial

mesh we now have in the Opteron.

In fact, it's one of the areas where Linux is very immature.

It's logic is still very SMP, and only has NUMA "hints," and does not scale well on a NUMA platform, let alone the partial mesh of the Opteron 800's 2xDDR/3xHyperTransport _per_ CPU. Especially when it comes to processor affinity for I/O. It's a crapload better than NT, but not many UNIX implementations.

Sun's support of the Opteron then became a no-brainer. They could deliver a partial-mesh platform at a commodity cost.

...

we ran a few 4way systems back then very effectivly simply because our application had only a low volume of communications.

But you're still accessing memory. I assume it was an Intel SMP solution, and therefore, had memory access limitations you describe.

These are still wholly _inapplicable_ to Opteron if you have an application and operating system that are effective at processor affinity for processes. And when it comes to communication, processor affinity for I/O can do wonders -- but _only_ on Opteron, not even proprietary Xeon MP / Itanium systems (because they are still "Front Side Bottleneck" designs).

I understand what you're trying to say. But it's not very applicable to Opteron in the least bit.

-- Bryan J. Smith | Sent from Yahoo Mail mailto:b.j.smith@ieee.org | (please excuse any http://thebs413.blogspot.com/ | missing headers)

Joshua Baker-LePain

6:40 p.m.

On Thu, 8 Sep 2005 at 11:33am, Bryan J. Smith wrote

...

Peter Arremann loony@loonybin.org wrote:

...

...
less memory bandwidth for each cpu and so on.

Okay, this is _misleading_. You're thinking Intel SMP.

Opterons _always_ have 128-bit of DDR (2 channels) per CPU. Opteron uses NUMA (and HyperTransport partial meshes for CPU-I/O). There is _no_ "less memory bandwidth for each cpu". That is a trait of Intel SMP [A]GTL+, not AMD NUMA/HyperTransport.

Yes, if the Opteron has to access memory over on another CPU, then that is a performance issue. If the other CPU is on another mainboard, then yes, contention can happen there.

One has to be even more careful with terminology these days. You can see less memory bandwidth per *core* with dual core Opterons. But, as you point out, each CPU (socket -- what should we call it?) has, essentially, its own bank of memory.

-- Joshua Baker-LePain Department of Biomedical Engineering Duke University

Bryan J. Smith

7:01 p.m.

Joshua Baker-LePain jlb17@duke.edu wrote:

...

One has to be even more careful with terminology these days. You can see less memory bandwidth per *core* with dual core Opterons.

Correct.

...

But, as you point out, each CPU (socket -- what should we call it?)

Yes, that's what I try to do. S[ocket]940.

...

has, essentially, its own bank** of memory.

S754 has one (1), glueless, 184-trace 64-bit DDR channel**. S939/940 has two (2), glueless, 184-trace 64-bit DDR channels.**

_All_ other PC sockets do not have memory channels. They have a "front side bus" (FSB) to a bridge. Many of these lines are multiplexed with others.

All Intel GTL platforms bridge into a "hub" all components share -- hence Memory Controller Hub (MCH). Intel S370 and 478 (not to be confused with S423, which uses Rambus, long story) and others have logically (data-wise, which may be muxed with address, control, etc...) _only_ one (1) 64-bit memory channel. Any "dual channel" marketing is an interleaving hack done at the MCH. Intel S603, 604 and 775 logically have two (2) 64-bit memory channels. Intel sometimes "widens" the GTL logic at the MCH for 4-way servers, although that requires additional support.

Digital EV6 platforms have ports (up to 16) into a "crossbar switch." The logical front-side bus is also 64-bit, including S462. Again, any claims of "dual DDR" is actually an interleaving hack, done in an attempt to reduce latency and increase overall throughput. But it is not the same as S939/940's _true_ 368 traces for a _true_ 128-bit DDR.

[ **ANAL NOTE: The term "bank" in traditional PC/RISC architecture actually 32-bit. So a 64-bit DIMM is 2 banks. ]

BTW, AMD multi-cores simply use an internal HyperTransport. They are ultra-simple to design.

Intel, on the other hand, uses a bridge for dual-core, which has massive limitations. This is why Intel is moving towards dual-ported FSBs, because it's just an evolution of what they've done with dual-core internally to the IC package.

-- Bryan J. Smith | Sent from Yahoo Mail mailto:b.j.smith@ieee.org | (please excuse any http://thebs413.blogspot.com/ | missing headers)

Bryan J. Smith

7:11 p.m.

New subject: maximum cpus/cores in CentOS 4.1 -- [OT] How EV6 differed from GTL

"Bryan J. Smith" b.j.smith@ieee.org wrote:

...

Digital EV6 platforms have ports (up to 16) into a "crossbar switch." The logical front-side bus is also 64-bit, including S462. Again, any claims of "dual DDR" is actually an interleaving hack, done in an attempt to reduce latency and increase overall throughput. But it is not the same as S939/940's _true_ 368 traces for a _true_ 128-bit DDR.

I should point out that it _is_ possible to have two (2) or even more DDR channels connected to an EV6 crossbar. And you could have _multiple_ CPU, which are on different "ports," bursting traffic to different memory channels on different memory "ports."

But _none_ of the "dual DDR" Athlon implementations did that AFAIK. Only Digital Alpha 264 systems used multiple 64-bit memory ports. It was commonly 2, 4 or 8 EV6 Alpha 264 CPUs, with 2 or 4 SDR/DDR memory channels and a few 64-bit PCI bridges, to round out the maximum of 16 "ports" into the crossbar.

The Athlon MP (AMD762 northbridge) platform used only 5 ports -- 2 CPU (each CPU had its own interconnect), 1 memory, 1 PCI32@266MHz (AGP2.0 aka x4) and 1 PCI64@33 (AMD766) or 1 PCI64@66 (AMD768). Supposedly up to four (4) AMD762 northbridges could be used in a system -- connected to each other using the PCI32@266MHz channel (normally used by AGP). But no PC mainboard I saw ever did (and by the time API came around, they were only using one AMD762 for Alpha 264 systems).

-- Bryan J. Smith | Sent from Yahoo Mail mailto:b.j.smith@ieee.org | (please excuse any http://thebs413.blogspot.com/ | missing headers)

Johnny Hughes

9:16 p.m.

Just for the record, here are the default values for the kernel variable "CONFIG_NR_CPUS=" for the following arches:

x86_64=1

x86_64.smp=8

i686=1

i686.smp=32

ppc=4

ppc64=64

s390=32

s390x=64

ia64=64

These values are the default values for RH ... the only one I see as a major problem actually is x86_64 ... I would think that 8 might be too small with the dual core machines out there. I would think 4 dual core machines would show up as 8 CPUS ... anything more would probabl ynot be recognized correctly.

Currently we are not changing these values in either the main or centosplus kernels, but we might set all the smp supported config files to 64 in the centosplus kernel for future versions.

Thanks to Karanbir Singh (z00dax on IRC) for helping me researching this :)

Bryan J. Smith

9:32 p.m.

New subject: maximum cpus/cores in CentOS 4.1 -- a kernel option is not the problem

Johnny Hughes mailing-lists@hughesjr.com wrote:

...

Just for the record, here are the default values for the kernel variable "CONFIG_NR_CPUS=" for the following arches: x86_64=1 x86_64.smp=8 i686=1 i686.smp=32 ppc=4 ppc64=64 s390=32 s390x=64 ia64=64 These values are the default values for RH ...

Because Red Hat has the hardware support.

You don't find >2 or >4 CPU i686 processors in the wild that don't have some sort of additional code required for the hardware. Red Hat has gotten that for the few Xeon and Itanium architectures. In reality, most of those are _proprietary_ implementations -- and _differ_ between vendors. GTL+ is very, very simplistic.

Opteron is clearly more of a "tier-2" solution right now (which has more to do with the AMD v. Intel lawsuit than anything), hence why you don't have a large volume of >4 CPU Opterons yet, and the code to support them. Consumers are trying to get HP to change -- e.g., Opterons comprised of 30% of there server sales at one point. Backorders are in the weeks, but not because of AMD stock. ;->

So there's been little reason for Red Hat to support more than 4xS940 in a system.

...

the only one I see as a major problem actually is x86_64

...

I would think that 8 might be too small with the dual core machines out there. I would think 4 dual core machines would show up as 8 CPUS ... anything more would probably not be recognized correctly.

Who says that any arbitrary 32-way Xeon platform will be recognized correctly with RHELv3/4? There are many cases only 2 processors will be recognized _despite_ the kernel setting.

Why? Again, beyond 2 or 4-way Xeon/Itanium, you're talking _proprietary_ implementations. They have extra bridging IC and APIC register support needed to support such. These are not "stock 1/2-way bridge/APIC compatible."

...

Currently we are not changing these values in either the main or centosplus kernels, but we might set all the smp supported config files to 64 in the centosplus kernel for future versions.

Again, it might not matter if the code to support the system interconnects are not standard. Granted, HyperTransport is quite commodity, and they are using a straight HyperTransport line from an Opteron 8xx CPU between boards. But there are serious hardware support questions that need to be addressed.

E.g., you can't just let the HyperTransport broadcasts ripple through a half-dozen HyperTransport links. That means the hardware has to add additional facilities for inter-board communication, meaning the OS has to know how to deal with them.

And it sounds like those haven't made it into the RHELv4 kernel yet.

-- Bryan J. Smith | Sent from Yahoo Mail mailto:b.j.smith@ieee.org | (please excuse any http://thebs413.blogspot.com/ | missing headers)

Lamar Owen

9 Sep 9 Sep

4:02 p.m.

On Thursday 08 September 2005 10:49, Tony Schreiner wrote:

...

What is the maximum number of AMD64 cores supported by CentOS 4?

While I know this doesn't help much in this context, and doesn't directly answer your question, indirectly it is relevant.

The Gigaplane/UPA architecture of the Sun Enterprise XX00 (3000-6500) allows up to16 connections (in the 6X00; 8 connections in the E4X00 and E5X00)). Each pair of CPU's has local RAM on each UPA board (crossbar switched interconnect) that has a port on the Gigaplane bus (2.6GB/s throughput in the 83MHz version, 3.2GB/s at 100MHz), up to 15 CPU/memory boards plus one I/O board per E6500 chassis/Gigaplane, giving up to 30 CPU's and 60GB of RAM max).

On this hardware, the 2.6 SPARC kernel is artificially limited to 24 processors; there seems to be stability issue over 24 CPU's. I'm burning in a 2.6.12 SPARC kernel (Corona from the Aurora project) on a 14x400 E6500 now (16GB of RAM).

Oddly enough, the power requirements for this beast and for an octal Opteron are pretty matched, about 1.5KW or a little more. Certainly the 8x Opteron will be faster on many things; but under heavy multiuser load the 14-way SPARC does a surprisingly good job, with around three quarters the performance of a dual 3GHz Xeon (that outclasses the SPARC box in every way possible except interconnect) at a load average of 30 or so. At a load average of 30, the E6500 feels more responsive than my laptop (1.7GHz Pentium M) at a load average of 2.

-- Lamar Owen Director of Information Technology Pisgah Astronomical Research Institute 1 PARI Drive Rosman, NC 28772 (828)862-5554 www.pari.edu

Bryan J. Smith

4:21 p.m.

Lamar Owen lowen@pari.edu wrote:

...

Certainly the 8x Opteron will be faster on many things; but under heavy multiuser load the 14-way SPARC does a surprisingly good job, with around three quarters the performance of a dual 3GHz Xeon (that outclasses the SPARC box in every way possible except interconnect)

STOP THE MADNESS!!!

Do _not_ use the Xeon as an example of how a SPARC would perform versus the Opteron. The Interconnect of the Xeon and Opteron are extremely different! So don't use them in the same context.

I'm begging! ;->

NOTE: There is a major reason why Sun is switching to Opteron. SPARC can't match it interconnect-wise, at least up to 8x S940. Beyond 8x S940, things change.

-- Bryan J. Smith | Sent from Yahoo Mail mailto:b.j.smith@ieee.org | (please excuse any http://thebs413.blogspot.com/ | missing headers)

Lamar Owen

4:37 p.m.

On Friday 09 September 2005 12:21, Bryan J. Smith wrote:

...

Do _not_ use the Xeon as an example of how a SPARC would perform versus the Opteron.

Since the Xeon will likely perform worse than an equivalent speed Opteron, this is a valid comparison, and it has nothing to do with interconnect.

...

The Interconnect of the Xeon and Opteron are extremely different! So don't use them in the same context.

Why not? Yes, they are different; why don't you stop assuming that people who use Opeteron and Xeon in the same sentence or paragraph are not as clueful as yourself? I am fully aware of the differences and of the similarities in the Hammer versus Xeon architecture; yet, since I have no Opterons here (for servers, we buy Dell (for reasons other than raw performance), and Dell doesn't yet do Opteron), a simple comparison to a Xeon is the best I can do. I was very pleased at the donated E6500's performance.

...

NOTE: There is a major reason why Sun is switching to Opteron. SPARC can't match it interconnect-wise, at least up to 8x S940. Beyond 8x S940, things change.

There are other reasons, not the least of which is that SPARC is difficult to get increased clock speeds (hardware contexts, IIRC). Hypertransport and UPA share many architectural similarities, though.

-- Lamar Owen Director of Information Technology Pisgah Astronomical Research Institute 1 PARI Drive Rosman, NC 28772 (828)862-5554 www.pari.edu

Bryan J. Smith

6:37 p.m.

Lamar Owen lowen@pari.edu wrote:

...

Since the Xeon will likely perform worse than an equivalent speed Opteron, this is a valid comparison, and it has nothing to do with interconnect.

??? What world do you live in ??? When it comes to a server, interconnect is everything. Especially the more I/O you do.

...

Why not? Yes, they are different; why don't you stop assuming that people who use Opeteron and Xeon in the same sentence or paragraph are not as clueful as yourself? I am fully aware of the differences

They are _completely_ different.

...

and of the similarities in the Hammer versus Xeon architecture;

GTL+ and NUMA/HyperTransport are _completely_ different.

...

yet, since I have no Opterons here

Yes, that was pretty obvious.

...

(for servers, we buy Dell (for reasons other than raw performance), and Dell doesn't yet do Opteron), a simple comparison to a Xeon is the best I can do.

And I said don't do it. You can't. They are _not_ comparable at all. There is far more similarity between SPARC and Xeon than Opteron when it comes to how the interconnect works, although UPA/SPARC is much closer to Opteron than Xeon.

...

I was very pleased at the donated E6500's performance.

Yes, because you compared it to Xeon. But you were doing it in the context of what the performance versus Opteron would be. It's wholly inapplicable.

...

There are other reasons, not the least of which is that SPARC is difficult to get increased clock speeds (hardware contexts, IIRC).

Don't even look at clock speeds. They are not comparable between _any_ platforms, much less have _nothing_ to do with most server operations. Interconnect is everything when it comes to the ability to move data.

...

Hypertransport and UPA share many architectural similarities, though.

Actually, UPA and EV6 share many architecture similarities.

HyperTransport is actually very generic, and a radical change. Furthermore, it really matters what you are tunneling over HyperTransport.

Some HyperTransport implementations -- e.g., IBM PowerPC 9xx -- tunnel everything over it, including the fact that memory is UMA and not local to the CPU. This is not much different than traditional UPA, EV6 and other "crossbar" system interconnects.

Other HyperTransport implementations -- e.g., AMD Opteron -- use HyperTransport via NUMA, and their performance varies wildly on the ability of the OS to handle processor affinity for programs and I/O. It is very, very, _very_ different from the standpoint of system management, even if the firmware/logic allows transparent use of older, non-affinity or only "affinity hinting" OSes to utilize it.

-- Bryan J. Smith | Sent from Yahoo Mail mailto:b.j.smith@ieee.org | (please excuse any http://thebs413.blogspot.com/ | missing headers)

Lamar Owen

8:44 p.m.

[One final reply to Bryan before *PLONK* goes in here]

On Friday 09 September 2005 14:37, Bryan J. Smith wrote:

...

Lamar Owen lowen@pari.edu wrote:

...
Since the Xeon will likely perform worse than an equivalent speed Opteron, this is a valid comparison, and it has nothing to do with interconnect.

...

??? What world do you live in ???

The world that shows that, if 3 is less than 4, then 3 is also less than 10. If the E6500 performs worse than a Xeon, then it will also perform worse than an Opteron. Simple enough for a child, yet it escapes you.

...

They are _completely_ different.

No, they are not completely different. Last time I checked, they both executed instructions, they both used RAM to store those instructions, and they behaved as a Von Neumann machine. They are only different in the details that you wish to emphasize.

...

GTL+ and NUMA/HyperTransport are _completely_ different.

Xeon != GTL+ and Opteron != HT. GTL+ and HT are merely the interconnect, which, while it very much does impact the performance of a box, it is just the interconnect; having a different interconnect does not make two CPU's _completely_ different, and you should know that. Is aluminum _completely_ different from steel?

...

...
yet, since I have no Opterons here

...

Yes, that was pretty obvious.

Why don't you donate one? :-)

...

...
I was very pleased at the donated E6500's performance.

...

Yes, because you compared it to Xeon. But you were doing it in the context of what the performance versus Opteron would be. It's wholly inapplicable.

The thread was about number of CPUs; I answered in a fashion that indicated that I knew that this was only indirectly applicable (but since when have you paid attention to what someone actually said?). I did a comparison that cast the E6500 (a five year old box) against a decent server (by today's standards) (BTW: not running x86_64 code, either, but i386 code) in a pretty decent light. I made a fairly simple comment that you have blown completely out of proportion, and you have made this worse than useless. You have told me nothing I did not already know, other than that doing a *PLONK* on your incoming e-mail to my servers would be a Big Win.

...

...
There are other reasons, not the least of which is that SPARC is difficult to get increased clock speeds (hardware contexts, IIRC).

Don't even look at clock speeds. They are not comparable between _any_ platforms, much less have _nothing_ to do with most server operations. Interconnect is everything when it comes to the ability to move data.

You once again fail to read what I wrote. The UltraSPARC CPU (not interconnect) architecture is difficult to get to run at higher clock speeds. The Opteron, both by being lower cost at a given speed (and speed does matter, even though raw clock speed doesn't matter as much as many think; speed is a big factor for number crunching) outclasses the SPARC systems these days. Economic reasons as much as any other probably played a significant role here, and I personally do not agree that interconnect technology was the major factor. Of course, a statement to the contrary by someone in Sun who helped make that decision would prove me wrong.

...

Other HyperTransport implementations -- e.g., AMD Opteron -- use HyperTransport via NUMA, and their performance varies wildly on the ability of the OS to handle processor affinity for programs and I/O. It is very, very, _very_ different from the standpoint of system management, even if the firmware/logic allows transparent use of older, non-affinity or only "affinity hinting" OSes to utilize it.

The Sun Starfire and Gigaplane architectures both are impacted by processor-RAM affinity, since access to the local RAM on any given CPU/memory card is via the local UPA and doesn't have to hit the board-external interconnect. In a manner of speaking, Gigaplane is a type of NUMA, even though it isn't 'true' NUMA. Starfire OTOH can be true NUMA (the architecture came from SGI). Don't bother to 'correct' me (I know I'm being somewhat generic in those statements, and, if I had time and wanted to do so, I could delve into the nitty-gritty); I won't see your reply.

-- Lamar Owen Director of Information Technology Pisgah Astronomical Research Institute 1 PARI Drive Rosman, NC 28772 (828)862-5554 www.pari.edu

Bryan J. Smith

9:48 p.m.

Lamar Owen lowen@pari.edu wrote:

...

The world that shows that, if 3 is less than 4, then 3 is also less than 10.

SPARC and Xeon are not comparable in generalities. Xeon and Opteron are not comparable in generalities.

...

No, they are not completely different. Last time I checked, they both executed instructions,

Are you for real?

...

they both used RAM to store those instructions,

Are you for real?

...

and they behaved as a Von Neumann machine.

NOTE: The definition of "Von Neumann" is not as generalized in the EE world. ;-> It's like saying E=mc^2 to a physicist.

...

They are only different in the details that you wish to emphasize.

But those details _mean_everything_!

...

Xeon != GTL+ and Opteron != HT. GTL+ and HT are merely the interconnect, which, while it very much does impact the performance of a box, it is just the interconnect; having a different interconnect does not make two CPU's _completely_ different, and you should know that.

No, it's _everything_ when it comes to a server!

...

Is aluminum _completely_ different from steel?

Poor analogy. It's like calling a pickup and an 18-wheeler both "trucks."

...

The thread was about number of CPUs;

Yes! And when you use more than 1 Xeon or 1 Opteron, _everything_ changes. And Xeon differs drasically from Opteron. Heck, Xeon and Itanium are about the same in this regard.

Once you pass 2-4 Xeons or Itaniums, they you're talking hardware-specific kernel hacks. Such implementations are typically proprietary -- especially beyond 4 sockets.

Opteron uses a more flexible inter-CPU interconnect. Custom bridging is not required. But beyond 8 sockets is not standardized quite yet. Since they are rare, the hacks have not been added to the kernel.

I don't know much more "real world" I can be on this. Everyone is talking in generalizations and I'm trying to say exactly what the differences are.

...

I answered in a fashion that indicated that I knew that this was only indirectly applicable

No, it's utterly inapplicable. If you want to compare Xeon and SPARC, that's one thing. But every single assumption I've read on Opteron in this thread has been so inapplicable, I just can't understand it.

Stock Xeon and Itanium do not support >4 nodes. You have to use non-standard/non-commodity briding. There are several vendors with such, from a few, semi-consistent 8x S604s (Xeon) to 32x S604 (HP Xeon) to SGI and other up-to 64-way Itanium. They are _not_ standard implementations, and additional hardware support is required.

Opteron has yet to have such non-commodity implementations, at least in volume. Although HyperTransport does make some things transparent, to this point, only 8x S940. I haven't seen a "standard" reference for more than 8x S940, and the modular approaches used with HT extenders is still being worked out.

Hence why the hardware support hasn't been added to the kernel.

CASE-IN-POINT: Merely boosting the x86-64 to support more than 8 processors, or even 16 processors, will do _little_ to support that many if the hardware support for those designs aren't there.

I've seen the same thing with 8-way Xeon systems, people only seeing 2 processors. Why? Because the bridging approach was not supported in the kernel.

...

(but since when have you paid attention to what someone actually said?).

Actually, I paid very close attention! At first I was interested. Then I was disgusted at your poor application after I saw where you went with it.

...

I did a comparison that cast the E6500 (a five year old box) against a decent server (by today's standards)

You made a comparison to Xeon. It is wholly inapplicable to make assumptions about the Opteron from that -- period.

...

(BTW: not running x86_64 code, either, but i386 code)

There's actually little difference from a server standpoint when it comes to EM64T, but that's another thread.

...

in a pretty decent light. I made a fairly simple comment that you have blown completely out of proportion,

Because you feel compelled to discuss a platform you have never used, and seemingly do not understand.

I have seen several posts on this board from people who think they understand Opteron -- several even comparing the HyperTransport interconnect approach of non-Opteron to Opteron, which is a bit different.

...

and you have made this worse than useless. You have told me nothing I did not already know, other than that doing a *PLONK* on your incoming e-mail to my servers would be a Big Win.

It would really help if you comment on what you have first-hand experience with, and not try to make assumptions based on poor third-hand experience.

I often get lambasted because what most -- possibly a great majorith -- people see as an "anal little difference" is actually a very, very _big_ difference. That's why I am very strict on correcting such things.

Not to "be a jerk" -- but to point out that fact that a "common assumption" is very, very _incorrect_. I find I'm actually in the _very_small_minority_ when it comes to several things -- from semiconductor concepts to file/database server design to server performance -- on various IT lists. And I have to say I will _not_ join the majority anytime soon.

I like to think that's why I get repeat business. If you want to take offense to my comments, then that's your choice. But I really do "put my foot down" on things that are wholly inaccurate -- and 9 times out of 10, it's the difference between first-hand experience and third-hand.

...

You once again fail to read what I wrote. The UltraSPARC CPU (not interconnect) architecture is difficult to get to run at higher clock speeds.

Why does clock speed matter one iota when it comes to servers? Why? Processors don't even need a clock (but that's another story)!

Get off clock speed. The only thing clock speed is good for is measuring performance of the _exact_same_ core design. Otherwise, it's rather useless.

...

The Opteron, both by being lower cost at a given speed (and speed does matter, even though raw clock speed doesn't matter as much as many think; speed is a big factor for number crunching)

Those statements are *DEAD*WRONG*.

Clock speed in a clocked boolean logic (CBL) circuit is when the gates switch. In fact, given the speed of light is too slow, it's quite regionalized below 0.25um feature sizes.

The number of execution units, the type of execution units and their number of stages are what matter -- assuming the design is even superscalar! I've seen 500MHz SGS-Thompson embedded processors that are so-called "P3 class" get killed by a decade-old superscalar NexGen Nx586 at 84MHz.

And proprietary implementations of the Itanium, let alone standard Alpha 264 which years older and cheaper, make the Xeon dog-meat at 1/4-1/5th the clock in many, many server applications. In fact, before the Opteron, proprietary Itanium was better than Xeon for many, many server applications if you could afford it.

You can_not_ compare performance by clock speed _between_ products -- period. In fact, clock is slowly but surely being removed from significant portions of the processor. Asynchronous is returning because the clock is a _very_bad_ thing.

But I won't go there. The stupidest thing introduced in a microprocessor was the clock, let alone the operand+operator approach to instruction sets. But that has more to do with the fact that CS majors controlled IC design in the '70s, and engineers didn't come around until the '80s (which is where the "RISC hack" came about).

The good news is the next generation of microprocessor designs does away with clock, the instruction set architecture (ISA) and other '70s legacy concepts design by CS approaches. Physicists and engineers now control design, and software-based binary translation solves the compatibiltiy issue.

...

outclasses the SPARC systems these days.

It has more to do with interconnect than clock. Trust me on this.

Proprietary Itanium systems are a great example. Just like proprietary Xeon systems before that.

But the Opteron is the first to do it commodity.

...

Economic reasons as much as any other probably played a significant role here,

Of course. Why do SPARC when Opteron has a better interconnect for less money?

And, BTW, SPARC _is_ available at about the _same_ clock as Opteron for the price.

...

and I personally do not agree that interconnect technology was the major factor.

Then you would be in the minority among system designers. I'm not talking about solution providers, I mean engineers.

It's the commodity systems interconnect of the Opteron that has made it the "cost king." You can't get its interconnect without some non-standard interconnect design with Xeon, Itanium, etc..., or using a less commodity RISC platform like SPARC.

The problem is that beyond 8-way S940 hasn't really taken off ... yet. There are some non-commodity approaches to be standardized on yet, but it's happening. And when it does, the kernel support will be added.

Hence why enabling 32+ way for Opteron won't do squat right now.

...

Of course, a statement to the contrary by someone in Sun who helped make that decision would prove me wrong.

Feel free to assume I'm pulling everything out of my @$$. ;->

...

The Sun Starfire and Gigaplane architectures both are impacted by processor-RAM affinity, since access to the local RAM on any given CPU/memory card is via the local UPA and doesn't have to hit the board-external interconnect.

Yes. And there is some I/O affinity too. Opteron adds a few more things with its direct, partial-mesh HyperTransport approach, instead of the crossbar of UPA. But both have NUMA.

Which is why Solaris' maturity in this regard makes it an _ideal_ operating system for Opteron _today_. It's grown up with not only NUMA, but cross-bar interconnects which are half-way to a partial mesh.

Linux has grown up on "front side bottleneck" into a "memory controller hub."

...

In a manner of speaking, Gigaplane is a type of NUMA, even though it isn't 'true' NUMA.

Yes. But it still has to hit the crossbar to get to I/O. Opteron has processor affinity for I/O too.

...

Starfire OTOH can be true NUMA (the architecture came from SGI).

Yes, I know. SGI even transferred some of the OS code to Microsoft for NT in their short-lived NT move.

...

Don't bother to 'correct' me (I know I'm being somewhat generic in those statements, and, if I had time and wanted to do so, I could delve into the nitty-gritty);

As could I.

...

I won't see your reply.

Ignorance is bliss.

-- Bryan J. Smith | Sent from Yahoo Mail mailto:b.j.smith@ieee.org | (please excuse any http://thebs413.blogspot.com/ | missing headers)

Mike McCarty

10:01 p.m.

Bryan J. Smith wrote:

...

Lamar Owen lowen@pari.edu wrote:

...
The world that shows that, if 3 is less than 4, then 3 is also less than 10.

SPARC and Xeon are not comparable in generalities. Xeon and Opteron are not comparable in generalities.

...
No, they are not completely different. Last time I checked, they both executed instructions,

Are you for real?

Bryan, may I speak as a friend?

I think that the thread about yum has gotten you (somewhat) understandably agitated, and it's starting to spill over a little into other areas. I suggest that you back off and relax with a cold one (it's 5:00 here in Dallas), and start over tomorrow.

I have no idea about this thread, since I'm not a CPU expert. I have written code with spinlocks for multiprocessors, but that's about it. My main experience is with MIMD loosely-coupled, not tightly-coupled. So I'm not going to comment on the thread in general. I will say that it is pretty ridiculous to claim that two processors are essentially the same because they are both stored-program machines.

Anyway, maybe things are just getting a bit much for you today. I know that they have for me a few times.

Mike

-- p="p=%c%s%c;main(){printf(p,34,p,34);}";main(){printf(p,34,p,34);} This message made from 100% recycled bits. You have found the bank of Larn. I can explain it for you, but I can't understand it for you. I speak only for myself, and I am unanimous in that!

William Warren

10:06 p.m.

Actually, the other person that Bryan is talking to is saying things that are wholly wrong. Wrong to the point that even a non-engineer type like me knows how off-base the other party is. Bryan is understandably agitated but the other party needs to do a bit of research(easily done on google) about cpu and i/o interconnects in a high i/o environment. Hopefully the other party will see the huge error in his statements and at least acknowledge the errors.

Mike McCarty wrote:

...

Bryan J. Smith wrote:

...
Lamar Owen lowen@pari.edu wrote:

...
The world that shows that, if 3 is less than 4, then 3 is also less than 10.

SPARC and Xeon are not comparable in generalities. Xeon and Opteron are not comparable in generalities.

...
No, they are not completely different. Last time I checked, they both executed instructions,

Are you for real?

Bryan, may I speak as a friend?

I think that the thread about yum has gotten you (somewhat) understandably agitated, and it's starting to spill over a little into other areas. I suggest that you back off and relax with a cold one (it's 5:00 here in Dallas), and start over tomorrow.

I have no idea about this thread, since I'm not a CPU expert. I have written code with spinlocks for multiprocessors, but that's about it. My main experience is with MIMD loosely-coupled, not tightly-coupled. So I'm not going to comment on the thread in general. I will say that it is pretty ridiculous to claim that two processors are essentially the same because they are both stored-program machines.

Anyway, maybe things are just getting a bit much for you today. I know that they have for me a few times.

Mike

-- My "Foundation" verse: Isa 54:17 No weapon that is formed against thee shall prosper; and every tongue that shall rise against thee in judgment thou shalt condemn. This is the heritage of the servants of the LORD, and their righteousness is of me, saith the LORD. -- carpe ductum -- "Grab the tape" CDTT (Certified Duct Tape Technician) Linux user #322099 Machines: 206822 256638 276825 http://counter.li.org/

Mike McCarty

10:13 p.m.

William Warren wrote:

...

Actually, the other person that Bryan is talking to is saying things that are wholly wrong. Wrong to the point that even a non-engineer type like me knows how off-base the other party is. Bryan is understandably agitated but the other party needs to do a bit of research(easily done on google) about cpu and i/o interconnects in a high i/o environment. Hopefully the other party will see the huge error in his statements and at least acknowledge the errors.

Major oops here!

Yes, it seems to me that you are correct. But that post was INTENDED to be private!

Sorry, Bryan.

I apologize profusely for posting that on the public forum.

It is unforgiveable, but maybe you can anyway...

Mike

Bryan J. Smith

10:21 p.m.

Mike McCarty wrote:

...

Bryan, may I speak as a friend? I think that the thread about yum has gotten you (somewhat) understandably agitated, and it's starting to spill over a little into other areas.

Actually, when I asked "are you for real?" I was being _dead_honest_. He's so off-base, I honestly can't believe he's saying what he's saying.

The YUM thread actually wasn't too agitating. And now that DAG has spoken up, it's a good sign to drop things. I actually came up with a hack that I'm going to try internally. If it does the job, I'll offer it as a patch to various YUM tools and the YUM client.

...

I suggest that you back off and relax with a cold one (it's 5:00 here in Dallas)

Can't. I'm working late every night, including the weekend right now. I went from my last client last week (and finishing endless documentation) to my new employer this week and we're knee deep in supporting FEMA.

Not to mention there was 1,000 miles between the two.

William Warren hescominsoon@emmanuelcomputerconsulting.com wrote:

...

Actually, the other person that Bryan is talking to is saying things that are wholly wrong. Wrong to the point that even a non-engineer type like me knows how off-base the other party is. Bryan is understandably agitated but the other party needs to do a bit of research(easily done on google) about cpu and i/o interconnects in a high i/o environment.

That's why I asked if he "was for real"? ;->

I am _purposely_not_ bringing up my background because people say I'm sporting my resume.

...

Hopefully the other party will see the huge error in his statements and at least acknowledge the errors.

I could care less whether or not he acknowleges them. I'm not into "ego" -- I'm into "accuracy." If you re-read through _all_ my posts, I'm all about "accuracy."

I _never_ hold things over people, although I know my insistance to be "accurate" does cause many people to understandably dislike me. I find the people I have the most "difficulty" with are the ones who constantly differ with myself.

Especially one who have not actually deployed first-hand what they are talking about. No offense, but that's the YUM and maximum CPU threads right there.

But DAG has spoken, so I will now fall silent.

-- Bryan J. Smith | Sent from Yahoo Mail mailto:b.j.smith@ieee.org | (please excuse any http://thebs413.blogspot.com/ | missing headers)

Lamar Owen

10 Sep 10 Sep

12:17 a.m.

On Friday 09 September 2005 18:06, William Warren wrote:

...

Actually, the other person that Bryan is talking to is saying things that are wholly wrong.

You know, my name is not hard to spell.

...

Wrong to the point that even a non-engineer type like me knows how off-base the other party is. Bryan is understandably agitated but the other party needs to do a bit of research(easily done on google) about cpu and i/o interconnects in a high i/o environment.

Stop. Go back and reread the original post. I said: "Certainly the 8x Opteron will be faster on many things; but under heavy multiuser load the 14-way SPARC does a surprisingly good job, with around three quarters the performance of a dual 3GHz Xeon (that outclasses the SPARC box in every way possible except interconnect) at a load average of 30 or so. "

I mentioned nothing there that is patently wrong. I did a simple benchmark that showed the E6500 held up nicely under load. Made a simple statement about it performing AROUND (that is, approximately) 75% of a different box's speed.

This was not an engineering-type post; while my degree IS in engineering, I made a very simple general observation of the capabilities of a crossbar interconnected system versus a bus-type system.

Exactly where is that completely wrong? Where is that off-base? I said the E6500 was outclassed in every way EXCEPT interconnect BY a Xeon box: I said NOTHING about an Opteron box except the very broadest generalization (since I don't have an Opteron on hand to try it out on).

Further, this very E6500 is the one that I'm offering as a build box for CentOS SPARC; this makes that portion on topic for, if not this list, the -devel list.

You can benchmark the dual Xeon against an 8x Opteron yourself.

As to research on I/O interconnects in a high I/O environment, been there, done that. There is more to a server's load than I/O. I have an application, IRAF, that is very compute intensive. Raw FPU gigaflops matters to this application, which runs in a client-server mode. Raw FPU gigaflops rises in standard stored program architectures roughly linearly with clock speed (this is of course not true in massively parallel and DEL-type architectures (such as SRC's MAP processor (getting a MAPstation here for real-time cross-correlation interferometry))); and, given a particular processor (say, SPARC) getting more clock speed will usually (but of course not always) get you more FPU power.

But that's not relevant.

The original post was simply that the Linux kernel does well with a large number of processors, at least on SPARC. Good grief.

-- Lamar Owen Director of Information Technology Pisgah Astronomical Research Institute 1 PARI Drive Rosman, NC 28772 (828)862-5554 www.pari.edu

Bryan J. Smith

1:38 a.m.

Lamar Owen lowen@pari.edu wrote:

...

You know, my name is not hard to spell.

Actually, many people seem to have a hell of a time spelling mine around here. But if you note, I don't mind one bit. I don't pick on ettique, mispelling, etc... The only things that get to me are hypocracy and lack of first-hand experience.

...

Stop. Go back and reread the original post. I said: "Certainly the 8x Opteron will be faster on many things; but under heavy multiuser load the 14-way SPARC does a surprisingly good job, with around three quarters the performance of a dual 3GHz Xeon (that outclasses the SPARC box in every way possible except interconnect) at a load average of 30 or so." I mentioned nothing there that is patently wrong.

First off, I still don't see how you could move from talking about the Opteron to comparing SPARC to Xeon. There is absolutely no similarities other than ISA, which is _not_ a performance issue.

Secondly, I think someone else was focusing on your _latter_ comments which prompted me so ask, "are you for real?" I think it was that which prompted the various responses.

...

I did a simple benchmark that showed the E6500 held up nicely under load. Made a simple statement about it performing AROUND (that is, approximately) 75% of a different box's speed.

And that's _fine_ for comparing the 6500 SPARC UPA system to a dual-Xeon FSB-MCH clusterfsck at a _specific_ application.

How this benchmark translates to Opteron, I have no idea.

...

This was not an engineering-type post;

I'm still trying to figure out what post it was. But then you really went off "the deep end" with your follow-up -- _that's_ what prompted others to respond AFAICT.

...

while my degree IS in engineering, I made a very simple general observation of the capabilities of a crossbar interconnected system versus a bus-type system.

And Opteron is _neither_. It's a partial mesh.

...

Exactly where is that completely wrong? Where is that off-base? I said the E6500 was outclassed in every way EXCEPT interconnect BY a Xeon box: I said NOTHING about an Opteron box except the very broadest generalization (since I don't have an Opteron on hand to try it out on).

Again, first you made the relationship. Then you expanded on it with quite incorrect information. I don't know your reasons, but your statements were trying to be technically specific enough at points, then overly general at others.

In the end, I just asked you not to explain the performance of the Opteron by your benchmark. Then you really went off.

...

Further, this very E6500 is the one that I'm offering as a build box for CentOS SPARC; this makes that portion on

topic

...

for, if not this list, the -devel list.

Actually, the thread was on processor support.

I merely pointed out that you can set Opteron to 1024 and it doesn't make a damn bit of difference if the hardware configuration isn't supported.

Same deal with Xeon. Upto 32 is supported, but I've booted some 8-way Xeon systems and only gotten 2 processors because the bridging logic wasn't supported.

...

You can benchmark the dual Xeon against an 8x Opteron yourself.

I'm still scratching my head on what you were saying. All I know is that it was wholly inapplicable to the performance of Opteron.

...

As to research on I/O interconnects in a high I/O environment, been there, done that. There is more to a server's load than I/O.

Don't forget memory and memory mapped I/O, let alone the _system_ interconnect to support any CPU-memory-I/O alongside the _peripheral_ interconnect. Servers are data pushers and manipulators in many, many cases.

...

I have an application, IRAF, that is very compute

intensive.

...

Raw FPU gigaflops matters to this application, which runs

...

a client-server mode.

Then it's far more CPU, although if it's being feed data, memory and system interconnect can affect that. Especially if there is a client-server communication.

...

Raw FPU gigaflops rises in standard stored program architectures roughly linearly with clock speed

On the _same_ core, _not_ different cores. You can_not_ compare different cores by clock.

Otherwise people wouldn't still be running 667-733MHz Alpha 264s, let alone Itanium2 733-800MHz systems with 3.8GHz Xeons out there. Clock is only a measure in the _same_ core design, _not_ different designs.

Heck, a P3 at 1.4GHz is _better_ than P4 at 2.8GHz when it comes to many FPU opterations. Throw the SSE switch to gain a P4 "advantage" and kiss your precision goodbye! Sadly enough, the Pentium M at 2.1GHz is Intel's _fastest_ x86 FPU engine -- you have to go to Itanium at, ironically, sub-1GHz to get faster from Intel.

...

(this is of course not true in massively parallel and DEL-type architectures (such as SRC's MAP processor (getting a MAPstation here for real-time cross-correlation interferometry))); and, given a particular processor (say, SPARC) getting more clock speed will usually (but of course not always) get you more FPU power.

When compared to the _same_ core. Clock is _incomparable_between_ cores. I honestly don't think you understand how superscalar architectures work. Some have more FPU units than others, some have FPU units staged out so they take far more cycles than others.

...

But that's not relevant. The original post was simply that the Linux kernel does well with a large number of processors, at least on SPARC. Good grief.

You went there too dude.

And my _original_ point continues to be that you can set the number of x86-64 processors to 1,024, and you might still only see 4-8 processors on a 32-way configuration.

Hardware support in the kernel for the system interconnect design is what matters. There is no "generic, transparent scalable system interconnect," although HyperTransport comes as close as you can get.

-- Bryan J. Smith | Sent from Yahoo Mail mailto:b.j.smith@ieee.org | (please excuse any http://thebs413.blogspot.com/ | missing headers)

Tony Wicks

1:41 a.m.

Guys, I would say this thread has been done to death now.

Lamar Owen

2:36 a.m.

On Friday 09 September 2005 21:41, Tony Wicks wrote:

...

Guys, I would say this thread has been done to death now.

I agree. Really, I do.

-- Lamar Owen Director of Information Technology Pisgah Astronomical Research Institute 1 PARI Drive Rosman, NC 28772 (828)862-5554 www.pari.edu

Solar Canine

1:44 a.m.

Bryan J. Smith b.j.smith@ieee.org wrote:

...

Lamar Owen lowen@pari.edu wrote:

Mommy, Daddy, please stop fighting!

Seriously though, I've gotten so much spam from this argument over the last day that I'm half tempted to unsubscribe from the whole thing and lose the valuable information resource, if only to avoid these inbox-crushing squabbles in the schoolyard.

You disagree. You don't see eye to eye. Both think the other is an insensitive close-minded clod. Let's leave it at that, huh?

Or at least take it private. For the sake of my bandwidth and storage space, if nothing else.

Please?

Jon McCauley

3:29 a.m.

Solar Canine wrote:

...

Bryan J. Smith b.j.smith@ieee.org wrote:

...
Lamar Owen lowen@pari.edu wrote:

Seriously though, I've gotten so much spam from this argument over the last day that I'm half tempted to unsubscribe from the whole thing and lose the valuable information resource, if only to avoid these inbox-crushing squabbles in the schoolyard.

Me too..... ;(

Best Regards, Jon McCauley

7379

Age (days ago)

7381

Last active (days ago)

discuss@lists.centos.org

23 comments

11 participants

tags (0)

participants (11)

Bryan J. Smith
Johnny Hughes
Jon McCauley
Joshua Baker-LePain
Lamar Owen
Mike McCarty
Peter Arremann
Solar Canine
Tony Schreiner
Tony Wicks
William Warren