[CentOS] Re: Is a new HP-dl380 dual XEON 64 bit or 32 bit -- AMD64 v. EM64T

Thu Jun 16 03:21:09 UTC 2005
Bryan J. Smith <b.j.smith at ieee.org>

On Wed, 2005-06-15 at 22:39 -0400, Peter Arremann wrote:
> <even more anal>Except the iommu, those are limitations of chipset, bus and 
> whatever, not EM64T.</even more anal> 

Yes and no.  In fact, it has to do with the fact that Intel is still
relying on a chipset to do what most everyone else is doing at the CPU
interconnect.  Even the original Athlon MP moved many details into the
CPU.  Much of this was forced by the crossbar switch of Alpha EV6,
because the CPU can't be segmented from the interconnect aspects if you
use multiple connections.

The legacy concept that the CPU is independent of the interconnect is a
viewpoint only realized by largely Intel today.  When you say "chipset"
-- the context is completely different between AMD and Intel.  In AMD,
the "chipset" is rather generic and largely glueless.  With Intel, you
can only have *1* point between the CPU and the "memory hub."

> EV6 is what slot/socket A was all about... The Athlon64 and Opteron (which 
> just happen to implement the AMD64 instruction set) use HyperTransport.

They only use HyperTransport as a generic transport between other
HyperTransport devices -- be it another CPU or HyperTransport
tunnel/bridge.  But the addressing to both memory as well as virtualized
over HT in the AMD64 platform is very much 40-bit EV6.

In other words, EV6 is at the heart of addressing outside the AMD64 CPU,
just like on 32-bit Athlon before it (which was also capable of 40-bit,
at least in the Athlon MP, long story).

> The instruction set has nothing to do with the interconnect - PowerPC and
> a whole bunch of other very use specific chips use hypertransport. 

Not true.  PowerPC implementations, like the 970, that use
HyperTransport do _not_ use it to the CPU.  They still use an Intel like
"memory hub" and single-point-of-contention.  I.e., they only use it as
a system interconnect for I/O, but not CPU itself.  They use their own
bus for CPU/memory.

Same deal for inter-bridge connections between chipsets in even AGTL+
platform like in nVidia and SiS chipsets.  The value of HyperTransport
is not realized.  There is still only a _single_ point on interconnect
to the CPU(s).

Athlon64/Opteron is the first, commodity platform to bring something of
"partial mesh" to the system interconnect.

> Nice flame - but has very little to do with real world. the IA64 architecture 
> got the basics right...

I disagree entirely.  The concept of optimizing the organization of
machine code is based on the premise that machine code is the best way
to execute instructions in silicon.  That concept has been considered
flawed for a long time, since the '80s.  But because machine code is how
everything is developed in software (even if at higher levels), that's
why RISC came about.  The idea to optimize the machine code for silicon
considerations and run-time optimization in the chip, then hiding the
added burden of the eccentric instruction set in the C compiler.

With EPIC, Intel merely thought it could do away with run-time
optimization in the chip, and parallelize 3 instructions in the
instruction word, to take RISC's typical 60% stage utilization closer to
100%.  The reality is that unless you are parallelizing to the depth of
the superscalar design in silicon, then it's rather self-defeating.
I.e., you've gotta turn the _entire_ programmer world upside down and
get them to think like IC design engineers (not likely).  And trying to
do it at _only_ compiler was just ludicrious IMHO (and makes me wonder
if Intel is full of CS majors and not EEs anymore ;-).  Sorry, but the
reality is that you can't keep the pipes full with the approach
_regardless_ of what tricks you play with the opcode+operand machine
code -- it's inherit to the flaw of sending instructions to the
processor in the traditional machine code string.  RISC with a
combination of run-time and compile-time optimization is as good as it
gets, and not some CS ideal to somehow make machine code "better."

Then let's talk Predication.  The other side of the concept that RISC
only keeps 60% of the pipes full anyway, so we can use those extra
cycles to execute both paths and just forget branch prediction and any
logic dedicated to it.  An analogy of this is like trying to solve the
problem of DRAM read latency by adding more DRAM channels but chucking
the SRAM cache because it's too costly.  Sure, you're going to save on
the transistor logic, but you're just going to have more overhead and
the same, increased latency in the end.  The chance of a branch
mispredict and stall is rather small, just like a SRAM cache miss, so
it's worth it to keep branch prediction around, just like SRAM cache.

> The reason for the low performance of Itanium chips (low as in real world
> performance compared to what it could do theoretically) are because of
> the immaturity of the chip (not nearly as tweaked as a P4 is) 
> and platform (slow memory and then you expect great benchmark scores?)
> as well as some really really really stupid decisions.

I'm not even looking at P4, but comparing EPIC to RISC of the same
technology.

If "EPIC" and "Predication" are so good, why are they retrofiting run-
time optimization and traditional branch prediction back into the
design?

That's exactly what the Digital Semiconductor team predicted the IA-64
design teams would have to do -- predicted way back in 1997 -- years
before the first IA-64 Itanium hit silicon.  They explicitly stated that
the concept of compiler-time-only optimization was never going to work.
Intel should have listened.  After all, Digital Semi basically invented
every major interconnect in the '90s, as well as showed Intel how to fix
their superscalar ALU in the Pentium Pro from the Pentium (hence the
resulting lawsuit later).

Even Itanium2 does not compete well with aged Alpha 264 at a older
larger feature size (much less the new Alpha 364 at a newer one) at its
own, native instruction set.  It doesn't have anything to do with memory
or other technology adoption -- heck, Alpha has been well behind Itanium
in getting the silicon fabrication technology and it still competes very
well.

And probably the biggest insult to Itanium was the fact that Digital's
Binary Translation technology from the Alpha has been adopted for
Itanium.  Why?  Because it emulates PA-RISC and x86 _faster_ than the
IA-64 can do in hardware.  Digital has always been right on everything
from RISC to interconnects.  It's much better to build an anal RISC
architecture, and then translate from one byte code to another (of the
same OS), than try to build byte code compatibility in the architecture.

I'm sure IA-64 Itanium3 will benefit from completely chucking x86 and
PA-RISC in the hardware thanx to Digital's technology.

> Like allocating too few bits for the template... but these things are
> simply bad decisions on how to implement it, not something wrong with
> VLIW architectures in general.

Oh, I believe very much in VLIW architectures.  Transmeta's design was
an excellent example.

But HP-Intel's concept of pure, compiler-side optimization in EPIC and
Predication was a CS ideological fantasy.  And Digital Semiconductor
predicted its monumentus _failure_.  MDR Microprocessor Forum even
opened a few years ago with a "Twilight Zone" hindsight theme where
Intel decided to forget EPIC and adopt Alpha, and they were just about
to release the Alpha 364 (with all the leading-edge Intel fab technology
-- damn that would be tasty!).

But IA-64 is too far developed to drop now.  It has already replaced
Alpha because Alpha has no future beyond 364, let alone isn't designed
for the latest fab technologies.

> In fact, if you look at the Itanium chips, they are very RISC like. To
> the point where a lot of guys say its a risc core with a VLIW decoder
> in front of it... and that the VLIW decoder happened to be the main
> issue is, at least to me, hysterical.

It is the reliance on the compiler-only for optimization.  It's like
Intel picked a half-ass point between RISC and VLIW and said, let's
merge these concepts and rely entirely on compile-time optimizations.  I
honestly don't know what they were thinking with EPIC -- let alone
that's before we even look at Predication.  As I said, it's like using
more DRAM channels and chucking SRAM cache because it takes up a lot of
transistors -- the reality is that 95%+ SRAM cache hits are your "best
bang for the buck."


-- 
Bryan J. Smith                                     b.j.smith at ieee.org 
--------------------------------------------------------------------- 
It is mathematically impossible for someone who makes more than you
to be anything but richer than you.  Any tax rate that penalizes them
will also penalize you similarly (to those below you, and then below
them).  Linear algebra, let alone differential calculus or even ele-
mentary concepts of limits, is mutually exclusive with US journalism.
So forget even attempting to explain how tax cuts work.  ;->