On Wed, 2005-06-15 at 22:39 -0400, Peter Arremann wrote:
<even more anal>Except the iommu, those are limitations of chipset, bus and whatever, not EM64T.</even more anal>
Yes and no. In fact, it has to do with the fact that Intel is still relying on a chipset to do what most everyone else is doing at the CPU interconnect. Even the original Athlon MP moved many details into the CPU. Much of this was forced by the crossbar switch of Alpha EV6, because the CPU can't be segmented from the interconnect aspects if you use multiple connections.
The legacy concept that the CPU is independent of the interconnect is a viewpoint only realized by largely Intel today. When you say "chipset" -- the context is completely different between AMD and Intel. In AMD, the "chipset" is rather generic and largely glueless. With Intel, you can only have *1* point between the CPU and the "memory hub."
EV6 is what slot/socket A was all about... The Athlon64 and Opteron (which just happen to implement the AMD64 instruction set) use HyperTransport.
They only use HyperTransport as a generic transport between other HyperTransport devices -- be it another CPU or HyperTransport tunnel/bridge. But the addressing to both memory as well as virtualized over HT in the AMD64 platform is very much 40-bit EV6.
In other words, EV6 is at the heart of addressing outside the AMD64 CPU, just like on 32-bit Athlon before it (which was also capable of 40-bit, at least in the Athlon MP, long story).
The instruction set has nothing to do with the interconnect - PowerPC and a whole bunch of other very use specific chips use hypertransport.
Not true. PowerPC implementations, like the 970, that use HyperTransport do _not_ use it to the CPU. They still use an Intel like "memory hub" and single-point-of-contention. I.e., they only use it as a system interconnect for I/O, but not CPU itself. They use their own bus for CPU/memory.
Same deal for inter-bridge connections between chipsets in even AGTL+ platform like in nVidia and SiS chipsets. The value of HyperTransport is not realized. There is still only a _single_ point on interconnect to the CPU(s).
Athlon64/Opteron is the first, commodity platform to bring something of "partial mesh" to the system interconnect.
Nice flame - but has very little to do with real world. the IA64 architecture got the basics right...
I disagree entirely. The concept of optimizing the organization of machine code is based on the premise that machine code is the best way to execute instructions in silicon. That concept has been considered flawed for a long time, since the '80s. But because machine code is how everything is developed in software (even if at higher levels), that's why RISC came about. The idea to optimize the machine code for silicon considerations and run-time optimization in the chip, then hiding the added burden of the eccentric instruction set in the C compiler.
With EPIC, Intel merely thought it could do away with run-time optimization in the chip, and parallelize 3 instructions in the instruction word, to take RISC's typical 60% stage utilization closer to 100%. The reality is that unless you are parallelizing to the depth of the superscalar design in silicon, then it's rather self-defeating. I.e., you've gotta turn the _entire_ programmer world upside down and get them to think like IC design engineers (not likely). And trying to do it at _only_ compiler was just ludicrious IMHO (and makes me wonder if Intel is full of CS majors and not EEs anymore ;-). Sorry, but the reality is that you can't keep the pipes full with the approach _regardless_ of what tricks you play with the opcode+operand machine code -- it's inherit to the flaw of sending instructions to the processor in the traditional machine code string. RISC with a combination of run-time and compile-time optimization is as good as it gets, and not some CS ideal to somehow make machine code "better."
Then let's talk Predication. The other side of the concept that RISC only keeps 60% of the pipes full anyway, so we can use those extra cycles to execute both paths and just forget branch prediction and any logic dedicated to it. An analogy of this is like trying to solve the problem of DRAM read latency by adding more DRAM channels but chucking the SRAM cache because it's too costly. Sure, you're going to save on the transistor logic, but you're just going to have more overhead and the same, increased latency in the end. The chance of a branch mispredict and stall is rather small, just like a SRAM cache miss, so it's worth it to keep branch prediction around, just like SRAM cache.
The reason for the low performance of Itanium chips (low as in real world performance compared to what it could do theoretically) are because of the immaturity of the chip (not nearly as tweaked as a P4 is) and platform (slow memory and then you expect great benchmark scores?) as well as some really really really stupid decisions.
I'm not even looking at P4, but comparing EPIC to RISC of the same technology.
If "EPIC" and "Predication" are so good, why are they retrofiting run- time optimization and traditional branch prediction back into the design?
That's exactly what the Digital Semiconductor team predicted the IA-64 design teams would have to do -- predicted way back in 1997 -- years before the first IA-64 Itanium hit silicon. They explicitly stated that the concept of compiler-time-only optimization was never going to work. Intel should have listened. After all, Digital Semi basically invented every major interconnect in the '90s, as well as showed Intel how to fix their superscalar ALU in the Pentium Pro from the Pentium (hence the resulting lawsuit later).
Even Itanium2 does not compete well with aged Alpha 264 at a older larger feature size (much less the new Alpha 364 at a newer one) at its own, native instruction set. It doesn't have anything to do with memory or other technology adoption -- heck, Alpha has been well behind Itanium in getting the silicon fabrication technology and it still competes very well.
And probably the biggest insult to Itanium was the fact that Digital's Binary Translation technology from the Alpha has been adopted for Itanium. Why? Because it emulates PA-RISC and x86 _faster_ than the IA-64 can do in hardware. Digital has always been right on everything from RISC to interconnects. It's much better to build an anal RISC architecture, and then translate from one byte code to another (of the same OS), than try to build byte code compatibility in the architecture.
I'm sure IA-64 Itanium3 will benefit from completely chucking x86 and PA-RISC in the hardware thanx to Digital's technology.
Like allocating too few bits for the template... but these things are simply bad decisions on how to implement it, not something wrong with VLIW architectures in general.
Oh, I believe very much in VLIW architectures. Transmeta's design was an excellent example.
But HP-Intel's concept of pure, compiler-side optimization in EPIC and Predication was a CS ideological fantasy. And Digital Semiconductor predicted its monumentus _failure_. MDR Microprocessor Forum even opened a few years ago with a "Twilight Zone" hindsight theme where Intel decided to forget EPIC and adopt Alpha, and they were just about to release the Alpha 364 (with all the leading-edge Intel fab technology -- damn that would be tasty!).
But IA-64 is too far developed to drop now. It has already replaced Alpha because Alpha has no future beyond 364, let alone isn't designed for the latest fab technologies.
In fact, if you look at the Itanium chips, they are very RISC like. To the point where a lot of guys say its a risc core with a VLIW decoder in front of it... and that the VLIW decoder happened to be the main issue is, at least to me, hysterical.
It is the reliance on the compiler-only for optimization. It's like Intel picked a half-ass point between RISC and VLIW and said, let's merge these concepts and rely entirely on compile-time optimizations. I honestly don't know what they were thinking with EPIC -- let alone that's before we even look at Predication. As I said, it's like using more DRAM channels and chucking SRAM cache because it takes up a lot of transistors -- the reality is that 95%+ SRAM cache hits are your "best bang for the buck."