New subject: [OT] Memory Models and Multi/Virtual-Cores -- WAS: 4.0 -> 4.1 update failing

21 Jun 2005


      From: Maciej ?enczykowski maze@cela.pl
...
That's a good point - does anyone know what the new Intel
Virtualization thingamajig in the new dual core pentium D's is about?
It's all speculation at this point.  But there are _several_ factors.
But I'm sure the first time Intel saw AMD's x86-64/PAE52 presentation,
the same thing popped into my mind that popped into Intel's mind ...
  Virtualization
- The 48-bit/256TiB limitation of x86-64 "Long Mode"
There is a "progammers limit" of 48-bit/256TiB in x86-64 "Long Mode."
This limitation is due to how i386/i486-TLB works -- 16-bit segment,
32-bit off-set.  If AMD would have choosen to ignore such compatibility,
it would have been near-impossible for 32-bit/PAE36 programs to run
under a kernel of a different model.  But "Long Mode" was designed
so its PAE52 model could run both 32-bit (and PAE36) as well as new
48-bit programs.
We'll revisit that in a bit.  Now, let's talk about Intel/AMD design
lineage.
- Intel IA-32 Complete Design Lineage
IA-32 Gen 1 (1986):  i386, including i486
- Non-superscalar:  ALU + optional FPU (std. in 486DX), TLB added in i486
IA-32 Gen 2 (1992):  i586, Pentium/MMX (defunct, redesigned in i686)
- Superscalar  2+1 ALU+FPU (pipelined)
IA-32 Gen 3 (1994):  i686, Pentium Pro, II, III, 4 (partial refit)
- Superscalar:  2+2 ALU+FPU (pipelined), FPU 1 complex or 2 ADD
- P3 = +1 SSE pipe, P4 = +2 SSE pipe
Intel hasn't revamped it's aging i686 architecture in almost 12 years.
the Pentium Pro through Pentium III are the _exact_same_ 7-issue
(2+2+3 ALU+FPU+controll) design (the P3 slaps on one SSE unit),
and the Pentium 4 was a quick, 18-month refit of longer pipes (with
associated reduction in ALU/FPU performance MHz for MHz) that
extended pipes for clock (and added a 2nd SSE unit).
I'm sure Intel's reasoning for not bothering with a complete generation
redesign beyond i686 is because it thought EPIC/Predication would
have taken over by now.  The reality has been quite the opposite
(which I won't get back into).
Since then, Intel has made a number of "hacks" to the i686 architecture.
One is HyperThreading which tries to keep its pipes full by using its
control units to virtualize two instruction schedulers, registers, etc...
In a nutshell, it's a nice way to get "out-of-order and register
renaming for almost free."  Other than basic coherency checking as
necessary in silicon, it "passes the buck" to the OS, leveraging its
context switching (and associated overhead) to manage some details.
That's why HyperThreading can actually be slower for some applications,
because they do not thread, and the added overhead in _software_
results in reduced processing time for the applications.
"Yamhill" IA-32e aka "EM64T"  was just a P4 ALU refit for x86-64/PAE52,
but it lacks many design considerations that the Athlon has -- especially
outside the programmer/software considerations, and definitely more
at the core interconnect/platform..  I.e., because Intel continues to use
a single-point-of-contention "memory controller hub" (MCH), memory
interconnect and I/O management, among other details, are still left to
the MCH.  This is going to become more and more of a headache.  The
reality is that the Intel IA-32e platform _must_ get past the "northbridge
outside the CPU" attitude to compete with AMD.
As such, I have _always_ theorized that "Yamhill" is a 2-part project.
Part 2 is the first redesign of a x86 core in almost (now) 12 years,
which goes beyond merely adding true register renaming and out-of-
order execution (which are largely hacks in the P4/HT), but goes directly
to the concept of virtualizing cores.  More on that in a bit, now AMD ...
- AMD x86 Complete Design Lineage
AMD Gen 1 (1992*):  i386/486 ISA -- 386, 486, 5x86, K5*
- Non-superscalar:  ALU + optional FPU (std. in K5)
AMD Gen 2 (1994*):  i486/686 ISA -- Nx586+FPU/K5*, Nx686/K6
- Superscalar:  3+1 ALU+FPU (ALUs pipelined, FPU _not_ piplined)
AMD Gen 3 (1999):  i686/x86-64 ISA -- Athlon, Athlon64/Opteron
- Superscalar:  3+3 ALU+FPU (pipelined), FPU 2 _and_ 1 ADD/MULT
- Extensions are microcoded and leverage ALU/FPU as appropriate
*NOTE:  The NexGen Nx586 released in 1994 forms the basis for
latter K5 (i486) and the K6 (i686).  AMD had scalability issues with
its original non-superscalar K5 design and purchased NexGen.
SIDE NOTE:  SSE Comparison
- P4 can do 3 MULT SSE (1 FPU complex + 2 SSE pipes)
- Athlon can do 3 MULT SSE (2 FPU complex + 1 FPU MULT)
Contrary to popular opinion, Athlon64/Opteron is the _same_core_
design as the 32-bit Athlon platform.  It is still the same, ultra-
powerful 3+3 ALU+FPU core, with its 2 complex + 1 ADD/MULT
FPU able to equal Intel's 1 complex _or_ 2 ADD FPU plus 2 SSE
pipes at doing the majority of matrix transforms (which are MULT --
hence why Intel's FPU can't do 2 simultaneously, and relies heavily
on its precision-lacking SSE pipes).
Also contrary to popular opinion, 40-bit/1TiB Digital Alpha EV6
interconnect forms the basis for _all_ addressing in _all_ Athlon
releases, including the 32-bit.  There are a few mainboards that
allow even 32-bit Athlons to safely address above 4GB with
_no_ paging or issues (with an OS that offers such a supporting
kernel, like Linux).  The 3-16 point EV6 crossbar and not
"hub" architecture, forced Athlon MP to put any I/O coherency
login in the chip, so the AGPgart control is actually on the Athlon
MP, and not in the northbridge.  This has evolved into a full
I/O MMU in Athlon64/Opteron.
Because Athlon is 5 years newer than Intel i686, and there is
a wealthy of talent influx from Digital (even though Intel did get
some as well, they haven't redesigned i686 completely), Athlon
has some of the latest, run-time register renaming and out-of-order
execution control in the core itself.  This is why doing something
like HyperThreading would benefit AMD _very_little_ and largley
introduce self-defeating (and even performance reducing) overhead.
In addition to the design of PAE52, the #1 reason why you can
safely assume AMD is moving towards virtualization is because of
the design limits they put on Athlon64/Opteron.  E.g., although the
original 32-bit Athlon platform used logic that allowed up to the
full EV6 8MB SRAM addressing (cache), Athlon64/Opteron has been
artificially limited to 1MB SRAM (saving many considerations and
other benefits).  This clearly indicates AMD did not consider
Athlon64/Opteron
- The Evolution to Virtual Cores
AMD's adoption of '90s concepts of register renaming and out-of-order
execution are great for a single core.  And Intel's HyperThreading
with the minor P4 run-time additions passes-the-buck decently in lieu
of a complete core redesign (which they haven't done since 1994).
But the concept of extending the pipes any further for performance
has been largely broken in the P4, and Intel is actually falling back
to its last rev of the i686 original, P3.
Multiple, _physical_ cores have been the first step.  This is little more
than slapping in a second set of all the non-SRAM transistors, plus
any additional bridging logic, if necessary.  AMD HyperTransport
requires none -- as HyperTransport can "tunnel" anything, EV6
memory/addressing, I/O tunnels/bridges, inter-CPU, etc... all
"gluelessly."  Intel MCH GTL+ cannot, and requires bridges between
the "chipset MCH" and the "multi-core MCH," adding latency.  And
there are nagging 32-bit limitations with GTL+ as well (long story).
The next logical evolution in microprocessor design is to blur the
physical separation between cores.  It's the best way without tearing
down the entire '70s-induced concept of machine code (operator+
operand, possibly control, at least microcoded internally) and the
resulting instruction sets.  Instead of discrete, superscalar units
of a half-dozen to a dozen, pipelined units, there will be numerous,
independent pipes, possibly with their own registers or a number
of generic registers, as a single unit.  Other than the controlling
firmware and/or OS, this is _not_ what software will use.
What the software will use are the virtual instantiations that
partition this set of pipes and registers, which may very well be
dynamic in nature.  Let's say I boot Windows, I might instantiate
a virtual i686/PAE36 core guaranteeing 100% full Win32
compatibility.  Depending on what resources the chip physically
has, I will likely even instantiate multiple i686 processors.  The
concept of multi-CPU and multi-threading has evolved into
virtual-cores with virtual-threading.  Virtualizing more CPUs with
a total number of more pipes/registers than is actual will allow
more registers and pipelines to be executing instead of the 
common 40-50% for superscalar CISC or 60-70% for superscalar
RISC.
As an "added bonus," this means the 48-bit/256TiB constraint
for PAE36 compatibility is _removed_.  I.e., you can have a
much larger, true memory pool, and any required windowing/
segmentation is done with_out_ paging by the "host" memory
model, even though the OS is virtually running in a PAE36 or
PAE52 model.
This also gives rise to an entirely new platform for virtualization
of simultaneous OSes -- be it the same OS, or different OSes.
Because cores are virtual, you can have multiple, independent
processors with their own registers, memory windows into
physical RAM, etc...  On the more "consumer" front, this will
allow it to work with existing OSes as-is.  On the more "load
balancing server" front, this will often be paired with software
(think EMC/VMWare *SX products) so numerous instances can
be dynamically load-balanced across virtual cores -- but far
more overhead and increased efficiency is put on the chip.
But it is still managed by software (just with reduced
context switching overhead in the software).
Again, it's really just a consolidation of all the run-time
optimizations we have now, along with both multi-core and
multi-threading approaches, into a general pool of pipes,
registers and organization.  Additionally, it breaks the physical
constraints of the memory model for the physical hardware,
which is a very big issue for our future.  To ensure x86/PAE36
and x86-64/PAE52 compatibility in the future, such machines
will need to be virtualized or we'll be stuck at 48-bit/256TiB.
...
As in is it worth anything?
Yes -- and almost everything to the future of Microsoft being able to
sustain much their existing Win32 codebase which does _not_ port
to PAE52 very easily and definitely _not_ with full compatibility.
And we have to break the 48-bit/256TiB limitations of PAE52,
while still ensuring PAE52 OSes/applications, as well as some
legacy PAE36 OSes/applications, still run.  The only way is to
virtualize the whole freak'n chip so we can instantiate a processor,
registers and its memory model -- even if dynamically assigned/
shared.  And that's just for end-users, possibly workstations and
entry servers.
For load-balancing servers, you'll still need a software solution
for management.  It will be that the hardware just offers far
greater efficiency and reduced context switching.  In fact, the
next consolidation are these virtual core chips in blades, where
you not only manage the virtual cores in the individual chips/
blades, but an entire rack of blades as a single unit with multiple
OSes spread across.  This already exists, but this takes it one
step further -- because the processors themselves are virtualized
with greatly reduced overhead on the part of the software.
...
Will it allow a dual simultaneous boot of Linux+WinXP+MaxOS 
under Xen or something along those lines?
Yes.
It will both give more virtualized processors to a single executing
OS, as well as create segmented, virtualized processors for
independently and simultaneously operating processors.
...
Even on an SMP machine?
First off, remove the Intel-centric notion of "Symmetric" MP (SMP).
Secondly, multi-processing and multi-threading are going to merge
with traditional register renaming and out-of-order execution.  So
the traditional concept of "MP" is _dying_.  In fact, in the '90s,
it really died in general.
I know it's hard to think outside the box and traditional thought,
but most users don't understand superscalar design in the first
place.  Those who do understand why AMD has _not_ bothered
to adopt Intel SMT (HyperThreading) in Athlon, because it won't
benefit (because AMD's cores are 5 years newer in design, and
put far more optimizations in the chip to keep pipes full and
registers used that to virtualize two sets for the OS to use).
...
Anyone have any experience/knowledge about this?
I can only speculate based on the history of the players involved,
as well as what AMD's PAE52 design as well as limitations of the
current Athlon core (which is largely the _same_ between both the
32-bit and newer 64-bit versions).
But the concept of adding more pipes with lots of stages for
timing is only leaving more and more stages in pipes empty,
or doing little.  There has to be a consolidation of many
run-time optimizations inside of the chip, and the best way to
do that is to create a bank of pipes, registers, etc... and virtually
assemble them into virtual cores that are partitioned with memory
as a traditional PAE36 or PAE52 processor (or multi-processor).
It's going to solve a _lot_ of issues -- both semiconductor and
software.
...
What level of CPU/hardware(?) does the virt-core support?
And is the virt-core 32bit?
You can be certain that the "host" OS (possibly firmware-based?)
will be able to instantiate multiple PAE36 and/or PAE52 virtual
systems with their own and -- I'll use legacy terminology here 
(even if it's not technically correct) -- "Ring 0" access.  So,
technically, there should be possible to run any PAE36 or PAE52
OS simultaneously on the same hardware as any other PAE36 or
PAE52 OS.
The larger issues of firmware-OS interoperability as well as
partitioning resources (memory, disk, etc...) is really more of a
political/market issue.  I.e., AMD and Intel can provide the platform,
but people have to work together to use it.  Furthermore, it also
means that Intel can continue to best AMD in funding of OEMs and
firmware/software vendors, so it still has an advantage in that
capacity.
I'm sure Apple will be protective of its firmware, and Intel's new,
supposed "open" firmware is rather proprietary.  As I've repeatedly
commented elsewhere, the 2 "most open" hardware vendors right
now are AMD and Sun, x86-64 and SPARC, respectively.  Intel
has not only protected non-programmer aspects of IA-64 heavily,
but most of their new platform developments for even IA-32e
(EM64T) are _very_proprietary_.  IBM is partially doing the same
with Power in a microelectronics offering, but it is _not_ the
same in its branded Power solutions (among others).
So it's not going to solve vendors who require firmware and
data organization that is not open and stanardized.  We're
fine on legacy Win32 platforms, but it's not going to address
Mactel, nor solve the problem of existing OSes that don't run
under current virtualization solutions because of such
proprietary requirements.
--
Bryan J. Smith   mailto:b.j.smith@ieee.org