[CentOS] [OT] Memory Models and Multi/Virtual-Cores -- WAS: 4.0 -> 4.1 update failing

Tue Jun 28 17:00:01 UTC 2005
Bryan J. Smith <b.j.smith@ieee.org> <thebs413 at earthlink.net>

From: alex at milivojevic.org
> Well, this is how I interpreted Bryan's emails.  He'll probably correct me if
> I'm wrong (yeah, I have EE, but haven't done much EE work since I 
> graduated, it just happened to be mostly "software" things for me, so I'm
> rather rusty here  :-(

If you haven't noticed, I am _not_ big on "credentials."  As far as I'm
concerned, as long as someone has the knowledge, I could care less
about the paper**.  Even most state BoPEs will allow you to replace
an ABET Accredited BSE with 8-12 years of experience (plus another
4-5 years experience post-degree) to qualify to become a PE.

There are semiconductor engineering mindsets and they create differ
from programmer or even technologist.

[ **NOTE:  Being a consultant, I finally had to "give in" to the paper.
I also decided to major in engineering "just in case I needed it" (and,
later on, got to use it for 2 years in the semiconductor industry). ]

> I don't think Bryan is talking about "external" width of the address bus,
> how many lines you see printed on the motherboard.

That's because it is _irrelevant_ in many cases.  You can mux lines,
which is exactly what the original EV6 Crossbar Athlon does.

> He's talking how the things are organized internally, and about the way
> the bus logic works.  That would be what theoretically implementations
> might use, not what some specific implementation is limiting itself to.

Exactly.  Until the new 40-bit Xeon MPs came out, they were pretty much
"slap on" designs with extended ALUs and microcoding to be x86-64
compatible to a point.

> Sorry for mixing software and hardware from now on,

It hard not to.

First you have to consider the programmer aspects.
Then you have to realize those will influence hardware compatibility.
Then you have to remember that with every new addressing model,
you _exponentially_ increase the external logic to drive it.

> just trying to make Peter see where his misconception is (the best
> way I can, which might not be good way at all).

It's fairly difficult to do it without breakout a basic memory controller
circuit at least the transistor level, or possibly the combinational NAND
gates it somewhat represents.

> The programming model for 32-bit userland applications is obviusly
> limited to 32 bits -- the sizeof(void *) will tell you that.  So single 
> process can see only 32-bit logical linear address space.

To a point.  PAE36 is what allows you to break that, because even in
the i386, the 16-bit segment register that is offset 4-bits from the
32-bit offset register results in a 36-bit "normalized" address (it can
actually be 37-bit, but that's another story).  How the processor
handles that in a way that is compatible with the OS is the problem.

PAE36 uses "paging" from above 32-bit (up to 36-bit/64GiB) down to
below 32-bit (upto 4GiB).  PAE36 processors can support this paging,
and then have at least 36-bit traces on the platform.  GTL+ logic
was designed with this "slapped on," and never bothered with direct,
linear access until just recently (with the new 40-bit redesign for
Xeon MP).

Athlon, on the other hand, has always been 40-bit EV6.  AMD
decided to support the PAE36 paging, which added logic.  But they
_always_ had the "40-bit linear addressing for free" inherently in
the platform.  That's where these few BIOS hacks come in,
combined with an OS/model that can take advantage of it.  If you
enable this mode for Windows, it will _not_ work!

> On the other hand, processor (the hardware) doesn't need to have
> such limits, if it's internal organization is wider.  So, AMD (processor)
> is able to see 40-bit linear physical address space as one single big
> chunk of memory.  32-bit applications will have their
> 32-bit *logical* address space mapped into this 40-bit linear *physical*
> address space of processor.  I don't know if programming model of
> "32-bit" AMD processors allows you to have wider-then-32-bit pointers
> (even if it did, you would have to have compiler that can generate such
> code, gcc can't do it for sure).

Not true.  The model is _still_ PAE36, up to 36-bit/64GiB.  But instead
of paging, the combination of Athlon's "inherent 40-bit" with an OS
that can do it will not use paging.  Normally PAE36 does paging, which
is what PAE36 OSes normally do.

> Obviously, the kernel needs to know how to manage things in this
> wider physical address space, the reason why you need patched Linux
> kernel to take advantage of it.

The actual amount of space doesn't change -- at least not under the
PAE36 model.  It's just how the OS commands the memory logic to
use it, and that also requires the firmware (BIOS) to pre-configure it
at POST.  Under a norminal PAE36 OS and board, the memory logic and
OS always do paging to access above 32-bit/4GiB.

But I don't think this hack is able to linearly address the entire 40-bit
because of the limitations of what PAE36 can address.

> Intel (the processor) on the other hand, is able to physically address
> only 32-bit address space.  Anything wider than that, it needs to page.
> Dealing with paging will obviosly be additional work for OS, hence
> lower performance.

Paging is a significant hit.  Anyone running in the 4G+4G "HIGHMEM"
model of their Linux kernel and recompiles for 1G+3G model will notice
a noticeable performance gain (as long with only 960MiB of memory).

> So while both processors will have more that 32 address lines on the
> packaging (and printed on motherboard) minus couple of lowest one 
> that are not needed) as you can see in various specifications that you
> queted, that doesn't mean processor's core actually sees all that
> address space.




> How else to try explaining this...  Hm... Remeber Intel 8086?  It could 
> address only 16-bit address space, but it had more than 16 address lines
> on the packaging.  It used segmenting (hopefully the right term) to see
> wider-than-16-bit address space.  Now try to make analogy with what was
> discussed so far ;-)

It's somewhat correct, although it gets interesting.

It's more like EMS than XMS, because XMS used to shunt the processor in
and out of Protected286/386, and that required a 286/386 to access it.
You could run a true 24-bit/32-bit, respectively, to avoid that.

EMS worked on old 8088/8086 (as well as 80286/386), because pages
above 1MB could be mapped in.  The 80386 and some 80286 could
emulate this as well without a special card with special addressing.

In the GTL+ bus, this is exactly what it does.  It offers special
addressing lines for a memory logic that pages, because that's all
the OS does.  Even the early EM64T processors had to deal with
this "limitation" of GTL+ platform.

You want to ensure you get a new EM64T platform that doesn't
have that approach.


--
Bryan J. Smith   mailto:b.j.smith at ieee.org