Ok earlier you said: "At more than 1GiB on Linux/x86, you must use a 4G+4G kernel (this is the default) to see more than 960MiB. This causes a signficant (10%+) performance hit. On more than 4GiB, it is worsened as more extensive paging is used."
Note I said "Linux/x86" and _not_ "Linux/x86-64".
:)
where does the performance hit for 4G/4G on Intel (whether ia32e or not) come from?
The performance hit is for _all_ IA-32 compatible architectures running Linux/x86, because there is definitely a hit.
There's a hit for the 4G+4G HIGHMEM model. And there is another, bigger one if you go 64G model (more than 4GiB user).
As far as _both_ Intel IA-32 on Linux/x86 _and_ Intel IA-32e (EM64T) on Linux/x86-64, you _always_ have "bounce buffers" (c/o the Soft I/O MMU, Soft IOTLB in Linux/x86-64 on EM64T) if you are doing a transfer between two memory areas -- e.g., user memory and memory mapped I/O -- when _one_ area is above 4GiB. No way around that, and a major problem with Intel right now.
Right, so if I have 2G of RAM, I want 2G/2G (kernel/user) split instead of 1G/3G so that I don't have to turn on HIGHMEM and thus avoid the penalty of using HIGHMEM.
x86-64 (AMD64) on Linux/x86-64 uses its I/O MMU hardware to drastically improve the performance. There were a few bugs early on, but most of them have been resolved.
Does that mean that Linux on AMD64 does not do ZONE_NORMAL <-> ZONE_HIGHMEM buffering/paging?