From: Robert Hanson roberth@abbacomm.net
or should i be more specific with the question(s)? the reason i ask is that i just dumped 2 gig dram in a basic P4 Intel 3.0GHz box to play with. regards and TIA,
At more than 1GiB on Linux/x86, you must use a 4G+4G kernel (this is the default) to see more than 960MiB. This causes a signficant (10%+) performance hit. On more than 4GiB, it is worsened as more extensive paging is used.
If you have 1GiB or less, you should rebuild with_out_ "HIGHMEM" support which is a 1G+3G kernel, and you'll see better performance (and memory will be limited to 960MiB).
In a nutshell, you should be running Linux/x86-64 on systems with more than 1GiB for optimal performance. If you have more than 4GiB of combined system and memory mapped I/O, you should be running Opterons with I/O MMUs. Intel EM64T systems will have protections in place for both earlier generation GTL+ limitations, as well as lack of an I/O MMU.
Much of the additional "tangent" surrounded the fact that there are a few so-called "32-bit" Athlons that actually have a BIOS hack and Linux kernel support so it doesn't take a performance hit. Long story short, it has to do with the fact that even so-called "32-bit" Athlons have a core and underlying interconnect platform that supports 40-bit _linear_ addressing _natively_.
-- Bryan J. Smith mailto:b.j.smith@ieee.org
Bryan J. Smith b.j.smith@ieee.org wrote:
From: Robert Hanson roberth@abbacomm.net
or should i be more specific with the question(s)? the reason i ask is that i just dumped 2 gig dram in a basic P4 Intel 3.0GHz box to play with. regards and TIA,
At more than 1GiB on Linux/x86, you must use a 4G+4G kernel (this is the default) to see more than 960MiB. This causes a signficant (10%+) performance hit. On more than 4GiB, it is worsened as more extensive paging is used.
If you have 1GiB or less, you should rebuild with_out_ "HIGHMEM" support which is a 1G+3G kernel, and you'll see better performance (and memory will be limited to 960MiB).
I thought they have done away with the high memory bounce buffers?
Can you explain what Andi means by this? ----quote---- Current X86-64 implementations only support 40 bit of address space, but we support upto 46bits. This expands into MBZ space in the page tables.
-Andi Kleen, Jul 2004 ----quote----
Does it mean that we don't need no fancy tweaks to get direct addressing for over 1G or over 4G?
Is that hack for Athlons limited/useful only to Athlon MP boards with the Linux option in BIOS or do Opterons also need that?
On Wednesday 29 June 2005 00:20, Feizhou wrote:
I thought they have done away with the high memory bounce buffers?
Unfortunately AMD gets punished for Intel's lazyness. Intel does not want to implement an iommu. RedHat and others don't want to have to support two separate kernels - so they limit IO to the lowest 4GB no matter if you're running an Opteron or EM64T.
Can you explain what Andi means by this? ----quote---- Current X86-64 implementations only support 40 bit of address space, but we support upto 46bits. This expands into MBZ space in the page tables.
-Andi Kleen, Jul 2004 ----quote----
I assume this quote is from http://lwn.net/Articles/117783/? about the 4th page table level? The memory that your process can use is split in several different segments as listed in that article. The processes need to have (among other stuff) access to the kernel, shared memory and so on. For that they have to select a mapping - and the mapping was simply selected to support 46 bits...
Does it mean that we don't need no fancy tweaks to get direct addressing for over 1G or over 4G?
Is that hack for Athlons limited/useful only to Athlon MP boards with the Linux option in BIOS or do Opterons also need that?
No - opterons running in 64bit mode don't need any games to address more than 4GB.
Peter.
On Wed, 2005-06-29 at 00:26 -0400, Peter Arremann wrote:
Unfortunately AMD gets punished for Intel's lazyness. Intel does not want to implement an iommu.
Actually, it was more like the fact that because AMD doesn't have all processors connect to a "hub" over the same connection, AMD was _forced_ to develop GARTs for _all_ I/O in the so-called "32-bit" Athlon MP. The I/O MMU in the Athlon 64 and Opteron is just an evolution of that.
GARTs and I/O MMU's are _nothing_ new in RISC platforms that are switched or meshed. Intel just hasn't developed such a beast, and 3rd party proprietary Xeon or Itanium systems that do use "glue" (and costly system redesign) when they need to. AMD was just the first to offer a commodity one that comes built-in (along with other goodies).
The "bonus" is that an I/O MMU solves the _real_world_ issue that some I/O cards and drivers only do 32-bit addressing, and are incapable of handling memory mapped I/O above 4GiB.
RedHat and others don't want to have to support two separate kernels - so they limit IO to the lowest 4GB no matter if you're running an Opteron or EM64T.
? I was unaware this is how they handled Opteron. I thought Red Hat _dynamically_ handled EM64T separately in their x86-64 kernels, and that was a major performance hit.
I need to go research this ...
I assume this quote is from http://lwn.net/Articles/117783/? about the 4th page table level? The memory that your process can use is split in several different segments as listed in that article. The processes need to have (among other stuff) access to the kernel, shared memory and so on. For that they have to select a mapping - and the mapping was simply selected to support 46 bits...
Actually, it sounds like they were good with 3-level at 39-bit for the current generation of x86-64, which only does 40-bit/1TiB. Unless, of course, that was a compatibility issue with running 32-bit, PAE36 and PAE52 program simultaneously.
I wonder if the 4-level is a performance hit, which is not ideal. Maybe there is a way to disable it if there is no compatibility issue?
On Wed, 2005-06-29 at 00:26 -0400, Peter Arremann wrote:
RedHat and others don't want to have to support two separate kernels - so they limit IO to the lowest 4GB no matter if you're running an Opteron or EM64T.
On Wed, 2005-06-29 at 00:01 -0500, Bryan J. Smith wrote:
? I was unaware this is how they handled Opteron. I thought Red Hat _dynamically_ handled EM64T separately in their x86-64 kernels, and that was a major performance hit.
Looking again at the release notes ...
http://www.centos.org/docs/3/release-notes/as-amd64/RELEASE-NOTES-U2- x86_64-en.html#id3938207
From the looks of it, it's not just whether memory mapped I/O is above
4GiB, but _any_ direct memory access (DMA) by a device where either the source or destination is above 4GiB. I.e., the memory mapped I/O might be below 4GiB, but the device might be executing a DMA transfer to user memory above 4GiB.
That's where the "Software IOTLB" comes in, _only_enabled_ on EM64T.
If I remember back to the March 2004 onward threads on the LKML, that's how they dealt with it -- using pre-allocated kernel bounced buffers below 4GiB. A Linux/x86-64 kernel _always_ uses an I/O MMU -- it is just software for EM64T if either the source or destination address of a DMA transfer is above 4GiB.
I don't think it really matters where the memory mapped I/O is itself. Although it obviously is advantageous if it is setup under 4GiB on EM64T -- because it would only need the "bounce buffers" when a DMA transfer is to user memory above 4GiB, instead of _always_ if the memory mapped I/O was above 4GiB.
On Wednesday 29 June 2005 02:15, Bryan J. Smith wrote:
On Wed, 2005-06-29 at 00:26 -0400, Peter Arremann wrote:
RedHat and others don't want to have to support two separate kernels - so they limit IO to the lowest 4GB no matter if you're running an Opteron or EM64T.
On Wed, 2005-06-29 at 00:01 -0500, Bryan J. Smith wrote:
? I was unaware this is how they handled Opteron. I thought Red Hat _dynamically_ handled EM64T separately in their x86-64 kernels, and that was a major performance hit.
Looking again at the release notes ...
http://www.centos.org/docs/3/release-notes/as-amd64/RELEASE-NOTES-U2- x86_64-en.html#id3938207
Yes, I used that URL before as well :-) I interpreted it different though as it being implemented for both... pci-gart.c shows in its init function iommu_setup where the initialization is done... the only place where its called is in setup.c where there is a define around it with CONFIG_GART_IOMMU ... That's set to yes, so that code is compiled in. In pci-gart.c you can see that the string "soft" would have to be passed to that function for Intel software mmu... I didn't have time to track down where it goes from there - my vacation is over and I need to get back to work...
From the looks of it, it's not just whether memory mapped I/O is above
4GiB, but _any_ direct memory access (DMA) by a device where either the source or destination is above 4GiB. I.e., the memory mapped I/O might be below 4GiB, but the device might be executing a DMA transfer to user memory above 4GiB.
That's where the "Software IOTLB" comes in, _only_enabled_ on EM64T.
If I remember back to the March 2004 onward threads on the LKML, that's how they dealt with it -- using pre-allocated kernel bounced buffers below 4GiB. A Linux/x86-64 kernel _always_ uses an I/O MMU -- it is just software for EM64T if either the source or destination address of a DMA transfer is above 4GiB.
I don't think it really matters where the memory mapped I/O is itself. Although it obviously is advantageous if it is setup under 4GiB on EM64T -- because it would only need the "bounce buffers" when a DMA transfer is to user memory above 4GiB, instead of _always_ if the memory mapped I/O was above 4GiB.
Yes - the question was about bounce buffers... You need them for DMA access like I said before - and if Intel had implemented an IO mmu, you wouldn't need it there either.
Peter.
On Wed, 2005-06-29 at 12:20 +0800, Feizhou wrote:
I thought they have done away with the high memory bounce buffers?
Correct. On x86-64, they have.
Can you explain what Andi means by this? ----quote---- Current X86-64 implementations only support 40 bit of address space, but we support upto 46bits. This expands into MBZ space in the page tables. -Andi Kleen, Jul 2004 ----quote---- Does it mean that we don't need no fancy tweaks to get direct addressing for over 1G or over 4G?
Correct. I haven't looked at how Linux/x86-64's paging works, but as long as they at least support 40-bit, they're good for the current generation of AMD64.
Some EM64T processors only do 36-bit, the new ones do 40-bit (with a new TLB-page table).
Is that hack for Athlons limited/useful only to Athlon MP boards with the Linux option in BIOS or do Opterons also need that?
It hack is _solely_ for Athlon MP, and it's quite limited in scope. In a nutshell, it really enables features of the Athlon MP that _breaks_everything_. Hence why it's rare, and Linux is the only OS with the hack.
Opteron with its x86-64 PAE (52-bit) mode is completely linear, up to 40-bit in its current, legacy EV6 logic implementation. The 52-bit virtual addressing using PAE allows 32-bit, legacy 36-bit PAE as well as new 52-bit PAE applications to run.
Bryan J. Smith wrote:
On Wed, 2005-06-29 at 12:20 +0800, Feizhou wrote:
I thought they have done away with the high memory bounce buffers?
Correct. On x86-64, they have.
Ok earlier you said:
"At more than 1GiB on Linux/x86, you must use a 4G+4G kernel (this is the default) to see more than 960MiB. This causes a signficant (10%+) performance hit. On more than 4GiB, it is worsened as more extensive paging is used."
where does the performance hit for 4G/4G on Intel (whether ia32e or not) come from?
Feizhou wrote:
I thought they have done away with the high memory bounce buffers?
Bryan J. Smith wrote:
Correct. On x86-64, they have.
Feizhou wrote:
Ok earlier you said: "At more than 1GiB on Linux/x86, you must use a 4G+4G kernel (this is the default) to see more than 960MiB. This causes a signficant (10%+) performance hit. On more than 4GiB, it is worsened as more extensive paging is used."
Note I said "Linux/x86" and _not_ "Linux/x86-64".
where does the performance hit for 4G/4G on Intel (whether ia32e or not) come from?
The performance hit is for _all_ IA-32 compatible architectures running Linux/x86, because there is definitely a hit.
There's a hit for the 4G+4G HIGHMEM model. And there is another, bigger one if you go 64G model (more than 4GiB user).
As far as _both_ Intel IA-32 on Linux/x86 _and_ Intel IA-32e (EM64T) on Linux/x86-64, you _always_ have "bounce buffers" (c/o the Soft I/O MMU, Soft IOTLB in Linux/x86-64 on EM64T) if you are doing a transfer between two memory areas -- e.g., user memory and memory mapped I/O -- when _one_ area is above 4GiB. No way around that, and a major problem with Intel right now.
x86-64 (AMD64) on Linux/x86-64 uses its I/O MMU hardware to drastically improve the performance. There were a few bugs early on, but most of them have been resolved.
I'm still checking on the "hack" for a select few so-called "32-bit" Athlon MP mainboards to find out all the true capabilities of it. But I have _never_ been mistaken on the EV6 platform as 40-bit natively capable.
Ok earlier you said: "At more than 1GiB on Linux/x86, you must use a 4G+4G kernel (this is the default) to see more than 960MiB. This causes a signficant (10%+) performance hit. On more than 4GiB, it is worsened as more extensive paging is used."
Note I said "Linux/x86" and _not_ "Linux/x86-64".
:)
where does the performance hit for 4G/4G on Intel (whether ia32e or not) come from?
The performance hit is for _all_ IA-32 compatible architectures running Linux/x86, because there is definitely a hit.
There's a hit for the 4G+4G HIGHMEM model. And there is another, bigger one if you go 64G model (more than 4GiB user).
As far as _both_ Intel IA-32 on Linux/x86 _and_ Intel IA-32e (EM64T) on Linux/x86-64, you _always_ have "bounce buffers" (c/o the Soft I/O MMU, Soft IOTLB in Linux/x86-64 on EM64T) if you are doing a transfer between two memory areas -- e.g., user memory and memory mapped I/O -- when _one_ area is above 4GiB. No way around that, and a major problem with Intel right now.
Right, so if I have 2G of RAM, I want 2G/2G (kernel/user) split instead of 1G/3G so that I don't have to turn on HIGHMEM and thus avoid the penalty of using HIGHMEM.
x86-64 (AMD64) on Linux/x86-64 uses its I/O MMU hardware to drastically improve the performance. There were a few bugs early on, but most of them have been resolved.
Does that mean that Linux on AMD64 does not do ZONE_NORMAL <-> ZONE_HIGHMEM buffering/paging?
where does the performance hit for 4G/4G on Intel (whether ia32e or not) come from?
The performance hit is for _all_ IA-32 compatible architectures running Linux/x86, because there is definitely a hit.
There's a hit for the 4G+4G HIGHMEM model. And there is another, bigger one if you go 64G model (more than 4GiB user).
As far as _both_ Intel IA-32 on Linux/x86 _and_ Intel IA-32e (EM64T) on Linux/x86-64, you _always_ have "bounce buffers" (c/o the Soft I/O MMU, Soft IOTLB in Linux/x86-64 on EM64T) if you are doing a transfer between two memory areas -- e.g., user memory and memory mapped I/O -- when _one_ area is above 4GiB. No way around that, and a major problem with Intel right now.
Right, so if I have 2G of RAM, I want 2G/2G (kernel/user) split instead of 1G/3G so that I don't have to turn on HIGHMEM and thus avoid the penalty of using HIGHMEM.
ugh...RHEL4 kernels do not provide 2G/2G split...only a 4G/4G option. Documentation say between 0% -> 30% performance hit...and to treat it as 20%...
Sorry about the 'no need for HIGHMEM' part...that is still needed to see more than 1G, my mistake.