On Tue, 2005-06-21 at 12:40 -0500, Bryan J. Smith wrote:
From: Dan Pritts danno@internet2.edu
I read a web page that suggested that in some cases software built for "i686" would not in fact work on Via C3 processors (this is near & dear to my heart since i just bought a motherboard based on one). The C3 is definitely a modern platform - it's not fast by modern standards but it works well enough for many applications and its heat/power requirements are wonderful (circa 10 watts).
It all depends how you define "modern."
The IDT-Centaur team found that it was very easy to build a chip that dedicated more transistors to a larger cache that get into a lot of pipeline optimizations, out-of-order execution, etc... They were able to build the WinChip in 18 months -- instead of 36+ for a traditional design. The first design also ran on standard 3.3V CMOS and was quite a bit more tolerant of variances.
Cyrix did similar with the design of the M2/Geode. The C3 is ViA's evolution of the WinChip-M2 line, in a Socket-370 package for Intel GTL+ under official license from Intel. As much as non-x86 platforms promise more performance for lower power, it's hard to best some of the low-power designs of the x86 world at their economies of scale.
The discussion suggested that the "cmov" instruction was the problem.
Yes, the "cmov" instruction is an optional i686 instruction in Intel's own documentation. ViA (possibly both Cyrix and IDT-Centaur too) probably thought it was either a waste of transistors or, more likely, a timing nightmare to integrate into the ALU/control. In fact, if it was considered optional, it sounds like an engineer realized that it could have design impact when the i686 ISA was written (probably in advance of the actual release of the Pentium Pro, or in consideration of Intel's own, future i686 designs).
<flammage=on> Intel continues to be the poster child for 1970s CS thought when it comes to overbloat of an already CISC pig. Ironically, at one time, they partnered with Digital on the Alpha chip, which is definitely the most over-anal of RISC architectures. I.e., if it could hold up timing, it didn't go in the AXP ISA -- and they were extremely anal on this to the point of not even including an 8 or 16-bit LOAD/STOR (although Digital finally caved in this in 21164 -- but all other 8/16-bit data operations were always left out).
</flammage>
The GCC i686 target, unfortunately, assumes it always exists. I can semi-understand the logic, because for a run-time to always test would add a a number of bytes to every single program. And the software workaround takes over 100x more clock cycles.
I, of course, had similar upgrade problems to the original poster. I don't really care about the performance optimizations from "i386" to "i686" as long as i'll continue to have something that works.
i486 was liberally licensed by Intel under reasonable terms after a US court said numbers could not be trademarked (Intel was hoping to make money on trademark licensing in conjunction).
i686/GTL+ was only licensed by ViA, although I'll assume the lack of a cmov instruction in Cyrix/IDT-Centaur designs pre-dates that license.
One additional interesting data point. The CentOS 4.0 installer gave me the "i586" glibc and the "i686" kernel. I would hope that this would be consistent.
That's not what you want. You want to drop down to a i486 instead of running i586.
Here's a general guide:
--march=i486
Runs well on a non-superscalar i486, obviously. Runs fairly well on a super-scalar ix86 that does i486, although optimizations for the specific architecture should be used. Portions may run like crap oni586 (true Pentium/MMX), especially ALU control.
--march=i486 --mtune=i686
Still runs well on a non-superscalar i486 because superscalar optimizations don't affect it. Runs near-optimal on a superscalar ix86 that has at least the same 7-issue core of the Pentium Pro-P3 (2+2+3 ALU+FPU+control). Even more likely to run like crap on i586 because it assumes the ALU is 2-issue and well-designed, and i586's just ain't anywhere near 'dat!
--march=i486 --mtune=i586 (or --march=i586)
Improves performance of many operations singificantly on i586. On i486 (if --march=i486, --march=i586 will not run) and i686, will use the chip rather inefficently.
E.g., (and this is just 1 example) generated machine code will ripple integers through what it assumes is a pipelined FPU. On an i486 or anything but an Athlon clone, this will not only tie up the FPU, but significantly and artifically delay integer loads. On a true Intel i686 or AMD Athlon, it will leave the 2 and 3 ALU pipes, respectively, unused. Worse yet, the original Nx586 (through the Athlon) is 3x faster at ALU loads than i586, and it is a clear "de-optimization"
And that's just 1 example. ;->
--march=i686
Offers little performance gain over --march=i486 --mtune=i686, while not running at all on an i486 or i586.
--march=i486|i686 --mtune=pentium4
Offers not only improved performance on the Pentium 4, but all Athlons as well.
--march=i486|i686 --mtune=athlon
This option has been widely debated. It really _kills_ Intel performance because Intel i686's 2+2 (ALU+FPU) is not going to handle optimizations for AMD Athlon's 3+3 (ALU+FPU) and is going to have lots of stalls (especially on Pentium 4, don't get me started ;-). At the same time, the Athlon benefit is debatable for general applications -- although engineering and scientific _will_ see a good boost (40% is not uncommon).
The reason why is simple. Intel's 1 complex _or_ 2 ADD FPU only allows 1 MULT at full 64-bit precision (and not SSE "lossy math") whereas AMD's 2 complex _and_ 1 ADD/MULT allows 3 MULT at full 64-bit precision.
Plus, today, you're typically going to be running Athlon64/Opteron, and you get such "optimizations for free" on x86-64 targets.
SIDE NOTE: -O3 ... _never_ run -O3 with --march=pentium3, pentium4 or athlon. You're asking for "lossy math" in the case of the former, and unstable optmizations in the case of the latter. It should _only_ be used when you developed the application and know what you're doing.
All of that is probably true ... but the optimizations are already set by default by the config.guess and autoconf and automake for almost all RedHat SRPMS ... and CentOS doesn't change those, unless absolutely necessary.
Therefore, for almost all packages on CentOS-3 they are --march=i386 -- mtune=i686 ... and for most CentOS-4 packages they are --march=i386 -- mtune=pentium4 ... being that the default target is i386.