[CentOS] CentOS 4.0 -> 4.1 update failing

Tue Jun 21 17:40:50 UTC 2005
Bryan J. Smith <b.j.smith@ieee.org> <thebs413 at earthlink.net>

From: Dan Pritts <danno at internet2.edu>
> I read a web page that suggested that in some cases software built for
> "i686" would not in fact work on Via C3 processors (this is near & dear
> to my heart since i just bought a motherboard based on one).  The C3
> is definitely a modern platform - it's not fast by modern standards but it
> works well enough for many applications and its heat/power requirements
> are wonderful (circa 10 watts).  

It all depends how you define "modern."

The IDT-Centaur team found that it was very easy to build a chip that
dedicated more transistors to a larger cache that get into a lot of pipeline
optimizations, out-of-order execution, etc...  They were able to build the
WinChip in 18 months -- instead of 36+ for a traditional design.  The
first design also ran on standard 3.3V CMOS and was quite a bit more
tolerant of variances.

Cyrix did similar with the design of the M2/Geode.  The C3 is ViA's
evolution of the WinChip-M2 line, in a Socket-370 package for Intel
GTL+ under official license from Intel.  As much as non-x86 platforms
promise more performance for lower power, it's hard to best some of
the low-power designs of the x86 world at their economies of scale.

> The discussion suggested that the "cmov" instruction was the problem.

Yes, the "cmov" instruction is an optional i686 instruction in Intel's
own documentation.  ViA (possibly both Cyrix and IDT-Centaur too)
probably thought it was either a waste of transistors or, more likely,
a timing nightmare to integrate into the ALU/control.  In fact, if
it was considered optional, it sounds like an engineer realized that
it could have design impact when the i686 ISA was written (probably
in advance of the actual release of the Pentium Pro, or in consideration
of Intel's own, future i686 designs).

<flammage=on>
Intel continues to be the poster child for 1970s CS thought when it
comes to overbloat of an already CISC pig.  Ironically, at one time,
they partnered with Digital on the Alpha chip, which is definitely the
most over-anal of RISC architectures.  I.e., if it could hold up timing,
it didn't go in the AXP ISA -- and they were extremely anal on this
to the point of not even including an 8 or 16-bit LOAD/STOR (although
Digital finally caved in this in 21164 -- but all other 8/16-bit data
operations were always left out).
</flammage>

The GCC i686 target, unfortunately, assumes it always exists.
I can semi-understand the logic, because for a run-time to always test
would add a a number of bytes to every single program.
And the software workaround takes over 100x more clock cycles.

> I, of course, had similar upgrade problems to the original poster.  
> I don't really care about the performance optimizations from "i386"
> to "i686" as long as i'll continue to have something that works.

i486 was liberally licensed by Intel under reasonable terms after a US
court said numbers could not be trademarked (Intel was hoping to
make money on trademark licensing in conjunction).

i686/GTL+ was only licensed by ViA, although I'll assume the lack of
a cmov instruction in Cyrix/IDT-Centaur designs pre-dates that license.

> One additional interesting data point.  The CentOS 4.0 installer
> gave me the "i586" glibc and the "i686" kernel.  I would hope that
> this would be consistent.

That's not what you want.  You want to drop down to a i486 instead
of running i586.

Here's a general guide:  

--march=i486

Runs well on a non-superscalar i486, obviously.
Runs fairly well on a super-scalar ix86 that does i486, although
optimizations for the specific architecture should be used.
Portions may run like crap oni586 (true Pentium/MMX), especially
ALU control.

--march=i486 --mtune=i686

Still runs well on a non-superscalar i486 because superscalar
optimizations don't affect it.
Runs near-optimal on a superscalar ix86 that has at least the
same 7-issue core of the Pentium Pro-P3 (2+2+3 ALU+FPU+control).
Even more likely to run like crap on i586 because it assumes the ALU
is 2-issue and well-designed, and i586's just ain't anywhere near 'dat!

--march=i486 --mtune=i586 (or --march=i586)

Improves performance of many operations singificantly on i586.
On i486 (if --march=i486, --march=i586 will not run) and i686,
will use the chip rather inefficently.

E.g., (and this is just 1 example) generated machine code will
ripple integers through what it assumes is a pipelined FPU.
On an i486 or anything but an Athlon clone, this will not only tie up
the FPU, but significantly and artifically delay integer loads.
On a true Intel i686 or AMD Athlon, it will leave the 2 and 3 ALU
pipes, respectively, unused.
Worse yet, the original Nx586 (through the Athlon) is 3x faster at
ALU loads than i586, and it is a clear "de-optimization"

And that's just 1 example.  ;->

--march=i686

Offers little performance gain over --march=i486 --mtune=i686,
while not running at all on an i486 or i586.

--march=i486|i686 --mtune=pentium4

Offers not only improved performance on the Pentium 4, but
all Athlons as well.

--march=i486|i686 --mtune=athlon

This option has been widely debated.  It really _kills_ Intel
performance because Intel i686's 2+2 (ALU+FPU) is not going to
handle optimizations for AMD Athlon's 3+3 (ALU+FPU) and is
going to have lots of stalls (especially on Pentium 4, don't get
me started ;-).  At the same time, the Athlon benefit is
debatable for general applications -- although engineering and
scientific _will_ see a good boost (40% is not uncommon).

The reason why is simple.  Intel's 1 complex _or_ 2 ADD FPU
only allows 1 MULT at full 64-bit precision (and not SSE "lossy
math") whereas AMD's 2 complex _and_ 1 ADD/MULT allows
3 MULT at full 64-bit precision.

Plus, today, you're typically going to be running Athlon64/Opteron,
and you get such "optimizations for free" on x86-64 targets.

SIDE NOTE:
-O3 ... _never_ run -O3 with --march=pentium3, pentium4 or
athlon.  You're asking for "lossy math" in the case of the former,
and unstable optmizations in the case of the latter.  It should _only_
be used when you developed the application and know what you're
doing.


--
Bryan J. Smith   mailto:b.j.smith at ieee.org