[CentOS] Re: Demonizing generic Linux issues as Fedora Core-only issues -- WAS: Hi, Bryan

On Wed, 2005-05-25 at 22:43 -0500, Les Mikesell wrote:
> I guess I could narrow down my complaint here to the specific RedHat
> policy of not shipping version number upgrades of applications within
> their own distribution versions.  In this instance, it is about
> building OS versions with options that required UTF-8 (etc.) character
> set support along with a perl version that didn't handle it correctly,
> (which I can understand because that's the best they could do at the
> time), then *not* providing updates to those broken distributions to
> perl 5.8.3+ which would have fixed them in RH 8.0 -> RHEL3.x but instead
> expecting users to move to the next RH/Fedora release which introduces
> new broken things.  Maybe the problems have been fixed in recent
> backed-in patches to RHEL3/centos3 but I don't think so.

The problem was a lot of CPAN programs were written for ASCII/ISO8859.
The ones included with RHL/RHEL all worked fine and were tested, but I
know people ran into issues with added programs.

The problem then becomes a Catch-22.  Perl 5.8.3 fixed some issues in
2004, but also introduced other compatibility issues.

Fedora Core 1 did update to Perl 5.8.3 when it became available in 2004,
but Red Hat has decided to stick with 5.8.0 for RHEL3.  They must have
had some reasons.

In a nutshell, by disabling the UTF-8 default locale, it fixes the
problem with ASCII/ISO8859 Perl programs.

> But Centos 4 includes it, and I assume RHEL4.

Nope, only MySQL 3.23.

> We've already covered why it isn't reasonable to run those.  But
> why can't there be an application upgrade to 4.x on a distribution
> that is usable today,

Definitely not!  The whole reason why RHEL is very well trusted is
because Red Hat sticks with a version and then backports any necessary
fixes.  Trust me, it actually takes Red Hat _more_work_ to do this,
but they do it to ensure _exact_ functionality over the life of the
product.

> and one that will continue to keep itself updated with a
> stock 'yum update' command?  I think this is just a policy issue,
> not based on any practical problems.

It's a policy issue, yes.  And upgrading from MySQL 3.23 to MySQL 4.x
would throw a massive wrench into a lot of Red Hat's SLAs.

Once Red Hat ships a package version in RHEL, unless they are unable
to backport a fix, they do _not_ typically move forward.  Again, SLAs,
exact operation to an anal power, and _never_ "feature upgrades."

If you want that, that's what Fedora Core is for.

> Allow multiple version of apps in the update repositories, I think.

Again, massive wrench into Red Hat SLAs.

> Why can't we explictly update to an app version beyond the stock
> release if we want it and then have yum (etc.) track that instead
> of the old one?

SLAs.

> If I had the perl, mysql, and dovecot versions from centos 4 backed
> into centos 3, I'd be happy for a while.

Not people who pay for RHEL with SLAs, no sir.  Trust me on this,
Red Hat is listening to the people who pay, and the people pay for
the attention to bug fixes and that's about it.  SuSE was the first
to really prove this was the market, Red Hat just followed them.

> I know it wouldn't be horribly hard to do this myself

Hard is not the problem.  It's actually much harder to backport
fixes to old versions.  But Red Hat does it for a reason.

Remember, updating a system is more than just taking the latest
package and building it.  It's building it, running it in regression
tests across a suite of systems, and _then_ shipping it.  At least
when you're talking about an environment where you're guaranteeing
SLAs.

Unless you're like Microsoft and you ship things that re-introduce
old bugs, have unforseen consequences, etc...  Microsoft is notorious
for "feature creep" and "historical loss" in their updates.

> but I really hate to break automatic updates and introduce problems
> that may be unique to each system.

Exactomundo.  ;->

> I'd be extremely conservative about changes that increase the chances of
> crashing the whole system (i.e. kernel, device drivers, etc.) and stay
> fairly close to the developer's version of applications that just run
> in user mode.  Even better, make it easy to pick which version of each
> you want, but make the update-tracking system automatically follow
> what you picked.  Then if you need a 2.4 kernel, perl 5.8.5 and
> mysql 4.1 in the same bundle you can have it.

And you are now going to run a suite of regression tests with this
various combinations -- remember, with each added combination, you
increase the number of tests _exponentially_ -- and guarantee an X
hour Service Level Agreement (SLA) on it?

In reality, what you're looking for is Fedora Core, not RHEL.

> I'm talking about the CIPE author, who had to be involved to write the
> 1.6 version not an RPM maintainer who probably couldn't have.

Not to burst your bubble, but most Red Hat developers go beyond just
being "maintainers."  Many actively participate in many project
developments.  Red Hat used to actively include CIPE in the kernel, and
test it as their standard VPN solution.

That changed in 2.6, for a number of reasons, a big one being that the
other developers weren't even looking at the kernel 2.6-tests in 2003,
let alone 2.6.0 on-ward once it came out in December.  In reading the
fall 2003 and other comments, it became pretty clear that Red Hat was
extremely skeptical about even getting it to work, and if it was really
worth it.

> So how does any of this relate to the CIPE author, who didn't write CIPE
> for fedora and almost certainly didn't have an experimental 2.6 kernel
> on some unreleased distribution, knowing that CIPE wasn't going to
> work?

Excuse me?  The developer didn't have to wait for a "release" distro to
look at what issues where happening with kernel 2.6 -- let alone late
kernel 2.5 developments or the months upon months of 2.6-test releases.
For some reason you seem to believe this "scenario" is something only
CIPE runs into?

There are countless kernel features and packages _external_ to the core
kernel developers, and those projects _do_ "keep up" with kernel
developments as they happen.  But let's even assume for a moment they do
not.

Debian 3.0 "Woody" and Debian 3.1 "Sarge" had kernel 2.6.0 available for
download almost immediately.  Various kernel 2.6-test testing in early
Fedora Development showed that CIPE was totally broken for 2.6.  And
there are similar threads in 2003, while 2.6 was in 2.6-test, where
people were talking about the lack of any CIPE compatibility.

This was _known_.  Your continued insistence on saying Red Hat released
2.6 "early" is just non-sense.  It was known for 9 months before it was
released, months before even development started on Fedora Core 2, SuSE
Linux 9.1, Mandrake Linux 10.0, etc...  This is no where near a Red Hat
policy, decision or otherwise "unstable" issue.

> On the other hand, someone involved in building FC2 must have
> known and I don't remember seeing any messages going to the CIPE list
> asking if anyone was working on it.

Okay, I'm going to hit the CIPE archives just to see what I'm don't
know ...

Hans Steegers seemed to be very aware and knowledgeable about the
fact that CIPE 1.5 did not run on kernel 2.6 back in September 2003,
3 months before the final kernel 2.6.0 release.  Unless I'm mistaken,
he is very involved with CIPE's development.

> Who else knew about the change?  Do you expect every author of something
> that has been rpm-packaged to keep checking with Linus to see if he
> feels like changing kernel interfaces this month so as not to disrupt
> the FC release schedule?

I don't think you even understand the issue here.  CIPE wasn't just
made incompatible because of some "minor interface change" made in an
odd-ball, interim 2.6 developer release.  Kernel 2.6 was changed
_massively_ from 2.4, and things like CIPE required _extensive_
re-writes!  Hans knew this, as did most other people, about the same
time -- Fall 2003 when the kernel 2.6-test releases were coming out!

This has absolutely *0* to do with Red Hat or any distributor, _period_!

> I can understand people backing away from a changing interface.

???  I don't understand what you meant by that at all  ???

> And, as much as you want this to not be about RH/Fedora policies, you
> are then stuck with something unnecessarily inconvenient because
> of their policy of not upgrading apps within a release.

Fedora Core does, probably a little more so than Red Hat Linux prior.

But RHEL -- when you ship SLAs, you ship SLAs -- and you aren't
upgrading features mid-release that can impact compatibility and
reliability.

Period.

> That's not the issue - I don't expect a hot-plug to go into the raid
> automatically.  I do want it to pair them up on a clean reboot as
> it would if they were both directly IDE connected.  So far nothing has.

That is _exactly_ the issue!  Once you remove a disk from the volume,
you have to _manually_ re-add it, even if you powered off and re-
connected the drive.  Once the system has booted without the drive
just once, it doesn't connect it automagically.

> Isn't it?  I see different behavior with knoppix and ubuntu.  I think
> their startup order and device probing is somewhat different.

Then report it to Bugzilla and use Knoppix and Ubuntu as examples.
Red Hat _likes_ people to find issues and report them, and they will
get fixed.

_Unless_ they don't do what Knoppix and Ubuntu do for a reason.
Many times I've seen reasons not to autodetect things, and software
RAID is one, depending on the conditions.

> Separate issues - I'm able to use mdadm to add the firewire drive to
> the raid and it will re-sync, but if I leave the drive mounted and
> busy, every 2.6 kernel based distro I've tried so far will crash after
> several hours.

I've seen this issue with many other OSes as well.

> I can get a copy by unmounting the partition, letting
> the raid resync then removing the external drive (being able to take
> a snapshot offsite is the main point anyway).

Once you do this, you must manually tell the system to trust it again.
Otherwise, it will assume the drive was taken off-line for other
reasons.

If some distros are trumping that logic and just blindly trusted it
by default, then they deserve what they get from that logic -- even
if it will only bite them in the ass 1 out of 20 times.  I'll take
the manual approach the other 19 times to avoid that 1.  ;->

> I've seen some bug reports about reiserfs on raid that may relate
> to the crash problem when running with the raid active.

Well, I'm ignorant on ReiserFS in general (I have limited experience
dealing with it -- typically clean-ups and the off-line tools are
never in-sync with the kernel, which seems good on its own, I'll
admit), but maybe there is a race condition between ReiserFS and
LVM2/MD.

> This didn't happen under FC1 which
> never crashed between weekly disk swaps.  There could also be some
> problems with my drive carriers.

It definitely could be a drive carrier issue.  In reality, _only_ SATA
(using the edge connections direction) and SCA SCSI can be trusted to
properly stage transient power properly.

I typically like to use more reliable drive swapping.  Again, either
SCA SCSI or the newer SATA.

> A firmware update on one type seems to
> have changed things but none of the problems are strictly reproducible
> so it is taking a long time to pin anything down.

Well, I wish you the best of luck.

> There's really only one CIPE 'developer' and I don't think he has any
> particular interest in any specific distributions.

Could you _please_ explain the lack of 2.6 support until later in 2004
being a "distro-specific" issue?  Red Hat, SuSE and many others just
"moved on" and didn't bother to return, despite repeat attempts to
get CIPE working in late 2003 through early 2004.

> If anyone else was talking about it, and in any other place than the
> CIPE mailing list, I'm not surprised that it did not have useful
> results.

>From what I've now read, people _were_ aware of it in fall of 2003
on-ward, and kernel 2.6-test was out, and basically no one worked on
it.

> You use the one that works and has a long history of working until
> the replacement handles all the needed operations.

I don't think you seem to understand what I just said.  The standards
compliant version can_not_ always handle the exact functionality of
the variant from the standard.

Many times, what people think is so-called "proven" is actually quite
broken.  Anyone who exchanged tarballs between Linux and Solaris, Irix
and other systems using GNU Tar typically ran into such issues.

POSIX compliance exists for a reason.  GNU Tar, among many other Linux
utilities, have deviated over the years.  Things must break to bring
back that deviation to standard.

I think the LibC4/5 forks and the return to GLibC 2 was a perfect
example.  And it doesn't take a rocket scientist to realize why GNU
gave the reins to Cygnus (now Red Hat) on GCC 3 because GCC 2's C++
was quite the wasteland.

> A committee decision isn't always the most reliable way to do
> something even if you follow the latest of their dozens of revisions.

I don't think you realize that many times it's not a 'committee
decision' that cause the problem in the first place.  Sometimes Linux
utilities are just a bit too "eccentric" or introduce their own
"extensions."

> No, but I assume that Gnu tar will be available anywhere I need it.

On Linux, yes.  The problem is that it doesn't interact well with other
systems in many cases.

> Given that I've compiled it under DOS, linked to both an aspi scsi
> driver and a tcp stack that could read/feed rsh on another machine
> that seems like a reasonable assumption.  I can't think of anything
> less likely to work...

Unfortunately GNU Tar doesn't exactly handle its own extensions well on
different platforms.  ;->

> So which is more important when I want to read something from my
> 1990's vintage tapes?

If GNU Tar even reads some of them!  You should read up on GNU Tar.  ;->

> Maybe, maybe not.  I always set up backups on filesystem boundaries
> anyway so I can prevent them from wandering into CD's or NFS mounts
> by accident, but I can imagine times when you'd want to include them
> and still do correct incrementals.  

There are some defaults that are just dangerous.  That's one of them.

> You aren't following the scenario.  The drives worked as shipped. They
> were running Centos 3.x which isn't supposed to have behavior-changing
> updates.  I did a 'yum update' from the nicely-running remote boxes
> that didn't include a kernel and thus didn't do a reboot immediately
> afterwords.

You should have tested this in-house _first_.

> I normally test on a local system, then one or a few
> of the remotes, make sure nothing breaks, then proceed with the
> rest of the remotes.  So, after all that, I ended up with a flock
> of running remote boxes that were poised to become unreachable on
> the next reboot.

Again, you should have tested all this in-house _first_.

> And even if I had rebooted the a local box after
> the corresponding update, it wouldn't have had the problem because
> I would have either installed that one in place or assigned the IP
> from its own console after swapping the disk in.

But had you followed such a procedure, you would have discovered it.

> But they could at least think about what a behavior change is likely
> to do in different situations, and this one is pretty obvious.  If
> eth0 is your only network interface and you refuse to start it at
> bootup, remote servers that used to work become unreachable.

You might want that in the case where you want only a specific hardware
address to access the network.

I will re-iterate, there are things in "Common Criteria" standardization
that is affecting both RHEL and SLES.

> I do understand the opposite problem that they were trying to fix
> where a change in kernel detection order changes the interface names
> and has the potential to make a DHCP server start on the wrong
> interface, handing out addresses that don't work.  But, it's the
> kind of change that should have come at a version revision or
> along with the kernel with the detection change.

Maybe so.  But I'm still waiting on you to detail when this change was,
in fact, made.  So far, I'm just going on your comments that you merely
yum'd the updates from the proper repository.

> Note that I did test everything I could, and everything I could have
> tested worked because the pre-shipping behavior was to include the
> hardware address in the /etc/sysconfig/networking/profiles/defaults/xxxx
> file, but to ignore it at startup.

Ahhh, now we're getting to it!
After you did a "yum update", did you check for any ".rpmsave" files?

> So even when I tested the
> cloned disks after moving to a 2nd box they worked.  The 'partially
> replicated environment' to catch this would have had to be a local
> machine with it's IP set while the drive was in a different box and
> then rebooted after installing an update that didn't require it. I
> suppose if lives were at stake I might have gone that far.

Maybe I've just been in too many environments where that's the deal,
yes.  And even when it's not lives, it's an "one shot deal" and I
don't get a 2nd chance.

E.g., people complain about bugs in semiconductor designs, yet
semiconductors aren't something like software where you build, run it,
and know you've got bugs in 6-8 minutes.  You have to go to layout, then
fab it and then you get it back -- some 6-8 _weeks_ later if you're a
major company (possibly 6-8 months if you're not).

So I tend to err on the side of making sure my formal testing is
actually well thought out.

> You are right, of course.  I take responsibility for what happened along
> with credit for catching it before it caused any real downtime (which
> was mostly dumb luck from seeing the message on the screen because I
> happened to be at one of the remote locations when the first one was
> rebooted for another reason).

And that's good.  If you're going to make a mistake, at least do it on a
minimal number of systems.  I've seen far too many people assume
something will work and push it out to all.

> Still, it gives me a queasy feeling about
> what to expect from vendors - and I've been burned the other direction
> too by not staying up the minute with updates so you can't just skip
> them.

If you want to make a comparison of the Linux world to any other, at
least the "worst" Linux vendors are still better at patching than any
other OS.

> Hmmm, now I wonder if the code was intended to use the hardware
> address all along but was broken as originally shipped.  It would be
> a bit more comforting if it was included in an update because someone
> thought it was a bugfix instead of someone thinking it was a good idea
> to change currently working behavior.

It might be that it was disabled -- possibly by yourself during config.
But then an update changed that.  Again, doing a:  
  find / -name *.rpmsave

Is almost a mandatory step for myself anytime I upgrade.  RPM is very
good at dumping out those files when it can't use an existing config
file or script that has been modified.

-- 
Bryan J. Smith                                     b.j.smith at ieee.org 
--------------------------------------------------------------------- 
It is mathematically impossible for someone who makes more than you
to be anything but richer than you.  Any tax rate that penalizes them
will also penalize you similarly (to those below you, and then below
them).  Linear algebra, let alone differential calculus or even ele-
mentary concepts of limits, is mutually exclusive with US journalism.
So forget even attempting to explain how tax cuts work.  ;->