[CentOS] Re: Demonizing generic Linux issues as Fedora Core-only issues -- WAS: Hi, Bryan

Thu May 26 18:17:56 UTC 2005
Les Mikesell <lesmikesell at gmail.com>

On Thu, 2005-05-26 at 01:20, Bryan J. Smith wrote:
> >In this instance, it is about
> > building OS versions with options that required UTF-8 (etc.) character
> > set support along with a perl version that didn't handle it correctly,

> In a nutshell, by disabling the UTF-8 default locale, it fixes the
> problem with ASCII/ISO8859 Perl programs.

No, that only fixes the problem of not being able to handle the
default character set.  It does not make explicit conversions
work correctly as needed by Mime-tools, etc.  Maybe this doesn't
fit RedHat's definition of a bug, but it is still broken behavior,
fixed in the upstream version that they only provide if you do
a full disto change.

> > We've already covered why it isn't reasonable to run those.  But
> > why can't there be an application upgrade to 4.x on a distribution
> > that is usable today,
> 
> Definitely not!  The whole reason why RHEL is very well trusted is
> because Red Hat sticks with a version and then backports any necessary
> fixes.  Trust me, it actually takes Red Hat _more_work_ to do this,
> but they do it to ensure _exact_ functionality over the life of the
> product.

If you believe that, you have to believe that Red Hat's programmers
are always better than the original upstream program author.  I'll
agree that they are good and on the average do a good job, but
that stops far short of saying that they know better than the
perl (etc.) teams what version you should be running.

> Once Red Hat ships a package version in RHEL, unless they are unable
> to backport a fix, they do _not_ typically move forward.  Again, SLAs,
> exact operation to an anal power, and _never_ "feature upgrades."
>
> If you want that, that's what Fedora Core is for.

So, you want a working application, take an incomplete kernel. I
understand that's the way things are. I don't understand why
you like it.

> > Allow multiple version of apps in the update repositories, I think.
> 
> Again, massive wrench into Red Hat SLAs.
> 
> > Why can't we explictly update to an app version beyond the stock
> > release if we want it and then have yum (etc.) track that instead
> > of the old one?
> 
> SLAs.

OK, that limits what RedHat might offer.  We are sort-of talking
about Centos here as well as how other distibutions might be
better.  Is there a reason that a Centos or third-party repository
could not be arranged such that an explicit upgrade could be
requested to a current version which would then be tracked like
your kernel-xxx-version is when you select smp/hugemem/unsupported?

> Unless you're like Microsoft and you ship things that re-introduce
> old bugs, have unforseen consequences, etc...  Microsoft is notorious
> for "feature creep" and "historical loss" in their updates.

Realistically you are just substituting a different set of people
making different compromises.

> > I'd be extremely conservative about changes that increase the chances of
> > crashing the whole system (i.e. kernel, device drivers, etc.) and stay
> > fairly close to the developer's version of applications that just run
> > in user mode.  Even better, make it easy to pick which version of each
> > you want, but make the update-tracking system automatically follow
> > what you picked.  Then if you need a 2.4 kernel, perl 5.8.5 and
> > mysql 4.1 in the same bundle you can have it.
> 
> And you are now going to run a suite of regression tests with this
> various combinations -- remember, with each added combination, you
> increase the number of tests _exponentially_ -- and guarantee an X
> hour Service Level Agreement (SLA) on it?

There are times when you want predictable behavior, and times when
you want correct behavior.   When an upstream app makes changes
that provide correct behavior but you are ensured of the old
buggy behavior as a matter of policy, something is wrong.  

> In reality, what you're looking for is Fedora Core, not RHEL.

Well, FC1 seems like the only way to get the specific mix
of working kernel and apps for certain things right now, but
it is by definition a dead end - and not really up to date
on the app side either.

> > I'm talking about the CIPE author, who had to be involved to write the
> > 1.6 version not an RPM maintainer who probably couldn't have.
> 
> Not to burst your bubble, but most Red Hat developers go beyond just
> being "maintainers."  Many actively participate in many project
> developments.  Red Hat used to actively include CIPE in the kernel, and
> test it as their standard VPN solution.

Hence my surprise at their change of direction.

> Excuse me?  The developer didn't have to wait for a "release" distro to
> look at what issues where happening with kernel 2.6 -- let alone late
> kernel 2.5 developments or the months upon months of 2.6-test releases.
> For some reason you seem to believe this "scenario" is something only
> CIPE runs into?

It's something a decision to change kernels runs into.  The CIPE
author didn't make that decision.

> Various kernel 2.6-test testing in early
> Fedora Development showed that CIPE was totally broken for 2.6.  And
> there are similar threads in 2003, while 2.6 was in 2.6-test, where
> people were talking about the lack of any CIPE compatibility.

I just don't remember seeing any discussion of this on the CIPE
mailing list which is the only place it might have been resolved.

> I don't think you even understand the issue here.  CIPE wasn't just
> made incompatible because of some "minor interface change" made in an
> odd-ball, interim 2.6 developer release.  Kernel 2.6 was changed
> _massively_ from 2.4, and things like CIPE required _extensive_
> re-writes!  Hans knew this, as did most other people, about the same
> time -- Fall 2003 when the kernel 2.6-test releases were coming out!

I don't see how anyone but Olaf Titz could have made the necessary
changes, and I don't see why he would have done so with appropriate
timing for the FC2 release unless someone involved in the release
made him aware of the planned changes.

> This has absolutely *0* to do with Red Hat or any distributor, _period_!

The distribution decided to change the kernel version and you don't
see how this affects the usability of included packages - or the
need to coordinate such changes with the authors of said packages?

> > I can understand people backing away from a changing interface.
> 
> ???  I don't understand what you meant by that at all  ???

An interface is supposed to be a form of contract among programmers
that is not changed.  Linus has consistently refused to freeze his
interfaces, hence the lack of binary driver support from device
vendors, and frankly I'm surprised at the the number of open
source developers that have continued to track the moving target.
How interesting can it be to write the same device driver for the
third time for the same OS?

> > And, as much as you want this to not be about RH/Fedora policies, you
> > are then stuck with something unnecessarily inconvenient because
> > of their policy of not upgrading apps within a release.
> 
> Fedora Core does, probably a little more so than Red Hat Linux prior.

How much change is going to happen in the lifetime of an FC release?

[back to firewire/raid]
> > That's not the issue - I don't expect a hot-plug to go into the raid
> > automatically.  I do want it to pair them up on a clean reboot as
> > it would if they were both directly IDE connected.  So far nothing has.
> 
> That is _exactly_ the issue!  Once you remove a disk from the volume,
> you have to _manually_ re-add it, even if you powered off and re-
> connected the drive.  Once the system has booted without the drive
> just once, it doesn't connect it automagically.

No, that isn't the issue on a simple reboot.  A drive that is connected
when you go down cleanly and is still connected when you restart
shouldn't be handled differently just because there is a different
type of wire connecting it.

> > Separate issues - I'm able to use mdadm to add the firewire drive to
> > the raid and it will re-sync, but if I leave the drive mounted and
> > busy, every 2.6 kernel based distro I've tried so far will crash after
> > several hours.
> 
> I've seen this issue with many other OSes as well.

It didn't happen with FC1 on the same box/same drives.

> > I can get a copy by unmounting the partition, letting
> > the raid resync then removing the external drive (being able to take
> > a snapshot offsite is the main point anyway).
> 
> Once you do this, you must manually tell the system to trust it again.
> Otherwise, it will assume the drive was taken off-line for other
> reasons.

Agreed - I expect to have to mdadm --add and have a resync if
I've done a --fail or --remove, or the hardware is disconnected.

> > I've seen some bug reports about reiserfs on raid that may relate
> > to the crash problem when running with the raid active.
> 
> Well, I'm ignorant on ReiserFS in general (I have limited experience
> dealing with it -- typically clean-ups and the off-line tools are
> never in-sync with the kernel, which seems good on its own, I'll
> admit), but maybe there is a race condition between ReiserFS and
> LVM2/MD.

Actually, I think there may be a really horrible race condition
built into any journaled file system that counts on ordered writes
and the software raid level that doesn't guarantee that across
the mirrors which may be working at different speeds, handling
error retries independently, etc.  But nobody seems to be talking
much about it...

> > This didn't happen under FC1 which
> > never crashed between weekly disk swaps.  There could also be some
> > problems with my drive carriers.
> 
> It definitely could be a drive carrier issue.  In reality, _only_ SATA
> (using the edge connections direction) and SCA SCSI can be trusted to
> properly stage transient power properly.
> 
> I typically like to use more reliable drive swapping.  Again, either
> SCA SCSI or the newer SATA.

Ummm, great.  When I started doing this with FC1, SATA mostly didn't
work and firewire did, except you had to modprobe it manually and
tell it about new devices.  These are 250 gig drives and I have
3 externals for offsite rotation, so I can't afford scsi.

> > A firmware update on one type seems to
> > have changed things but none of the problems are strictly reproducible
> > so it is taking a long time to pin anything down.
> 
> Well, I wish you the best of luck.

Today it is running with the mirroring on under FC3, but I don't
know if anything is really different yet.  There has been a recent
kernel update, I've updated firmware on this carrier, and run
some diagnostics to fix drive errors that might have been caused
by the earlier firmware or kernels.   The funny thing is that I
started doing this because I thought working with disks would be
easier than tapes...  But it is nice to be able to plug the
drive carrier into my laptop's usb and be able to restore anything
instantly (the drive case does both usb and firewire).

[...]
> > You use the one that works and has a long history of working until
> > the replacement handles all the needed operations.
> 
> I don't think you seem to understand what I just said.  The standards
> compliant version can_not_ always handle the exact functionality of
> the variant from the standard.

Yes, when a standard is changed late in the game, that is to be
expected.  People will already have existing solutions and can
only move away so fast - especially with formats of archived
data.

> Many times, what people think is so-called "proven" is actually quite
> broken.  Anyone who exchanged tarballs between Linux and Solaris, Irix
> and other systems using GNU Tar typically ran into such issues.

An issue easily resolved by compiling GNU tar for the target system.

> POSIX compliance exists for a reason.  GNU Tar, among many other Linux
> utilities, have deviated over the years.  Things must break to bring
> back that deviation to standard.

POSIX is the thing that changed here.  And GNU tar has nothing to
do with Linux other than being included in some distributions
that also include a Linux kernel.  I'm too lazy to look up the
availability dates but I used GNUtar myself long before Linux.
I agree that forward-looking, the current POSIX spec is useful,
but the 'a' in tar is about archives that exist from a time
when it wasn't.

> > So which is more important when I want to read something from my
> > 1990's vintage tapes?
> 
> If GNU Tar even reads some of them!  You should read up on GNU Tar.  ;->

If you are reading the star author's comments, try to duplicate
the situation yourself.  The worst-case issue with GNU tar is that
you have to repeat a restore of an incremental to get back a
directory that was created between a full and incremental with the
same name that an ordinary file had at the time of the full (or
maybe that's backwards - at least your data is all there and you
can restore it).  For several years while the star author was posting
this, star would have completely missed copying many changed files in
an incremental.  He's done some work in the last few months that
probably fixes it but I doubt if that is in current distributions yet. 

Here's the real test that you should try if you are even thinking
about trusting incrementals:
Make a full run of a machine with nearly full filesystems.  Delete
a bunch of files, add enough new ones that the old/new total would
not fit.  Rename some directories that contain old files. Make an
incremental.  Repeat if you plan multi-level incrementals. Restore the
full and subsequent incremental(s) to bare metal. If you get a working
machine with exactly the same files in the same places including your
old files under the directories with new names, your plan will work.
GNUtar gets all of this right with the --listed-incremental form at
least from the mid-90's through recent distros that don't need magic
file attributes to work (i.e. it might not do everything SELinux
expects).  And amanda depends on this behavior.

> > You aren't following the scenario.  The drives worked as shipped. They
> > were running Centos 3.x which isn't supposed to have behavior-changing
> > updates.  I did a 'yum update' from the nicely-running remote boxes
> > that didn't include a kernel and thus didn't do a reboot immediately
> > afterwords.
> 
> You should have tested this in-house _first_.

I did.  It worked.

> > I normally test on a local system, then one or a few
> > of the remotes, make sure nothing breaks, then proceed with the
> > rest of the remotes.  So, after all that, I ended up with a flock
> > of running remote boxes that were poised to become unreachable on
> > the next reboot.
> 
> Again, you should have tested all this in-house _first_.

I did.  It worked.

> > And even if I had rebooted the a local box after
> > the corresponding update, it wouldn't have had the problem because
> > I would have either installed that one in place or assigned the IP
> > from its own console after swapping the disk in.
> 
> But had you followed such a procedure, you would have discovered it.

Actually, in retrospect, the funny part is that one of the main
reasons for cloning the disks in the first place was so that
I'd be testing a bit-for-bit duplicate of what was in production.

> > But they could at least think about what a behavior change is likely
> > to do in different situations, and this one is pretty obvious.  If
> > eth0 is your only network interface and you refuse to start it at
> > bootup, remote servers that used to work become unreachable.
> 
> You might want that in the case where you want only a specific hardware
> address to access the network.

Perhaps, but do you really think I'd change my mind about that
well after the machines were deployed?

> Maybe so.  But I'm still waiting on you to detail when this change was,
> in fact, made.  So far, I'm just going on your comments that you merely
> yum'd the updates from the proper repository.

That's because I didn't do a reboot along with the update, so
it could have been any of several runs. It pretty much had to
be from initscripts-7.31.18.EL-1.centos.1.i386.rpm which I see
is dated April 18 in my download cache. How should I associate
this with RHEL3/Centos3 revisions to describe it?

> > Note that I did test everything I could, and everything I could have
> > tested worked because the pre-shipping behavior was to include the
> > hardware address in the /etc/sysconfig/networking/profiles/defaults/xxxx
> > file, but to ignore it at startup.
> 
> Ahhh, now we're getting to it!
> After you did a "yum update", did you check for any ".rpmsave" files?

No, none of my configs changed, just the way they were handled after
the initscript revision.

> Maybe I've just been in too many environments where that's the deal,
> yes.  And even when it's not lives, it's an "one shot deal" and I
> don't get a 2nd chance.

I'll admit to being a little sloppy because the boxes are behind
a load balancer and I know I can lose one in production with
serious problems.  But, an *exact* copy here didn't show any
problem, the updated remote machine didn't show any problem
while still running.  Everything looked like a go...  I suppose
I should have known that a new initscripts package could break
booting, but RHEL3/Centos had a decent track record about that
sort of thing so far. 

> It might be that it was disabled -- possibly by yourself during config.
> But then an update changed that.  Again, doing a:  
>   find / -name *.rpmsave
> 
> Is almost a mandatory step for myself anytime I upgrade.  RPM is very
> good at dumping out those files when it can't use an existing config
> file or script that has been modified.

None of the above.  When I get a chance I'll compare the old/new
version of the ifup steps to see if ignoring on a MAC mismatch
was a new addition or if they fixed a broken comparison in the
original.  I'll feel much better about the whole thing if the
check was there all along and they thought this was just a
bugfix.  Still, I have to wonder how many RH/Centos machines are
out there in the same situation (IP set with redhat-config-network,
then the disk or NIC moved, then a post April 18 update) just
waiting to disappear from the network on the next reboot.  It would
also be interesting to see how RH support would respond when
called about an unreachable box, but being a cheapskate running
Centos, I wouldn't know.

-- 
  Les Mikesell
    lesmikesell at gmail.com