[CentOS] Re: Why is yum not liked by some? -- CVS analogy (and why you're not getting it)

Thu Sep 8 22:37:45 UTC 2005
Bryan J. Smith <b.j.smith at ieee.org>

Les Mikesell <lesmikesell at gmail.com> wrote:
> What I want is to be able to update more than one machine
> and expect them to have the same versions installed.  If
> that isn't a very common requirement I'd be very surprised.


So what you want to checkout the repository from a specific
tag and/or date.  So you want:  

1.  The repository to have every single package -- be it
packages as whole, or some binary delta'ing between RPMs (if
possible)

2.  The repository meta-data to have all history so it can
backtrack to any tag/date.

In other words, you want a repository to maintain storage and
use CPU-I/O power to resolves tens of GBs of inter-related
data and corresponding versioning meta-data.

BTW, Your comparison to CVS is extremely poor, so _stop_. 
;->
I'm going to show you how in a moment.

APT, YUM and countless other package repositories store
packages whole, with a "current state" meta-data list, and
the packages and that meta-data is services via HTTP and the
_client_ resolves what it wants to do.

What you want is a more "real-time" resolution logic "like
CVS."  That either requires:

A)  A massive amount of data transfer if done at the client,
or

B)  A massive amount of CPU-I/O overhead if done at the
server

Gettting to your piss-poor and inapplicable analogy to CVS,
"A" is typically done either on _local_ disk or over a NFS
mount, possibly a streamed RSH/SSH.  In any case, "A" is
almost always done locally -- at least when it comes to
multiple-GBs of files.  ;->

"B" is what happens when you run in pserver/kserver mode, and
you now limit your transaction size.  I.e., try checking in a
500MB file to a CVS pserver, and see how _slow_ it is.

In other words, what you want is rather impractical for a
remote server _regardless_ if the server or client does it. 
Remember, we're talking GBs of files!

I see 2 evolutionary approaches to the problem.

1.  Maintain multiple YUM repositories, even if all but the
original are links to the original.  The problem is this is
who defines what the "original" is?  That's why you should
maintain your _own_, so it's what _you_ expect it to be.

2.  Modify the YUM respository meta-data files so they store
revisions, whereby each time createrepo is run, the meta-data
is continuing list.

#1 is direct and practical.  #2 adds a _lot_ to the initial
query YUM does, and could push it from seconds to minutes or
even _hours_ at the client (not to mention the increase in
traffic).  That's the problem.

> This isn't Centos-specific - I just rambled on from some
> other mention of it and apologize for dwelling on it here.
> There are 2 separate issues:
> One is that yum doesn't know if a repository or mirror is
> consistent or in the middle of an update with only part of
> a set of RPM's that really need to be installed together.

Not true.  The checks that createrepo does can prevent an
update if there are missing dependencies.  The problem is
that most "automated" repos bypass those checks.

So, again, we're talking "repository management" issues and
_not_ the tool itself.

> The other is that if you update one machine and everything
> works, you have no reason to expect the same results on the

> next machine a few minutes later.

Because there is not tagging/date facility.
But to add that, you'd have to add either (again):
1.  A _lot_ of traffic (client-based)
2.  A _lot_ of CPU-I/O overhead (server-based)

Again, using your poor analogy to CVS, have you every done a
checkout of a 500MB over the Internet -- using ssh or, God
help you, pserver/kserver?

> Both issues would be solved if there were some kind of tag
> mechanism that could be applied by the repository updater
> after all files are present and updates could be tied to
> earlier tags even if the repository is continuously
updated.

So, in other words, you want the client to get repository
info in 15-30 minutes, instead of 15-30 seconds.  ;->

Either that, or you want the server of the repository to deal
with all that overhead, taking "intelligent requests" from
clients, instead of merely serving via HTTP.

> I realize that yum doesn't do what I want - but lots of
> people must be having the same issues and either going to
> a lot of trouble to deal with them or just taking their 
> chances.

Or we do what we've _always_ done.
We maintain _internal_ configuration management.

We maintain the "complete" repository, and then individual
"tag/date" repositories of links.

Understand we are _not_ talking a few MB of source code that
you resolve via CVS.  We're talking GBs of binary packages.

You _could_ come up with a server repository solution using
XDelta and a running journal for the meta-data.  And after a
few hits, the repository server would tank.

The alternative is for the server repository to just keep
complete copies of all packages (which some do), but then
keeping a running journal for the meta-data.  But that would
still require the client to either download/resolve a lot
(taking 15-30 minutes, instead of 15-30 seconds), _or_ put
that resolution back on the server.

_This_ is the point you keep missing.  It's the load that is
required to do what you want.  Not just a few hundred
developers moving around a few MBs of files, but _tens_ of
_thousands_ of users accessing _GBs_ of binaries.

That's why you rsync the repository down, and you do that
locally.  There is no way to avoid that.  Even Red Hat
Network (RHN) and other solutions do that -- they have you
mirror things locally, with resolution going on locally.

In other words, local configuration management.  It's very
easy to do with RPM and YUM.  You can't "pass the buck" to
the Internet repository.  Red Hat doesn't even let its
Enterprise customers do it, and they wouldn't want to either.
 They have a _local_ repository.



-- 
Bryan J. Smith                | Sent from Yahoo Mail
mailto:b.j.smith at ieee.org     |  (please excuse any
http://thebs413.blogspot.com/ |   missing headers)