[CentOS] Re: Why is yum not liked by some? -- CVS analogy (and why you're not getting it)

Fri Sep 9 03:16:43 UTC 2005

On Thu, 2005-09-08 at 17:37, Bryan J. Smith wrote:
> > What I want is to be able to update more than one machine
> > and expect them to have the same versions installed.  If
> > that isn't a very common requirement I'd be very surprised.

> So what you want to checkout the repository from a specific
> tag and/or date.  So you want:  
> 
> 1.  The repository to have every single package -- be it
> packages as whole, or some binary delta'ing between RPMs (if
> possible)

It just needs to  keep every package that it has ever had - at
least as long as it might be useful for someone to install them.
That seems to be the case now.  You need this anyway unless you
are sure that no files that remain have specific dependencies on
anything removed.

> 2.  The repository meta-data to have all history so it can
> backtrack to any tag/date.

If by history, you mean a timestamp of when a file was added,
yes - and that already seems to be there.  That would be
sufficient to make updates repeatable.  I'd like to add
one more thing to make it more or less atomic, and that would
be some indication of the latest timestamp that should be
usable - that is, newer files are in an inconsistent state
of a partial update.  When the repository maintainer has
all the files in place this file would be modified - and
some special consideration should be applied to make sure
it shows up last during mirror updates.  This extra part could
be avoided if the 'latest timestamp' is published somewhere
and you could manually pass it to yum during the update.

> In other words, you want a repository to maintain storage and
> use CPU-I/O power to resolves tens of GBs of inter-related
> data and corresponding versioning meta-data.

No, I want to be able to tell yum not to consider files
newer than a certain date corresponding to the time I
did the update on the baseline/test machine even if
newer ones happen to be sitting in the repository.  And
I'd like yum to always ignore changes that are transitory
and incomplete.

> BTW, Your comparison to CVS is extremely poor, so _stop_. 

CVS would give the result I want.  How it gets done is
not particularly relevant.

> APT, YUM and countless other package repositories store
> packages whole, with a "current state" meta-data list, and
> the packages and that meta-data is services via HTTP and the
> _client_ resolves what it wants to do.

CVS can run with only file-level access to the repository and
no particular intelligence on the server.  However, I agree that
it isn't exactly the service we need here.

> What you want is a more "real-time" resolution logic "like
> CVS."  That either requires:
> 
> A)  A massive amount of data transfer if done at the client,
> or

Yum only needs the headers which involve a massive amount
of data tranfer already.  Using them slightly more intelligently
would not take much more, even if a timestamp/tagname filed
had to be added to the header.

> B)  A massive amount of CPU-I/O overhead if done at the
> server

No it doesn't.  All it needs is for yum to observe the timestamps
on files and ignore any past the point you specify even if they
are available.  Or move this info to the headers if you don't
trust timestamps to be maintained.

> I see 2 evolutionary approaches to the problem.
> 
> 1.  Maintain multiple YUM repositories, even if all but the
> original are links to the original.  The problem is this is
> who defines what the "original" is?  That's why you should
> maintain your _own_, so it's what _you_ expect it to be.

The Centos repository is the only one I've seen that doesn't
keep every file that has ever been added forever.  And they
do have that available.  I'm really not asking for eons of
history here.  I just want repeatable updates for some
small testing window.

> 2.  Modify the YUM respository meta-data files so they store
> revisions, whereby each time createrepo is run, the meta-data
> is continuing list.

> #1 is direct and practical.  #2 adds a _lot_ to the initial
> query YUM does, and could push it from seconds to minutes or
> even _hours_ at the client (not to mention the increase in
> traffic).  That's the problem.

The only extra piece of data really needed is the latest timestamp
of a consistent update.  The rest could be figured out but you'd
need a way to find what that value was at the time you do one
update so you could re-use it for repeatable results even if
it had subsequently changed in the repository.  If I were doing
the #2 approach, as much as I like an arbitrary number of arbitrary
named tags, I'd probably go with an incrementing 'repository update
version' tag that would be bumped on new sets of files so you don't
ever have to change old ones and you can compute which ones are
past what you specify and should be ignored. Some of those header
files are 100k now - how much more overhead could an update version
entry add?

> > This isn't Centos-specific - I just rambled on from some
> > other mention of it and apologize for dwelling on it here.
> > There are 2 separate issues:
> > One is that yum doesn't know if a repository or mirror is
> > consistent or in the middle of an update with only part of
> > a set of RPM's that really need to be installed together.
> 
> Not true.  The checks that createrepo does can prevent an
> update if there are missing dependencies.  The problem is
> that most "automated" repos bypass those checks.

Does createrepo do its magic atomically?  What do yum attempts
running concurrently see as it succeeds/fails?

> So, again, we're talking "repository management" issues and
> _not_ the tool itself.

No, I want the repository to be able to be inconsistent and
the tool to be able to perform an update based on a prior
known-good state.

> > The other is that if you update one machine and everything
> > works, you have no reason to expect the same results on the
> 
> > next machine a few minutes later.
> 
> Because there is not tagging/date facility.
> But to add that, you'd have to add either (again):
> 1.  A _lot_ of traffic (client-based)
> 2.  A _lot_ of CPU-I/O overhead (server-based)

Or a sensible approach.

> > Both issues would be solved if there were some kind of tag
> > mechanism that could be applied by the repository updater
> > after all files are present and updates could be tied to
> > earlier tags even if the repository is continuously
> updated.
> 
> So, in other words, you want the client to get repository
> info in 15-30 minutes, instead of 15-30 seconds.  ;->

No, I want it to get more or less what it already does but
ignore inconsistent changes in progress and have the option
to ignore things newer than a time you did an earlier
update which you'd like to repeat.

> _This_ is the point you keep missing.  It's the load that is
> required to do what you want.  Not just a few hundred
> developers moving around a few MBs of files, but _tens_ of
> _thousands_ of users accessing _GBs_ of binaries.
>
> That's why you rsync the repository down, and you do that
> locally. 

Sorry, I just don't buy the concept that rsync'ing a whole
repository is an efficient way to keep track of the timestamps
on a few updates so you can repeat them later.  Rsync imposes
precisely that big load on the server side that you wanted
to avoid having everyone do.

-- 
  Les Mikesell
    lesmikesell at gmail.com