[CentOS] Why is yum not liked by some?

Fri Sep 9 18:18:51 UTC 2005

Lamar Owen <lowen at pari.edu> wrote:
> It depends on the implementation.  You in your other delta
> message spell out essentially the same idea.

No, my other message was a _completely_different_ idea.  It
is a hack to the HTTP-serviced repository that just keeps
multiple sets of repodata directories.

A major difference between a true delta back-end and that
hack is that while you can re-generate the meta-data at any
point for the former, you can_not_ for the latter.  In other
words, the "repodelta" hack I described can _only_ generated
repodata for the state of the repository then, there and at
_no_ other time.  I.e., you cannot "go back in time" to
re-generate it.

There is no "database" or "interwoven history" of the
repository in the repodelta hack.  It is just a simple hack
to keep multiple copies of the repodata meta-data, that's it.

> I have, repeatedly.  If the RPMs in question are stored
> with the payload unpacked, and binary deltas against each
> file (similar to the CVS repository v file) stored, 

I don't think you're realizing what you're suggesting.
Who is going to handle the load of the delta assembly?

It's one thing to do an off-line disassembly and "check-in"
the files, that only happens once -- when you upload the
file.

But the on-line, real-time, end-user assembly during
"check-out" is going to turn even a high-end server into a
big-@$$ door-stop (because it's not able to do much else)
with just a few users checking things out!  Do you understand
this?

BTW/FYI:  I know how deltas work -- not only text, but the
larger issue of delta'ing binary files.  And I have
personally deployed XDelta as a binary delting application
over the last 5 years, since CVS can only store binaries
whole.  I haven't looked into how Subversion stores binaries
(same algorithm as XDelta?).

> then what is happening is not quite as CPU-intensive as you

> make it out to be.

Not true!  Not true at all!  You're talking GBs of
transactions _per_user_.

You're going to introduce:
- Massive overhead
- Greatly increased "resolution time" (even before
considering the server responsiveness)
- Many other issues that will make it "unusable" from the
standpoint of end-users

You can_not_ do this on an Internet server.  At most, you can
do it locally with NFS with GbE connections so the clients
themselves off-load a lot of the overhead.  That's not
feasible over the Internet, so that falls back on the
Internet server.

As I mentioned before, not my Internet server!  ;->

> Most patches are a few bytes here and there in an up to a 
> few megabyte executable, with most package patches touching
> one or a few files, but typically not touching every binary
> in the package.  You store the patch (applied with xdelta
or 
> similar) and build the payload on the fly (simple CPIO
> here).  You send an RPM out that was packed by the server,
> which is I/O bound, not CPU bound.

Either you have to:  
- Do full xdelta revisions on the entire RPM (ustar/cpio)
- Break up the RPM and use your own approach

In any case, it's a crapload more overhead than merely
serving out files via HTTP.  You're going to reduce your
ability to service users by an order of magntitude, if not 2!

> With forethought to those things that can be 
> prebuilt versus those things that have to be generated
> realtime, the amount of realtime generation can be
> minimized, I think.

That's the key right there -- you think.

Again, keep in mind that repositories merely serve out files
via HTTP today.  Now you're adding in 10-100x the overhead. 
You're sending data back and forth, back and forth, back and
forth, between the I/O, memory, CPU, etc...  Just 1 single
operation is going to choke most servers that can service
10-100 HTTP users.

> Prove exponential CPU usage increase.
> If designed intelligently, it might be no more intensive
> than rsync, which is doing much of what is required 
> already.  Would need information on the loading of rsync on
> a server.

No, you're talking about facilities that go beyond what rsync
does.  You're not just doing simple file differences between
one system and another.  You're talking about _multiple_
steps through _multiple_ deltas and lineage.

There's a huge difference between traversing extensive delta
files and just an rsync delta between existing copies.  ;->

> That's because CVS as it stands is inefficient with
> binaries.

I only referenced CVS because someone else made the analogy.
So yes, I know CVS stores binaries whole.
That aside, the XDelta is _still_ going to cause a sizeable
amount of overhead.
Far more than Rsync.

> Think outside the CVS box, Bryan.

I am.  I _only_ used CVS because it was used prior for
analogy.  Now I'm talking about XDelta, which I _did_ have in
mind previously when I wrote my prior e-mails.

> I did not say 'Use CVS for this'; I said 
> 'Use a CVS-like system for this' meaning simply the guts of
> the mechanism.

I know.  I was already thinking ahead, but since the original
poster doesn't even understand how delta'ing works, I didn't
want to burden him with further understanding.

> CVS per se would be horribly inefficient for this purpose.

Delta'ing _period_ is horribly inefficient for this purpose.
In fact, storing the revisions whole would actually be
_faster_ than reverse deltas of _huge_ binary files.

I don't care how you "break it up" -- it's going to _kill_
your server compared to just an HTTP stream.

> Store the unpacked RPMs and binary deltas for each file.

You're talking about cpio operations _en_masse_ on a server!
Have you ever done just a few smbtar operations from a server
before?  Do you _know_ what happens to your I/O?

_That's_ what I'm talking about.

> Store prebuilt headers if needed.

As far as I'm concerned, that's the _only_ thing you should
_ever_ delta.  I don't relish the idea of a repository of
delta'd cpio archives.  It's just ludicrious to me -- and
even more so over the Internet.

Because on the Internet, now you have to start "buffering" or
"temporarily storing" packages.  When you have tens of
systems getting updates, you're duplicating a lot. 
Case-in-point:  You'd be better off just storing the RPMs
whole on the filesystem itself.

Only revision headers, period.

> Trust the server to sign on the fly rather than at build 
> time (I/O bound).

No, sorry.  I sign _off-line_ for a reason.

> Pack the payload on the fly with CPIO (I/O bound).

But the problem is you have duplicate I/O streams -- back and
forth.  That's a PITA when you've got tens of operations
going on.

Again, have you _ever_ run smbtar from your server to just a
few Windows clients for backup?  Same problem.

> Send the RPM out (I/O bound) when needed.

And buffer it, temporarily store it, etc... for 10+
connections.

> Mirrors rsync the whole unpacked repository (I/O bound).

But it does a delta against 2 existing files -- not an entire
lineage of deltas.  I really don't think you've thought this
through.

> Are there issues with this?  Of course there are.  But the
> tradeoff is mirroring many GB of RPM's (rsync has to take
> some CPU for mirroring this large of a collection) versus
> mirroring fewer GB of unpacked RPM's plus binary deltas,

I think your minimizing the binary delta operation, big time.
 I don't think you're going to save any size in the end for
mirrors either.

> and signing the on-the-fly RPM.

Again, for security reasons, I very much consider this to be
a "disadvantage."  I like to sign _off-line_ for a reason --
still automated -- but from an _internal_ system.

> Yes, it will take more CPU, but I think linearly more CPU
> and not exponentially.

Here's a "real world" test for you.

Write a Apache script or even C program that takes XDelta
version files, makes them into a cpio archive, and services
them up.

Now just services up the cpio archive without all the
processing.

How many clients can you serve for each?

> Of course, it would  have to be tried.  The many GB of
> mirror has got to have many GB of redundancy in it.
> The size of the updates is getting out of control; for
> those with limited bandwidth it becomes very difficult
> to stay up to date.

I think you've underestimated the resources required to
XDelta -- not "two points" like in rsync, but _multiple_. 
The cpio operation actually pales in comparison.

-- 
Bryan J. Smith                | Sent from Yahoo Mail
mailto:b.j.smith at ieee.org     |  (please excuse any
http://thebs413.blogspot.com/ |   missing headers)