[CentOS] Why is yum not liked by some?

Fri Sep 9 20:20:05 UTC 2005
Lamar Owen <lowen at pari.edu>

On Friday 09 September 2005 14:18, Bryan J. Smith wrote:
> I don't think you're realizing what you're suggesting.

Yes, I do.  I've suggested something like this before, and there has been some 
work on it (see Fedora lists archives from nearly a year or more ago).

> Who is going to handle the load of the delta assembly?

The update generation process.  Instead of building just an RPM, the 
buildsystem builds the delta package to push to the package server.

> But the on-line, real-time, end-user assembly during
> "check-out" is going to turn even a high-end server into a
> big-@$$ door-stop (because it's not able to do much else)
> with just a few users checking things out!  

Do benchmarks on a working system of this type, then come back to me about the 
unbearable server load.

> Do you understand 
> this?

Do you understand how annoyingly arrogant you sound?  I am not a child, Bryan.

> Not true!  Not true at all!  You're talking GBs of
> transactions _per_user_.

I fail to see how a small update of a few files (none of which approach 1GB in 
size!) can produce multiple GB's of transactions per user.  You seem to not 
understand how simple this system could be, nor do you seem willing to even 
try to understand it past your own preconceived notions.

> You're going to introduce:
> - Massive overhead

In your opinion.

> - Greatly increased "resolution time" (even before
> considering the server responsiveness)
> - Many other issues that will make it "unusable" from the
> standpoint of end-users

All in your opinion.

> You can_not_ do this on an Internet server.  At most, you can
> do it locally with NFS with GbE connections so the clients
> themselves off-load a lot of the overhead.  That's not
> feasible over the Internet, so that falls back on the
> Internet server.

How in the world would sending an RPM down the 'net built from a delta use 
more bandwidth than sending that same file as is sent now?  Being that HTTP 
is probably the transport for EITHER.

> As I mentioned before, not my Internet server!  ;->

That is your choice, and your opinion.

> In any case, it's a crapload more overhead than merely
> serving out files via HTTP.  You're going to reduce your
> ability to service users by an order of magntitude, if not 2!

Have you even bothered to analyze  this in an orderly fashion, instead of 
flying off the handle like Chicken Little?  Calm down, Bryan.

> > With forethought to those things that can be
> > prebuilt versus those things that have to be generated
> > realtime, the amount of realtime generation can be
> > minimized, I think.

> That's the key right there -- you think.

Against your opinion, because neither of us has empirical data on this.

> Again, keep in mind that repositories merely serve out files
> via HTTP today.  Now you're adding in 10-100x the overhead.
> You're sending data back and forth, back and forth, back and
> forth, between the I/O, memory, CPU, etc...  Just 1 single
> operation is going to choke most servers that can service
> 10-100 HTTP users.

And this is balanced to the existing rsync-driven mirroring that is doing 
multiple gigabytes worth of traffic.  If the size of the files being rsync'd 
is reduced by a sufficient percentage, wouldn't that lighten that portion of 
the load?  Have you worked the numbers for a balance?  I know that if I were 
contracting with you on any of my upcoming multi-terabyte-per-day radio 
astronomy research projects, and you started talking to me this way, you'd be 
looking for another client.

> No, you're talking about facilities that go beyond what rsync
> does.  You're not just doing simple file differences between
> one system and another.  You're talking about _multiple_
> steps through _multiple_ deltas and lineage.

If you have, say, ten updates.  You apply the ten update deltas in sequence 
and send it down the pike.  Is applying a delta to a binary file that is a 
few kilobytes in length that stressful?  What single binary in a typical 
CentOS installation is over a few megs?

> There's a huge difference between traversing extensive delta
> files and just an rsync delta between existing copies.  ;->

Yes, there is.  The rsync delta is bidirectional traffic.

> I only referenced CVS because someone else made the analogy.

You not even paying enough attention to know who said what; why should I 
listen to a rant about something you have no empirical data to back?

I made an analogy to CVS, and I really think things could be made more 
bandwidth and storage efficient for mirrors, master repositories, and 
endusers without imposing an undue CPU load at the mirror.  Feel free to 
disagree with me, but at least keep it civil, and without insulting my 
intelligence.

> So yes, I know CVS stores binaries whole.
> That aside, the XDelta is _still_ going to cause a sizeable
> amount of overhead.

How much?  Why not try it (read the Fedora lists archives for some folks who 
have indeed tried it).

> Far more than Rsync.

You think.

> I know.  I was already thinking ahead, but since the original
> poster doesn't even understand how delta'ing works, I didn't
> want to burden him with further understanding.

Oh, just get off the arrogance here, please.  You are not the only genius out 
here, and you don't have anything to prove with me.  I am not impressed with 
resumes, or even with an IEEE e-mail address.  Good attitude beats brilliance 
any day of the week.

> > CVS per se would be horribly inefficient for this purpose.
>
> Delta'ing _period_ is horribly inefficient for this purpose.
> In fact, storing the revisions whole would actually be
> _faster_ than reverse deltas of _huge_ binary files.

But then there's still the many GB for the mirror.  There are only two reasons 
to do deltas, in my opinion:
1.)	Reduce mirror storage space.
2.)	Reduce bandwidth required to mirror, and/or reduce bandwidth to the 
enduser (which I didn't address in this, but could be addressed, even though 
it is far more complicated to send deltas straight to the user).

> You're talking about cpio operations _en_masse_ on a server!
> Have you ever done just a few smbtar operations from a server
> before?  Do you _know_ what happens to your I/O?

> _That's_ what I'm talking about.

It once again depends on the process used.  A streaming process could be used 
that would not impact I/O as badly as you state (although you first said it 
would kill my CPU, not my I/O).  But tests and development would have to be 
done.

Again, the tradeoff is between the storage and bandwidth required at the 
mirrors to processing.  Of course, if the mirror server is only going to 
serve http, doing it in the client isn't good.

> > Store prebuilt headers if needed.
>
> As far as I'm concerned, that's the _only_ thing you should
> _ever_ delta.  I don't relish the idea of a repository of
> delta'd cpio archives.  It's just ludicrious to me -- and
> even more so over the Internet.

So you think I'm stupid for suggesting it.  (That's how it comes across).  Ok, 
I can deal with that.

*PLONK*
-- 
Lamar Owen
Director of Information Technology
Pisgah Astronomical Research Institute
1 PARI Drive
Rosman, NC  28772
(828)862-5554
www.pari.edu