On Friday 09 September 2005 14:18, Bryan J. Smith wrote:
I don't think you're realizing what you're suggesting.
Yes, I do. I've suggested something like this before, and there has been some work on it (see Fedora lists archives from nearly a year or more ago).
Who is going to handle the load of the delta assembly?
The update generation process. Instead of building just an RPM, the buildsystem builds the delta package to push to the package server.
But the on-line, real-time, end-user assembly during "check-out" is going to turn even a high-end server into a big-@$$ door-stop (because it's not able to do much else) with just a few users checking things out!
Do benchmarks on a working system of this type, then come back to me about the unbearable server load.
Do you understand this?
Do you understand how annoyingly arrogant you sound? I am not a child, Bryan.
Not true! Not true at all! You're talking GBs of transactions _per_user_.
I fail to see how a small update of a few files (none of which approach 1GB in size!) can produce multiple GB's of transactions per user. You seem to not understand how simple this system could be, nor do you seem willing to even try to understand it past your own preconceived notions.
You're going to introduce:
- Massive overhead
In your opinion.
- Greatly increased "resolution time" (even before
considering the server responsiveness)
- Many other issues that will make it "unusable" from the
standpoint of end-users
All in your opinion.
You can_not_ do this on an Internet server. At most, you can do it locally with NFS with GbE connections so the clients themselves off-load a lot of the overhead. That's not feasible over the Internet, so that falls back on the Internet server.
How in the world would sending an RPM down the 'net built from a delta use more bandwidth than sending that same file as is sent now? Being that HTTP is probably the transport for EITHER.
As I mentioned before, not my Internet server! ;->
That is your choice, and your opinion.
In any case, it's a crapload more overhead than merely serving out files via HTTP. You're going to reduce your ability to service users by an order of magntitude, if not 2!
Have you even bothered to analyze this in an orderly fashion, instead of flying off the handle like Chicken Little? Calm down, Bryan.
With forethought to those things that can be prebuilt versus those things that have to be generated realtime, the amount of realtime generation can be minimized, I think.
That's the key right there -- you think.
Against your opinion, because neither of us has empirical data on this.
Again, keep in mind that repositories merely serve out files via HTTP today. Now you're adding in 10-100x the overhead. You're sending data back and forth, back and forth, back and forth, between the I/O, memory, CPU, etc... Just 1 single operation is going to choke most servers that can service 10-100 HTTP users.
And this is balanced to the existing rsync-driven mirroring that is doing multiple gigabytes worth of traffic. If the size of the files being rsync'd is reduced by a sufficient percentage, wouldn't that lighten that portion of the load? Have you worked the numbers for a balance? I know that if I were contracting with you on any of my upcoming multi-terabyte-per-day radio astronomy research projects, and you started talking to me this way, you'd be looking for another client.
No, you're talking about facilities that go beyond what rsync does. You're not just doing simple file differences between one system and another. You're talking about _multiple_ steps through _multiple_ deltas and lineage.
If you have, say, ten updates. You apply the ten update deltas in sequence and send it down the pike. Is applying a delta to a binary file that is a few kilobytes in length that stressful? What single binary in a typical CentOS installation is over a few megs?
There's a huge difference between traversing extensive delta files and just an rsync delta between existing copies. ;->
Yes, there is. The rsync delta is bidirectional traffic.
I only referenced CVS because someone else made the analogy.
You not even paying enough attention to know who said what; why should I listen to a rant about something you have no empirical data to back?
I made an analogy to CVS, and I really think things could be made more bandwidth and storage efficient for mirrors, master repositories, and endusers without imposing an undue CPU load at the mirror. Feel free to disagree with me, but at least keep it civil, and without insulting my intelligence.
So yes, I know CVS stores binaries whole. That aside, the XDelta is _still_ going to cause a sizeable amount of overhead.
How much? Why not try it (read the Fedora lists archives for some folks who have indeed tried it).
Far more than Rsync.
You think.
I know. I was already thinking ahead, but since the original poster doesn't even understand how delta'ing works, I didn't want to burden him with further understanding.
Oh, just get off the arrogance here, please. You are not the only genius out here, and you don't have anything to prove with me. I am not impressed with resumes, or even with an IEEE e-mail address. Good attitude beats brilliance any day of the week.
CVS per se would be horribly inefficient for this purpose.
Delta'ing _period_ is horribly inefficient for this purpose. In fact, storing the revisions whole would actually be _faster_ than reverse deltas of _huge_ binary files.
But then there's still the many GB for the mirror. There are only two reasons to do deltas, in my opinion: 1.) Reduce mirror storage space. 2.) Reduce bandwidth required to mirror, and/or reduce bandwidth to the enduser (which I didn't address in this, but could be addressed, even though it is far more complicated to send deltas straight to the user).
You're talking about cpio operations _en_masse_ on a server! Have you ever done just a few smbtar operations from a server before? Do you _know_ what happens to your I/O?
_That's_ what I'm talking about.
It once again depends on the process used. A streaming process could be used that would not impact I/O as badly as you state (although you first said it would kill my CPU, not my I/O). But tests and development would have to be done.
Again, the tradeoff is between the storage and bandwidth required at the mirrors to processing. Of course, if the mirror server is only going to serve http, doing it in the client isn't good.
Store prebuilt headers if needed.
As far as I'm concerned, that's the _only_ thing you should _ever_ delta. I don't relish the idea of a repository of delta'd cpio archives. It's just ludicrious to me -- and even more so over the Internet.
So you think I'm stupid for suggesting it. (That's how it comes across). Ok, I can deal with that.
*PLONK*