Lamar Owen lowen@pari.edu wrote:
It depends on the implementation. You in your other delta message spell out essentially the same idea.
No, my other message was a _completely_different_ idea. It is a hack to the HTTP-serviced repository that just keeps multiple sets of repodata directories.
A major difference between a true delta back-end and that hack is that while you can re-generate the meta-data at any point for the former, you can_not_ for the latter. In other words, the "repodelta" hack I described can _only_ generated repodata for the state of the repository then, there and at _no_ other time. I.e., you cannot "go back in time" to re-generate it.
There is no "database" or "interwoven history" of the repository in the repodelta hack. It is just a simple hack to keep multiple copies of the repodata meta-data, that's it.
I have, repeatedly. If the RPMs in question are stored with the payload unpacked, and binary deltas against each file (similar to the CVS repository v file) stored,
I don't think you're realizing what you're suggesting. Who is going to handle the load of the delta assembly?
It's one thing to do an off-line disassembly and "check-in" the files, that only happens once -- when you upload the file.
But the on-line, real-time, end-user assembly during "check-out" is going to turn even a high-end server into a big-@$$ door-stop (because it's not able to do much else) with just a few users checking things out! Do you understand this?
BTW/FYI: I know how deltas work -- not only text, but the larger issue of delta'ing binary files. And I have personally deployed XDelta as a binary delting application over the last 5 years, since CVS can only store binaries whole. I haven't looked into how Subversion stores binaries (same algorithm as XDelta?).
then what is happening is not quite as CPU-intensive as you
make it out to be.
Not true! Not true at all! You're talking GBs of transactions _per_user_.
You're going to introduce: - Massive overhead - Greatly increased "resolution time" (even before considering the server responsiveness) - Many other issues that will make it "unusable" from the standpoint of end-users
You can_not_ do this on an Internet server. At most, you can do it locally with NFS with GbE connections so the clients themselves off-load a lot of the overhead. That's not feasible over the Internet, so that falls back on the Internet server.
As I mentioned before, not my Internet server! ;->
Most patches are a few bytes here and there in an up to a few megabyte executable, with most package patches touching one or a few files, but typically not touching every binary in the package. You store the patch (applied with xdelta
or
similar) and build the payload on the fly (simple CPIO here). You send an RPM out that was packed by the server, which is I/O bound, not CPU bound.
Either you have to: - Do full xdelta revisions on the entire RPM (ustar/cpio) - Break up the RPM and use your own approach
In any case, it's a crapload more overhead than merely serving out files via HTTP. You're going to reduce your ability to service users by an order of magntitude, if not 2!
With forethought to those things that can be prebuilt versus those things that have to be generated realtime, the amount of realtime generation can be minimized, I think.
That's the key right there -- you think.
Again, keep in mind that repositories merely serve out files via HTTP today. Now you're adding in 10-100x the overhead. You're sending data back and forth, back and forth, back and forth, between the I/O, memory, CPU, etc... Just 1 single operation is going to choke most servers that can service 10-100 HTTP users.
Prove exponential CPU usage increase. If designed intelligently, it might be no more intensive than rsync, which is doing much of what is required already. Would need information on the loading of rsync on a server.
No, you're talking about facilities that go beyond what rsync does. You're not just doing simple file differences between one system and another. You're talking about _multiple_ steps through _multiple_ deltas and lineage.
There's a huge difference between traversing extensive delta files and just an rsync delta between existing copies. ;->
That's because CVS as it stands is inefficient with binaries.
I only referenced CVS because someone else made the analogy. So yes, I know CVS stores binaries whole. That aside, the XDelta is _still_ going to cause a sizeable amount of overhead. Far more than Rsync.
Think outside the CVS box, Bryan.
I am. I _only_ used CVS because it was used prior for analogy. Now I'm talking about XDelta, which I _did_ have in mind previously when I wrote my prior e-mails.
I did not say 'Use CVS for this'; I said 'Use a CVS-like system for this' meaning simply the guts of the mechanism.
I know. I was already thinking ahead, but since the original poster doesn't even understand how delta'ing works, I didn't want to burden him with further understanding.
CVS per se would be horribly inefficient for this purpose.
Delta'ing _period_ is horribly inefficient for this purpose. In fact, storing the revisions whole would actually be _faster_ than reverse deltas of _huge_ binary files.
I don't care how you "break it up" -- it's going to _kill_ your server compared to just an HTTP stream.
Store the unpacked RPMs and binary deltas for each file.
You're talking about cpio operations _en_masse_ on a server! Have you ever done just a few smbtar operations from a server before? Do you _know_ what happens to your I/O?
_That's_ what I'm talking about.
Store prebuilt headers if needed.
As far as I'm concerned, that's the _only_ thing you should _ever_ delta. I don't relish the idea of a repository of delta'd cpio archives. It's just ludicrious to me -- and even more so over the Internet.
Because on the Internet, now you have to start "buffering" or "temporarily storing" packages. When you have tens of systems getting updates, you're duplicating a lot. Case-in-point: You'd be better off just storing the RPMs whole on the filesystem itself.
Only revision headers, period.
Trust the server to sign on the fly rather than at build time (I/O bound).
No, sorry. I sign _off-line_ for a reason.
Pack the payload on the fly with CPIO (I/O bound).
But the problem is you have duplicate I/O streams -- back and forth. That's a PITA when you've got tens of operations going on.
Again, have you _ever_ run smbtar from your server to just a few Windows clients for backup? Same problem.
Send the RPM out (I/O bound) when needed.
And buffer it, temporarily store it, etc... for 10+ connections.
Mirrors rsync the whole unpacked repository (I/O bound).
But it does a delta against 2 existing files -- not an entire lineage of deltas. I really don't think you've thought this through.
Are there issues with this? Of course there are. But the tradeoff is mirroring many GB of RPM's (rsync has to take some CPU for mirroring this large of a collection) versus mirroring fewer GB of unpacked RPM's plus binary deltas,
I think your minimizing the binary delta operation, big time. I don't think you're going to save any size in the end for mirrors either.
and signing the on-the-fly RPM.
Again, for security reasons, I very much consider this to be a "disadvantage." I like to sign _off-line_ for a reason -- still automated -- but from an _internal_ system.
Yes, it will take more CPU, but I think linearly more CPU and not exponentially.
Here's a "real world" test for you.
Write a Apache script or even C program that takes XDelta version files, makes them into a cpio archive, and services them up.
Now just services up the cpio archive without all the processing.
How many clients can you serve for each?
Of course, it would have to be tried. The many GB of mirror has got to have many GB of redundancy in it. The size of the updates is getting out of control; for those with limited bandwidth it becomes very difficult to stay up to date.
I think you've underestimated the resources required to XDelta -- not "two points" like in rsync, but _multiple_. The cpio operation actually pales in comparison.