On Thu, 2005-09-08 at 17:37, Bryan J. Smith wrote:
What I want is to be able to update more than one machine and expect them to have the same versions installed. If that isn't a very common requirement I'd be very surprised.
So what you want to checkout the repository from a specific tag and/or date. So you want:
- The repository to have every single package -- be it
packages as whole, or some binary delta'ing between RPMs (if possible)
It just needs to keep every package that it has ever had - at least as long as it might be useful for someone to install them. That seems to be the case now. You need this anyway unless you are sure that no files that remain have specific dependencies on anything removed.
- The repository meta-data to have all history so it can
backtrack to any tag/date.
If by history, you mean a timestamp of when a file was added, yes - and that already seems to be there. That would be sufficient to make updates repeatable. I'd like to add one more thing to make it more or less atomic, and that would be some indication of the latest timestamp that should be usable - that is, newer files are in an inconsistent state of a partial update. When the repository maintainer has all the files in place this file would be modified - and some special consideration should be applied to make sure it shows up last during mirror updates. This extra part could be avoided if the 'latest timestamp' is published somewhere and you could manually pass it to yum during the update.
In other words, you want a repository to maintain storage and use CPU-I/O power to resolves tens of GBs of inter-related data and corresponding versioning meta-data.
No, I want to be able to tell yum not to consider files newer than a certain date corresponding to the time I did the update on the baseline/test machine even if newer ones happen to be sitting in the repository. And I'd like yum to always ignore changes that are transitory and incomplete.
BTW, Your comparison to CVS is extremely poor, so _stop_.
CVS would give the result I want. How it gets done is not particularly relevant.
APT, YUM and countless other package repositories store packages whole, with a "current state" meta-data list, and the packages and that meta-data is services via HTTP and the _client_ resolves what it wants to do.
CVS can run with only file-level access to the repository and no particular intelligence on the server. However, I agree that it isn't exactly the service we need here.
What you want is a more "real-time" resolution logic "like CVS." That either requires:
A) A massive amount of data transfer if done at the client, or
Yum only needs the headers which involve a massive amount of data tranfer already. Using them slightly more intelligently would not take much more, even if a timestamp/tagname filed had to be added to the header.
B) A massive amount of CPU-I/O overhead if done at the server
No it doesn't. All it needs is for yum to observe the timestamps on files and ignore any past the point you specify even if they are available. Or move this info to the headers if you don't trust timestamps to be maintained.
I see 2 evolutionary approaches to the problem.
- Maintain multiple YUM repositories, even if all but the
original are links to the original. The problem is this is who defines what the "original" is? That's why you should maintain your _own_, so it's what _you_ expect it to be.
The Centos repository is the only one I've seen that doesn't keep every file that has ever been added forever. And they do have that available. I'm really not asking for eons of history here. I just want repeatable updates for some small testing window.
- Modify the YUM respository meta-data files so they store
revisions, whereby each time createrepo is run, the meta-data is continuing list.
#1 is direct and practical. #2 adds a _lot_ to the initial query YUM does, and could push it from seconds to minutes or even _hours_ at the client (not to mention the increase in traffic). That's the problem.
The only extra piece of data really needed is the latest timestamp of a consistent update. The rest could be figured out but you'd need a way to find what that value was at the time you do one update so you could re-use it for repeatable results even if it had subsequently changed in the repository. If I were doing the #2 approach, as much as I like an arbitrary number of arbitrary named tags, I'd probably go with an incrementing 'repository update version' tag that would be bumped on new sets of files so you don't ever have to change old ones and you can compute which ones are past what you specify and should be ignored. Some of those header files are 100k now - how much more overhead could an update version entry add?
This isn't Centos-specific - I just rambled on from some other mention of it and apologize for dwelling on it here. There are 2 separate issues: One is that yum doesn't know if a repository or mirror is consistent or in the middle of an update with only part of a set of RPM's that really need to be installed together.
Not true. The checks that createrepo does can prevent an update if there are missing dependencies. The problem is that most "automated" repos bypass those checks.
Does createrepo do its magic atomically? What do yum attempts running concurrently see as it succeeds/fails?
So, again, we're talking "repository management" issues and _not_ the tool itself.
No, I want the repository to be able to be inconsistent and the tool to be able to perform an update based on a prior known-good state.
The other is that if you update one machine and everything works, you have no reason to expect the same results on the
next machine a few minutes later.
Because there is not tagging/date facility. But to add that, you'd have to add either (again):
- A _lot_ of traffic (client-based)
- A _lot_ of CPU-I/O overhead (server-based)
Or a sensible approach.
Both issues would be solved if there were some kind of tag mechanism that could be applied by the repository updater after all files are present and updates could be tied to earlier tags even if the repository is continuously
updated.
So, in other words, you want the client to get repository info in 15-30 minutes, instead of 15-30 seconds. ;->
No, I want it to get more or less what it already does but ignore inconsistent changes in progress and have the option to ignore things newer than a time you did an earlier update which you'd like to repeat.
_This_ is the point you keep missing. It's the load that is required to do what you want. Not just a few hundred developers moving around a few MBs of files, but _tens_ of _thousands_ of users accessing _GBs_ of binaries.
That's why you rsync the repository down, and you do that locally.
Sorry, I just don't buy the concept that rsync'ing a whole repository is an efficient way to keep track of the timestamps on a few updates so you can repeat them later. Rsync imposes precisely that big load on the server side that you wanted to avoid having everyone do.