On Jul 24, 2011, at 3:14 PM, Ljubomir Ljubojevic wrote:
Eeek! You are already well beyond my expertise: that's a whole lotta repos.
You are likely paying a significant performance cost carrying around that number of repositories. Can you perhaps estimate how much that performance cost is? Say, how long does it take to do some single package update with only CentOS repositories configured all of the above configured I'm just interested in a data point to calibrate my expectations of how yum behaves with lots of repositories. You're one of the few and the brave with that number of repositories …
Take notice that only 16 are enabled, and ~24 are disabled by default and used only if I do not find what I am looking for.
I can tell that there are already yum performance problems scaling to that number because you (like any rational person would) are choosing to manually intervene and enable/disable repositories as needed.
Performance is not much of an issue, since the attributing factor is the number of packages in side those repositories. Biggest of third party repos are repoforge and repoforge-dag.
You are correct that the scaling depends on the number of packages not the number of repositories.
However the solution to a distributed lookup scaling problem *does* depend on the number of places that have to be searched as well as the cost of a failed lookup. If you have to look in a large number of repositories to ensure that some packages does NOT exist anywhere, well there are ways to do that efficiently.
And none of the right solutions to the increasing cost of a failed lookup are implemented in yum afiak.
I was hoping to get an estimate of how bad the scaling problem actually is from an objective wall clock time seat-of-the-pants measurement.
Meanwhile I'm happy that you've found a workable solution for your purposes. I'm rather more interested in what happens when there hundreds of repositories and 10's of thousands of packages that MUST be searched.
I suspect that yum will melt into a puddle if/when faced with depsolving on that scale. Not that anyone needs depsolving on the scale of hundreds of repos and 10's of thousands of packages in the "real world", but that isn't a proper justification for not considering the cost of a failed lookup carefully which (from what you are telling me) you are already seeing, and dealing with by enabling/disabling repositories and inserting a high priority repository that is also acting as a de facto cache and "working set" for the most useful packages.
… again no fault intended: I am seriously interested in the objective number for "engineering" and development purposes, not in criticizing.
<snip> > Prefer answers from the same repository. > A "nearness" rather than a "priority" metric starts to scale better. E.g. > with a "priority" metric, adding a few more repositories likely forces > an adjustment in *all* the priorities. There's some chance (I haven't > looked) that a "nearness" metric would be more localized and that > a "first found" search on a simple repository order might be > sufficient to mostly get the right answer without the additional artifact > of attaching a "priority" score to every package. >
This is why I chose to create plnet-downloaded. Versions on useful packages are copied and their versions frozen with stable releases, and updated in bulk and controlled. Might be easier to just repac them and create separate repository.
Prsumably this is the high priority (and hence searched first) that is acting as a de facto cache, thereby avoiding the failed lookup scaling issues I've just alluded to.
73 de Jeff