On Jul 24, 2011, at 4:35 PM, Ljubomir Ljubojevic wrote:
Jeff Johnson wrote:
You are correct that the scaling depends on the number of packages not the number of repositories.
However the solution to a distributed lookup scaling problem *does* depend on the number of places that have to be searched as well as the cost of a failed lookup. If you have to look in a large number of repositories to ensure that some packages does NOT exist anywhere, well there are ways to do that efficiently.
And none of the right solutions to the increasing cost of a failed lookup are implemented in yum afiak.
I was hoping to get an estimate of how bad the scaling problem actually is from an objective wall clock time seat-of-the-pants measurement.
Meanwhile I'm happy that you've found a workable solution for your purposes. I'm rather more interested in what happens when there hundreds of repositories and 10's of thousands of packages that MUST be searched.
I suspect that yum will melt into a puddle if/when faced with depsolving on that scale. Not that anyone needs depsolving on the scale of hundreds of repos and 10's of thousands of packages in the "real world", but that isn't a proper justification for not considering the cost of a failed lookup carefully which (from what you are telling me) you are already seeing, and dealing with by enabling/disabling repositories and inserting a high priority repository that is also acting as a de facto cache and "working set" for the most useful packages.
Thank you! In general those numbers are better than I would have guessed from yum.
<snip>
If you have some specific stres test I would be happy to run it.
If I can think of something, I'll pass it along.
Oh, yeah, yum reads and process xml files, not actual files, so searches are fast because of it.
Here's something that might help you:
Using xml is a significant performance hit: see recent patches to yum/createrepo to use sqlite instead of xml … lemme find the check-in claim … here is the claim http://lists.baseurl.org/pipermail/rpm-metadata/2011-July/001353.html and quoting
Tested locally on repodata of 9000 pkgs.
Goes from 1.8-> 2GB of memory in use with the old createrepo code to 325MB of memory in use - same operation - performance-wise it is not considerably different. More testing will bear that out, though.
So -- if I believe those numbers -- there's *lots* of room for improvement in yum ripping out xml and replacing with a sqlite database. Note that createrepo != yum but some of the usage cases are similar. The general problem in yum (and smart and apt) is the high cost of the cache load, and the amount of aml that must be parsed/read in order to be cached. Adding a sqlite backing store which can just be used, not loaded, is a win.
Note that the other problem I alluded to, avoiding the cost of a failed search across a distributed store, is very well researched and modeled (and unimplemented in yum). But most depsolving just needs to find what package is needed and using priority is a reasonable way to improve that search (if you can choose the priorities sanely, which is hard).
The usual approach is to devise a cheap way to detect and avoid a failing search. This is often done with Bloom filters, but there are other equivalent ways to avoid the cost of failure.
Wikipedia isn't too bad an introduction to Bloom filters if interested. The hard part is choosing the parameters correctly for an "expected" population. If you miss that estimate (or choose the parameters incorrectly) then Bloom filters will just make matters worse.
<snip>
off to study and think a bit … thanks!
73 de Jeff