For me as a mirror admin, the only feature I don't like about MirrorBrain is that I don't have the ability to log in and "check on" or admin my mirror. I mirror for a few different distros, and ubuntu's mirror manager is quite poor as well. I have an account, but can't get to it. When I fail a test of some sort, I get a not-very-useful e-mail, and no way to get more info on what happened. I usually end up just "waiting it out". It would be nice if I get an e-mail allerting me to something being wrong, and then allowing me to log in and see. I also like being able to specify some IP ranges I'm authoritative for. As my mirror is on a university campus, I'd love to be able to enter my campus' IP ranges, and that way ensure that all my campus gets my mirror. So far, none of the OSes I mirror for (I don't mirror Fedora presently) allows me to do that. MirrorBrain sounds like it has a lot of the functionality, but only available to the distro managers. They're busy people; I'd rather not bother them if I can handle stuff myself. --Jim On Mon, Nov 8, 2010 at 1:02 AM, Peter Pöml <peter at poeml.de> wrote: > Hi everybody, > > [resending, after realizing that I was subscribed with an old address] > > On Wed, Oct 27, 2010 at 11:31:56PM +0200, Ralph Angenendt wrote: >> There is a wiki page for that process now. I put down the notes I took >> at the meeting for now. There's also a log of the IRC meeting, which I >> want to redact a bit first, as there is some off topic chatting in there >> (and several joins/leaves during the meeting). I won't have time for >> that before friday, though. >> >> Here's the page, which will fill up with more information: >> >> http://wiki.centos.org/InfraWiki/Mirrors >> >> I like to thank the people who were there and gave us input about other >> solutions (and questioned why we do things like we do). >> >> Regards, >> >> Ralph > > I would also like to thank you for the good meeting, and also for > considering MirrorBrain. > > This mail is very long -too long-, which I would like to apologize for, > but I thought it would be good to provide a comprehensive overview of > the options that I see. > > First off, I think you can't go wrong if you go with MirrorManager, > because it works for Fedora, and it already has support for the somewhat > more special requirement that you have, which is yum mirror lists. The > similarity of Fedora and Centos might make many things easier. > MirrorBrain doesn't have this yet, because none of its users needed it > so far. As MirrorBrain tries to be a generic solution, it is generally > agnostic of project or metadata structure, and does everything on file > level. That doesn't mean that support for "special" features is > unwanted, of course. Especially if it can be implemented in a way that > it fits into the concept, and doesn't make deployment for other users > more difficult. It is certainly a nice option - there are many Yum-based > distros, after all. > > (background: > Being usable not only by Linux distros is a declared goal of the > MirrorBrain project, in order to get as many users (and potential > developers) into the boat and collaborate. > > For a mirroring infrastructure, I believe that only collaboration across > organization borders can yield a mature, flexible and long-lived > solution. And there are not really many people working on this, only a > handful. It would be cool to merge MirrorBrain and MirrorManager > somehow. Might be a lot of work but useful in the long-term. > ) > > Having said all that, I thought that Yum mirrorlist in MirrorBrain > should not be hard to implement. I spent some time on it today and got > quite far; configuring mapping of URL query arguments to > directories/files is done, and actual mapping works. I chose Apache > config as vehicle for that, and the following is a working config: > > MirrorBrainYumDir release=(5\.5) \ > repo=(os|extras|addons|updates|centosplus|contrib) \ > arch=x86_64 \ > $1/$2/x86_64 repodata/repomd.xml > > For instance, $1/$2/x86_64 is the base URL to a repository, and the match > groups can optionally be replaced with what the client specified to the > query arguments. ($1 is the first group from the configuration line, $2 > the second, and so on. The names and number of query args are all > arbitrary.) > The last argument is a relative path, and the file that must be present > on eligible mirrors. The resulting path here would be e.g. > 5.5/os/x86_64/repodata/repomd.xml, and the client would get a list of > mirrors in the form of > http://mirror.example.com/path/to/centos/5.5/os/x86_64/ > (That's what's missing to be implemented, but it's the easiest part :-) > So I'm confident that I can promise Yum mirror list soon. Maybe I can > finish it this week, maybe the week after, I don't now. > > Meanwhile, I would appreciate input from you: is this reasonable? Would > it serve your needs? > > If it does, I think the only feature in missing in MirrorBrain for you > would be sorted out. > > (Needless to say that the mirror list that yum gets will be sorted by > suitability of the mirrors) > > > So, on to the other issues that were raised in the meeting. > > Summarizing what I heard, the following are the problems that you would > like to solve: > > 1) scalability > 2) cleaning up the historic DVD/nonDVD setup > 3) partial mirroring > 4) finer mirror selection (by prefix, autonomous system, state/region, in > addition to country/continent) > 5) consistency problems > 6) content verification > 7) (presumably) backwards compatibility to existing installations > 8) (maybe) satellite setups > > > 1) scalability > > The dimensions are: > - 70.000 files in 500 directories > - >400 mirrors > - 40 requests per second > > Sounds fine from my point of view. MB has handled more > files, and more requests. The number of mirrors I have run it with was > smaller, 150 at most, but I wouldn't expect big problems. The little > mirrorprobe that runs every minute might run into a system limit when > starting 400 threads, to check all mirrors at the same time, so maybe it > needs to be tweaked, or changed to a different model, using a pool of > threads or starting some processes as well. > > > 2) cleaning up the historic DVD/nonDVD setup > > Sounds like a good idea :-) > > 3) partial mirroring > > Supported well by MirrorBrain. > > > 4) finer mirror selection (by prefix, autonomous system, state/region, in > addition to country/continent) > > MirrorBrain uses BGP/routing data to find out the network prefix and AS > of clients and mirrors, and matches them. Other criteria are GeoIP > country and continent. The closest match is used for mirror selection. > If several mirrors are there to choose from, a weighted randomization is > also applied, to be able to give some mirrors more requests and others > less. We talked in our meeting about the need for a smarter selection in > e.g. the US, where one doesn't want to be sent from one coast to the > other. GeoIP regions were discussed for this. I considered going that > route, but decided to implement a different concept, which I believe is > more widely useful, because it works also when no mirror within the same > state/region is found: using geographical distance between the client and > the mirrors. I just released this new feature into the wild: > http://mirrorbrain.org/news/2140-takes-geographical-distances-account/ > You can try it out > http://download.services.openoffice.org/files/stable/3.2.1/OOo-SDK_3.2.1_Linux_x86-64_install-deb_en-US.tar.gz.mirrorlist > and feedback is appreciated. > > > > 5) consistency problems > > Regarding problems with consistency of trees on mirrors / clients > accessing them, this is indeed a hard problem to solve. From discussions > with Fedora people I know that they also have/had major fights with > that. It took me a long time to finally get this sorted out when I still > worked on the openSUSE infrastructure. The following have proved useful > for me in the past: > > - Always take care of setting appropriate cache headers. must-revalidate > is the key, because it doesn't prevent caching, but causes clients > (and intermediaries) to always validate that a resource is still > fresh. > > It is hopeless to get all mirrors to run the same configuration in > this regard, and there are also some FTP mirrors (and FTP doesn't have > a feature to control caching at all), so for certain content, there is > no other option than delivering it from defined places _with_ proper > headers. Luckily, this concerns mostly small metadata files. > > This is against inconsistency as it happens when things come from > different places (different age). If cache control is not exerted by > the server (or client), intermediaries (web caches) commonly "guess" > how long they should deliver stuff from their cache, without > revalidation freshness. Typically, a squid assumes freshness for 4-18 > hours by default, and the exact time is hard to predict, because cache > pruning is complex and may take file size into account. Thus, it is > inevitable that clients see an inconsistent picture. > > - The second (and even more important) measure is to version metadata. > Actually, any data. Always and Everywhere. With RPMs, one is in the > lucky situation that this is usually done anyway (reliably increasing > version/release numbers with each rebuild). Exception files like > "MD5SUMS" definitely need to be treated separately and should never be > redirected to a mirror, not only for security reasons. > repo-md metadata, as used by Yum, exists in various incarnations. > Unfortunately, the ones I dealt with in the past were not versioned, > and files had names like "filelists.xml.gz", which leaves only > non-redirection as the only 100% solution. (So I did that.) > Nowadays, at least the repo-md metadata that the Fedora and openSUSE > people build is versioned, as can be seen in this example: > http://download.opensuse.org/repositories/Apache:/MirrorBrain/Apache_openSUSE_11.3/repodata/ > I suppose that createrepo does that these days. Anyway, this is > certainly a point where tight cooperation (and appropriate input) with > the build system folks is very important. > > - A third line of "defense" can be a client that double-checks itself > that it doesn't get old metadata, by checking with cryptohashes if the > download is the expected one, _and_ falls back to a different mirror > if it isn't the case. That's what Yum does, since MirrorManager sends > hashes/timestamps via Metalinks, and what Zypper does, since it uses a > Metalink client for all downloads that allows it to fall back to other > mirrors until it got the expected data. You won't be able to do > something fancy like that with CentOS 5 I guess, but maybe with the > next version. > (Actually, it's not that difficult to teach Yum using a Metalink client > -- I once tried it out, and it was a one-liner to replace its usage of > python-urlgrabber with a call to aria2c (powerful Metalink client) for > all downloads. Another great option would be to extend > python-urlgrabber to be a Metalink client.) > > > That's what I learnt anyway... maybe some of it can be useful to you. > Verifying 400 mirrors in realtime is no option, with our limited means, > IMO -- simply not doable. Of course, if anyone knows how to do that, I > am *very* interested :-) > > > 6) content verification > > Regarding content verification: I don't know how you currently check > exactly, but what can be done with MirrorBrain is: > - there currently is a tool for downloading a file from one or all mirrors > and displaying a hash of it. > - this obviously doesn't work well for huge files (DVDs) (if it's not > about a close, fast mirror). > - since recently, MB can keep all hashes of all files in a database. The > hashes include block (piece-wise) hashes. It would be fairly easy > to fetch the hash of a random block (or a defined one) and download > just that piece from all mirrors. (Since the hashes in the database > are retrievable from everywhere, such checkers could also run _very_ > distributed in fact.) > If you look at > http://download.documentfoundation.org/libreoffice/testing/3.3.0-beta2/rpm/x86_64/LibO_3.3.0_beta2_Linux_x86-64_install-rpm_en-US.tar.gz.mirrorlist > there is various metadata, including the block hashes inside the > linked IETF Metalink in the form of XML. > > I'm open (and happy) to implement more means of content verification. So > far, I either didn't have more need for it, or time was lacking. But it > would be very useful. I just would like to point out that I see a need > for it mainly for debugging purposes, when something goes wrong, and not > as a security measure. Content verification is too easy to spoof as to > significantly trust it. It is much more important to give clients the > top hash from a trusted source, maybe even over TLS-encrypted web > server, and rely on cryptographic signatures for the rest (which is easy > with RPM, luckily). > > In the context of file-tree consistency and content verification, I > should note that verifying only certain critical files might not prevent > that a mirror is "half synced", and thus inconsistent. I think that > running something after syncing is a smart way to discover the moment > when the mirror is "ready". That's where MirrorManager is very clever. > > I wondered if there is a crucial file that can be used as "marker" to > determine whether a mirror is up to date or not. A timestamp file might > work, but maybe there need to be several of them, in different parts of > the tree, if some setups are complex and sync parts of the tree with > different scripts. MirrorBrain can also download files from mirrors > (to look at the timestamp content), but that said, one wouldn't want to > disable a mirror necessarily when it hasn't synced since a day, when it > is still up to date (when no new content has come, except new > timestamps). Or how do you handle this? > > I was tossing around the idea whether the mirror scanner should > integrate such a timestamp check, maybe comparing the timestamp of a > certain "marker" file on the mirror with the known timestamp in its > database. But I'm not clear yet where this would lead and how it could > be made useful. > > ...Maybe the mirror scanner should simply check all repodata/repomd.xml > files in the tree frequently, comparing with the current version. With > the yum mirrorlist implementation described above, it would be easy to > have only mirrors end up on the lists that are known to have the current > file. > > > 7) backwards compatibility to existing installations > > I don't see an issue, once mirror lists work. However, I know much to > few things about CentOS. :-) > > BTW, one idea for the future, that I would like to at least mention, is > that you could change Yum to contact a/the redirector for each request, > instead of only in the beginning. I cannot judge if that would be better > or worse -- I use Yum since many years, but always in that mode, and not > with the mirror lists that you guys use. Anyway, that would give you > more control over what Yum downloads where, let alone because of the > ability of exerting proper cache control. It's also good for security if > critical hash files (those containing the top hash) are downloaded from > a trusted server only. > > > > 8) (maybe) satellite setups > > Here I didn't get the details. > > > Curious what you think about all this. > > Again, sorry sorry sorry for the long mail. > > Thanks, > Peter > > _______________________________________________ > CentOS-mirror mailing list > CentOS-mirror at centos.org > http://lists.centos.org/mailman/listinfo/centos-mirror >