[CentOS-mirror] IRC meeting regarding new mirroring system for CentOS

For me as a mirror admin, the only feature I don't like about
MirrorBrain is that I don't have the ability to log in and "check on"
or admin my mirror.

I mirror for a few different distros, and ubuntu's mirror manager is
quite poor as well.  I have an account, but can't get to it.  When I
fail a test of some sort, I get a not-very-useful e-mail, and no way
to get more info on what happened.  I usually end up just "waiting it
out".  It would be nice if I get an e-mail allerting me to something
being wrong, and then allowing me to log in and see.

I also like being able to specify some IP ranges I'm authoritative
for.  As my mirror is on a university campus, I'd love to be able to
enter my campus' IP ranges, and that way ensure that all my campus
gets my mirror.  So far, none of the OSes I mirror for (I don't mirror
Fedora presently) allows me to do that.

MirrorBrain sounds like it has a lot of the functionality, but only
available to the distro managers.  They're busy people; I'd rather not
bother them if I can handle stuff myself.

--Jim

On Mon, Nov 8, 2010 at 1:02 AM, Peter Pöml <peter at poeml.de> wrote:
> Hi everybody,
>
> [resending, after realizing that I was subscribed with an old address]
>
> On Wed, Oct 27, 2010 at 11:31:56PM +0200, Ralph Angenendt wrote:
>> There is a wiki page for that process now. I put down the notes I took
>> at the meeting for now. There's also a log of the IRC meeting, which I
>> want to redact a bit first, as there is some off topic chatting in there
>> (and several joins/leaves during the meeting). I won't have time for
>> that before friday, though.
>>
>> Here's the page, which will fill up with more information:
>>
>> http://wiki.centos.org/InfraWiki/Mirrors
>>
>> I like to thank the people who were there and gave us input about other
>> solutions (and questioned why we do things like we do).
>>
>> Regards,
>>
>> Ralph
>
> I would also like to thank you for the good meeting, and also for
> considering MirrorBrain.
>
> This mail is very long -too long-, which I would like to apologize for,
> but I thought it would be good to provide a comprehensive overview of
> the options that I see.
>
> First off, I think you can't go wrong if you go with MirrorManager,
> because it works for Fedora, and it already has support for the somewhat
> more special requirement that you have, which is yum mirror lists. The
> similarity of Fedora and Centos might make many things easier.
> MirrorBrain doesn't have this yet, because none of its users needed it
> so far. As MirrorBrain tries to be a generic solution, it is generally
> agnostic of project or metadata structure, and does everything on file
> level. That doesn't mean that support for "special" features is
> unwanted, of course. Especially if it can be implemented in a way that
> it fits into the concept, and doesn't make deployment for other users
> more difficult. It is certainly a nice option - there are many Yum-based
> distros, after all.
>
> (background:
> Being usable not only by Linux distros is a declared goal of the
> MirrorBrain project, in order to get as many users (and potential
> developers) into the boat and collaborate.
>
> For a mirroring infrastructure, I believe that only collaboration across
> organization borders can yield a mature, flexible and long-lived
> solution. And there are not really many people working on this, only a
> handful. It would be cool to merge MirrorBrain and MirrorManager
> somehow. Might be a lot of work but useful in the long-term.
> )
>
> Having said all that, I thought that Yum mirrorlist in MirrorBrain
> should not be hard to implement. I spent some time on it today and got
> quite far; configuring mapping of URL query arguments to
> directories/files is done, and actual mapping works. I chose Apache
> config as vehicle for that, and the following is a working config:
>
> MirrorBrainYumDir release=(5\.5) \
>                  repo=(os|extras|addons|updates|centosplus|contrib) \
>                  arch=x86_64 \
>                  $1/$2/x86_64 repodata/repomd.xml
>
> For instance, $1/$2/x86_64 is the base URL to a repository, and the match
> groups can optionally be replaced with what the client specified to the
> query arguments. ($1 is the first group from the configuration line, $2
> the second, and so on. The names and number of query args are all
> arbitrary.)
> The last argument is a relative path, and the file that must be present
> on eligible mirrors. The resulting path here would be e.g.
> 5.5/os/x86_64/repodata/repomd.xml, and the client would get a list of
> mirrors in the form of
> http://mirror.example.com/path/to/centos/5.5/os/x86_64/
> (That's what's missing to be implemented, but it's the easiest part :-)
> So I'm confident that I can promise Yum mirror list soon. Maybe I can
> finish it this week, maybe the week after, I don't now.
>
> Meanwhile, I would appreciate input from you: is this reasonable? Would
> it serve your needs?
>
> If it does, I think the only feature in missing in MirrorBrain for you
> would be sorted out.
>
> (Needless to say that the mirror list that yum gets will be sorted by
> suitability of the mirrors)
>
>
> So, on to the other issues that were raised in the meeting.
>
> Summarizing what I heard, the following are the problems that you would
> like to solve:
>
> 1) scalability
> 2) cleaning up the historic DVD/nonDVD setup
> 3) partial mirroring
> 4) finer mirror selection (by prefix, autonomous system, state/region, in
>   addition to country/continent)
> 5) consistency problems
> 6) content verification
> 7) (presumably) backwards compatibility to existing installations
> 8) (maybe) satellite setups
>
>
> 1) scalability
>
> The dimensions are:
> - 70.000 files in 500 directories
> - >400 mirrors
> - 40 requests per second
>
> Sounds fine from my point of view. MB has handled more
> files, and more requests. The number of mirrors I have run it with was
> smaller, 150 at most, but I wouldn't expect big problems. The little
> mirrorprobe that runs every minute might run into a system limit when
> starting 400 threads, to check all mirrors at the same time, so maybe it
> needs to be tweaked, or changed to a different model, using a pool of
> threads or starting some processes as well.
>
>
> 2) cleaning up the historic DVD/nonDVD setup
>
> Sounds like a good idea :-)
>
> 3) partial mirroring
>
> Supported well by MirrorBrain.
>
>
> 4) finer mirror selection (by prefix, autonomous system, state/region, in
>   addition to country/continent)
>
> MirrorBrain uses BGP/routing data to find out the network prefix and AS
> of clients and mirrors, and matches them. Other criteria are GeoIP
> country and continent. The closest match is used for mirror selection.
> If several mirrors are there to choose from, a weighted randomization is
> also applied, to be able to give some mirrors more requests and others
> less. We talked in our meeting about the need for a smarter selection in
> e.g. the US, where one doesn't want to be sent from one coast to the
> other. GeoIP regions were discussed for this. I considered going that
> route, but decided to implement a different concept, which I believe is
> more widely useful, because it works also when no mirror within the same
> state/region is found: using geographical distance between the client and
> the mirrors. I just released this new feature into the wild:
> http://mirrorbrain.org/news/2140-takes-geographical-distances-account/
> You can try it out
> http://download.services.openoffice.org/files/stable/3.2.1/OOo-SDK_3.2.1_Linux_x86-64_install-deb_en-US.tar.gz.mirrorlist
> and feedback is appreciated.
>
>
>
> 5) consistency problems
>
> Regarding problems with consistency of trees on mirrors / clients
> accessing them, this is indeed a hard problem to solve. From discussions
> with Fedora people I know that they also have/had major fights with
> that. It took me a long time to finally get this sorted out when I still
> worked on the openSUSE infrastructure. The following have proved useful
> for me in the past:
>
> - Always take care of setting appropriate cache headers. must-revalidate
>  is the key, because it doesn't prevent caching, but causes clients
>  (and intermediaries) to always validate that a resource is still
>  fresh.
>
>  It is hopeless to get all mirrors to run the same configuration in
>  this regard, and there are also some FTP mirrors (and FTP doesn't have
>  a feature to control caching at all), so for certain content, there is
>  no other option than delivering it from defined places _with_ proper
>  headers. Luckily, this concerns mostly small metadata files.
>
>  This is against inconsistency as it happens when things come from
>  different places (different age). If cache control is not exerted by
>  the server (or client), intermediaries (web caches) commonly "guess"
>  how long they should deliver stuff from their cache, without
>  revalidation freshness. Typically, a squid assumes freshness for 4-18
>  hours by default, and the exact time is hard to predict, because cache
>  pruning is complex and may take file size into account. Thus, it is
>  inevitable that clients see an inconsistent picture.
>
> - The second (and even more important) measure is to version metadata.
>  Actually, any data. Always and Everywhere. With RPMs, one is in the
>  lucky situation that this is usually done anyway (reliably increasing
>  version/release numbers with each rebuild). Exception files like
>  "MD5SUMS" definitely need to be treated separately and should never be
>  redirected to a mirror, not only for security reasons.
>  repo-md metadata, as used by Yum, exists in various incarnations.
>  Unfortunately, the ones I dealt with in the past were not versioned,
>  and files had names like "filelists.xml.gz", which leaves only
>  non-redirection as the only 100% solution. (So I did that.)
>  Nowadays, at least the repo-md metadata that the Fedora and openSUSE
>  people build is versioned, as can be seen in this example:
>  http://download.opensuse.org/repositories/Apache:/MirrorBrain/Apache_openSUSE_11.3/repodata/
>  I suppose that createrepo does that these days. Anyway, this is
>  certainly a point where tight cooperation (and appropriate input) with
>  the build system folks is very important.
>
> - A third line of "defense" can be a client that double-checks itself
>  that it doesn't get old metadata, by checking with cryptohashes if the
>  download is the expected one, _and_ falls back to a different mirror
>  if it isn't the case. That's what Yum does, since MirrorManager sends
>  hashes/timestamps via Metalinks, and what Zypper does, since it uses a
>  Metalink client for all downloads that allows it to fall back to other
>  mirrors until it got the expected data. You won't be able to do
>  something fancy like that with CentOS 5 I guess, but maybe with the
>  next version.
>  (Actually, it's not that difficult to teach Yum using a Metalink client
>  -- I once tried it out, and it was a one-liner to replace its usage of
>  python-urlgrabber with a call to aria2c (powerful Metalink client) for
>  all downloads. Another great option would be to extend
>  python-urlgrabber to be a Metalink client.)
>
>
> That's what I learnt anyway... maybe some of it can be useful to you.
> Verifying 400 mirrors in realtime is no option, with our limited means,
> IMO -- simply not doable. Of course, if anyone knows how to do that, I
> am *very* interested :-)
>
>
> 6) content verification
>
> Regarding content verification: I don't know how you currently check
> exactly, but what can be done with MirrorBrain is:
> - there currently is a tool for downloading a file from one or all mirrors
>  and displaying a hash of it.
> - this obviously doesn't work well for huge files (DVDs) (if it's not
>  about a close, fast mirror).
> - since recently, MB can keep all hashes of all files in a database. The
>  hashes include block (piece-wise) hashes. It would be fairly easy
>  to fetch the hash of a random block (or a defined one) and download
>  just that piece from all mirrors. (Since the hashes in the database
>  are retrievable from everywhere, such checkers could also run _very_
>  distributed in fact.)
>  If you look at
>  http://download.documentfoundation.org/libreoffice/testing/3.3.0-beta2/rpm/x86_64/LibO_3.3.0_beta2_Linux_x86-64_install-rpm_en-US.tar.gz.mirrorlist
>  there is various metadata, including the block hashes inside the
>  linked IETF Metalink in the form of XML.
>
> I'm open (and happy) to implement more means of content verification. So
> far, I either didn't have more need for it, or time was lacking. But it
> would be very useful. I just would like to point out that I see a need
> for it mainly for debugging purposes, when something goes wrong, and not
> as a security measure.  Content verification is too easy to spoof as to
> significantly trust it. It is much more important to give clients the
> top hash from a trusted source, maybe even over TLS-encrypted web
> server, and rely on cryptographic signatures for the rest (which is easy
> with RPM, luckily).
>
> In the context of file-tree consistency and content verification, I
> should note that verifying only certain critical files might not prevent
> that a mirror is "half synced", and thus inconsistent. I think that
> running something after syncing is a smart way to discover the moment
> when the mirror is "ready". That's where MirrorManager is very clever.
>
> I wondered if there is a crucial file that can be used as "marker" to
> determine whether a mirror is up to date or not. A timestamp file might
> work, but maybe there need to be several of them, in different parts of
> the tree, if some setups are complex and sync parts of the tree with
> different scripts. MirrorBrain can also download files from mirrors
> (to look at the timestamp content), but that said, one wouldn't want to
> disable a mirror necessarily when it hasn't synced since a day, when it
> is still up to date (when no new content has come, except new
> timestamps). Or how do you handle this?
>
> I was tossing around the idea whether the mirror scanner should
> integrate such a timestamp check, maybe comparing the timestamp of a
> certain "marker" file on the mirror with the known timestamp in its
> database. But I'm not clear yet where this would lead and how it could
> be made useful.
>
> ...Maybe the mirror scanner should simply check all repodata/repomd.xml
> files in the tree frequently, comparing with the current version. With
> the yum mirrorlist implementation described above, it would be easy to
> have only mirrors end up on the lists that are known to have the current
> file.
>
>
> 7) backwards compatibility to existing installations
>
> I don't see an issue, once mirror lists work. However, I know much to
> few things about CentOS. :-)
>
> BTW, one idea for the future, that I would like to at least mention, is
> that you could change Yum to contact a/the redirector for each request,
> instead of only in the beginning. I cannot judge if that would be better
> or worse -- I use Yum since many years, but always in that mode, and not
> with the mirror lists that you guys use. Anyway, that would give you
> more control over what Yum downloads where, let alone because of the
> ability of exerting proper cache control. It's also good for security if
> critical hash files (those containing the top hash) are downloaded from
> a trusted server only.
>
>
>
> 8) (maybe) satellite setups
>
> Here I didn't get the details.
>
>
> Curious what you think about all this.
>
> Again, sorry sorry sorry for the long mail.
>
> Thanks,
> Peter
>
> _______________________________________________
> CentOS-mirror mailing list
> CentOS-mirror at centos.org
> http://lists.centos.org/mailman/listinfo/centos-mirror
>