[CentOS-mirror] IRC meeting regarding new mirroring system for CentOS

Hi everybody,

[resending, after realizing that I was subscribed with an old address]

On Wed, Oct 27, 2010 at 11:31:56PM +0200, Ralph Angenendt wrote:
> There is a wiki page for that process now. I put down the notes I took
> at the meeting for now. There's also a log of the IRC meeting, which I
> want to redact a bit first, as there is some off topic chatting in there
> (and several joins/leaves during the meeting). I won't have time for
> that before friday, though.
> 
> Here's the page, which will fill up with more information:
> 
> http://wiki.centos.org/InfraWiki/Mirrors
> 
> I like to thank the people who were there and gave us input about other
> solutions (and questioned why we do things like we do).
> 
> Regards,
> 
> Ralph

I would also like to thank you for the good meeting, and also for
considering MirrorBrain. 

This mail is very long -too long-, which I would like to apologize for,
but I thought it would be good to provide a comprehensive overview of
the options that I see.

First off, I think you can't go wrong if you go with MirrorManager,
because it works for Fedora, and it already has support for the somewhat
more special requirement that you have, which is yum mirror lists. The
similarity of Fedora and Centos might make many things easier.
MirrorBrain doesn't have this yet, because none of its users needed it
so far. As MirrorBrain tries to be a generic solution, it is generally
agnostic of project or metadata structure, and does everything on file
level. That doesn't mean that support for "special" features is
unwanted, of course. Especially if it can be implemented in a way that
it fits into the concept, and doesn't make deployment for other users
more difficult. It is certainly a nice option - there are many Yum-based
distros, after all.

(background:
Being usable not only by Linux distros is a declared goal of the
MirrorBrain project, in order to get as many users (and potential
developers) into the boat and collaborate. 

For a mirroring infrastructure, I believe that only collaboration across
organization borders can yield a mature, flexible and long-lived
solution. And there are not really many people working on this, only a
handful. It would be cool to merge MirrorBrain and MirrorManager
somehow. Might be a lot of work but useful in the long-term.
)

Having said all that, I thought that Yum mirrorlist in MirrorBrain
should not be hard to implement. I spent some time on it today and got
quite far; configuring mapping of URL query arguments to
directories/files is done, and actual mapping works. I chose Apache
config as vehicle for that, and the following is a working config:

MirrorBrainYumDir release=(5\.5) \
                  repo=(os|extras|addons|updates|centosplus|contrib) \
		  arch=x86_64 \
		  $1/$2/x86_64 repodata/repomd.xml

For instance, $1/$2/x86_64 is the base URL to a repository, and the match
groups can optionally be replaced with what the client specified to the
query arguments. ($1 is the first group from the configuration line, $2
the second, and so on. The names and number of query args are all
arbitrary.)
The last argument is a relative path, and the file that must be present
on eligible mirrors. The resulting path here would be e.g.
5.5/os/x86_64/repodata/repomd.xml, and the client would get a list of
mirrors in the form of
http://mirror.example.com/path/to/centos/5.5/os/x86_64/
(That's what's missing to be implemented, but it's the easiest part :-)
So I'm confident that I can promise Yum mirror list soon. Maybe I can
finish it this week, maybe the week after, I don't now.

Meanwhile, I would appreciate input from you: is this reasonable? Would
it serve your needs?

If it does, I think the only feature in missing in MirrorBrain for you
would be sorted out.

(Needless to say that the mirror list that yum gets will be sorted by
suitability of the mirrors)

So, on to the other issues that were raised in the meeting.

Summarizing what I heard, the following are the problems that you would
like to solve: 

1) scalability
2) cleaning up the historic DVD/nonDVD setup
3) partial mirroring
4) finer mirror selection (by prefix, autonomous system, state/region, in
   addition to country/continent)
5) consistency problems
6) content verification
7) (presumably) backwards compatibility to existing installations
8) (maybe) satellite setups

1) scalability

The dimensions are:
- 70.000 files in 500 directories
- >400 mirrors
- 40 requests per second

Sounds fine from my point of view. MB has handled more
files, and more requests. The number of mirrors I have run it with was
smaller, 150 at most, but I wouldn't expect big problems. The little
mirrorprobe that runs every minute might run into a system limit when
starting 400 threads, to check all mirrors at the same time, so maybe it
needs to be tweaked, or changed to a different model, using a pool of
threads or starting some processes as well.

2) cleaning up the historic DVD/nonDVD setup

Sounds like a good idea :-)

3) partial mirroring

Supported well by MirrorBrain.

4) finer mirror selection (by prefix, autonomous system, state/region, in
   addition to country/continent)

MirrorBrain uses BGP/routing data to find out the network prefix and AS
of clients and mirrors, and matches them. Other criteria are GeoIP
country and continent. The closest match is used for mirror selection.
If several mirrors are there to choose from, a weighted randomization is
also applied, to be able to give some mirrors more requests and others
less. We talked in our meeting about the need for a smarter selection in
e.g. the US, where one doesn't want to be sent from one coast to the
other. GeoIP regions were discussed for this. I considered going that
route, but decided to implement a different concept, which I believe is
more widely useful, because it works also when no mirror within the same
state/region is found: using geographical distance between the client and
the mirrors. I just released this new feature into the wild:
http://mirrorbrain.org/news/2140-takes-geographical-distances-account/
You can try it out
http://download.services.openoffice.org/files/stable/3.2.1/OOo-SDK_3.2.1_Linux_x86-64_install-deb_en-US.tar.gz.mirrorlist
and feedback is appreciated.

5) consistency problems

Regarding problems with consistency of trees on mirrors / clients
accessing them, this is indeed a hard problem to solve. From discussions
with Fedora people I know that they also have/had major fights with
that. It took me a long time to finally get this sorted out when I still
worked on the openSUSE infrastructure. The following have proved useful
for me in the past:

- Always take care of setting appropriate cache headers. must-revalidate
  is the key, because it doesn't prevent caching, but causes clients
  (and intermediaries) to always validate that a resource is still
  fresh. 

  It is hopeless to get all mirrors to run the same configuration in
  this regard, and there are also some FTP mirrors (and FTP doesn't have
  a feature to control caching at all), so for certain content, there is
  no other option than delivering it from defined places _with_ proper
  headers. Luckily, this concerns mostly small metadata files.

  This is against inconsistency as it happens when things come from
  different places (different age). If cache control is not exerted by
  the server (or client), intermediaries (web caches) commonly "guess"
  how long they should deliver stuff from their cache, without
  revalidation freshness. Typically, a squid assumes freshness for 4-18
  hours by default, and the exact time is hard to predict, because cache
  pruning is complex and may take file size into account. Thus, it is
  inevitable that clients see an inconsistent picture. 

- The second (and even more important) measure is to version metadata.
  Actually, any data. Always and Everywhere. With RPMs, one is in the
  lucky situation that this is usually done anyway (reliably increasing
  version/release numbers with each rebuild). Exception files like
  "MD5SUMS" definitely need to be treated separately and should never be
  redirected to a mirror, not only for security reasons.
  repo-md metadata, as used by Yum, exists in various incarnations.
  Unfortunately, the ones I dealt with in the past were not versioned,
  and files had names like "filelists.xml.gz", which leaves only
  non-redirection as the only 100% solution. (So I did that.)
  Nowadays, at least the repo-md metadata that the Fedora and openSUSE
  people build is versioned, as can be seen in this example:
  http://download.opensuse.org/repositories/Apache:/MirrorBrain/Apache_openSUSE_11.3/repodata/
  I suppose that createrepo does that these days. Anyway, this is
  certainly a point where tight cooperation (and appropriate input) with
  the build system folks is very important.

- A third line of "defense" can be a client that double-checks itself
  that it doesn't get old metadata, by checking with cryptohashes if the
  download is the expected one, _and_ falls back to a different mirror
  if it isn't the case. That's what Yum does, since MirrorManager sends
  hashes/timestamps via Metalinks, and what Zypper does, since it uses a
  Metalink client for all downloads that allows it to fall back to other
  mirrors until it got the expected data. You won't be able to do
  something fancy like that with CentOS 5 I guess, but maybe with the
  next version. 
  (Actually, it's not that difficult to teach Yum using a Metalink client
  -- I once tried it out, and it was a one-liner to replace its usage of
  python-urlgrabber with a call to aria2c (powerful Metalink client) for
  all downloads. Another great option would be to extend
  python-urlgrabber to be a Metalink client.)

That's what I learnt anyway... maybe some of it can be useful to you.
Verifying 400 mirrors in realtime is no option, with our limited means,
IMO -- simply not doable. Of course, if anyone knows how to do that, I
am *very* interested :-)

6) content verification

Regarding content verification: I don't know how you currently check
exactly, but what can be done with MirrorBrain is:
- there currently is a tool for downloading a file from one or all mirrors
  and displaying a hash of it.
- this obviously doesn't work well for huge files (DVDs) (if it's not
  about a close, fast mirror).
- since recently, MB can keep all hashes of all files in a database. The
  hashes include block (piece-wise) hashes. It would be fairly easy
  to fetch the hash of a random block (or a defined one) and download
  just that piece from all mirrors. (Since the hashes in the database
  are retrievable from everywhere, such checkers could also run _very_
  distributed in fact.)
  If you look at 
  http://download.documentfoundation.org/libreoffice/testing/3.3.0-beta2/rpm/x86_64/LibO_3.3.0_beta2_Linux_x86-64_install-rpm_en-US.tar.gz.mirrorlist
  there is various metadata, including the block hashes inside the
  linked IETF Metalink in the form of XML.

I'm open (and happy) to implement more means of content verification. So
far, I either didn't have more need for it, or time was lacking. But it
would be very useful. I just would like to point out that I see a need
for it mainly for debugging purposes, when something goes wrong, and not
as a security measure.  Content verification is too easy to spoof as to
significantly trust it. It is much more important to give clients the
top hash from a trusted source, maybe even over TLS-encrypted web
server, and rely on cryptographic signatures for the rest (which is easy
with RPM, luckily).

In the context of file-tree consistency and content verification, I
should note that verifying only certain critical files might not prevent
that a mirror is "half synced", and thus inconsistent. I think that
running something after syncing is a smart way to discover the moment
when the mirror is "ready". That's where MirrorManager is very clever.

I wondered if there is a crucial file that can be used as "marker" to
determine whether a mirror is up to date or not. A timestamp file might
work, but maybe there need to be several of them, in different parts of
the tree, if some setups are complex and sync parts of the tree with
different scripts. MirrorBrain can also download files from mirrors
(to look at the timestamp content), but that said, one wouldn't want to
disable a mirror necessarily when it hasn't synced since a day, when it
is still up to date (when no new content has come, except new
timestamps). Or how do you handle this?

I was tossing around the idea whether the mirror scanner should
integrate such a timestamp check, maybe comparing the timestamp of a
certain "marker" file on the mirror with the known timestamp in its
database. But I'm not clear yet where this would lead and how it could
be made useful.

...Maybe the mirror scanner should simply check all repodata/repomd.xml
files in the tree frequently, comparing with the current version. With
the yum mirrorlist implementation described above, it would be easy to
have only mirrors end up on the lists that are known to have the current
file.

7) backwards compatibility to existing installations

I don't see an issue, once mirror lists work. However, I know much to
few things about CentOS. :-)

BTW, one idea for the future, that I would like to at least mention, is
that you could change Yum to contact a/the redirector for each request,
instead of only in the beginning. I cannot judge if that would be better
or worse -- I use Yum since many years, but always in that mode, and not
with the mirror lists that you guys use. Anyway, that would give you
more control over what Yum downloads where, let alone because of the
ability of exerting proper cache control. It's also good for security if
critical hash files (those containing the top hash) are downloaded from
a trusted server only.

8) (maybe) satellite setups

Here I didn't get the details. 

Curious what you think about all this.

Again, sorry sorry sorry for the long mail.

Thanks,
Peter