[CentOS-devel] Getting content to mirrors faster

Wed Mar 1 13:44:50 UTC 2017
Anssi Johansson <centos at miuku.net>

We have 600+ active CentOS mirrors[1] and sometimes there are critical 
updates that would need to get published to the mirror network as 
quickly as possible. As of now, it takes around five hours to go from 0 
up-to-date mirrors to 80% up-to-date mirrors, with a longer tail for the 
remaining 20%.

Current guidelines for setting up a mirror[2] specify that mirrors 
should sync 2-4 times per day. Instead of telling mirrors to sync hourly 
I thought we could come up with something smarter.

One option that I have considered would be something similar to what the 
ClamAV guys use to signal end users that there is new content. They use 
a DNS TXT record for that purpose. For example, as of this writing "dig 
txt +short current.cvd.clamav.net" produces 
"0.99.2:57:23148:1488371340:1:63:45637:290", which shows the version 
numbers for ClamAV itself, main virus database, daily virus database, 
timestamp and other version numbers.

We could have something similar, showing the timestamps when the content 
for CentOS, CentOS AltArch and CentOS Vault was last modified, like 
"1488372781:1487767981:1488113581". "Last modified" in the sense that 
new packages got added at that time. The idea is that mirrors could set 
up scripts to check that timestamp from DNS more frequently (such as 
hourly) without causing load issues to msync nodes by rsyncing hourly. 
The TTL for the TXT record could be relatively small, like 10 minutes.

As you're all aware, DNS is a prime example of a very scalable system, 
and that's why I'm fond of this solution. Another option would be to 
publish the same data in a central location and served over http(s), if 
relaying the timestamp data via DNS is not desired for some reason.

The basic principle would be "if timestamp in TXT record > my current 
timestamp (TIME file), synchronize the mirror". With more frequent 
syncs, mirror admins would need to take care that no two rsync runs 
would happen at the same time. Using lockfile in the scripts would help 
with this. I hope that many of the mirror admins already use lockfiles, 
but providing an example script might help for the newer mirror admins.

The timestamps should be updated only after it has been verified that 
all (or at least the majority) of msync nodes actually have the content. 
It takes a while for the data to reach all the msync nodes from the master.

On the other hand, this may cause some traffic peaks for the msync 
nodes. I don't know how well they would handle the peaks. One obvious 
way to alleviate the peaks would be to instruct mirror admins to pick a 
random minute when to check for new content. echo $[ $RANDOM % 60 ] 
works nicely for this. I don't have statistics, but I believe there 
might be mirrors that sync at "0 */6" ie. at the top of the hour.


If the traffic peaks to msync nodes is deemed to be a problem, there 
might be ways to reduce the load to msync nodes. The following idea 
could be implemented separately from the above timestamp idea, if needed.

There could be some sort of a "web service" which instructs mirrors 
where to sync from. The core idea in this is that the source might not 
always be a msync.centos.org server, but it could also be a nearby 
public mirror that offers rsync and has been verified to have the new 
content. If requested from Finland, that service could say "ok, you're 
from .fi, go sync from ftp.funet.fi as it has the new content already" 
or "uh oh, no nearby external mirrors have the new content, please rsync 
from eu-msync.centos.org". It could simply return a list of rsync 
servers in descending priority, with some msync.centos.org addresses at 
the bottom as fallback. Once the mirror has rsynced, the mirror could 
ping back and say "I have the content now, please check, and if OK, add 
me to the list of mirrors that have the new content".

One concern is that the list of rsync sources would need to be 
protected, so that mirrors could not be tricked into syncing from a 
malicious source (think DNS poisoning). Ways to protect from this 
include DNSSEC, TLS and PGP signed data.


Any thoughts about this?


[1] http://mirror-status.centos.org/
[2] https://wiki.centos.org/HowTos/CreatePublicMirrors