[CentOS-devel] setting up an emergency update route

Thu Feb 5 09:32:35 UTC 2015
Karanbir Singh <mail-lists at karan.org>

On 02/04/2015 03:01 PM, Jeff Sheltren wrote:
> 
> 
> On Tue, Feb 3, 2015 at 12:58 PM, Karanbir Singh <mail-lists at karan.org
> <mailto:mail-lists at karan.org>> wrote:
> 
>     repeated polling is counter productive. for the 6 times the high-prio
>     push was needed in the last year, its a waste to destroy mirror cache's
>     every 10 min through the entire year.
> 
> 
> What cache are you referring to specifically (filesystem?, reverse proxy
> cache? other?)?

filesystem cache's - getting them up and keeping them warm is a massive
impact on deliverability of content from the mirror nodes. A very large
number of machines still run off 1 or 2 hdd's typically in a raid1, but
they can easily deliver more than a couple of hundred megs of data. A
complete rsync over stale content kills that.

John Hawley's paper on mirrors and filesystem cache's is largely still
relevant ( ~ 7 years down the road from when it was written ? )

the main issue is that while there are only a few updates, the rsync
will trawl the entire tree, including components that are potentially 3
or 4 updates behind - for a 100GB on-disk size payload.

> Obviously the rsync method where each mirror pretty much "does their own
> thing" is dated and not optimal.  The "hi, I just updated my mirror,
> here's what I have currently" script portion of MirrorManager can at
> least help on the polling side so that you have a more accurate and
> timely idea of which mirrors are up to date.  Leveraging that, or
> similar, may be a small change that could help move things in the right
> direction (and may or may not be part of a long-term way to improve
> distro mirroring).

we tried this - people lied. Not everyone runs entire mirrors, and
having this run client side dramatically increases the chances for a
dirty mirror being accepted in. If we validate mirrors, it really must
happen from an external source. Maybe publishing a checksum or some
metadata that is used as a component of the overall yes/no might work.

> For starters, why not select a core group (10-20? Just making up a
> number here, but get a good geographic/network spread) of external "tier
> 1" mirrors and ask them to update more frequently (one hour seems
> reasonable to me, and as an ex-mirror-admin I don't think that is asking
> too much).  And scan those more frequently (or use something similar to
> the MirrorManager "I just updated" script) so that the status of those
> mirrors is well known and they can be easily flagged if they are not
> being updated.

at this point, why not just deploy a distributed gluster setup and ask
them to join as replica's ?

> Non "tier 1" mirrors are asked to pull from the tier 1 mirrors, and are
> asked to update at least every X hours.  I'm making the assumption that
> one hour may be too frequent for some mirror admins, but perhaps push
> them into updating at least every 2 or 3 hours.  These mirrors could be
> scanned for status less frequently than the tier 1 mirrors because you
> know they will be at least 2 hours behind or so.
> 
> Any other mirrors (not tier 1 or tier 2) are either dropped completely
> from the official mirror list or are kept on a separate "we don't
> endorse these, but here are some mirrors that may be fast for you to
> use, although perhaps slightly out of date).
> 
> I think just that bit of shrinking the update window for mirrors could
> make quite a difference.
> 
> I would argue that people who demand a faster update window than 3-4
> hours should look at a paid, supported alternative.  That said, I don't
> want to use that as an argument against making the updates process as
> fast as we possibly can.

What you say here makes sense, but it works on the assumption that rsync
over massive tree's is going to work - it doesnt. I think for
increasingly larger sets of data, rsync as a mechanism to send down
cascading tree's is just broken. We end up with large numbers of
machines doing no real user facing content, just iterating over the same
content and comparing with states of remote machines.

ofcourse - all this is orthogonal to the 'urgent updates' repo. We need
to solve and find a better way to get content out for the entire tree's
- but do we need to have that in place before we do this 'urgent
updates' repo ? Can we not just have that run from mirror.centos.org (
that has a 10 min update delta .. ), while we workout what the larger
solution might be ?



-- 
Karanbir Singh
+44-207-0999389 | http://www.karan.org/ | twitter.com/kbsingh
GnuPG Key : http://www.karan.org/publickey.asc