[CentOS-devel] setting up an emergency update route

Thu Feb 5 09:37:52 UTC 2015
Karanbir Singh <mail-lists at karan.org>

On 02/04/2015 04:04 PM, centoslistmail at gmail.com wrote:
> On Feb 03  8:58pm, Karanbir Singh wrote:
>>
>> repeated polling is counter productive. for the 6 times the high-prio
>> push was needed in the last year, its a waste to destroy mirror cache's
>> every 10 min through the entire year.
>>
>> having dedicated nodes to just push rsync targets is also bad - since
>> those machines then dont deliver any user facing service ( or bandwdith
>> ) for most of the time.
> 
> Since the collection of mirror hosts is really just a large distributed
> system, it would be prudent to think about in that context and not worry
> (at this point) about such minor implementation-specific details.

this is not a minor issue... being able to saturate links from our side,
with a focus on what-matters-when, allowed us to reduce the overall
mirror seed time from 7 days to just under 2.5 days for a major release
- this is inspite of the fact that we seed almost 4,000 external mirrors
at point of release.

but again, this isnt the question at hand!

> 
> The overview (10,000 ft view) becomes simply the message layer and the
> transport layer. Rsync is perfectly sufficient for the transport layer. 
> The problem being discussed, however, is mostly relevant to the message
> layer. That layer is simply "when is there new stuff to grab?". The
> problem is muddled by the fact that rsync is being used as a part of the
> message layer, too, and that is not optimal. Rsync should be able to say:
> 
> "I am grabbing that which is different"
> 
> Instead of saying:
> 
> "If there is something different, I will grab it"
> 
> The second sentence is primarily a question of when, not a question of
> what. Rsync is a very expensive way of trying to ask when. What is
> needed is a better (not time-based) method of triggering rsync. A simple
> timestamp check of a file grabbed via curl, while not exactly robust,
> would suffice as a trigger. A high rate of polling for such a tiny thing
> would be low cost and then logic, based on that poll, would determine if
> rsync is triggered. Other options, like a rabbitMQ-based queue, could be
> very robust in that it can coordinate the external rsync processes to
> manage a thundering herd and lessen the chance of inadvertent DDoS.

if we are able to solve this: what changed since i saw you last, without
needing to walk and compare metadata on every file across a 100gb
corpus, we would have quite a nice solution indeed. But how does one
implement that ?

a reverse opportunity driven cache replacing mirror nodes ? so we have a
CDN of sorts, with a on-demand repo level expunge ?


-- 
Karanbir Singh
+44-207-0999389 | http://www.karan.org/ | twitter.com/kbsingh
GnuPG Key : http://www.karan.org/publickey.asc