On 02/04/2015 04:04 PM, centoslistmail@gmail.com wrote:
On Feb 03 8:58pm, Karanbir Singh wrote:
repeated polling is counter productive. for the 6 times the high-prio push was needed in the last year, its a waste to destroy mirror cache's every 10 min through the entire year.
having dedicated nodes to just push rsync targets is also bad - since those machines then dont deliver any user facing service ( or bandwdith ) for most of the time.
Since the collection of mirror hosts is really just a large distributed system, it would be prudent to think about in that context and not worry (at this point) about such minor implementation-specific details.
this is not a minor issue... being able to saturate links from our side, with a focus on what-matters-when, allowed us to reduce the overall mirror seed time from 7 days to just under 2.5 days for a major release - this is inspite of the fact that we seed almost 4,000 external mirrors at point of release.
but again, this isnt the question at hand!
The overview (10,000 ft view) becomes simply the message layer and the transport layer. Rsync is perfectly sufficient for the transport layer. The problem being discussed, however, is mostly relevant to the message layer. That layer is simply "when is there new stuff to grab?". The problem is muddled by the fact that rsync is being used as a part of the message layer, too, and that is not optimal. Rsync should be able to say:
"I am grabbing that which is different"
Instead of saying:
"If there is something different, I will grab it"
The second sentence is primarily a question of when, not a question of what. Rsync is a very expensive way of trying to ask when. What is needed is a better (not time-based) method of triggering rsync. A simple timestamp check of a file grabbed via curl, while not exactly robust, would suffice as a trigger. A high rate of polling for such a tiny thing would be low cost and then logic, based on that poll, would determine if rsync is triggered. Other options, like a rabbitMQ-based queue, could be very robust in that it can coordinate the external rsync processes to manage a thundering herd and lessen the chance of inadvertent DDoS.
if we are able to solve this: what changed since i saw you last, without needing to walk and compare metadata on every file across a 100gb corpus, we would have quite a nice solution indeed. But how does one implement that ?
a reverse opportunity driven cache replacing mirror nodes ? so we have a CDN of sorts, with a on-demand repo level expunge ?