On 02/04/2015 04:04 PM, centoslistmail at gmail.com wrote: > On Feb 03 8:58pm, Karanbir Singh wrote: >> >> repeated polling is counter productive. for the 6 times the high-prio >> push was needed in the last year, its a waste to destroy mirror cache's >> every 10 min through the entire year. >> >> having dedicated nodes to just push rsync targets is also bad - since >> those machines then dont deliver any user facing service ( or bandwdith >> ) for most of the time. > > Since the collection of mirror hosts is really just a large distributed > system, it would be prudent to think about in that context and not worry > (at this point) about such minor implementation-specific details. this is not a minor issue... being able to saturate links from our side, with a focus on what-matters-when, allowed us to reduce the overall mirror seed time from 7 days to just under 2.5 days for a major release - this is inspite of the fact that we seed almost 4,000 external mirrors at point of release. but again, this isnt the question at hand! > > The overview (10,000 ft view) becomes simply the message layer and the > transport layer. Rsync is perfectly sufficient for the transport layer. > The problem being discussed, however, is mostly relevant to the message > layer. That layer is simply "when is there new stuff to grab?". The > problem is muddled by the fact that rsync is being used as a part of the > message layer, too, and that is not optimal. Rsync should be able to say: > > "I am grabbing that which is different" > > Instead of saying: > > "If there is something different, I will grab it" > > The second sentence is primarily a question of when, not a question of > what. Rsync is a very expensive way of trying to ask when. What is > needed is a better (not time-based) method of triggering rsync. A simple > timestamp check of a file grabbed via curl, while not exactly robust, > would suffice as a trigger. A high rate of polling for such a tiny thing > would be low cost and then logic, based on that poll, would determine if > rsync is triggered. Other options, like a rabbitMQ-based queue, could be > very robust in that it can coordinate the external rsync processes to > manage a thundering herd and lessen the chance of inadvertent DDoS. if we are able to solve this: what changed since i saw you last, without needing to walk and compare metadata on every file across a 100gb corpus, we would have quite a nice solution indeed. But how does one implement that ? a reverse opportunity driven cache replacing mirror nodes ? so we have a CDN of sorts, with a on-demand repo level expunge ? -- Karanbir Singh +44-207-0999389 | http://www.karan.org/ | twitter.com/kbsingh GnuPG Key : http://www.karan.org/publickey.asc