On Feb 03 8:58pm, Karanbir Singh wrote:
repeated polling is counter productive. for the 6 times the high-prio push was needed in the last year, its a waste to destroy mirror cache's every 10 min through the entire year.
having dedicated nodes to just push rsync targets is also bad - since those machines then dont deliver any user facing service ( or bandwdith ) for most of the time.
Since the collection of mirror hosts is really just a large distributed system, it would be prudent to think about in that context and not worry (at this point) about such minor implementation-specific details.
The overview (10,000 ft view) becomes simply the message layer and the transport layer. Rsync is perfectly sufficient for the transport layer. The problem being discussed, however, is mostly relevant to the message layer. That layer is simply "when is there new stuff to grab?". The problem is muddled by the fact that rsync is being used as a part of the message layer, too, and that is not optimal. Rsync should be able to say:
"I am grabbing that which is different"
Instead of saying:
"If there is something different, I will grab it"
The second sentence is primarily a question of when, not a question of what. Rsync is a very expensive way of trying to ask when. What is needed is a better (not time-based) method of triggering rsync. A simple timestamp check of a file grabbed via curl, while not exactly robust, would suffice as a trigger. A high rate of polling for such a tiny thing would be low cost and then logic, based on that poll, would determine if rsync is triggered. Other options, like a rabbitMQ-based queue, could be very robust in that it can coordinate the external rsync processes to manage a thundering herd and lessen the chance of inadvertent DDoS.
Just my 2¢.