Dear Tru
I want to give you some input what I think is the key aspect for a content delivery network (CDN) - what a mirror network in fact is:
Simply Cost!
I do not want to play down the efforts of the people who gave their dedication to Geo IP, which is useful for many things but not so much for managing the cost of a content delivery network since an IP Address is not really a geographical commodity. Every IP belongs to a parent, the autonomous system (AS). In terms of cost, an IP can theoretically be located literally meters away but in terms of traffic cost be far more expensive than an IP at the other side of the planet to which the network owner has a settlement free peering established.
Of course someone can rent a "flatrate" server with "unlimited" bandwith. However, this is only a marketing concept. Every single bit needs to be carried by a pipe which someone pays for. Both provider-ends involved in a bi-directional data stream are likely interested to deliver the service to the paying end-customer, be it the server owner (content) or the person who accesses the server (eyeball). I don't want to get into the discussion what network type is more valuable since this is a hot topic in the "net neutrality" discussion. There is a small club of "Tier1" carriers (http://en.wikipedia.org/wiki/Tier_1_network) sitting in the middle, not paying anyone for transit and only collecting money from the rest of the crowd for carrying traffic to networks which a peered for them but which the not-Tier1 networks can't reach themselves through direct interconnection. Everyone else is by definition not Tier 1 and pays someone cash for traffic. The incentive is to avoid the Tier 1 networks to cut on that cost. This is a little bit of theory for everyone who is not a carrier himself and just purchases upstream traffic from a provider who blends the price for transit and peering traffic. If you understand this, you will understand why a provider hosting a mirror has no problems committing 1 Gbps on peering routes but is afraid from having too much paid transit traffic which could not only spike but permanently increase his 95% traffic quota on paid routes.
Let's take my mirror as an example: There is no real problem with only 20 Gb daily average transfer (statistics on http://mirror.silyus.net) but with 200 servers participating on the CentOS CDN, this globally results in quite some traffic which could be engineered far better than by GeoIP round robbing. If you look at the AS numbers of eyballs sucking from the mirror, there are more transit than peering requesters thanks to Geo IP's unawareness of AS network topology (http://mirror.silyus.net/webalizer/usage_200806.html#TOPASNS).
I have two solutions in mind:
1. the centralized one
The domain name server only returns a round robbing IP to the requester/eyeball of the CentOS mirror URL, if the hoster of the mirror has authorized the IP range of the requester because it can be reached "local" without extensive transit cost. The drawback is that this database needs to be kept up to date since IP prefixes (ranges) are dynamically assigned to autonomous systems. If nobody wants to serve the IP of a requester because the provider of that eyeball does not peer for free, the eyeball will end up in a black hole unless a failover transit "take all" mirror is provided.
2. the decentralized one
The domain name server behaves like now but the mirror itself bounces all file requests which are not "local" according to his ACL. The eyeball still contacts the server and causes some minor transit bandwith overhead but content delivery is denied by the access control mechanism on the mirror. This can be developed into some "inter-server-peer-to-peer mirror" network if the mirror further suggests the next mirror to be tried until a server accepts the request. This is a bit like the mechanism currently in use among telephony routers: If a prefix does not match the locally connected numbers, the call is routed on to the next switch (default route) until an authoritative switch terminates (accepts or releases) the call. We still need to make sure that every IP in the numberspace has an authoritative mirror server or failover default route and the download client must be able to understand the "hint/redirect" to the next mirror on a protocol level. The mirror giving the hint should not hint to other mirrors which have been retired so they should regularly talk to the peer servers whether they are still alive and what numberspace they want to serve and what their current update status is to determine whether they should still recommend a lagging server or in case they lag themselves retreive content from a peer server if it has more current content than the local one. To summarize: in this design the intelligence transferred to the mirror servers and the master only needs to seed the content to a few well connected peer servers which then propagate the content to their nighbors. The problem to ensure the integrity of the files on each node (and lock out zombie mirrors) is still unsolved and I am not really competent to suggest anything right now. I guess there is also some risk in the current production architecture that a mirror server delivers malware files unless the master would build and compare the md5 sum of each and every file on each and every downstream mirror. I guess I am getting too paranoid now since all people hosting a mirror would never ever have bad intentions.
Well, I have released this for now. Anyone wants to comment or pick this up as a project?
Regards, Florian
----- Original Message ----- From: "Tru Huynh" tru@centos.org To: centos-mirror@centos.org Sent: Monday, June 09, 2008 4:01 PM Subject: [CentOS-mirror] RFC on public centos mirrors
CentOS-mirror mailing list CentOS-mirror@centos.org http://lists.centos.org/mailman/listinfo/centos-mirror