[CentOS-mirror] Please remove from mirror list: mirrors.bigtennetwork.com/CentOS

Hi Scott,

On Thu, Jan 21, 2010 at 12:06:47 -0800, Scott Adametz wrote:
> Due to an inordinate amount of Chinese based traffic from only a handful
> of IP addresses (over 145 TB transferred to just 12 IP addresses in only
> 15 days since we started hosting the mirror) we are forced to cease our
> participation in the CentOS mirror project.  

This is an interesting report, because it doesn't fit the patterns that
were seen in the past.

Of course, there could always something new, that we haven't seen yet --
but for the moment, I'll analyze your case on the common ground of
what's known to me.

> In our research before deciding to offer our support we were told to
> expect a sustained 3-5 Mbit/s of mirror traffic.  In reality, and from
> only a handful of IPs, we regularly push over 200Mbit/s on our 300Mbit/s
> line.  Each of the abusive IPs downloads the same DVD iso files over and
> over thousands of times.  We have tried blocking the abusive IPs only to
> see another IP with a sequentially increased last octet take its place.
> Whether this is an outright attack or just an unfortunate coincidence
> matters not.  

Let's check: with a connection that maxes out at 300 MBit/s, I calculate
a maximum amount of data of 2.5 TB that can be delivered in 24 hours.

Within 15 days, you would not be able to deliver more than 40 TB, thus I
think that the number of 145 TB that you report must be based on a
miscalculation. It is impossible on my above calculation at least.

If you measured by looking at mod_status, or by analyzing Apache logs
without using mod_logio, this is to be expected. The numbers that are
logged there are grossly overestimating traffic because (as Adrian
mentioned already) they don't log the effectively transferred amount,
but numbers based on file size.

> Regretfully, I must ask that we be delisted from the mirror list asap.
> Once our links are down, we will shut down the server.  
> 
>  
> 
> At some point in the future we may decide to participate again but for
> now, we cannot justify the inordinate bandwidth use.  

I guess that you already checked other means to assess actual network
traffic, but if you didn't, I would recommend to check again in more
detail.

(Your followup indicates that you used only webalizer and analog, i.e.
purely Apache log analyzer that look only at the default log format, and
that would lead to those skewed numbers.)

If you discover that the numbers were indeed wrong (I would be surprised
if not :-) then the amount of data transferred might not be like a
problem at all anymore. However, the number of connections opened by
some clients might be the actual problem the you might want (and need)
to fight against to protect your resources. 

I have seen as much as 300 (!) parallel connections from single IP
addresses, all downloading the single same file in range requests.

The fact that the connection from many Chinese network to the rest of
the world is so very bad is the _reason_ why those parallel connections
are opened, and they persist for long periods of time because the netto
transfer is low.

You can address this issue by limiting parallel connections to your
mirror by IP address, as was suggested in several followups to this
thread. mod_limitipconn is the treatment of choice.

You'll have seen the number of connections being discussed in the thread
as being a possible source of cutting off legitimate users. Indeed, I
wouldn't go as low as 5 parallel connections, because with a number that
low I see the risk as well. However, I can recommend "MaxConnPerIP 20"
as a good value, which I have had no problems with whatsoever. This
effectively limits the harm, as long as you don't encounter dozens of
those kind of downloaders -- but I have never seen more than a small
handful in practice, with the mirrors that I maintained.

Of course, you could apply the limitation just to the large files
(*.iso), to further reduce the risk of impacting legitimate users.
That can be done with on-board Apache config, and there's also a handy
NoIPLimit directive to define exclusion rules.

The typical "download accelerator" will download content in chunks (with
partial GET requests / range requests) and often also open parallel
connections to steal a little bandwidth from other users in their own
interest. 2 parallel connections are suggested as per the HTTP standard
(the newer, to-come HTTPbis standard will remove the limit); 4-5
connections is a frequent default, and some users might (mis-)configure
their clients to use excessive numbers. As mentioned above, I have seen
as much as 300, and that certainly became a problem for the mirror I was
maintaining.

Now there's a litle gotcha:
The partial GET requests are logged correctly by Apache, even without
mod_logio, as long as the the client doesn't prematurely terminate the
connection. When Apache gets a Range request for bytes x to y, it'll
deliver that range and log the correct number (in the default log file).
However, the more frequently used type of request that typical download
clients do is "Range: 12345-", i.e. they don't specify the end of the
chunk they want, which means "till the end". However, they wait until
they got just as much data as they want, and decide whether to stick to
the connection (if it's fast), or to terminate it. Now if they
terminate it, Apache will log a wrong number (likely the whole file
size).

Here's an example:

 % curl -o /dev/null http://doozer.poeml.de/opensuse-education/ISOs/openSUSE-Edu-li-f-e-11.2-1-i686.iso         
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  1 2929M    1 32.7M    0     0  47.5M      0  0:01:01 --:--:--  0:01:01 47.6M^C

(Note how I terminated the client with Ctrl-C after a short while)

For this, Apache logs the following. (I'm breaking the line to make it
more readable). Note thate there are two numbers appended to the common
log format, which are those I added with mod_logio:

87.79.143.238 - - [23/Jan/2010:00:40:29 +0100] 
  "GET /opensuse-education/ISOs/openSUSE-Edu-li-f-e-11.2-1-i686.iso HTTP/1.1" 
  200 3071279104 
  "-" "curl/7.19.7 (i386-apple-darwin9.8.0) libcurl/7.19.7 zlib/1.2.3" 
  189 10020960

As can be seen here, %b from mod_log_config is filled with the whole
file size (3071279104, 300 MB in this case), because that's what Apache
intended to send. However, the actually sent bytes were just 10 MB (last
number).

For connections that run to the end, or for range requests that come
_with_ an end of the range specified, and are not terminated
prematurely, the number would actually be correct, which you can easily
try out.

However, the above "special case" ruins your statistic. Thus, it's good to
always use mod_logio, and/or use vnstat or other (e.g. external) means
to keep an eye on network traffic.

(Don't trust mod_status in this regard, either - the numbers are even
worse; even a HEAD request will cound the entire file size.)

I wish you that you find out that you didn't really experience those
extreme amounts of traffic, as it might have looked at first!
Hopefully, the above explains the things that happend with a much more
welcome explanation.

Good Luck!

Peter
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 197 bytes
Desc: not available
URL: <http://lists.centos.org/pipermail/centos-mirror/attachments/20100123/7dde3649/attachment-0004.sig>