Hello,
This is a CentOS 5.2 box configured as a router for a network handling about 200-300 Mbps, routing traffic to/from the internet for about 6,000 IPs.
After about 2-3 days, the kernel complains about "dst cache overflow" and even thought it hasn't crashed, the network is un-responsive. All IP forwarding stops and the server cannot be reached from any network interfaces.
After diagnosing the issue, it appears that there is a rt_cache leak in the kernel. At 5 minute intervals I collect the following two values:
`/sbin/ip -o route ls cache | wc -l` `grep ip_dst_cache /proc/slabinfo | awk -F' ' '{ print $2; }'`
The first value represents the number of cached routes in the network stack. The second value represents the number of cached route objects the kernel has allocated.
After collecting the data for 8 hours, I have seen the cached routes count remain fairly constant, but the number of cached route objects increase from about 220,000 objects to 410,000 objects. The cached routes count remains between 4,000 and 8,000 routes.
I posted a pretty graph at http://www.pier-pro.com/ip_dst_cache_leak.png (the blue line is the value of ip_dst_cache, the green value represents the count of cached routes).
Once the ip_dst_cache value reaches the value of /proc/sys/net/ipv4/route/max_size then the network fails and the kernel complains about 'dst cache overflow' whenever a packet arrives.
The only solution at that point is to perform a reboot.
Flushing the routing tables by `echo 1 > /proc/sys/net/ipv4/route/flush` only clears the cached routes, however the value of ip_dst_cache does not change.
According to http://linux.derkeiler.com/Mailing-Lists/Fedora/2005-07/1175.html this is a known bug that was fixed in 2.6.11, however, I'm running 2.6.18 (as updated with `yum update`)
I downloaded the kernel sources, and indeed, the kernel source contains the bug fix in the above article.
Therefore ... I'm at a loss as to where to go from here. Certainly rebooting the server every day is not an option, and increasing the max_size will just delay it.
Suggestions?
Thank you,
Hector
Hector Herrera wrote:
Hello,
This is a CentOS 5.2 box configured as a router for a network handling about 200-300 Mbps, routing traffic to/from the internet for about 6,000 IPs.
Therefore ... I'm at a loss as to where to go from here. Certainly rebooting the server every day is not an option, and increasing the max_size will just delay it.
Suggestions?
Use a real router or L3 switch to do the job instead of a PC? Or run the newer patched kernel on the system. 300mbit is trivial.
Even modern good gigabit L3 switches can forward over 100 million packets per second, which is tens of gigs of data. (not talking cisco gear, it's astonishing how poor performing most cisco gear is given their prices).
nate
On Wednesday 11 February 2009, Hector Herrera wrote: ...
After about 2-3 days, the kernel complains about "dst cache overflow" and even thought it hasn't crashed, the network is un-responsive. All IP forwarding stops and the server cannot be reached from any network interfaces.
...
According to http://linux.derkeiler.com/Mailing-Lists/Fedora/2005-07/1175.html this is a known bug that was fixed in 2.6.11, however, I'm running 2.6.18 (as updated with `yum update`)
I downloaded the kernel sources, and indeed, the kernel source contains the bug fix in the above article.
Therefore ... I'm at a loss as to where to go from here. Certainly rebooting the server every day is not an option, and increasing the max_size will just delay it.
Suggestions?
Have a look around the upstream (rh) bugzilla to see if there is a fix in the pipe. If not then you'll have to either run a newer kernel or add the patch to the centos-kernel and rebuild it (both ways are quite messy).
/Peter
Thank you,
Hector
According to http://linux.derkeiler.com/Mailing-Lists/Fedora/2005-07/1175.html this is a known bug that was fixed in 2.6.11, however, I'm running 2.6.18 (as updated with `yum update`)
It could be something new. I got dst cache overflows before and it was a while before they finally identified the bug for the one I saw. Some references below. All I remember was that the chap who finally paid some attention really had to dig through the code before he found it and informed Dave Miller.
http://oss.sgi.com/cgi-bin/extract-mesg.cgi?a=netdev&m=2004-06&i=40C...