Hello,
This is a CentOS 5.2 box configured as a router for a network handling about 200-300 Mbps, routing traffic to/from the internet for about 6,000 IPs.
After about 2-3 days, the kernel complains about "dst cache overflow" and even thought it hasn't crashed, the network is un-responsive. All IP forwarding stops and the server cannot be reached from any network interfaces.
After diagnosing the issue, it appears that there is a rt_cache leak in the kernel. At 5 minute intervals I collect the following two values:
`/sbin/ip -o route ls cache | wc -l` `grep ip_dst_cache /proc/slabinfo | awk -F' ' '{ print $2; }'`
The first value represents the number of cached routes in the network stack. The second value represents the number of cached route objects the kernel has allocated.
After collecting the data for 8 hours, I have seen the cached routes count remain fairly constant, but the number of cached route objects increase from about 220,000 objects to 410,000 objects. The cached routes count remains between 4,000 and 8,000 routes.
I posted a pretty graph at http://www.pier-pro.com/ip_dst_cache_leak.png (the blue line is the value of ip_dst_cache, the green value represents the count of cached routes).
Once the ip_dst_cache value reaches the value of /proc/sys/net/ipv4/route/max_size then the network fails and the kernel complains about 'dst cache overflow' whenever a packet arrives.
The only solution at that point is to perform a reboot.
Flushing the routing tables by `echo 1 > /proc/sys/net/ipv4/route/flush` only clears the cached routes, however the value of ip_dst_cache does not change.
According to http://linux.derkeiler.com/Mailing-Lists/Fedora/2005-07/1175.html this is a known bug that was fixed in 2.6.11, however, I'm running 2.6.18 (as updated with `yum update`)
I downloaded the kernel sources, and indeed, the kernel source contains the bug fix in the above article.
Therefore ... I'm at a loss as to where to go from here. Certainly rebooting the server every day is not an option, and increasing the max_size will just delay it.
Suggestions?
Thank you,
Hector