On Monday, March 28, 2011 08:29:58 pm mcclnx mcc wrote: > To answer your questios: > > 1. network is Intranet not internet. > > 2. servers CPU and I/O are very light. We did use "sar -u" and sar -b to check. > > 3. is NOT only one server has this network slow problem. at least 4 to 5 servers on that rack all report slow. it is NOT possible all servers on that rack are all heavy load. I have seen issues in the past with certain Broadcom gigabit ethernet NICs and the tg3 Linux kernel driver. Occasionally the NIC would just go into 'molasses' mode and get really slow. I haven't seen the problem in quite a while, though, so I don't know if that issue has been fixed or not. Note that not seeing the problem doesn't mean the problem didn't occur, of course. I never saw a correlation between multiple servers with that NIC going slow at the same time, however. The next thing I would check is the network switch these servers are attached to. Many switches, especially Cisco Catalyst switches with hardware-assisted forwarding at mixed layers, tend to provide multiple physical connections on a single ASIC; the networking people can check in the Cisco IOS command line and see if the ASIC is throwing errors and such; the particular commands vary by ASIC, by switch model, and by operating system. I had an older Catalyst 2900XL (I did say 'older' after all) where a certain set of ports would hang and go slow for minutes at a time; plugged the devices into ports served by a different ASIC, and things got better. I then put a home-made permaplug into each of the the bad ports (A permaplug, something I make a few dozen of every so often, is an RJ45 with no contacts and where the latch release has been cut off, and the back end of the plug filled with red silicone; it'll go in, but it takes some work to get back out; I have been known to epoxy them into bad ports before to keep people from trying to use them....). It was the ASIC; on that switch each ASIC serves eight ports. And the last problem I had was related to a new IP security camera that had multicast features; note to self: always check to make sure multicast is set ot OFF if the entire subnet that camera is on is not carried by multicast-aware switches. I had lots of devices just give up under the sustained 5Mb/s multicast load. Multicast traffic also doesn't necessarily show up in the usual places for checking network traffic; you need Wireshark running on a SPAN port to catch it most of the time. Since I wasn't aware that multicast was on by default, it took an inordinate amount of time to find the issue; I had switches giving up, losing BPDU's, causing spanning-tree loops, etc. It was not a pleasant day. My console terminal servers for devices, SitePlayer Telnets, all stopped responding completely after an hour of that sort of traffic. Like I say, it was not a pleasant day. I have revisited the multicast filtering features of many of my switches in the days since that issue.