[CentOS] Re: 10Gbit ethernet
Chris Payne
Chris.Payne at triumf.ca
Fri Mar 14 17:19:54 UTC 2008
On Fri, Mar 14, 2008 at 11:06:18AM +0000, Jake Grimmett wrote:
> I'm probably not going to feed 10Gb to the individual blades, as I have few
> MPI sers, though it's an option with IBM and HP blades. However IBM, and
> Dell offer a 10Gbit XFP uplink to the blade servers internal switch, and
> this has to be worthwhile with 56 CPUs on the other side of it.
>
> I'm most concerned about whether anyone has tried the Netxen or Chelsio 10Gbit
> NICs on Centos 5.1; I see drivers in /lib/modules for these...
>
> Also - do people have good / bad experiences of CX4 cabling? As an economical
> short range solution (<15M) it seems ideal for a server room, but I have a
> sales rep who is trying to scare me off, but he is biased as the 10Gb SR XFP
> transceivers are very expensive (~£820)...
Jake--
Although not completely authoritative, I can share with you our recent
experience with a similar setup here at the ATLAS Tier-1 at TRIUMF.
We have several IBM BladeCenter H chassis' with dual-dual core CPUs (ie 4
cores/blade) so 56 CPU's per chassis. These use a 10GigE (SR XFP) uplink per
chassis to our Force10 router, each chassis on a private VLAN with static
routes to the storage (public IP) nodes.
Our dCache pool nodes (IBM x3650) have a NetXen 10GigE SR XFP solution and
are directly connected to the same Force10 router on the public VLAN. Since
we are on SL4.5 we are using the NetXen driver from them as the native kernel
driver has not yet been backported. (or has it now?)
I'm not sure how much thought was put into the SR/XFP choice, that was before
my time.
Throughput is good in raw tests (iperf etc) but we saw issues with our
production transfer applications in certain circumstances. Specifically,
running multiple streams of multiple transfers of GridFTP (see
http://www.globus.org/toolkit/data/gridftp/). I think 10 transfers with 10
streams (not my department) would cause the network card to "lock up" and
connectivity was completely lost. Generally, this took a matter of minutes to
accomplish.
Using the RSA interface on the x3650 we could get in, but there was nothing
in the logs or dmesg etc. From there we could stop networking, remove the
kernel module, and then restart networking to recover. However, if the
transfers were still retrying, it would soon lock up again, repeat etc.
Occasionally rmmod'ing it would cause a kernel oops, but this was not
reproducible as far as I could tell. If the transfers were killed, the
machine generally recovers.
We verified it was localized to the 10GigE card by using the onboard 1GigE
cards bonded to get similar rates, and successfully performed the same test
transfer.
Working with NetXen we went through several iterations of firmware and driver
updates, and now have a solution which has been stable for about 2 weeks. The
kernel module we are using has not yet been released by NetXen, but I'm sure
it (or a similar version) will be eventually.
Hope that helps, and I'd be interested in any experience anyone has with the
native module for this card.
Cheers
Chris
--
Chris Payne chris.payne at triumf.ca
TRIUMF ATLAS Tier-1 System Administrator - Networking
TRIUMF +1 604 222 7554
4004 Wesbrook Mall, Vancouver, BC, V6T2A3, CANADA
More information about the CentOS
mailing list