One of the common questions I see people asking is, “what about 10 Gigabit Ethernet in HPC?” I’ve been seeing lots and lots of FUD (Fear, Uncertainty, and Doubt) floating around about 10 Gigabit Ethernet (10GigE). So I decided to write this blog to start talking about 10GigE and hopefully put to rest from of the FUD.

10GigE – A Little Background

For those that have been watching 10GigE for a while, as I have, let’s go over a little history. It was standardized in 2002. Since that time people have been waiting for the price to drop as it did for GigE. I’ve been one of those people because I wanted a faster network for my clusters at home.

For a good view of the details of 10GigE and the various connection options, look at There have been various companies offering 10GigE NICS such as Chelsio, Neterion , Netxen , Intel, Myricom , Broadcom, Mellanox, and others. There have been a number of switch vendors from the typical ones, Cisco, Force10, Foundry, as well as smaller start-up companies.

Without diving too deep, let’s also briefly discuss some Ethernet networking. Recall that TCP does not guarantee delivery of a packet. Indeed that was why it was developed – so a packet could be dropped or missed or whatever, and the protocol would just retransmit the missing packet. Let’s compare this protocol to InfiniBand where you have a guaranteed in-order packet delivery with no dropped packets. In addition, if you build multiple tiers of Ethernet networks, as you would with large number of nodes and 10GigE, you would need to ensure that every switch is running the spanning tree protocol. It makes sure there are no loops in the network and every node is reachable. This can introduce a fair amount of latency (typically people recommend for simple Ethernet networks to turn off spanning tree to get better performance).

Recently, there have been efforts at what was called Data Center Ethernet or CEE. Now it is being referred to as DCBX (Data Center Bridging Capability Exchange Protocol) to help solve problems in Ethernet based networks primarily for FCoE (Fibre Channel over Ethernet) which requires in-order delivery with no packet drops for delivering good performance. This might help in HPC as well. For example, the DCBX initiative is proposing additions to the IEEE standards that allow for lossless TCP transmission (matching IB) and for better routing schemes to reduce the impact of spanning tree. But DCBX has not yet been approved and its specification continues to evolve so it is not a standard at this time.

With these introductory comments, let’s look at 10GigE performance, since HPC is all about performance (well usually).

10GigE – Performance

There have been many people wanting 10GigE as the computational interconnect for HPC. There are two things that have to be achieved for this to happen on a reasonable scale. The first key is that the application performance with 10GigE has to be good enough to justify the cost. The second key is that the price has to be low enough for people to buy 10GigE systems on a large scale. Price is fairly fluid, even thought it doesn’t change much during the year, so I will only talk about it a little bit. But performance is something that we can talk about and it’s not likely to change too much.

I’m going to talk about 10GigE with respect to TCP since this is the big draw behind 10GigE. Admins everywhere understand TCP. They know how it works, they can debug it using something as simple as tcpdump, and TCP can be routed (which some customers are very interested in). So, let’s focus on 10GigE using TCP. Now comes the fun part – micro benchmarks.

There are several micro benchmarks that are reasonable when examining interconnects. Of course micro benchmarks aren’t always going to be predictors of performance, but without actually benchmarking the specific application, the only thing you can use are micro benchmarks. Over time these micro benchmarks have also become reasonably good predictors of performance for some codes and, more importantly, for certain classes of codes. The three primary micro benchmarks are bandwidth, latency includingmulti-core latency scaling, and N/2.

The best tool for computing these micro benchmarks, in my opinion, is netpipe. Netpipe takes two nodes and sends packets of various sizes between them, measuring the time it takes. From this information you can compute latency (the time it takes an extremely small packet to travel from one node to another), the bandwidth, the maximum amount of data per time to travel down the wire (recall that you have data going in each direction down the wire), and N/2 that is the smallest packet size where you achieve full wire speed in one direction (an easy way to compute this is to find the smallest packet size that reaches half of the maximum bandwidth).

Below is a sample graph of netpipe results.

10GigE Image 1

Figure 1 – Sample Netpipe Plot of Throughput (MB/s) vs. Packet Size (BlockSize)

This sample is a sample from a GigE network using 3 different MPI libraries. From this graph, you can measure bandwidth (the peak of the chart divided by two), and N/2 (the packet size in bytes where you first achieve the maximum bandwidth).

From a second plot, such as the one below,

10GigE Image 2

Figure 2 – Sample Netpipe Plot of Throughput vs. Time

From this plot, you can determine the latency, which is the time for a 2 Byte data size packet (or something extremely small).

In general the most useful way to test an interconnect is to compute the latency, the maximum bandwidth, and N/2, for a sample configuration. This means having two nodes with a switch in between them, running MPI. The reason this is most useful is because it contains the elements of an HPCC system – the nodes with the OS, the NICs, the cables, the switch, and the software (MPI).

Several researchers got together and tested the Chelsio 10GigE NIC (T11) that has a TCP Off-load Engine (TOE). They tested the NIC both back-to-back (no switch) and with a 12-port Fujitsu switch. They tested a variety of aspects, but the one that is likely to be most relevant to HPC is the MPI evaluation. They used LAM (link – and an MTU of 1500 bytes. They achieved the following performance,

  • Latency = 10.2 microseconds (native sockets was 8.2 microseconds)
  • Bandwidth = 6.9 Gbps (862.5 MB/s)
  • N/2 = 100,000+ bytes (from Figure 7 in the document)

Unfortunately, the results from this study are the only publically posted complete results for a pure TCP 10GigE solution that I could find. They are a bit old (2005), but I could not find any other complete set of results. There are other bits and pieces of performance data on the internet however. For example, Mellanox has a version of their ConnectX HCA that can run TCP natively. This site has some performance information. In particular they list the following:

  • Bandwidth = 9.5 Gbps (1187.5 MB/s) with MTU=1500
  • Bandwidth = 9.9 Gbps (1237.5 MB/s) with MTU=9000

The latency was not listed and the N/2 was not listed. In addition, the tests were done with TCP and not MPI.

10GigE – Observations

I think it’s worthwhile to compare the performance of 10GigE vs. IB. Mellanox has some performance numbers as does this site and from these numbers we have the following performance for DDR IB:

  • Latency = below 1 microsecond
  • Bandwidth = 3000 MB/s for InfiniBand DDR using PCI-e Gen 1, 3800 MB/s for InfiniBand DDR using PCI-e Gen2 and 6600MB/s for InfiniBand QDR using PCIe Gen2. Those are the bi-directional BW numbers.
  • N/2 = 480 bytes

If we compare these results to 10GigE, we see that DDR and QDR IB have much better performance than 10GigE at this time. We’ve also recently seen QDR (Quad-Data Rate) IB introduced that will push up the bandwidth but will not likely have much impact on latency or N/2 for the initial implementations. But recall these are micro-benchmarks and while they can predict performance, they do not replace testing your application.

I agreed to not talk about price since it is somewhat fluid, but in general terms, for reasonably sized clusters (32+ nodes), the cost per port of 10GigE is much higher than IB at this time.

Will 10GigE Move Into HPC?

The title of this section is a little provocative because I think the answer is yes. But the caveats are “how much” and “when”. Right now, 10GigE does not have the same level of performance as IB on the common micro-level benchmarks. Moreover, the performance is not likely to improve much. It’s still going to have a latency of about 8-10 microseconds and a bandwidth around 1,100-1,200 MB/s. It definitely does have room to improve the N/2 and it’s likely to do this (I just wish someone would publically post some more recent numbers).

In addition, there are some issues with using TCP such as not being able to have a lossless network or introducing more latency when introducing spanning tree. DCE may help in these areas, but until it is a standard it’s not much use to HPC.

So, what’s my advice? Ideally you should benchmark your application on various networks to measure the performance. In particular look at how the application scales since HPC systems tend to get bigger and not smaller (i.e. people run on more cores every year). But the micro-benchmarks for 10GigE aren’t the best at this time and will not ever reach the performance of Infiniband. Right now the cost for 10GigE prohibits its wide-spread adoption in HPC.

I’ve been waiting at least 4 years for the price of 10Gige to come down. I’m still waiting and unfortunately growing old. During this time InfiniBand has become the dominant network in HPC. The performance has greatly improved and the prices have come way down to about $250 a node for smaller systems. I think 10GigE has a long way to go to become a common HPC network.