Many HPC and high throughput computing (HTC) application environments are well served by gigabit Ethernet as the primary cluster interconnect. Increasing processor core counts and the availability of cost effective quad socket systems are growing the IO demands of compute nodes. When the available bandwidth of a gigabit connection is exceeded, IO wait cycles are introduced, the overall throughput of a compute node is constrained and CPU utilization drops. Transitioning to 10Gb Ethernet is one way to address the increasing IO demand. A challenge with 10Gb Ethernet networks for clusters is deploying a cost effective network that scales as you grow and minimizes cost. Using a conventional multi-tier design is one possible solution for building a scalable 10Gb network. However, the bandwidth available at the top tier will limit the size of the network that can be built and often the first tier switches introduce over-subscription into the network because the amount of uplink bandwidth is less than the bandwidth required by systems connected into the switches. The solution to these limitations is to build the network using a switch like the Dell Force10 Z9000TM in a “fat tree” topology taking advantage of the Z9000’s distributed core capabilities. The Distributed Core Architecture Using the Z9000 Core Switching System White Paper discusses the advantages of this approach and the various communications protocols used in implementation, but does not cover how you actually design a fat tree network. The Dell Force10 Z9000 TM is a line-speed, 32-port 40GbE, two rack unit, top of rack (TOR) switch. Each 40Gb QSFP port can be split into 4 10Gb SFP+ ports using a simple splitter cable.
A two tier fat tree network uses what are commonly called “leaf” and “spine” switches. Leaf switches are switches at the edge of the network that connect to the compute or storage node elements in the cluster. Spine switches make up the second tier of switches that connect the leaf switches together. I will cover non-blocking solution design with this posting and follow-up with some possibilities for oversubscribed designs at a later date.
A non-blocking leaf switch configuration using the Z9000 is made by splitting half of the Z9000’s 32 40Gb ports into 10Gb ports enabling the connection of up to 64 compute nodes (each 40Gb port is split into 4 10Gb ports). The remaining 16 40Gb ports will be used as uplink ports to connect into spine switches. There is an equal number of 40Gb ports used for connection to compute nodes and spine switches so this is a non-blocking configuration.
To complete the fat tree network, leaf switches are connected to spine switches. The number of spine switches needed is determined by counting the number of 40Gb uplinks from all leaf switches and dividing by 32 (since there are 32 40Gb ports in the Z9000). To connect the network leaf switch uplinks are evenly divided between the spine switches.
The following example has four leaf switches in non-blocking configuration which can support a 256-node cluster. There are a total of 64 40Gb uplinks spread across two Z9000 spine switches.
The maximum number of non-blocking 10Gb connected systems that can be configured into a single network fabric using the leaf switch configuration described here is 2048. 2048 comes from multiplying the number of ports in the spine switch times the number of nodes that can be connected to each leaf switch. If we change from using 40Gb connections to 10Gb connections in the spine and leaf switches, by splitting each QSFP port into 4 SFP+ ports, the Z9000 becomes a 128 port 10Gb switch and the maximum number of non-blocking 10Gb ports in a single network fabric grows to 8192 (128 spine switch ports * 64 leaf switch node connections).
Designing a conventional 10Gb multi-tier topology either limits the size of your network to 100s of ports or requires the introduction of oversubscription. Also, purchasing a core switch of sufficient size could cost hundreds of thousands of dollars. Using a fat tree topology basd on the Z9000 enables a massive number of nodes to be connected into a scalable, high performing network. You can start small and grow as your cluster grows and dramatically reduce the cost of the solution.
In part 2 I will cover how to design oversubscribed fat tree networks.