In my first post on designing scalable 10Gb Ethernet networks I discussed some of the motivations for migrating HPC and HTC computing environments from 1Gb Ethernet to 10Gb Ethernet and how to design scalable, non-blocking 10Gb networks using the Dell Force10 Z9000 TM.  It may be the case that your current computing environment needs more bandwidth than 1Gb Ethernet offers but 10Gb is overkill and will be underutilized.  It is possible to design scalable 10Gb networks that meet lower IO throughput requirements by introducing oversubscription into the architecture.

Oversubscription exists when the theoretical peak bandwidth needs of the systems connected to a switch exceeds the theoretical peak bandwidth of the uplinks out of the switch.  Oversubscription is expressed as the ratio of inputs to outputs (ex. 3:1), or as a percent which is calculated (1 – (# outputs / # inputs)).  For example (1 – (1 output / 3 inputs)) = 67% oversubscribed).  It is important to remember that oversubscription in a network is not inherently bad.  It is a feature that must be designed as a part of an overall computing solution.  Note that the oversubscription of a network describes the worst case bandwidth for an environment.  If all the servers connected to a leaf switch are not saturating their individual 10Gb links at the same time, the actual bandwidth delivered to a server will be higher than the oversubscribed value and may even be the full 10Gb of bandwidth. The actual available bandwidth will depend on the IO access demands across all servers connected to a switch.

Using the Dell Force10 Z9000TM to design oversubscribed distributed core 10Gb networks is very similar to designing non-blocking networks.  Simply use more of the 40Gb ports of the switch to connect to servers than to connect to spine switches.  Suppose it was determined that a 3:1 oversubscribed 10Gb network was the ideal configuration for an environment.  How do you figure out how many downlinks you need to servers and uplinks to spine switches to achieve 3:1 oversubscription?  I know this sounds weird and too easy to be true, but the port counts are determined by dividing the total number of ports in the switch by the sum of the number of inputs and number of outputs in the oversubscription ratio.  The result of this formula is the number of uplink ports.

Applying the formula to the Z9000, the number of uplink ports in a 3:1 oversubscribed leaf switch configuration is eight.  Splitting the remaining 24 40Gb ports into 4 10Gb ports each results in 96 10Gb SFP+ links for connection to servers.

As in a non-blocking configuration, leaf switches are connected to spine switches to complete the fat tree network.  The number of spine switches needed is determined by same method used for a non-blocking network.  Count the number of 40Gb uplinks from all leaf switches and divide by 32 (since there are 32 40Gb ports in the Z9000).  Network leaf switch uplinks are evenly divided between the spine switches. To achieve balanced performance you should have the same number of uplinks from a leaf switch connected to each of the spine switches.  Mathematically, the number of uplinks you have should be evenly divisible by the number of spine switches.

It is also possible to use the Dell Force10 S4810TM when building distributed core networks.  The S4810 has 48 10Gb SFP+ ports plus four 40Gb uplink ports and may be used as both leaf and spine switches or can be used as a leaf switch in combination with Z9000 switches for the spine.

As I stated in my previous post, using a fat tree topology and the distributed core capabilities of the Dell Force10 Z9000TM and Dell Force10 S4810TM switches enables a massive number of nodes to be connected into a scalable, high performing 10Gb network.  Start small and grow as your cluster grows.