Join us at Super Computing 2011!We invite you to visit us at the SC2011 conference in Seattle, Nov 14-17 at Booth #2040 See first-hand how we are enabling research discovery with Dell HPC solutions.
One of the things that make a cluster, a cluster, is the network. It allows the nodes to send data to each other as part of a parallel application (upcoming link to article about how parallel programs work), it is the vehicle for reading and writing data, and for monitoring and controlling the cluster. Let’s begin a discussion about HPCC networks by talking about what kind of data needs to be pushed over the network. Since HPC is focused on performance, people typically first focus on computational traffic. This is the data being exchanged between nodes as part of the computational application. But, in addition to computational traffic, clusters also have the following traffic:
You can see that networks are pretty important to clusters because of all of the traffic that they carry. In particular having the right network(s) to accommodate the needed traffic, especially the management traffic, is vital to having a cluster that functions well and is easy to administer. Given this list of functions that use the network, most HPCC systems have two networks:
One of the great features of clusters is that they are so flexible and can be adapted to the situation or requirements. For example, you can use just a single network if you like, but it needs to be TCP based because of the management traffic. You can also use more than two networks. For example if you have one network for management, one for computation, and one for storage. The point is that the network design, while being one of the most important facets of a cluster, is very flexible.I did leave out the storage network from the previous list. In general you can run the storage traffic over the management network or the computational network depending upon your situation. Or you can even have a dedicated storage network. Determining which one is better suited for your situation is beyond the scope of this article. But for small I/O needs, it’s probably a good idea to put the storage traffic over the management network. For larger amounts of I/O performance, then it’s advisable to put the storage traffic on the computational network but you need to make sure that there won’t be traffic conflicts on the network. If there are conflicts or the amount of I/O is overwhelming, then a dedicated storage network may be needed.Let’s look a little closer at the management and computational networks.Management Network:This network carries management traffic that consists of:
In addition, depending upon the size of the cluster and the configuration, the management network may also have storage traffic.There are many options for networks. Not only are there options for the network hardware itself, but there are also options for network layout, and protocols for the network. So there is a reasonable amount of flexibility in designing and selecting networks for the best performance and the best price/performance. But in general, the management network is TCP based because of the IPMI and imaging traffic. Current clusters almost always use GigE as the management network. Future clusters may use 10GigE with an eye toward putting storage traffic on the management network.If the management network only carries management traffic, you can adjust the network topology to save some money. In general management traffic is fairly small and uses reasonably small packets. Having a management network without full bisectional bandwidth is a design that people typically use. For multi-tiered networks, this saves cables and network ports, allowing you to use smaller switches. This approach also does not overly impact the performance of the management traffic.Computational Network:The computational network does one thing – carry computational traffic. It’s a simple requirement but the exact details of the computational traffic vary from application to application. This can make choosing a network for computations difficult when the applications are varied.To choose a good computational network, you need to understand your applications. What kind of messages do they send? What sizes? How often? What kind of messages (e.g. AlltoAll or point-to-point)? How do the message sizes vary when the same problem is run over an increasing number of cores? Do the applications appear to send lots of small messages, indicating that latency could be an important factor in performance, or does the application send larger messages, perhaps indicating that bandwidth is important?Understanding your applications is perhaps the most important step in choosing a network. The next step is to look at various, so-called, micro-benchmarks of network solutions. Typical benchmarks you might see are:
These micro-benchmarks can be very useful in predicting application performance. On the other hand, sometimes they are not. The general rule is to always benchmark your application if possible and not rely on micro-benchmarks. But if you can’t then you need to know how micro-benchmarks relate to your application performance.Before moving on to a discussion about network options, I want to mention one important point about choosing a network for applications. In addition to knowing how your applications behave (i.e. what their bottlenecks are) as well as perhaps looking at micro-benchmarks for various network options, one extremely important aspect you need to examine is, how many nodes you are likely to use in a single job.Let me explain a bit more. If your application needs InfiniBand to scale well, perhaps over 64 nodes, but your problem sizes are such that you can run on 8 nodes at a time, then you may not need InfiniBand. Since you are running on only a small subset of the nodes, then perhaps GigE will give you good enough performance on 8 nodes. If you had a 128-node machine then you could run 16 jobs at the same time (assuming 8 nodes per job) and GigE would be sufficient. However, this scenario is perhaps a little rosy, but it does point out that in addition to looking at how the performance of your applications scales with the number of nodes, you also need to look at your problem sizes and the number of nodes you are likely to use in a single job. I hate to be critical but many people overlook this simple concept.As mentioned before, there are several options for computational network hardware. The primary options for computational traffic are:
There are others but they are not as popular as the ones on this list. In this document I will be referring to these options as “networking hardware.” In general, this means the network interface cards (NICs), the connecting wires (cables), and the switches. In many cases, it also refers to the controlling software as well. But the actual network protocol can be considered separately because for GigE and 10GigE it’s possible to run different protocols, not just TCP, by using them on the IP layer. Also, in the case of Myrinet 10G it’s possible to run TCP and mx, the Myrinet protocol, on the same hardware. Plus the Mellanox ConnectX HCA for InfiniBand will have the ability to run TCP protocols as well.These are the typical “network hardware” options. But it’s possible to arrange your network in ways that suit your application to improve performance and/or reduce costs. This is referred to as the network topology. The next section talks about the typical approaches people use for both the management network and the computational network.Network TopologiesThere have been, and continues to be, a great of research and discussion about network topologies, particularly for HPC. There are pros and cons about all network topologies, but in general, it’s a good idea to keep things simple as a starting point.Choosing a network topology really depends upon the performance you desire, the price you are willing to pay, and perhaps secondarily, the simplicity of the topology and the ability to upgrade the system. In general the best starting point for the network topology is called a “fat-tree” network topology. This type of network, if constructed properly, allows for full bisectional bandwidth to all nodes. In addition, current switches allow fairly large networks (several thousand nodes) to be constructed for GigE, InfiniBand, and Myrinet 10G. You can use several tiers, or layers, of switches to construct large networks. The only gotcha is that for TCP networks that have more than one level of switches, you have to enable something called the spanning tree protocol. Spanning tree can have a detrimental impact on performance.But as I mentioned there are many ways to skin the cat for network topologies. One of the most popular approaches is called oversubscription. It deals with designing network topologies so that each node does not get the full bandwidth of the network.OversubscriptionOne important phrase you will see mentioned in network topology design is something called oversubscription (definition link?). In general it’s a difficult term to describe much less compute for general networks. However, for fat-tree designs it is fairly easy to compute oversubscription. What oversubscription expresses is the network bandwidth at various levels in a fat-tree design. If the network bandwidth is not the same going into the network layer as going out of the network layer, then that layer or connection is described as being oversubscribed. Generically it means that the “input” connection of the network layer, typically a client, does not get the full bandwidth of its connection throughout the network topology.I know it’s rather difficult to visualize oversubscription if you’ve never seen it before, but let me give you an example that might improve your understanding. Figure 1 below is network switch with a total of 24-ports (impressive isn’t it?).
The ports can be directly connected to nodes within the cluster or to other switches in the network fabric. We typically put the “inputs” to the switches from the lower layers or clients at the bottom of the switch, and the output or next layer of switches is written at the top of the switch. If there are more than 24 nodes that need to be plugged into the switch, then you can use a larger switch (not a big deal). Alternatively you can use a second tier of switches. If you use a second tier of switches, then network traffic from this first switch must go to the second level (second tier) of switches before going to other switches or clients in the network. Figure 2 shows what this would look like.
An easy way to think about this configuration is that you could have maximum network speed traffic from 12 nodes going into the switch. To maintain the same bandwidth through the switch, you need to have the same amount of network traffic out of the switch. Therefore, you need 12 ports from the switch to connect to the next level of the network. This is generically called 1:1 oversubscription, or basically, you have no oversubscription. We have the same amount of traffic going into the switch as going out of the switch. I also like to think of the term “1:1” meaning “traffic in:traffic out.” So we would write that as 16:16. But using the great powers of common denominators or common factor (my daughter is just finishing fractions in school), we can rewrite this as 1:1. Just remember this is written as “traffic in:traffic out.” I always like to think of Mr. Miyagi yelling at the Karate Kid, “Wax On, Wax Off” to describe oversubscription. But in the case of oversubscription he puts more wax on than he takes off Mr. Miyagi’s car (beware the wrath of an angry Okinawian martial arts master).Now let’s look at oversubscription besides 1:1. Figure 3 below illustrates 2:1 oversubscription.
In this figure we have 16 ports going into the switch and 8 ports going out to the next network level. If you like you can write this as “16:8” but using our super common factor powers, this is reduced to “2:1.” So this network configuration is 2:1 oversubscribed. Continuing with this concept we can also create a 3:1 oversubscribed network. Figure 4 illustrates that concept.
Using our wax-on and wax-off theorem, we write this as 18:6, or 3:1. Remember this means we have 3 times as much traffic going into the switch as we have going out of the switch. One last oversubscription example is in Figure 5 below.
For this example, we have 20 ports going in and 4 ports going out. In other words, we write it as 20:5 or 5:1. I kept the oversubscription example fairly simple.I could have continued on to 11:1 and even 23:1 but that level of oversubscription can get a little silly. However, as I discuss below we sometimes use highly oversubscribed networks for management networks for large clusters since there isn’t much traffic and it’s not necessarily critical to the running applications.One last comment before moving on - it doesn’t make too much sense to have an “undersubscribed network.” This is written something like 0.8:1. The reason that it doesn’t make much sense is that we have more traffic coming out of the switch than going into the switch. In essence, we have too much capability within the switch. This means we are wasting money on network hardware that isn’t connected to any nodes. Moreover, the additional network hardware that is pushing traffic out of the switch is not giving us any more performance because we are limited by how much traffic we can put into the switch.Network ProtocolsA sometimes neglected topic in HPCC networks is the network protocol. While it’s not the most interesting topic for high-speed networks such as Myrinet 10G and Infiniband, it is a very interesting topic for IP-based networks.In a classic GigE network, TCP is the protocol that is run on top of the IP layer network. But GigE in general has a fairly high latency of around 29-100 microseconds compared to high-speed networks such as Myrinet 10G and InfiniBand (the 100 microseconds is not so common any more but more typical of older NICs). Some of this latency can be attributed to the protocol itself. To improve latency, there have been a number of efforts to create a different protocol than TCP to run on IP networks. The main ones that I’m aware of are:
These packages usually come with the needed drivers (if any) as well as an MPI implementation that allows MPI codes to be compiled and use the new protocol on IP networks. GAMMA has achieved a latency of about 9 microseconds on GigE networks while OpenMX has achieved a latency of 6.5 microseconds on 10GigE NICs and about 10-15 microseconds on GigE NICs. It is beyond the scope of this article to discuss the pros and cons of various network protocols, but using an alternative protocol is an option to be considered. But as a disclaimer Dell doesn’t support nor deploy any protocols other than TCP on IP networks.Rules of ThumbOne of the struggles people have in cluster design/layout is the network. To perhaps help get around this struggle, here are some rules of thumb you can use to help.Rule of Thumb #1: Keep it simpleI think this is a good rule for almost all aspects of clusters design including networks. Keeping the network simple allows the cluster to be quickly constructed and debugged (if necessary). It can also help simplify cabling depending upon the configuration. Moreover, while not always true, a simple network can help with upgrades.But, as with all rules, this one sometimes needs to be broken :). In some cases, a more complicated network can be better than a simple one but it comes at a price. A more complicated network can be more difficult to deploy, debug, and administer. But if the performance benefit is greater than the complications, then a more complicated network can be considered as an option. However, these cases are fairly rare.Rule of Thumb #2: For a cluster larger than about 16-32 nodes, use a high-speed networkRemember that with quad-core processors, a 16-node cluster actually has 128 cores and a 32-node cluster actually has 256 cores. This is a large number of cores, particularly for a single job (a single run of an application). Since many applications need either a low-latency network and/or a network with a large bandwidth to scale well, for reasonably sized systems, a high-performance network is needed.But as with Rule #1, this rule too, can be broken. There are applications that either run on a single node or run on just a few nodes. Consequently, a simple GigE network works just fine for these applications for clusters of 16-32 nodes and much larger.However, as a rule of thumb, using a high-speed network such as Myrinet 10G or Infiniband for clusters larger than 16-32 nodes is a good starting point. But it’s only a starting point.Rule of Thumb #3: Design the network for growth (if possible)People always like to talk about upgrading their cluster with faster processors or more memory, and to be honest, it rarely happens. But one upgrade that happens with reasonable regularity is the addition of more nodes. Therefore when you design a cluster network, I recommend planning for the addition of at least 2 nodes and as many as 10% the original size of the cluster.In many cases you can design the network for expansion fairly easily. Central or core switches usually consist of a chassis that has a backplane that connects “line cards” to each other. The line cards have a number of ports on them where you plug the network cables (insert picture here). An easy trick is to then choose a core switch or switches that have at least one open slot for a line card. Then if you need or want to add nodes, you just add a line card and plug in the new nodes.For multi-tiered switches, you can design the network such that you can either add switches at the lower level (tier 1) and plug into the core network or plug them directly into the core switches.When in Doubt…If you’ve made it this far in the article, then I congratulate you. At this point you also may be thinking, “network design is too complicated.” You are correct to a point. For someone fairly new to HPCC designing a network can be daunting. So I wanted to give you some basic guidelines or options for clusters of various sizes. Think of these as a “cheat sheet” for cluster design.1-16 Nodes:For clusters of this size, you can use a single TCP network (GigE or 10GigE) for both the management and the computational networks. Typically you will find a 24-port GigE switch that will work well.If it appears that you need two networks for clusters of this size, then I would still recommend two GigE networks (one GigE for management and a 10GigE for computational). You can use a 24-port switch for each network or perhaps use a larger 48-port switch that handles both networks. Personally, I recommend using two switches since it will allow you to easily debug the network. Regardless, for clusters of this size, I would not recommend using oversubscription for either network because you are not likely to save much money (usually the goal of oversubscription).Since GigE is pretty darn inexpensive at this point, including small unmanaged switches, I would recommend using two networks – a management GigE network and a computational GigE network. Just one word of caution – not all switches are created equal. Look for switches with a good track record (google for information about the switch) and that have a fair amount of buffer memory per port or in total.There is also the possibility that you could use an Infiniband network for the computational network. There are inexpensive SDR IB (Single Data Rate InfiniBand) switches (insert link) that make IB a competitive offering for small clusters. If you do this, then I recommend using a full bandwidth TCP network (GigE or 10GigE) for the management network and a full bandwidth IB network for the computational network.32-128 Nodes:This range of nodes is something of a transition area. You will find many applications that perform well over GigE in this range, but there are also applications that need InfiniBand to scale well. To get good scaling whether it’s TCP or InfiniBand, you will need a dedicated network for computations. So at the very minimum I recommend two networks for this range of nodes – a TCP-based management network, and a computational network. The management network is fairly easy. At this size, you can use a full bisectional network fairly easily with a single GigE switch for 32 nodes and even a single GigE switch for 128 nodes. As an option for node counts larger than 32 you can use three 48-port GigE switches and connect two of them to the third switch via 10GigE uplinks as shown in Figure 6 below.
This configuration could also allow you to connect the head node via 10GigE uplink if the central 48-port switch has enough 10GigE ports as shown in Figure 6. Notice that the left-hand and right-hand switches are 43:10 or 4.3:1 oversubscribed. The central switch is a bit more difficult to define. In my opinion it is 52:20 oversubscribed (2.6:1). The ambiguity comes from the 10GigE links from the left and right hand switches and the master node. The 52 comes from forty GigE nodes and the 10GigE link from the master node. This describes the traffic from nodes going into the switch. You also have two 10GigE links going out of the central switch to the neighboring switches which accounts for the 20. Consequently this is written as 52:20 or 2.6:1.This oversubscription approach can sometimes save you money because you don’t need large switches or as many cables. But it comes at a price of having reduced bandwidth and possibly increased latency. Most of the time this is not a problem for management networks, but if you are using a TCP-based computational network, then I recommend a full bisectional bandwidth network rather than an oversubscribed network. There are large GigE switches that can easily scale beyond 128 nodes that aren’t too expensive. Even in the case of 10GigE, I recommend a full bisectional bandwidth network because of the latency issues (using two tiers of TCP switches means that you have to use spanning tree in the switches, which introduces quite a bit of latency).If you are using Infiniband, then there are 48, 96, and 144 port switches from several vendors. With 144 ports you can easily add nodes to a 128-node cluster at a later date. However, I think one overlooked possibility is to oversubscribe the InfiniBand network.To effectively do oversubscription, you will need two tiers of switches. The smallest DDR IB switch has 24 ports. In the case of 64 nodes and 2:1 oversubscription, you would have four 24-port switches in the first tier of switches: 16 ports from the compute nodes to each switch (sometimes people call this first tier of switches, leaf switches). Each switch also has 8 uplinks to the next level of switches. Then you would need a core switch each with at least 32 ports (4 x 8 = 32). A perhaps easier way to do this is to use two 24-port core switches so that each second-tier switch has four sets of uplinks (one from each switch) where each set is actually 8 uplinks. Figure 7 below illustrates this.
For those of you who know something about IB switches, I’m assuming 4X links for each IB port.The interesting question for the 64-node configuration then becomes, is it cheaper to use 5 switches or a single switch with at least 64 ports? To be honest, I don’t know the answer. The single 64-port option is not oversubscribed (1:1) so it will have better bandwidth to the nodes. But if the single 64-port switch dies, then you lose the entire network whereas in the oversubscribed five 24-port switch configuration, if you lose a first-tier switch, you only lose access to 16 nodes. If you lose a second-tier switch you still have access to all of nodes but at a higher oversubscription level (it becomes 4:1).For 128 nodes, things get a bit more complicated. If we use 24-port switches on the first tier (16 ports to nodes and 8 ports to uplinks), then we will have eight 24-port switches at this level. This also means we will have 64 ports to uplink to the second layer (8 x 8 = 64). So you could do this with a single switch capable of 64 ports, probably a 96-port capable switch which gives you room for growth (adding nodes). Or you can create the second layer of switches with four 24-port switches where each switch has 2 uplinks from each first level switch. So each second-tier switch needs at least 16 ports (2 x 8 = 16 ports). Figure 8 below illustrates this configuration.
I’ve only showed the links from the first-tier switches to only one of the second-tier switches because the clutter from all of the links becomes too busy to understand what the configuration looks like. As you can see it can be a little involved to design IB networks since you have various switch sizes and you have the possibility of oversubscription.129-512 nodes:As clusters get larger the networks don’t necessarily get more complicated. For cluster sizes in this range, you will definitely need a management network that is TCP based. For the computational network you can use either GigE or Infiniband and should be full bisectional bandwidth, but doesn’t have to be (always exceptions to the rule). The most likely case where you will use GigE for clusters of this is when the applications are embarrassingly parallel (i.e. run on a single node), but it could also be for applications that run on only a few nodes (perhaps 4 or less).Currently the largest DDR IB switch is 288 ports. So for clusters larger than this size, you will definitely have to use 2 tiers of switches. Configuring these networks uses the same basic principles as I have mentioned in the previous sections. For example, for 1:1 oversubscription (i.e. no oversubscription), you would have to use 144 ports to the nodes and 144 ports to the next level of switches. Fortunately, many IB switches have the ability to use “fatter” network links between switches (12X links). If you do this, you can reduce the number of inter-switch links by a factor of 3 (12X / 4X = 3).The largest complicating factor for clusters in this range is the storage. For clusters of this size you are probably going to think about some kind of distributed parallel storage. The question then becomes, do you run the storage traffic over the management network or the computational network? If you run the storage traffic over the management network then you should probably make that network full bisectional bandwidth or use 10GigE and possibly oversubscribe the management network. If you run the storage traffic over the computational network, be sure there are no conflicts with the computational traffic. Plus remember that you can always oversubscribe the computational network.Parting Comments As you tell by the length of this article and by the discussion about the options, networking can be a bit involved. You can view this as a problem or, as I like to view it, as one of the features of clusters. So much flexibility means that you can design the network for cost, performance, upgradability, or ease of debugging/management. Return to Introduction to HPCC page