Join us at Super Computing 2011!We invite you to visit us at the SC2011 conference in Seattle, Nov 14-17 at Booth #2040 See first-hand how we are enabling research discovery with Dell HPC solutions.
Typically, clustered HPC systems have at least one node that is dedicated to handling logins from individual users of the system. From here, users manipulate their data, submit jobs into the system, and track status of their running jobs. Login nodes are generally not extremely critical to continued operation of a cluster. In the event of a failure, the cluster will continue to run jobs, and queued jobs will be dispatched for execution. I’ll talk more about job scheduling reliability in a later section.Login nodes are typically the customer-facing components of a cluster. As such, an outage will generally cause disruption to customers and negatively impact perception of the system as a whole. During an outage, customers will no longer be able to submit jobs into the queue.For these reasons it is recommended that a clustered HPC system have at least two Login nodes with redundant functionality. These nodes can be completely independent of each other, in which case access methods for both nodes will need to be published to the customer. It is also possible to configure two Login nodes as a high-availability pair. If one node fails, the remaining node would inherit the network interface properties of the failed node. In this case only one access method will need to be published to the customer.It is important to distinguish the Login node from the Head node of a HPC system. Head nodes typically handle cluster administration functions such as compute node provisioning, image management, cluster monitoring, and job scheduling. On many smaller systems though, the login functionality will also be combined on the Head node. The criticality of these functions on a Head node require similar redundancy and reliability as a stand-alone Login node. As your HPC systems grows, it is wise to start splitting out theses administrative functions to independent nodes. For instance, let’s consider a Head node with both job scheduling and cluster provisioning capabilities. If for some reason the node needs to be rebooted because the job scheduler needs a new kernel revision, you won’t want to necessarily impact provisioning (especially if your compute nodes boot from this node!). Thus it is wise to consider splitting out functionality to their own node.NEXT UP… COMPUTE NODES-- Blake GonzalesSee other posts from this Blog series:INTROSMPCLUSTERED SYSTEMSCLUSTERED SYSTEM INFRASTRUCTUREPOWER DISTRIBUTIONCOOLINGMEMORY