Managing tens of thousands of local and remote server nodes in a cluster is always a challenge. To reduce the cluster-management overhead and simplify setup of cluster of nodes, admins seek the convenience of a single snapshot view. Rapid changes in technology make management, tuning, customization, and settings updates an ongoing necessity, one that needs to be performed and easily as infrastructure is refactored and refreshed. 

To simplify some of these challenges, it’s important to fully integrate hardware management and the cluster management solution.  The following integration detailed in this blog between that of server hardware and the cluster management solution provides an example of some of the best practices achievable today.

Critical to this integration and design is the Integrated Dell Remote Access Controller (IDRAC).  Since IDRAC is embedded into the server motherboards for in-band and out-of-band system management, it can display and modify BIOS settings as well as perform firmware updates through the Life Cycle Controller and remote-console. Collectively, each server’s in-depth system profile information is gathered using system tools and utilities and is available in a single graphical user interface for ease of administration, thus reducing the need to physically access the servers themselves. 

Figure 1. BIOS-level integration between Dell PowerEdge servers and cluster management solution (Bright 7.1)

Figure 1 (above) depicts the configuration setup for a single node in the cluster. The fabric can be accessed via the dedicated iDRAC port or shared with the LAN-on-Motherboard capability. The cluster administration fabric is configured at the time of deployment with the help of built-in scripts in the software stack that help automate this. The system profile of the server is captured in an XML-based schema file that gets imported from the iDRAC using the racadm commands. Thus relevant data such as optimal system BIOS settings, boot order, console redirection and network configuration are parsed and displayed on the cluster dashboard of the graphical user interface.  By reversing this process, it is possible to change and apply other BIOS settings onto a server to tune and set system profiles from the graphical interface. These choices are then stored in an updated XML-based schema file on the head node, and pushed out to the appropriate nodes during reboots.

Figure 2. Snapshot of the Cluster Node Configuration via cluster management solution.

Figure 2 is a screenshot showing BIOS version and system profile information for a number of Dell PowerEdge servers of the same model. This is a particularly useful overview as inappropriate settings and versions can be easily and rapidly identified. 

Typical use would be when new servers are added or replaced in a cluster. The above integration will help to ensure that all servers have similar homogenous performance, BIOS versions, firmware, system profile and other tuning configurations.

This integration is also helpful for users who need custom settings – i.e. not the default settings - applied on their servers. For example codes that are latency sensitive may require custom profile with C-States disabled. These servers can be categorized into a node group, with specific BIOS parameters applied to that group.

This tightly coupled BIOS level integration delivers capabilities that provide a significantly enhanced solution offering for HPC cluster maintenance that provides a single snapshot view for simplified updates and tuning.  As a validated and tested solutions on the given hardware, it provides seamless operation and administration of clusters at scale.   

References:

  1. http://www.brightcomputing.com/Bright-Cluster-Manager
  2. http://en.community.dell.com/techcenter/systems-management/w/wiki/3204.dell-remote-access-controller-drac-idrac
  3. http://www.brightcomputing.com/Linux-Cluster-Architecture
  4. http://en.community.dell.com/techcenter/high-performance-computing/b/general_hpc/archive/2014/09/23/bios-tuning-for-hpc-on-13th-generation-haswell-servers
  5. http://en.community.dell.com/techcenter/high-performance-computing/b/general_hpc/archive/2015/04/29/linpack-benchmarking-on-a-4-nodes-cluster-with-intel-xeon-phi-7120p-coprocessors