High Performance Computing (HPC) at Dell
Thanks for visiting our online HPC and technical computing community. We have an active group of contributors from our technical Dell engineers, to our industry leading customers, and worldwide partners.
Stay awhile, browse some of our great content, and join the conversation!
High Performance Computing Clusters (HPCC) are a very popular type of supercomputer and it is used for solving large problems using parallelization. Dell has Linux-based HPCC solutions offering. Deploying HPC is a comprehensive process. This document provides a simplified method for the installation or updating of drivers on a cluster with heterogeneous platforms. Below is a proposed methodology where a user can use Dell Lifecycle controller to provide the recommended drivers when the Linux based HPCC image is getting provisioned. Dell Lifecycle Controller (LCC) is an onboard systems management device that is a part of the iDRAC express and above on the 11th generation Dell servers. The Lifecycle Controller (LCC) includes a 1GB managed and persistent storage that embeds systems management features like driver update, firmware update, RAID configuration etc. in addition to the iDRAC features. The LCC contains a driver repository for supported Operating Systems i.e. Windows, Redhat Linux, SuSe Linux etc. Remote Systems management is further enhanced with LCC as it completely eliminates the need for any media-based tools due to the integrated persistent storage. Users can further upgrade to iDRAC Enterprise and vFlash for advanced iDRAC features.In a typical HPCC offering from Dell, the recommended drivers for each of the supported server is included in the HPC software package and installed during the initial deployment. However there are situations where the device drivers need to be updated during the lifetime of the cluster. For example a self-deployed HPC cluster will need consistent drivers across all servers in the cluster or newer versions of drivers that would provide critical bug fixes or enhanced performance. In order to obtain similar performance on all the compute nodes, the drivers and the firmware versions should be identical. By leveraging the LCC features, the administrator can have all the device drivers updates automated with a onetime configuration of the LCC.This becomes easy for any Enterprise to easily update to the latest firmware or updated drivers available from http://support.dell.com to be installed in the setup. The manual mechanism to individually monitor updated firmware and drivers and then deploy, is avoided, as the process is now completely automated through LCC.Administrator can access LCC
In order to utilize the LCC remotely, the following tools are available:
This blog provides instructions to remotely enable LCC using OpenWSMAN in Linux environment. HPC architecture is based on the head nodes and compute nodes. For using LCC features in HPC deployment we require one to many implementation processes (Expose LCC Remotely). Install OpenWSMAN on the master node, and use this utility to access the LCC remotely using iDRAC IP address and logon credential. Below is a schematic representation of a basic HPC cluster:Prerequisites:The anaconda installer is LCC aware and the driver package for the respective operating system is available.Servers have iDRAC express or above.
Steps:Steps for the leveraging Dell Life Cycle Controller features for new HPC cluster deployment.
1. Install master node.2. Make the appropriate network connections to connect all the nodes in the cluster. Confirm that AC power is on for all nodes and Servers are not powered on.3. Confirm that master node received DHCP requests from the iDRAC of all the member nodes. Create a database of the assigned IPs.4. Once the database of the IP addresses is updated for the head node, the remote racadm utility detects the system type and name. Template of database is given here: IP address Hostname Server Model192.168.2.1 compute000 R7105. Use appropriate logic to detect the first system for each type. For example in a cluster with 20xR610, 20xR710, 20xR715, the logic should pull out the first R610, R710, R715 so that by using openwsman tool we could expose one server LCC of each type server instead of all 20 servers and copy all the required drivers on master node, exposing one LCC on each type of server instead of all servers will save time, network traffic, maintains efforts. 6. Use openwsman tool to expose the LCC of the each identified system from the above step.7. The openwsman tool unpacks the driver pack and copy the rpms nfs contrib folder on the master node.#wsman invoke –a GetDriverPackInfo [This command list supported OS information]#wsman invoke –a UnpackAndAttach [This command unpack driver and place it into OEMDRV device, LCC expose as OEMDRV device]8. Now add the compute nodes using appropriate method. For Rocks+ use insert-ethers command, using post install script push all the dell drivers on respective server.
Commands:1- Create a certificate:#openssl s_client -connect 192.168.10.2:443
Anil works with the HPC Solutions Engineering Team at the Bangalore Design Center. He has a Bachelor of Engineering degree in Information Technology from University of Rajasthan, India. He is a RHCE and CCNA and has extensive experience on Linux systems design and deployment. He has been with Dell for the last 1 year. Prior to Dell, Anil has worked with John Deere and Locuz Enterprise Solution Ltd.References  Integrated Dell™ Remote Access Controller 6 (iDRAC6) Version 1.7 User Guide http://support.dell.com/support/edocs/software/smdrac3/idrac/idrac17mono/en/ug/pdf/ug.pdf Life Cycle Controller 1.5:http://support.dell.com/support/edocs/software/smusc/smlc/lc_1_5/remoteservices/en/index.htm  DCIM_OSDeploymentService MOF http://www.delltechcenter.com/page/DCIM.Library.MOF.DCIM_OSDeploymentService http://support.dell.com/support/edocs/software/smusc/smlc/lc_1_2/en/ug/html/lc_osd.htm WinRM Scripting API, MSDNhttp://msdn.microsoft.com/en-us/library/aa384469(VS.85).aspx  DMTF Common Information Model (CIM) Infrastructure Specification (DSP0004), http://www.dmtf.org/standards/published_documents/DSP0004_2.5.0.pdf Installation and Configuration for Windows Remote Managementhttp://msdn.microsoft.com/en-us/library/aa384372(VS.85).aspx Clustercorphttp://www.rocksclusters.org/roll-documentation/base/5.4/roll-base-usersguide.pdf