Scott CollierI find as a HPC administrator there are several occasions where I want to go in and make sure everything is at the same version on my cluster. This includes drivers, firmware, software, amount of memory, etc. … There could be a few different reasons why your cluster could be out of sync:

  • You are new to the company and you are inheriting this cluster
  • There is more than one systems administrator for this cluster, and you don't know what he or she has done
  • You were troubleshooting some problems on the cluster at 2:15AM on a Saturday morning and didn't get around to documenting what you did - I NEVER suffer from this ;)
  • You are installing a new cluster and would like to baseline it
  • You need to capture information for Dell support
  • You have replaced a few motherboards and need to make sure everything is the same


Regardless of the reason, when things are not consistent on your cluster you could encounter "anomalies" when running jobs, things that are broken now that weren't before. Part of administering a cluster is making sure everything is at the same version, and that can be difficult without the right tools.

Today I'll talk about some of the tools I use to make sure my clusters are consistent.

Cluster Overview:


Compute Nodes: 64 R410s
Frontend Node: PER710
IB HCA's: Mellanox ConnectX-{1,2}
Cluster Middleware: Platform Computing PCM 1.2a


Items I want to ensure are consistent (in no particular order):

Firmware and Drivers:


  1. BIOS
  2. BMC
  3. Broadcom
  4. IB HCA
  5. Amount of Memory
  6. Amount of CPU Cores
  7. Etc. ...


So, how do we check all these components to make sure they are the same version? We can use pdsh, native Linux tools and OMSA. Now, you could use DSET to pull full verbose information from each node, and then parse each .zip file manually but I'm just looking for quick snapshots of cluster firmware / software versions. Here are some examples of how I use pdsh and OMSA to check each of the above items. You have to run pdsh from the frontend node.

1. BIOS

# pdsh -a "dmidecode | grep -A2 -m 1 'BIOS Information'" | dshbak -c

Or

# pdsh -a "omreport chassis bios | grep -i version" | dshbak -c


2. BMC

# pdsh -a "omreport chassis firmware | grep Version" | dshbak -c


3. Network Driver

# pdsh -a "dmesg | grep -Ei -m 1 'Broadcom NetXtreme II'" | dshbak -c


4. IB HCA Firmware and Speeds

For this one, I have two types of HCA's in my cluster. One is Mellanox ConnectX-1 and the other is ConnectX-2. I need to make sure the firmware is the same for both cards.

To check the firmware version for all of your cards, regardless of type -

# pdsh -a "ibstat | grep -i firmware" | dshbak -c

To check the speed of your cards, regardless of type -

# pdsh -a "ibstat | egrep -i 'rate'" | dshbak -c

To check which cards are ConnectX-1 and ConnectX-2 -

# pdsh -a "ibstat | egrep -i 'Hardware Version'" | dshbak -c


5. Amount of Memory (sometimes if you have a faulty DIMM, the server will lower the amount of memory it presents to the OS)

# pdsh -a "free -m | grep Mem: " | awk '{ print $1 "\t" $3 }' | dshbak -c


6. Amount of CPU Cores (I'll use a real world example here)

# pdsh -a -x compute-00-14 " cat /proc/cpuinfo | grep processor" | dshbak -c

While I was testing this on my cluster, I realized I had some nodes with Logical Processing turned on, some didn't have it… Here's the output:

# pdsh -a " cat /proc/cpuinfo | grep processor" | dshbak -c
----------------
compute-00-00-eth0,compute-00-01-eth0,compute-00-02-eth0,compute-00-03-eth0,compute-00-04-eth0,compute-00-06-eth0,compute-00-07-eth0,compute-00-08-eth0,compute-00-09-eth0,compute-00-10-eth0,compute-00-11-eth0,compute-00-12-eth0,compute-00-13-eth0,compute-00-15-eth0,compute-00-16-eth0,compute-00-17-eth0,compute-00-18-eth0,compute-00-19-eth0,compute-00-20-eth0,compute-00-21-eth0,compute-00-22-eth0,compute-00-23-eth0,compute-00-24-eth0,compute-00-25-eth0,compute-00-26-eth0,compute-00-27-eth0,compute-00-28-eth0,compute-00-29-eth0,compute-00-30-eth0,compute-00-31-eth0,compute-00-35-eth0,compute-00-36-eth0,compute-00-38-eth0,compute-00-47-eth0,compute-00-51-eth0,compute-00-53-eth0,compute-00-54-eth0,compute-00-59-eth0,compute-00-63-eth0
----------------
processor : 0
processor : 1
processor : 2
processor : 3
processor : 4
processor : 5
processor : 6
processor : 7
----------------
compute-00-32-eth0,compute-00-33-eth0,compute-00-34-eth0,compute-00-39-eth0,compute-00-40-eth0,compute-00-41-eth0,compute-00-42-eth0,compute-00-43-eth0,compute-00-44-eth0,compute-00-45-eth0,compute-00-46-eth0,compute-00-48-eth0,compute-00-50-eth0,compute-00-52-eth0,compute-00-55-eth0,compute-00-56-eth0,compute-00-57-eth0,compute-00-58-eth0,compute-00-60-eth0,compute-00-61-eth0,compute-00-62-eth0,compute-00-64-eth0,lodmds-00-00-eth0,lodmds-00-01-eth0,lodoss-00-00-eth0,lodoss-00-01-eth0
----------------
processor : 0
processor : 1
processor : 2
processor : 3
----------------
compute-00-05-eth0,compute-00-37-eth0,compute-00-49-eth0
----------------
processor : 0
processor : 1
processor : 2
processor : 3
processor : 4
processor : 5
processor : 6
processor : 7
processor : 8
processor : 9
processor : 10
processor : 11
processor : 12
processor : 13
processor : 14
processor : 15


Now I can use syscfg and the instructions from this BLOG to make changes and bring my cluster back to a consistent state:

http://www.delltechcenter.com/page/Changing+BIOS+settings+with+syscfg+from+the+DTK

As you can see, there are many items you can check to make sure your cluster is consistent. These simple tools should make the process a bit easier.

Another very powerful solution would be to use Intel's ICC (Intel Cluster Checker) that's part of the ICR (Intel Cluster Ready) program. It's very flexible and runs well over 100 tests to make sure your cluster is consistent. You can find more information about Intel Cluster Ready program here:

http://software.intel.com/en-us/cluster-ready/

If you have related tips you'd like to share with the readers of this blog, please post them. I'm curious and interested in what you are doing with your clusters as well to make sure everything is in sync.

-- Scott Collier