Blake GonzalesMost clustered HPC systems generally leverage job scheduling algorithms and routines in order to maximize the efficiency and use of a system. Unlike shared memory systems, where all jobs are generally in memory and running simultaneously, clustered systems will usually allocate one or more nodes for exclusive use by a particular job. The job scheduler maintains a mapping of resources in use along with their assigned jobs. It also keeps track of free resources that can be allocated, and jobs that are waiting in the queue. These resources and job mappings can be described at any moment of time as the “job scheduling state.”

It is important that the system be able to maintain the job scheduling state at all times, even in the event of multiple failures in other hardware or software subsystems in the cluster (we talk more about how to do this latter in this section). This is because we would not want a node, memory, or power failure in one part of the system to zero out the job scheduling state. If the state were corrupted or lost, all running jobs would not be able to determine which resources are allocated; nor would they be able to keep new jobs from allocating resources already in use. Additionally, jobs waiting in the queue would need to be resubmitted. Jobs in the run state would also not necessarily be able to complete successfully. So, we see that maintaining the job scheduling state is critical for system continuity in a failure and must be fault tolerant.

It is also important for the job scheduler to detect failures in the system, and mark those resources as unavailable. For instance, if a node were to fail for some reason, the job scheduler would need to detect the failure. Otherwise, the scheduler could possibly dispatch a job for execution, while assigning the job resources that are unavailable. This would most likely result in a failure of the job execution.

Most commercial job schedulers allow the use of multiple daemons in a failover scenario. One of the daemons will generally take on the primary responsibility for scheduling, while the remaining nodes will take on a secondary or failover function. Each of these daemons should be run on multiple (n) mutually exclusive nodes in the system. In this scenario, if n-1 nodes were to fail, we will still have at least one node left that is able to maintain the job scheduling state and dispatch jobs for execution.

As a last point, we discuss in what medium should the job scheduling state of the system be kept. Obviously, we need a medium that is non-volatile so that data will not be lost in a power failure. Additionally, we require a medium that can be shared between the multiple job scheduling daemons that are running in a primary or secondary role as described above. Based on these requirements, we will generally want to keep the “job scheduling state” database on a high-availability file system that can be shared among nodes. This could be a clustered file system (such as Lustre or GPFS), or a network attached storage device. We will discuss the specifics of fault tolerant clustered file systems in the next section.


-- Blake Gonzales

See other posts from this Blog series: