Sharing The Load: Cluster Resource Schedulers

July 10th, 2009 11:17 am
Posted by Douglas Eadline
Tags: , , , , , , ,

Job Schedulers are the workhorse of cluster computing

One thing virtually every cluster has in common is a batch or job scheduler. The job scheduler sits between the user and the cluster and manages the cluster resources. All user programs that are run on the cluster are under the control of the resource scheduler (we will use the word scheduler to refer to the resource manager/scheduler). Users submit a "job" to the scheduler, which in turn adds the job to the work queue of the cluster. The work queue can be thought of as the line at the supermarket. This type of computing is often called "batch mode" processing and is sometimes considered a throwback to days of the shared mainframe. Because HPC jobs usually run for long periods of time (days or weeks) interactive execution becomes somewhat cumbersome. With a batch scheduler, the user can submit and forget. When the job is finished or stops the user is often notified by email.

The goal of the scheduler is to allow multiple users to submit jobs (programs) to the cluster and then run them on the requested resources -- when the resources become available. That last part about resource availability is what makes the schedulers job so hard. We'll take a deeper look at this issue in a moment, but first let's look at the cluster from the users perspective.

The User View

Let's be honest, the cluster user is rather selfish! As a user you want to submit a job or program to the cluster and have it run right away. Even if you are the only user, you still want to use a job scheduler because it will keep track of jobs and make sure you don't over-subscribe nodes (place too many program on a node). The main problem a cluster user faces is the presence of other users! It is those other users that lead to the most common question facing the cluster administrator, "Why is my job not running?" Thus the user's goal is rather simple -- run my code now and tell me when you are done.

The user can usually interact with the scheduler from the command line, but most users write submit scripts (shell scripts) that allow options to be re-used for subsequent jobs. In particular, when using MPI the scheduler usually requires a large number of options in addition to the mpirun command itself. The resource manager also provides a list of nodes that can be used by an MPI starter program, thus instead of entering a command line with a large number of options, a simple script can submitted to the scheduler.

The Scheduler View

The scheduler has a more difficult task. Each job or program has a number or parameters that can make scheduling rather difficult. In a general sense, each job requires resources. Examples of resources include the number of cores, location of cores, amount of memory, amount of execution time, etc. The resource can also be a particular node owned by a research group or a group of nodes connected by a specific network or storage technology. The list of resources can be rather large and depends on the cluster. At a minimum, a user will need to request the number of cores they need. In the past when nodes had at most two cores, users would request nodes, however with multi-core the node request has given way to the number of cores or slots which usually translates into the number of individual MPI processes the user wants to run. Single or sequential jobs will require only one core. In addition, multi-threaded programs will need to request a number of cores on a specific node. As you can see the scheduler has quite a lot to consider.

Because users share the cluster, the scheduler should at least be fair (unless there is some other policy in place). A naive way to be fair is to use a first come first serve policy. While this seems like a good idea, this approach, may lead to poor utilization. Consider a cluster with 16 cores. If a user submits a job requesting 12 cores, then there are 4 cores remaining for other users. If someone submits a job requiring 14 cores, this job will have to wait until the 12 core job is finished leaving the four cores idle. Now suppose in the mean time, a user submits a 4 core job. Obviously, a smart scheduler would notice this and try to "back fill" to use the idle four cores. Utilization is much better, but what happens if the four core job continues to run after the initial 12 core job completes. Now there are 12 cores available, but not enough to run the "next in line" 14 core job. Of course the batch scheduler wants to use the idle cores, but now it runs the risk of continually pushing back the 14 core job.

If you include other resource requirements (e.g. large memory size) and the fact that users often do not or incorrectly specify the amount of run time they need, and the constant stream of new job submissions, you can appreciate that fair and efficient cluster resource scheduling is a hard problem. From the users prospective, it all seems rather simple -- submit and forget. From the scheduler perspective it is one headache after another! Thankfully the scheduler does not complain all that much.

Another interesting note about resource schedulers is the fact that modern multi-core/multi-processor motherboards run their own scheduler as part of the OS. In particular, the OS must decide where to place a process and at the same time balance the processes so all cores are equally busy, but try to keep processes on the same core so that data in the processor cache can be reused. This scheduling is independent of the global resource scheduler, but some co-ordination would be useful.

Resource Scheduler Background

Because resource schedulers can be complex and vary from one to another, understanding some concepts maybe helpful when evaluating various solutions. In general a resource scheduler has two major components;

  • The Resource Manager keeps track of node state (load, number of jobs, memory use, physically available) and manages jobs that it runs on the nodes. Most resource managers use a daemon (resident program) on nodes that report status (both job and node) and starts and stops jobs (see Job Execution Daemon below). Most resource mangers use a database to keep track of all resources, submitted requests and running jobs. In addition, not all resources are homogeneous clusters. Indeed, there can be multiple clusters or computers under the control of single resource manager.
  • The Scheduler does the hard work by determining when and where a job will run. It can be naive or very complicated. The design of the scheduler determines how efficient the cluster is utilized. Some resource mangers provide a modular interface so different schedulers can be used. In addition, many schedulers have various policies and modes that can be adapted to a sites work-flow requirements. Schedulers track job priority, compute resource availability, can manage license keys if a job is using licensed software, keep track of execution time allocated to the user, manage the number of simultaneous jobs allowed for a user, estimated execution time, and elapsed execution time. Finally, schedulers may also manage multiple queues.There are two basic designs used for job schedulers:
    • A master/worker model where the job scheduling software is installed on a single machine (Master) while on production machines (nodes) only a very small component is installed (daemon) that waits for commands from the master, executes them, and returns the exit code back to the master.
    • A cooperative model is a decentralized design where each machine is capable of helping with scheduling and can offload locally scheduled jobs to other cooperating machines.

Resource Scheduler Terminology

While a full discussion of resource scheduling is beyond this article, some basic concepts can be presented. A good reference can be found at The Portable Batch Scheduler and the Maui Scheduler on Linux Clusters

Utilization and Turnaround

The goal of any scheduler is to get as many jobs run in as short a time as possible utilizing all resources in a fair way. Depending on your job mix and policies the utilization rate may tend to vary. Heavily used clusters tend to pay more attention to utilization. Keep in mind in, with  a large 512 core cluster increasing the utilization rate by even 10% can be like adding more than 50 cores!

Prioritization and Fairness

It is the goal of a scheduler administrator to balance resource consumption amongst competing parties and implement policies that address a large number of political concerns. Fairness can mean different things to different people. It is important that users understand the "fairness" policy of your site as it can depend on many factors, the least of which is your place in the queue.

Fair Share Policy

A fair share policy tracks users use over a set period (a week, month, quarter) and ensures they receive their allocated percentage of computing resources. The tracking of historical resource utilization for each user results in the ability to modify job priority. In a fair share mode, users, groups, or projects that give up time to others (because the do not need the cluster) can later get their time back when the cluster contention is high.

Urgency Policy

An urgency policy usually allows an urgency value for each job. This urgency value can include resource requirement, resource attributes, a deadline weight and a waiting time weight. For instance with this type of policy a job waiting in the queue will increase in urgency and move up in priority. Priorities can be based on user, projects or departments. That is if a project has a deadline, then it may be given higher priority.

Functional Policy

Functional scheduling is sometimes called priority scheduling. A functional policy setup ensures that a defined share is guaranteed to each user, project, job, or department at any time because shares or tickets are assigned to the user. Unlike a fair share policy, the scheduler does not remember past use. If the cluster is underutilized, jobs run as they enter the queue. If the cluster is full then users, groups, or projects are guaranteed their functional fraction of the cluster.

Reservation

Resource reservation allows constructing a flexible, policy based batch scheduling environment. The user must specify the needed resources (nodes, cores, etc.) and the execution time.

Job Execution Daemon

Virtually all resource schedulers rely on a remote program to manage jobs sent by the master resource manager to the cluster nodes. These programs are referred to as "daemons" which run in the background. The daemon is responsible for starting remote programs, monitoring loads and health, and stopping programs (if necessary). Note that the job execution daemon usually has tighter control over remotely started jobs (those started with rsh or ssh). Many MPI starter programs (i.e. mpirun)  often use rsh or ssh to start remote jobs, however some MPI versions will now work directly with the resource manager.

Backfill Scheduling

Backfill is a key component to many schedulers. It allows for the periodic analysis of the running queue and execution of lower priority jobs if it is determined that running these jobs will not delay jobs higher in the queue. For instance, single core jobs can be sprinkled almost anywhere there are open cores. Most small jobs (small number of cores) benefit from this feature and it improves utilization.

Robust Availability

Because the user is assigned cores or nodes at the time their job runs, they have no control or knowledge of the actual hardware that is used (unless they request a specialized node or nodes). This provides a layer of abstraction that protects the user from having to manage hardware failures. Of course if a job is running and a node crashes, the job will stop, but other jobs not using the node will continue to run. This feature also allows scheduler or administrator to take nodes out the active queue for maintenance or repair without the users ever knowing.

Failed Job Restart

Many schedulers have the capability to restart a failed job. This can be helpful if hardware fails during a run, as the job will be restarted. This feature can be dangerous, however, because jobs that fail due to a programmer error will also restart and thus can create a cycle of continuous job failure and re-submission. In addition, failed job restart can also be a problem if changes are made to permanent files or other resources, like a database.

SMP Aware Queuing

The advent of multi-core has made SMP (Symmetric Multi-Processing) scheduling an important issue. Additionally, there are issues as to processor/core efficiency that are desirable with today's multi-core designs. There is some question as to where this control should be based. Currently most MPI packages can provide some control, but ultimately the scheduler should be in charge of cores and processor affinity. This level of scheduling will require co-ordination between the scheduler and the MPI starter program. In addition, applications may request multiple cores for multiple processes on multiple nodes or a number of cores for a single threaded process.

Popular Open Source Resource Schedulers

The following is a listing of the popular "open" resource schedulers. These packages are freely available, but check the license terms for each package before you deploy it on your cluster. In general, the open packages work quite well for the average cluster installation. It should be noted, that these are not cut-down or miniature versions of commercial products but real production level packages. As such, configuration and administration is often not trivial. All the packages have excellent documentation and support communities so it is possible to learn your way to success. Please see the individual package web sites for more information. The following are summaries are based on the package descriptions.

  • Platform Lava is an open source entry-level workload scheduler designed to meet a wide range of workload scheduling needs for clusters up to 512-nodes. Scalability is important for a workload manager in several dimensions including the number of physical hosts in the system, the number of CPUs managed, the number of users with work in the system, and the number of jobs running and pending. Lava was designed not to lose a job once submitted to the system. Lava will reliably continue operation even under conditions of very heavy load without losing any pending or active jobs. In addition, as long as one host in the Lava cluster is operational, the system will continue to run. Lava supports multiple candidate master hosts and fail back seamlessly when the master has recovered. This gives Lava tremendous fault tolerance compared to traditional master-slave setup.
  • Sun Grid Engine (SGE) is a Distributed Resource Management (DRM) software system. (i.e. Resource Scheduler). Previously known as CODINE (COmputing in DIstributed Networked Environments) or GRD (Global Resource Director), SGE software aggregates compute power and delivers it as a network service. Grid Engine software is used with a set of computers to create powerful compute farms and clusters which are used in a wide range of technical computing applications. SGE presents users a seamless, integrated computing capability where users start interactive, batch, and sets of repetitious jobs (parametric processing). SGE supports a wide variety of scheduling policies and is almost identical to the commercial version offered by Sun Microsystems.
  • SLURM (Simple Linux Utility for Resource Management) is an open-source resource manager designed for Linux clusters of all sizes. It provides three key functions. First it allocates exclusive and/or non-exclusive access to resources (computer nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work (typically a parallel job) on a set of allocated nodes. Finally, it arbitrates contention for resources by managing a queue of pending work.
    Optional plugins can be used for accounting, advanced reservation, gang scheduling (time sharing for parallel jobs), backfill scheduling, resource limits by user or bank account, and sophisticated multifactor job prioritization algorithms.
  • Condor is a high-throughput computing software framework for coarse-grained distributed parallelization of computationally intensive tasks. It can be used to manage workloads on a dedicated cluster of computers, and/or to farm out work to idle desktop computers -- so-called cycle scavenging. Condor runs on Linux, Unix, Mac OS X, FreeBSD, and contemporary Windows operating systems. Condor can seamlessly integrate both dedicated resources (rack-mounted clusters) and non-dedicated desktop machines (cycle scavenging) into one computing environment.
  • Torque is an open source resource manager providing control over batch jobs and distributed compute nodes. It is a community effort based on the original PBS project (Portable Batch System) and, with more than 1,200 patches, has incorporated significant advances in the areas of scalability, fault tolerance, and feature extensions contributed by NCSA, OSC, USC, the U.S. Dept of Energy, Sandia, PNNL, U of Buffalo, TeraGrid, and many other leading edge HPC organizations. This version may be freely modified and redistributed subject to the constraints of the included license. Torque is designed to work with the Maui Scheduler (below).
  • Maui Cluster Scheduler is an open source job scheduler for clusters and supercomputers. It is an optimized, configurable tool capable of supporting an array of scheduling policies, dynamic priorities, extensive reservations, and fairshare capabilities. It is currently in use at hundreds of government, academic, and commercial sites throughout the world. Maui must be used with a resource manager such as Torque (above). Note, this Maui is not associated with the Maui Scheduler Molokini Edition which was developed as a project on the SourceForge site independent of the original Maui scheduler.

Commercial Resource Schedulers

While the freely available resource schedulers are quite robust, there are many users who require commercial support and advanced features not found in the open versions. The following are the popular commercial scheduler for x86 platforms.

  • Platform LSF allows you to manage and accelerate batch workload processing for mission-critical compute- or data-intensive application workload. With Platform LSF you can intelligently schedule and guarantee the completion of the batch workload across your distributed, virtualized, High Performance Computing (HPC) environment. The benefits include maximum resource utilization and a commercially supported product.
  • PBSPro is based on the popular PBS (Portable Batch System) and is considered a service orientated architecture, field-proven grid infrastructure software that increases productivity even in the most complex computing environments. It efficiently distributes workloads across cluster, SMP, and hybrid configurations, scaling easily to hundreds or even thousands of processors. PBSPro is in active use at more than 1400 sites worldwide.
  • Sun N1 Grid Engine is based on the open source Grid Engine project (see above), and as such, is focused on development of future versions of Grid Engine software through the community process. In addition, to the open source functionality the Sun branded N1 Grid Engine software suite offers: ARCo - Accounting and Reporting Console, a web based interface to accounting and monitoring data, GEMM - Grid Engine Management Module, a SCS (Sun Control Station) plugin for automatic deployment and monitoring of a N1 Grid Engine cluster, Microsoft Windows support for job execution, and commercial support contract and services available from Sun.
  • Moab Workload Manager is a cluster workload management package that integrates the scheduling, managing, monitoring and reporting of cluster workloads. The Moab Cluster Suite simplifies and unifies management across one or multiple hardware, operating system, storage, network, license and resource manager environments. Moab's development was based on the Open Source Maui job scheduling package (see above).
  • Tivoli Workload Scheduler LoadLeveler is a parallel job scheduling system that allows users to run more jobs in less time by matching each job's processing needs and priority with the available resources, thereby maximizing resource utilization. LoadLeveler also provides a single point of control for effective workload management, offers detailed accounting of system utilization for tracking or chargeback and supports high availability configurations.

Conclusion

Resource schedulers are the backbone of HPC clusters. They represent an important layer between the user and the cluster hardware. Indeed, without a resource scheduler, using a large cluster would be almost impossible and terribly inefficient. Thankfully, there are many options both freely available and from the commercial sectro that help deliver the full potential of your cluster.

JOIN THE CONVERSATION


You must be a Registered Member in order to comment on Cluster Connection posts.

Members enjoy the ability to take an active role in the conversations that are shaping the HPC community. Members can participate in forum discussions and post comments to a wide range of HPC-related topics. Share your challenges, insights and ideas right now.

Login     Register Now


Author Info


Dr. Douglas Eadline has worked with parallel computers since 1988 (anyone remember the Inmos Transputer?). After co-authoring the original Beowulf How-To, he continued to write extensively about Linux HPC Clustering and parallel software issues. Much of Doug's early experience has been in software tools and and application performance. He has been building and using Linux clusters since 1995. Doug holds a Ph.D. in Chemistry from Lehigh University.