DCSC logo
+Open all         -Close all
 Queue scheduling policy at The Horseshoe.

 Queue scheduling simply means the process of analyzing the PBS queue(s)
 and decide which job(s) in state "Q", can be selected for execution and
 thus be converted to state "R" (i.e. running).

 However to understand the dynamics of the queue(s), two questions have
 to be answered:

 a) When is scheduling done.
 b) What exactly is the policy for promoting jobs to state "R".

 The answer to question a) is:

    - If a job is submitted to the queue.
    - If a job terminates.
    - Every 360 second if there has been no such event(s) in the queue(s).

 The answer to question b) is of course much more involved since the policy
 has to reflect the resource management, which at the same time must be
 optimizing the utilization and ensure that the research groups will be
 allocated resources according to the levels stated in the proposal submitted
 to DCSC.

 When a job is submitted to a queue the scheduler allocates the job a job
 number, an initial priority, a "space-time" area, and places the job in
 state "Q". The "space-time" area is simply the number of nodes requested
 times the requested walltime, which is 10 minutes if not specified. The 
 walltime is the maximum wall-clock time the job will be allowed to spend
 in state "R".

 The priority will grow as a function of time as long as the job is in state
 "Q". If the job belongs to a user who is member of a research group, which
 have used more than the target share of resources mentioned above, the
 priority growth is adjusted by a "penalty" depending on the level of past
 usage by the group. This "memory" of past usage decays over a period of 5
 weeks. The decay is in place to promote usage and utilization of the cluster.
 This procedure of steering towards a target usage per group (or users) is
 called FairShare. An explanation of how priorities are calculated and
 the content of the current FairShare "memory" is here: FairShare Information.

 The scheduler will in a scheduling iteration try to place jobs in state "R"
 based on job priorities. However if this was the only metric for starting
 jobs, this would lead to a low utilization of the system: The problem is that
 the job in state "Q" with the highest priority might simply not be able to run
 because there are not enough idle nodes available, and the scheduler will not 
 let any other job run, even if there are enough resources for other lower 
 priority jobs.

 This is where the "space-time" areas come in - the scheduler is allowed to
 "backfill", i.e. scan forward in the queue, to see if there are jobs which
 ask for fewer resources and have stated a walltime, such that even if these
 jobs are allowed to run, the original job with the highest priority, will
 start running at the same time as it would without "backfill".
 For example: The queue has 3 jobs in state "Q" and 100 nodes in idle state.
              The scheduler knows that 28 nodes will be available in 4 days
              according to the walltime resources specified with jobs
              currently running.

              job1 : asking for 128 nodes, walltime = 2 days, priority = 5000
              job2 : asking for 30 nodes,  walltime = 2 days, priority = 4000
              job3 : asking for 60 nodes,  walltime = 2 days, priority = 3000

 The scheduler will in this case be able to run job2 and job3 before job1 is
 even able to start running. So the "backfill" strategy allows the scheduler
 to fit into the total resource "space-time" area of the cluster, areas of
 the individual jobs like puzzle pieces.

 This example illustrates why every job submitted to the queue(s) should have
 a stated walltime to optimize the utilization ! Just remember:
       *** A job exceeding the stated walltime will be terminated. ***

 As a closing remark on the subject of queue scheduling, the administrators at
 The Horseshoe would like to remind all users, that no matter how clever the
 scheduling policy is, it cannot compensate for a lack of resources !

 Much more information about the MAUI scheduler used at The Horseshoe is
 available at http://www.supercluster.org/.

 A final note on the walltime resource:

 -- It can be specified on the commandline with the syntax:

    qsub -l walltime=HH:MM:SS pbs_job_script

 -- It can be specified in the job script with the syntax:

    #PBS -l walltime=HH:MM:SS

 As with other #PBS directives found in a script, it must be in the
 first block of comments, as shown in the example scripts.