DCSC logo
+Open all         -Close all
  Why is the overall utilization of Horseshoe only ~90%, when there are always jobs waiting in the queue ?

One might think that the machine utilization is essentially 100%, since there are always jobs waiting to be executed. Why is it then that the actual utilization hovers around 90% (typical value)?

This apparent paradox is a consequence of the finite number of nodes and job policies on Horseshoe:

  • There are no preset limits on the number of nodes a job can request,
  • The timelimit for jobs in the workq queue is 200 hours.

The 30 Day Profile lists all the jobs during the last month. A typical month has a handful of 128 node jobs, ~twenty 64 node jobs, a few hundred 32 and 16 node jobs, and a few thousand jobs with node requirements less than 16, mostly serial jobs.

The policy of running large multi-node jobs reduces the overall utilization of the cluster. In order to start a 128 node job, there must be 127 idle nodes immediately priori to job start. Collecting these 127 idle nodes will necessarily mean that a number of cpu-hours are wasted.

A rough estimate of this utilization loss can be obtained by considering a cluster with M nodes and a job-timelimit T. Collecting N nodes will on the average take N/M times T, i.e. the average node collection time will be

Tqueue = NT/M

once the job has acquired the highest priority of all the queued jobs (please review the scheduling strategy). The average number of idle nodes during this collection time will be N/2. The average (accumulated) idle time for collecting N nodes in an M node cluster with a job-timelimit of T is therefore:

Tidle = N2T/2M (measured in units of #CPU*time)

The Horseshoe has two queues, workq with M = 512 and T = 200 hours, and giga with M = 140 and T = 50 hours. Table 1 illustrates the utilization loss by running an N node job in either of these queues, with the loss given as a percentage of one month cpu on all the M nodes in the queue.

Table 1. Utilization loss for starting an N node job.
M T N Tqueue Tidle Tidle/N %Loss
512 200 256 100 12800 50 3.47
128 50 3200 25 0.87
64 25 800 12.5 0.22
32 12 200 6.25 0.054
16 8 50 3.1 0.014
140 50 128 46 2926 22.9 2.90
64 23 731 11.4 0.73
32 11 183 5.7 0.18

Table 1 shows that each 128 node job in the workq queue on the average causes 3200 hours idle time (~130 cpu-days), i.e. 127 nodes waiting 25 hours on the average, corresponding to 0.87% of one month capacity. The same 128 node job run in the giga queue gives a utilization loss of 2.90%. Scheduling a handful of 128 node jobs and ~twenty 64 node jobs thus accounts for the overall utilization hovering around 90%, despite that the number of queued jobs is never zero. Note also that short highly parallel jobs inherently have a low efficiency (a 10 hour 128 node job uses 1280 cpu-hours, but generates 3200 hours of idle time, corresponding to an overall efficiency of 29%).

To the extent that the Maui scheduler can backfill the idle nodes during the collection phase with smaller (few nodes, short time) jobs, the utilization loss will be smaller. Since (automatic) backfilling with a large multi-node job is a rare event (a 128 node job will never get backfilled, and a 64 node job will only in special cases be able to run during the collection phase for a 128 node job), users themselves can help the backfilling of jobs by adjusting job resource requests in accordance with the current Backfill Window asdescribed in the how-to.

A policy of only allowing jobs with up to say 16 nodes and a timelimit of 50 hours would ensure a very high utilization (99+%). But front-line research often require running massive parallel jobs for a long time. The current job policies reflect that we prioritize front-line research over a high overall utilization.