DCSC logo
 
ABOUT-DCSC
DCSC/SDU
DCSC/AU
DCSC/AAU
DCSC/DTU
DCSC/KU
 
+Open all         -Close all
 
 

Slurm - The hs9 job scheduler

For hs9, we have replaced the PBS/Torque job scheduler with Slurm.

A quick and good introduction to Slurm can be found here:

Slurm Quick Start User Guide

To ease the transition from PBS/Torque, wrapper scripts are available that translate from PBS commands (e.g., qsub) to Slurm commands (e.g., sbatch). But you will probably have to adjust your scripts to ensure that they work on hs9.

Basic script example

The following is a basic script:

#! /bin/bash # # Name of the job #SBATCH -J test-2nodes # # Use two nodes with 40 cores in total #SBATCH -N 2 -n 40 # # Memory: Use at most one of these two options # MB memory/node. Use 256000 to get fat nodes ##SBATCH --mem=256000 # MB memory/core. Use 12800 to get fat nodes, 20*12.8 ~= 256 GB ##SBATCH --mem-per-cpu=12800 # # Request access to two GPU cards ##SBATCH --gres=gpu:2 # # Max Walltime 1h # SBATCH -t 1:00:00 # Max Walltime 7days-0hours ##SBATCH -t 7-0 # # Remove one # to run on the express node instead # Note: the express node only has 1 GPU card ##SBATCH -p express ##SBATCH --gres=gpu:1 # # Send email # Valid types are BEGIN, END, FAIL, REQUEUE, and ALL #SBATCH --mail-type=ALL # By default send email to the email address specified when the user was # created this can be overrided here ##SBATCH --mail-user=user@example.com # # Write stdout/stderr output, %j is replaced with the job number # use same path name to write everything to one file #SBATCH --output slurm-%j.txt #SBATCH --error slurm-%j.txt echo Running on $(hostname) echo Available nodes: $SLURM_NODELIST echo Slurm_submit_dir: $SLURM_SUBMIT_DIR echo Start time: $(date) # cd $SLURM_SUBMIT_DIR # not necessary, is done by default # Load relevant modules module purge module add amber/14.2014.04 # Copy all input files to local scratch on all nodes for f in *.inp *.prmtop *.inpcrd ; do sbcast $f $SCRATCH/$f done cd $SCRATCH if [ "${CUDA_VISIBLE_DEVICES:-NoDevFiles}" != NoDevFiles ]; then # We have access to at least one GPU cmd=pmemd.cuda.MPI else # no GPUs available cmd=pmemd.MPI fi export INPF=$SCRATCH/input export OUPF=$SCRATCH/input mpirun \ $cmd -O -i em.inp -o $SLURM_SUBMIT_DIR/em.out -r em.rst \ -p test.prmtop -c test.inpcrd -ref test.inpcrd echo Done.

FAQ

How do I request access to our CUDA devices?
Use the parameter --gres=gpu:2 for sbatch
How do I request only nodes with 256 GB memory?
You can specify the amount of RAM per node or per core as shown in the example below. Fat hs9 nodes have 256 GB, normal non-GPU nodes have 128 GB, and GPU nodes have 64 GB. # Memory: Use at most one of these two options # MB memory/node. Use 256000 to get fat nodes #SBATCH --mem=256000 # MB memory/core. Use 12800 to get fat nodes, 20*12.8 ~= 256 GB #SBATCH --mem-per-cpu=12800
Previously I used rsh to copy files to scratch. This does not work anymore?
Please use sbcast instead as shown in the example below # Copy all input files to local scratch on all nodes for f in *.inp *.prmtop *.inpcrd ; do sbcast $f $SCRATCH/$f done
/scratch is no longer writable?
To enable multiple users running jobs on the same node, scratch files are now kept in /scratch/job.$SLURM_JOBID, e.g., /scratch/job.24. The variable $SCRATCH contains the value to use.
For scripts that need to be able to run on both on hs9 and hsx (x < 9), use the following snippet in the beginning of the script: test -n "$SCRATCH" || export SCRATCH=/scratch
Where did $PBS_O_WORKDIR go?
Use the SLURM variant instead, $SLURM_SUBMIT_DIR. In your scripts you can use the snippet below to use either variable: test -d "$PBS_O_WORKDIR" && cd $PBS_O_WORKDIR test -d "$SLURM_SUBMIT_DIR" && cd $SLURM_SUBMIT_DIR
How do I run an interactive job?
Use srun --pty /bin/bash as shown in the examples below. (You usually want to add a few extra options) user@fe9:~$ srun -p express -N 1 -n 20 --gres=gpu:1 --pty /bin/bash user@c91n06:~$ hostname c91n06 user@c91n06:~$ exit exit user@fe9:~$ or userfe9:~$ srun -p workq -N 1 -n 20 -t 5:00:00 --pty /bin/bash user@c91n03:~$ hostname c91n06 user@c91n03:~$ exit exit user@fe9:~$