Slurm

Commands for managing your “jobs”: Memo

Information about a job

to report information about active or completed job.

sacct -j job-id 

To submit a job

The script will typically contain one or more srun commands to launch parallel tasks.

sbatch script.slurm
sbatch -x node037 my_script.sh -> submits by excluding a calculation node

To cancel a job

scancel job-id

Information about partitions and nodes

sinfo

To list free nodes

Partition which integrate nodes is mentionned

sinfo --states=idle

Node states

  • mix : consumable resources partially allocated
  • idle : available to requests consumable resources
  • drain : unavailable for use per system administrator request
  • drng : currently executing a job, but will not be allocated to additional jobs. The node will be changed to state DRAINED when the last job on it completes
  • alloc : consumable resources fully allocated
  • down : unavailable for use. Slurm can automatically place nodes in this state if some failure occurs.

State of your jobs

squeue --me

Job states

  • BF BOOT_FAIL Job terminated due to launch failure.
  • CA CANCELLED Job was explicitly cancelled.
  • CD COMPLETED Job has terminated.
  • CF CONFIGURING Job has been allocated resources, but are waiting for them to become ready for use.
  • CG COMPLETING Job is in the process of completing.
  • F FAILED Job terminated with error code.
  • NF NODE_FAIL Job terminated due to failure of one or more allocated nodes.
  • OOM OUT_OF_MEMORY Job experienced out of memory error.
  • PD PENDING Job is awaiting resource allocation.
  • PR PREEMPTED Job terminated due to preemption.
  • R RUNNING Job currently has an allocation.
  • RD RESV_DEL_HOLD Job is being held after requested reservation was deleted.
  • RF REQUEUE_FED Job is being requeued by a federation.
  • RH REQUEUE_HOLD Held job is being requeued.
  • RQ REQUEUED Completing job is being requeued.
  • RS RESIZING Job is about to change size.
  • SI SIGNALING Job is being signaled.
  • SE SPECIAL_EXIT The job was requeued in a special state.
  • SO STAGE_OUT Job is staging out files.
  • ST STOPPED Job has an allocation, but execution has been stopped with SIGSTOP signal. CPUS have been retained by this job.
  • S SUSPENDED Job has an allocation, but execution has been suspended and CPUs have been released for other jobs.
  • TO TIMEOUT Job terminated upon reaching its time limit.

Job in real time

To submit a job in real time. srun has a wide variety of options.

srun command with parameters 

Sbatch options

  • #SBATCH --partition=partition name
  • #SBATCH --job-name=job name
  • #SBATCH --output=file in which the standard output will be saved
  • #SBATCH --error=name of the file to store the errors
  • #SBATCH --input=file name of the standard input
  • #SBATCH --open-mode="append" to write in existed file, "truncate" to reset files
  • #SBATCH --mail-type=<BEGIN,END,FAIL,TIME_LIMIT,TIME_LIMIT_50,...> case of sending an e-mail
  • #SBATCH --sockets-per-node=1 or 2
  • #SBATCH --threads-per-core thread number per core, no usable with MatriCS plateform, nodes aren’t multithreaded (ask us if it’s needed.)
  • #SBATCH --cores-per-socket= Core number per socket
  • #SBATCH --cpus-per-task=CPU number for each task
  • #SBATCH --ntasks=task number
  • #SBATCH –mem-per-cpu=RAM per core
  • #SBATCH --ntasks-per-node=task number per node.

Variable d’environnement SBATCH

  • SLURM_JOB_ID : job id
  • SLURM_JOB_NAME : job name
  • SLURM_JOB_NODELIST : Used nodes list
  • SLURM_SUBMIT_HOST : server from which the job has been launched
  • SLURM_SUBMIT_DIR : Répertoire dans lequel le job a été lancé
  • SLURM_JOB_NUM_NODES : Nombre de nœuds demandés
  • SLURM_NTASKS_PER_NODE : Nombre de cœurs demandés par nœuds
  • SLURM_JOB_CPUS_PER_NODE : Nombre de thread par nœud