Slurm

Commands for managing your “jobs”: Memo

Information about a job

to report information about active or completed job.

sacct -j job-id 

To submit a job

The script will typically contain one or more srun commands to launch parallel tasks.

sbatch script.slurm
sbatch -x node037 my_script.sh -> submits by excluding a calculation node

To cancel a job

scancel job-id

Information about partitions and nodes

sinfo

To list free nodes

Partition which integrate nodes is mentionned

sinfo --states=idle

Node states

  • mix : consumable resources partially allocated
  • idle : available to requests consumable resources
  • drain : unavailable for use per system administrator request
  • drng : currently executing a job, but will not be allocated to additional jobs. The node will be changed to state DRAINED when the last job on it completes
  • alloc : consumable resources fully allocated
  • down : unavailable for use. Slurm can automatically place nodes in this state if some failure occurs.

State of your jobs

squeue -u your-login 

Job states

  • BF BOOT_FAIL Job terminated due to launch failure.
  • CA CANCELLED Job was explicitly cancelled.
  • CD COMPLETED Job has terminated.
  • CF CONFIGURING Job has been allocated resources, but are waiting for them to become ready for use.
  • CG COMPLETING Job is in the process of completing.
  • F FAILED Job terminated with error code.
  • NF NODE_FAIL Job terminated due to failure of one or more allocated nodes.
  • OOM OUT_OF_MEMORY Job experienced out of memory error.
  • PD PENDING Job is awaiting resource allocation.
  • PR PREEMPTED Job terminated due to preemption.
  • R RUNNING Job currently has an allocation.
  • RD RESV_DEL_HOLD Job is being held after requested reservation was deleted.
  • RF REQUEUE_FED Job is being requeued by a federation.
  • RH REQUEUE_HOLD Held job is being requeued.
  • RQ REQUEUED Completing job is being requeued.
  • RS RESIZING Job is about to change size.
  • SI SIGNALING Job is being signaled.
  • SE SPECIAL_EXIT The job was requeued in a special state.
  • SO STAGE_OUT Job is staging out files.
  • ST STOPPED Job has an allocation, but execution has been stopped with SIGSTOP signal. CPUS have been retained by this job.
  • S SUSPENDED Job has an allocation, but execution has been suspended and CPUs have been released for other jobs.
  • TO TIMEOUT Job terminated upon reaching its time limit.

Job in real time

To submit a job in real time. srun has a wide variety of options.

srun command with parameters