Contents
view
Commands for managing your “jobs”: Memo
Information about a job
to report information about active or completed job.
sacct -j job-id
To submit a job
The script will typically contain one or more srun commands to launch parallel tasks.
sbatch script.slurm sbatch -x node037 my_script.sh -> submits by excluding a calculation node
To cancel a job
scancel job-id
Information about partitions and nodes
sinfo
To list free nodes
Partition which integrate nodes is mentionned
sinfo --states=idle
Node states
- mix : consumable resources partially allocated
- idle : available to requests consumable resources
- drain : unavailable for use per system administrator request
- drng : currently executing a job, but will not be allocated to additional jobs. The node will be changed to state DRAINED when the last job on it completes
- alloc : consumable resources fully allocated
- down : unavailable for use. Slurm can automatically place nodes in this state if some failure occurs.
State of your jobs
squeue --me
Job states
- BF BOOT_FAIL Job terminated due to launch failure.
- CA CANCELLED Job was explicitly cancelled.
- CD COMPLETED Job has terminated.
- CF CONFIGURING Job has been allocated resources, but are waiting for them to become ready for use.
- CG COMPLETING Job is in the process of completing.
- F FAILED Job terminated with error code.
- NF NODE_FAIL Job terminated due to failure of one or more allocated nodes.
- OOM OUT_OF_MEMORY Job experienced out of memory error.
- PD PENDING Job is awaiting resource allocation.
- PR PREEMPTED Job terminated due to preemption.
- R RUNNING Job currently has an allocation.
- RD RESV_DEL_HOLD Job is being held after requested reservation was deleted.
- RF REQUEUE_FED Job is being requeued by a federation.
- RH REQUEUE_HOLD Held job is being requeued.
- RQ REQUEUED Completing job is being requeued.
- RS RESIZING Job is about to change size.
- SI SIGNALING Job is being signaled.
- SE SPECIAL_EXIT The job was requeued in a special state.
- SO STAGE_OUT Job is staging out files.
- ST STOPPED Job has an allocation, but execution has been stopped with SIGSTOP signal. CPUS have been retained by this job.
- S SUSPENDED Job has an allocation, but execution has been suspended and CPUs have been released for other jobs.
- TO TIMEOUT Job terminated upon reaching its time limit.
Job in real time
To submit a job in real time. srun has a wide variety of options.
srun command with parameters
Sbatch options
#SBATCH --partition=partition name#SBATCH --job-name=job name#SBATCH --output=file in which the standard output will be saved#SBATCH --error=name of the file to store the errors#SBATCH --input=file name of the standard input#SBATCH --open-mode="append" to write in existed file, "truncate" to reset files#SBATCH --mail-type=<BEGIN,END,FAIL,TIME_LIMIT,TIME_LIMIT_50,...>case of sending an e-mail#SBATCH--sockets-per-node=1 or 2#SBATCH --threads-per-corethread number per core, no usable with MatriCS plateform, nodes aren’t multithreaded (ask us if it’s needed.)#SBATCH --cores-per-socket=Core number per socket#SBATCH --cpus-per-task=CPU number for each task#SBATCH --ntasks=task number- #SBATCH –mem-per-cpu=RAM per core
#SBATCH --ntasks-per-node=task number per node.
Variable d’environnement SBATCH
- SLURM_JOB_ID : job id
SLURM_JOB_NAME: job nameSLURM_JOB_NODELIST: Used nodes listSLURM_SUBMIT_HOST: server from which the job has been launchedSLURM_SUBMIT_DIR: Répertoire dans lequel le job a été lancéSLURM_JOB_NUM_NODES: Nombre de nœuds demandésSLURM_NTASKS_PER_NODE: Nombre de cœurs demandés par nœudsSLURM_JOB_CPUS_PER_NODE: Nombre de thread par nœud