Slurm – MatriCS Plateform

Contents view

Commands for managing your “jobs”: Memo

Information about a job

to report information about active or completed job.

sacct -j job-id

The script will typically contain one or more srun commands to launch parallel tasks.

sbatch script.slurm
sbatch -x node037 my_script.sh -> submits by excluding a calculation node

scancel job-id

sinfo

Partition which integrate nodes is mentionned

sinfo --states=idle

mix : consumable resources partially allocated
idle : available to requests consumable resources
drain : unavailable for use per system administrator request
drng : currently executing a job, but will not be allocated to additional jobs. The node will be changed to state DRAINED when the last job on it completes
alloc : consumable resources fully allocated
down : unavailable for use. Slurm can automatically place nodes in this state if some failure occurs.

squeue --me

BF BOOT_FAIL Job terminated due to launch failure.
CA CANCELLED Job was explicitly cancelled.
CD COMPLETED Job has terminated.
CF CONFIGURING Job has been allocated resources, but are waiting for them to become ready for use.
CG COMPLETING Job is in the process of completing.
F FAILED Job terminated with error code.
NF NODE_FAIL Job terminated due to failure of one or more allocated nodes.
OOM OUT_OF_MEMORY Job experienced out of memory error.
PD PENDING Job is awaiting resource allocation.
PR PREEMPTED Job terminated due to preemption.
R RUNNING Job currently has an allocation.
RD RESV_DEL_HOLD Job is being held after requested reservation was deleted.
RF REQUEUE_FED Job is being requeued by a federation.
RH REQUEUE_HOLD Held job is being requeued.
RQ REQUEUED Completing job is being requeued.
RS RESIZING Job is about to change size.
SI SIGNALING Job is being signaled.
SE SPECIAL_EXIT The job was requeued in a special state.
SO STAGE_OUT Job is staging out files.
ST STOPPED Job has an allocation, but execution has been stopped with SIGSTOP signal. CPUS have been retained by this job.
S SUSPENDED Job has an allocation, but execution has been suspended and CPUs have been released for other jobs.
TO TIMEOUT Job terminated upon reaching its time limit.

To submit a job in real time. srun has a wide variety of options.

srun command with parameters

#SBATCH --partition=partition name
#SBATCH --job-name=job name
#SBATCH --output=file in which the standard output will be saved
#SBATCH --error=name of the file to store the errors
#SBATCH --input=file name of the standard input
#SBATCH --open-mode="append" to write in existed file, "truncate" to reset files
#SBATCH --mail-type=<BEGIN,END,FAIL,TIME_LIMIT,TIME_LIMIT_50,...> case of sending an e-mail
#SBATCH --sockets-per-node=1 or 2
#SBATCH --threads-per-core thread number per core, no usable with MatriCS plateform, nodes aren’t multithreaded (ask us if it’s needed.)
#SBATCH --cores-per-socket= Core number per socket
#SBATCH --cpus-per-task=CPU number for each task
#SBATCH --ntasks=task number
#SBATCH –mem-per-cpu=RAM per core
#SBATCH --ntasks-per-node=task number per node.