Contents
view
Description
CUDA (Compute Unified Device Architecture) is a parallel computing platform developed by NVIDIA. It enables code execution on GPUs through an extension of C/C++ and a set of low-level and high-level APIs.
CUDA is used in:
- scientific computing and HPC
- AI and deep learning
- numerical simulation
- image and signal processing
- rendering and computer vision pipelines
Key points:
- Programming model based on kernels, threads, blocks, and grids
- Fine-grained GPU memory management (global, shared, constant, texture)
- Compilation via
nvcc - Profiling tools (Nsight Compute, Nsight Systems)
- Optimized libraries: cuBLAS, cuFFT, cuDNN, Thrust, etc.
Environment setup
ml cuda/13.0
- Available version(s): 12.6, 12.8, 12.9, 13.0 (default)
GPU monitoring
# On a GPU node
nvidia-smi
Compilation
# No requirement to be on a GPU node
ml gcc
ml cuda
nvcc -o my_program my_program.cu
The file extension for CUDA source files is .cu.
Advanced CUDA compilation with host compiler warnings enabled (gcc)
# -Xcompiler forwards options directly to gcc
# -Wall, -Wextra : CPU-side warnings
nvcc -Xcompiler "-Wall -Wextra" add.cu -o add
My first CUDA kernel
This program illustrates a first example of parallel computing with CUDA.
Each GPU thread computes the sum of two array elements, making it possible to exploit the massive parallelism of the GPU.
- Consider the following file
add.cu:
#include <stdio.h>
#include <cuda_runtime.h>
/*
* CUDA kernel:
* Each thread computes one element of vector c
*/
__global__ void add(int *a, int *b, int *c, int N) {
// Global thread index
int i = blockIdx.x * blockDim.x + threadIdx.x;
// Check that the index is within array bounds
if (i < N) {
c[i] = a[i] + b[i];
}
}
int main() {
const int N = 256;
int a[N], b[N], c[N];
int *d_a, *d_b, *d_c;
/* Initialize arrays on the CPU */
for (int i = 0; i < N; i++) {
a[i] = i;
b[i] = 2 * i;
}
/* Allocate memory on the GPU */
cudaMalloc((void**)&d_a, N * sizeof(int));
cudaMalloc((void**)&d_b, N * sizeof(int));
cudaMalloc((void**)&d_c, N * sizeof(int));
/* Copy data from CPU to GPU */
cudaMemcpy(d_a, a, N * sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(d_b, b, N * sizeof(int), cudaMemcpyHostToDevice);
/* Kernel launch configuration */
int threadsPerBlock = 256;
int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock;
/* Launch the kernel on the GPU */
add<<<blocksPerGrid, threadsPerBlock>>>(d_a, d_b, d_c, N);
/* Synchronize to ensure kernel completion */
cudaDeviceSynchronize();
/* Copy results from GPU to CPU */
cudaMemcpy(c, d_c, N * sizeof(int), cudaMemcpyDeviceToHost);
/* Display results */
int n_out = 10;
printf("Displaying the first %d elements of the array\n", n_out);
for (int i = 0; i < n_out; i++) {
printf("%d + %d = %d\n", a[i], b[i], c[i]);
}
/* Free GPU memory */
cudaFree(d_a);
cudaFree(d_b);
cudaFree(d_c);
return 0;
}
- Job submission script
launch.sh
#!/bin/bash
#SBATCH --job-name=test_cuda
#SBATCH --partition=bi-h200
#SBATCH --gres=gpu:1
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --mem-per-cpu=4G
#SBATCH --time=00:10:00
#SBATCH --output=job-%j.out
ml gcc/15.2.0
ml cuda/13.0
nvcc -Xcompiler "-Wall -Wextra" add.cu -o add
./add
- Job submission
sbatch launch.sh