CUDA

Contents view

Description

CUDA (Compute Unified Device Architecture) is a parallel computing platform developed by NVIDIA. It enables code execution on GPUs through an extension of C/C++ and a set of low-level and high-level APIs.

CUDA is used in:

scientific computing and HPC
AI and deep learning
numerical simulation
image and signal processing
rendering and computer vision pipelines

Key points:

Programming model based on kernels, threads, blocks, and grids
Fine-grained GPU memory management (global, shared, constant, texture)
Compilation via nvcc
Profiling tools (Nsight Compute, Nsight Systems)
Optimized libraries: cuBLAS, cuFFT, cuDNN, Thrust, etc.

Environment setup

ml cuda/13.0

Available version(s): 12.6, 12.8, 12.9, 13.0 (default)

GPU monitoring

# On a GPU node
nvidia-smi

Compilation

# No requirement to be on a GPU node
ml gcc
ml cuda
nvcc -o my_program my_program.cu

The file extension for CUDA source files is .cu.

Advanced CUDA compilation with host compiler warnings enabled (gcc)

# -Xcompiler forwards options directly to gcc
# -Wall, -Wextra : CPU-side warnings
nvcc -Xcompiler "-Wall -Wextra" add.cu -o add

My first CUDA kernel

This program illustrates a first example of parallel computing with CUDA.
Each GPU thread computes the sum of two array elements, making it possible to exploit the massive parallelism of the GPU.

Consider the following file add.cu :

#include <stdio.h>
#include <cuda_runtime.h>

/*
 * CUDA kernel:
 * Each thread computes one element of vector c
 */
__global__ void add(int *a, int *b, int *c, int N) {
    // Global thread index
    int i = blockIdx.x * blockDim.x + threadIdx.x;

    // Check that the index is within array bounds
    if (i < N) {
        c[i] = a[i] + b[i];
    }
}

int main() {
    const int N = 256;
    int a[N], b[N], c[N];
    int *d_a, *d_b, *d_c;

    /* Initialize arrays on the CPU */
    for (int i = 0; i < N; i++) {
        a[i] = i;
        b[i] = 2 * i;
    }

    /* Allocate memory on the GPU */
    cudaMalloc((void**)&d_a, N * sizeof(int));
    cudaMalloc((void**)&d_b, N * sizeof(int));
    cudaMalloc((void**)&d_c, N * sizeof(int));

    /* Copy data from CPU to GPU */
    cudaMemcpy(d_a, a, N * sizeof(int), cudaMemcpyHostToDevice);
    cudaMemcpy(d_b, b, N * sizeof(int), cudaMemcpyHostToDevice);

    /* Kernel launch configuration */
    int threadsPerBlock = 256;
    int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock;

    /* Launch the kernel on the GPU */
    add<<<blocksPerGrid, threadsPerBlock>>>(d_a, d_b, d_c, N);

    /* Synchronize to ensure kernel completion */
    cudaDeviceSynchronize();

    /* Copy results from GPU to CPU */
    cudaMemcpy(c, d_c, N * sizeof(int), cudaMemcpyDeviceToHost);

    /* Display results */
    int n_out = 10;
    printf("Displaying the first %d elements of the array\n", n_out);
    for (int i = 0; i < n_out; i++) {
        printf("%d + %d = %d\n", a[i], b[i], c[i]);
    }

    /* Free GPU memory */
    cudaFree(d_a);
    cudaFree(d_b);
    cudaFree(d_c);

    return 0;
}

Job submission script launch.sh

#!/bin/bash

#SBATCH --job-name=test_cuda
#SBATCH --partition=bi-h200
#SBATCH --gres=gpu:1
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --mem-per-cpu=4G
#SBATCH --time=00:10:00
#SBATCH --output=job-%j.out

ml gcc/15.2.0
ml cuda/13.0
nvcc -Xcompiler "-Wall -Wextra" add.cu -o add
./add

Job submission

sbatch launch.sh

MatriCS Plateform

Shared platform for the research laboratories of the University of Picardie Jules Verne