CUDA

      Comments Off on CUDA

Description

CUDA (Compute Unified Device Architecture) is a parallel computing platform developed by NVIDIA. It enables code execution on GPUs through an extension of C/C++ and a set of low-level and high-level APIs.

CUDA is used in:

  • scientific computing and HPC
  • AI and deep learning
  • numerical simulation
  • image and signal processing
  • rendering and computer vision pipelines

Key points:

  • Programming model based on kernels, threads, blocks, and grids
  • Fine-grained GPU memory management (global, shared, constant, texture)
  • Compilation via nvcc
  • Profiling tools (Nsight Compute, Nsight Systems)
  • Optimized libraries: cuBLAS, cuFFT, cuDNN, Thrust, etc.

Environment setup

ml cuda/13.0
  • Available version(s): 12.6, 12.8, 12.9, 13.0 (default)

GPU monitoring

# On a GPU node
nvidia-smi

Compilation

# No requirement to be on a GPU node
ml gcc
ml cuda
nvcc -o my_program my_program.cu

The file extension for CUDA source files is .cu.

Advanced CUDA compilation with host compiler warnings enabled (gcc)

# -Xcompiler forwards options directly to gcc
# -Wall, -Wextra : CPU-side warnings
nvcc -Xcompiler "-Wall -Wextra" add.cu -o add

My first CUDA kernel

This program illustrates a first example of parallel computing with CUDA.
Each GPU thread computes the sum of two array elements, making it possible to exploit the massive parallelism of the GPU.

  • Consider the following file add.cu :
#include <stdio.h>
#include <cuda_runtime.h>

/*
 * CUDA kernel:
 * Each thread computes one element of vector c
 */
__global__ void add(int *a, int *b, int *c, int N) {
    // Global thread index
    int i = blockIdx.x * blockDim.x + threadIdx.x;

    // Check that the index is within array bounds
    if (i < N) {
        c[i] = a[i] + b[i];
    }
}

int main() {
    const int N = 256;
    int a[N], b[N], c[N];
    int *d_a, *d_b, *d_c;

    /* Initialize arrays on the CPU */
    for (int i = 0; i < N; i++) {
        a[i] = i;
        b[i] = 2 * i;
    }

    /* Allocate memory on the GPU */
    cudaMalloc((void**)&d_a, N * sizeof(int));
    cudaMalloc((void**)&d_b, N * sizeof(int));
    cudaMalloc((void**)&d_c, N * sizeof(int));

    /* Copy data from CPU to GPU */
    cudaMemcpy(d_a, a, N * sizeof(int), cudaMemcpyHostToDevice);
    cudaMemcpy(d_b, b, N * sizeof(int), cudaMemcpyHostToDevice);

    /* Kernel launch configuration */
    int threadsPerBlock = 256;
    int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock;

    /* Launch the kernel on the GPU */
    add<<<blocksPerGrid, threadsPerBlock>>>(d_a, d_b, d_c, N);

    /* Synchronize to ensure kernel completion */
    cudaDeviceSynchronize();

    /* Copy results from GPU to CPU */
    cudaMemcpy(c, d_c, N * sizeof(int), cudaMemcpyDeviceToHost);

    /* Display results */
    int n_out = 10;
    printf("Displaying the first %d elements of the array\n", n_out);
    for (int i = 0; i < n_out; i++) {
        printf("%d + %d = %d\n", a[i], b[i], c[i]);
    }

    /* Free GPU memory */
    cudaFree(d_a);
    cudaFree(d_b);
    cudaFree(d_c);

    return 0;
}
  • Job submission script launch.sh
#!/bin/bash

#SBATCH --job-name=test_cuda
#SBATCH --partition=bi-h200
#SBATCH --gres=gpu:1
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --mem-per-cpu=4G
#SBATCH --time=00:10:00
#SBATCH --output=job-%j.out

ml gcc/15.2.0
ml cuda/13.0
nvcc -Xcompiler "-Wall -Wextra" add.cu -o add
./add
  • Job submission
sbatch launch.sh

Documentation