# CUDA (Compute Unified Device Architecture)

* Collection of accelerated libraries and extensions for C, C++ and Fortran
* Low-level programming model, full control on data placement and code execution
  <!-- * Kernels (device code) and host code use the same programming language -->
* Kernels (device code) can not run on host CPUs
* CUDA code is not C/C++/FORTRAN compliant -> Compiling requires NVIDIA or LLVM compiler
* Proprietary software, closed source
* Available for a long time $\Rightarrow$ Most probably market leader
* Comprehensive solution (e.g cuBLAS, cuFFT)
* Links to external resources:
  * [CUDA Homepage ](https://developer.nvidia.com/cuda) (documentation, programming guide, tutorials and samples)
  * [CUDA Programming Guide](https://docs.nvidia.com/cuda/cuda-programming-guide/index.html)

## Supported Compilers

* LLVM
* NVIDIA HPC SDK Compiler

## Hardware Portability

* NVIDIA GPUs

## Example Code

* Source code available in [exampleCUDA.cu](../src/exampleCUDA.cu)

* Platform information

  ```c++
  int runtime_version = 0;
  error_id = cudaRuntimeGetVersion(&runtime_version);
  cout << "CUDA version: "
       << runtime_version / 1000 << "." << (runtime_version % 100) / 10
       << endl;

  int driver_version = 0;
  error_id = cudaDriverGetVersion(&driver_version);
  cout << "CUDA driver version: "
       << driver_version / 1000 << "." << (driver_version % 100) / 10
       << endl;

  int num_devices = 0;
  error_id = cudaGetDeviceCount(&num_devices);
  if (error_id == cudaErrorNoDevice || num_devices == 0) {
      errx(1, "No CUDA device found");
  }
  cout << "Number of CUDA devices: " << num_devices << endl;

  int device_num = 0;
  error_id = cudaSetDevice(device_num);
  cout << "CUDA Device number: " << device_num << endl;

  size_t memory_free = 0, memory_total = 0;
  error_id = cudaMemGetInfo(&memory_free, &memory_total);
  cout << "Memory on CUDA device: "
       << memory_total / (1024. * 1024. * 1024.) << " GiB"
       << endl;
  cout << "Free Memory on CUDA device: "
       << memory_free / (1024. * 1024. * 1024.) << " GiB"
       << endl;

  cudaDeviceProp device_properties;
  error_id = cudaGetDeviceProperties(&device_properties, device_num);
  cout << "CUDA device name: "
       << device_properties.name
       << endl;
  cout << "CUDA capability: "
       << device_properties.major << "." << device_properties.minor
       << endl;
  cout << "CUDA device max clock rate: "
       << device_properties.clockRate / 1000000. << " GHz"
       << endl;
  cout << "CUDA device max memory clock rate: "
       << device_properties.memoryClockRate / 1000000. << " GHz"
       << endl;
  cout << "CUDA device compute mode: "
       << device_properties.computeMode
       << endl;
  ```

* A list of devices can be obtained by `nvidia-smi --list-gpus`
* Device selection can be done by setting environment variable `CUDA_VISIBLE_DEVICES`

* Allocate memory on the host

  ```c++
  double *a = (double *) malloc(size * sizeof(double));
  if (a == NULL) {
      errx(1, "malloc a[] failed");
  }
  ```

* Allocate memory on the device

  ```c++
  double *device_a;
  if (cudaMalloc(&device_a, size * sizeof(double)) != cudaSuccess) {
      errx(1, "cudaMalloc device_a[] failed");
  }
  ```

* Pre-process / initialize data on the host
  e.g. read data from storage

  ```c++
  for (unsigned int i = 0; i < size; i++) {
      a[i] = 1.;
  }
  ```

* Copy data from the host to the device

  ```c++
  error_id = cudaMemcpy(device_a, a, size * sizeof(double), cudaMemcpyHostToDevice);
  ```

* Compute kernel definition

  ```c++
  __global__ void inc_kernel(double *device_a) {
      const int i = blockDim.x * blockIdx.x + threadIdx.x;
      device_a[i]++;
  }
  ```

* Compute on the device

  ```c++
  int blockSize = 256;
  int numBlocks = size / blockSize;
  inc_kernel<<<numBlocks, blockSize>>>(device_a);
  ```


* Transfer data back from the device to the host

  ```c++
  error_id = cudaMemcpy(a, device_a, size * sizeof(double), cudaMemcpyDeviceToHost);
  ```


* Delete data on the device

  ```c++
  error_id = cudaFree(device_a);
  ```


* Post-process data on the host
  e.g. write data to storage

  ```c++
  for (unsigned int i = 0; i < size; i++) {
      if (a[i] != 2.) {
          errx(2, "Computation on GPU failed");
      }
  }
  ```


* Free memory on the host

  ```c++
  free(a);
  ```

### Compilation

In [None]:
#!/usr/bin/bash
# LLVM
! clang++-21 \
    -O2 -march=native -Wall -Wextra -fuse-ld=lld -Wl,-rpath=/usr/lib/llvm-21/lib \
    -x cuda --offload-arch=native \
    "../src/exampleCUDA.cu" -o "../bin/exampleCUDA" \
    -L/usr/local/cuda/targets/x86_64-linux/lib -lcudart

In [None]:
#!/usr/bin/bash
# NVIDIA CUDA Compiler
! module purge; \
  module add devel/cuda/12.9; \
  nvcc \
    -O2 -ccbin='gcc' --forward-unknown-to-host-compiler -march=native -Wall -Wextra \
    --gpu-architecture=native \
    "../src/exampleCUDA.cu" -o "../bin/exampleCUDA"

### Execution

In [3]:
#!/usr/bin/bash
! module purge; \
  module add devel/cuda/12.9; \
  ../bin/exampleCUDA

CUDA version: 12.9
CUDA driver version: 12.8
Number of CUDA devices: 4
CUDA Device number: 0
Memory on CUDA device: 93.0955 GiB
Free Memory on CUDA device: 92.554 GiB
CUDA device name: NVIDIA H100
CUDA capability: 9.0
CUDA device max clock rate: 1.98 GHz
CUDA device max memory clock rate: 1.593 GHz
CUDA device compute mode: 0
* Allocate memory on the host
* Allocate memory on the device
* Pre-process / initialize data on the host
  e.g. read data from storage
* Copy data from the host to the device
* Compute on the device
* Transfer data back from the device to the host
* Delete data on the device
* Post-process data on the host
  e.g. write data to storage
* Free memory on the host
