# A comparison of GPU offloading techniques

<!-- Author: Holger Obermaier -->
<!-- Date: 2025-12-04 -->

* Why offloading to GPUs?
  * Dedicated fast memory (e.g. HBM)
  * Many parallel execution units
  * Energy efficiency
    <!-- Same performance using CPUs would exceed the energy budget. -->
  * The majority of HoreKA's computing power comes from GPUs
* Offloading challenges
  * Manage distinctly separated CPU and GPU memory
  * Hide / avoid data transfer between CPU and GPU
  * Hide higher GPU memory latency
  * Utilize many weaker GPU execution units
* Many techniques for GPU offloading
  * Compiler pragmas
  * Programming language extensions
  * Libraries
* Comparison based on
  * Usability, simplicity
    <!-- * Low-level / High-level implementation --> 
    <!-- * Degree of abstraction -->
  * Porting effort for existing codes
  * Achievable performance
    <!-- * data management (layout, placement) -->
    <!-- * explicit vs. implicit data copies (Unified Shared Memory (USM)) -->
    <!-- * distinctly separated memory between CPU and GPU vs. Unified Memory Architecture (UMA) -->
    <!-- * device selection -->
  * Supported compilers
    <!-- * only compilers available on HoreKA (GNU, LLVM, OneAPI Compiler, NVIDIA HPC SDK Compiler) -->
  * Hardware portability
* Comparison shows: **No clear winner**

## Example Program

  * For each Offloading technique one or more *example* source code
  * Demonstrate usability, simplicity, and  achievable performance
  * No full introduction into these Offloading technique!

### GPU Offloading workflow

  * Retrieve platform information
  * Allocate memory on the host
    <!-- Host and device do have dedicated memories -->

    ```c
    double *a = (double *) malloc(size * sizeof(double));
    if (a == NULL) {
        errx(1, "malloc a[] failed");
    }
    ```

  * Pre-process / initialize data on the host (e.g. read data from storage)
    <!-- GPUs normally do not support IO operations -->

    ```c
    for (unsigned int i = 0; i < size; i++) {
        a[i] = 1.;
    }
    ```

  * Allocate memory on the device
  * Copy data from the host to the device
    <!-- * Transfer through PCIe, much slower than memory access -> Expensive operation -->
    <!-- * Should be avoided as much as possible -->
  * Compute on the device
    <!-- Should be faster / accelerated compared to the host:
         * Faster memory (e.g. HBM)
         * More parallel execution units -->

    ```c
    for (unsigned int i = 0; i < size; i++) {
        a[i]++;
    }
    ```

  * Transfer data back from the device to the host
  * Delete data on the device
  * Post-process data on the host (e.g. write data to storage)

    ```c
    for (unsigned int i = 0; i < size; i++) {
        if (a[i] != 2.) {
            errx(2, "Computation on GPU failed");
        }
    }
    ```
  
  * Free host memory

    ```c
    free(a);
    ```

## Comparison

GPU offloading based on:

* Compiler pragmas
  * [OpenMP (Open Multi-Processing)](./exampleOpenMP.ipynb)
  * [OpenACC (Open Accelerators)](./exampleOpenACC.ipynb)
* Programming language extensions
  * [C++ Standard Parallelism](./exampleStdpar.ipynb)
  * [SYCL](./exampleSYCL.ipynb)
  * [CUDA (Compute Unified Device Architecture)](./exampleCUDA.ipynb)
  * [HIP (Heterogeneous-Compute Interface for Portability)](./exampleHIP.ipynb)
  * [Thrust](./exampleThrust.ipynb)
* Libraries
  * [OpenCL (Open Computing Language)](./exampleOpenCL.ipynb)
  * [Kokkos](./exampleKokkos.ipynb)