# OpenMP (Open Multi-Processing)

* Directive-based parallel programming model for C, C++ and Fortran
  <!-- * primary mode of programming is directives -->
  <!-- * Additional runtime functions -->
* Originally only targeted shared-memory multiprocessing
* GPU offload support added more recently
* Managed by nonprofit corporation *OpenMP Architecture Review Board*
  <!-- * multiple Accelerators of same kind -->
  <!-- * OpenMP 5.0 improves performance portability with meta directive
         -> allows different variants of directives for different accelerators -->
  <!-- * GPU multiple level of parallelism: SIMD, threads (parallel for), thread blocks (teams distribute) -->
  <!-- * A thread block is a programming abstraction that represents a group of threads that can be executed serially or in parallel. For better process and data mapping, threads are grouped into thread blocks. The threads in the same thread block run on the same stream processor. Threads in the same block can communicate with each other via shared memory, barrier synchronization or other synchronization primitives such as atomic operations. -->
* Links to external resources:
  * [OpenMP Homepage](https://www.openmp.org) (Specification and Reference Guides)
  * [OpenMP 6.0 Reference Guide](https://www.openmp.org/wp-content/uploads/OpenMP-RefGuide-6.0-OMP60SC24-web.pdf)
  * [OpenMP 6.0 API Specification](https://www.openmp.org/wp-content/uploads/OpenMP-API-Specification-6-0.pdf)
  * [OpenMP Examples Document](https://github.com/OpenMP/Examples)

## Supported Compilers

* AMD ROCm Compiler
* GCC
* Intel oneAPI Compiler
* LLVM
* NVIDIA HPC SDK Compiler

## Hardware Portability

* CPUs
* AMD GPUs
* Intel GPUs
* NVIDIA GPUs

## Example Code

* Source code available in [exampleOpenMP.c](../src/exampleOpenMP.c)

* Include OpenMP Support into your code

  ```c
  #ifdef _OPENMP
      #include <omp.h>
  #else
      #error "OpenMP support required"
  #endif

  int main() {
      // ...

      return 0;
  }
  ```

* Platform information

  ```c
  printf("OpenMP version (yyyymm): %i\n", _OPENMP);
  printf("Number of OpenMP devices: %i\n", omp_get_num_devices());
  ```

* Device selection can be done by setting environment variable `OMP_DEFAULT_DEVICE`

* Allocate memory on the host

  ```c
  double *a = (double *) malloc(size * sizeof(double));
  if (a == NULL) {
      errx(1, "malloc a[] failed");
  }
  ```

* Pre-process / initialize data on the host
  e.g. read data from storage

  ```c
  for (unsigned int i = 0; i < size; i++) {
      a[i] = 1.;
  }
  ```

* Automatically allocate memory on the device
* Copy data from the host to the device

  ```c
  #pragma omp target enter data map(to: a[0:size])
  {}
  ```

* Compute on the device

  ```c
  #pragma omp target teams distribute parallel for
  for (unsigned int i = 0; i < size; i++) {
      a[i]++;
  }
  ```

* Transfer data back from the device to the host

  ```c
  #pragma omp target update from(a[0:size])
  {}
  ```

* Delete data on the device

  ```c
  #pragma omp target exit data map(release: a[0:size])
  {}
  ```

* Post-process data on the host
  e.g. write data to storage

  ```c
  for (unsigned int i = 0; i < size; i++) {
      if (a[i] != 2.) {
          errx(2, "Computation on GPU failed");
      }
  }
  ```

* Free memory on the host

  ```c
  free(a);
  ```

### Compilation

In [None]:
#!/usr/bin/bash
# AMD ROCm Compiler
! amdclang \
    -O2 -march=native -flto -Wall -Wextra \
    -fopenmp --offload-arch=native -foffload-lto \
    "../src/exampleOpenMP.c" -o "..bin/exampleOpenMP"

In [None]:
#!/usr/bin/bash
# GCC
! gcc \
    -O2 -march=native -Wall -Wextra \
    -fopenmp \
    "../src/exampleOpenMP.c" -o "../bin/exampleOpenMP"

In [None]:
#!/usr/bin/bash
# LLVM
! clang-21 \
    -O2 -march=native -Wall -Wextra -fuse-ld=lld -Wl,-rpath=/usr/lib/llvm-21/lib \
    -fopenmp --offload-arch=native -foffload-lto \
    "../src/exampleOpenMP.c" -o "../bin/exampleOpenMP"

In [None]:
#!/usr/bin/bash
# Intel OneAPI Compiler
! icx \
    -O2 -xHost -ipo -Wall -Wextra \
    -qopenmp -fopenmp-targets=spir64 -foffload-lto \
    "../src/exampleOpenMP.c" -o "../bin/exampleOpenMP"

In [None]:
#!/usr/bin/bash
# NVIDIA HPC Compiler
! module purge; \
  module add toolkit/nvidia-hpc-sdk/25.3; \
  nvc \
    -O2 -tp=host -Minform=inform \
    -mp -target=gpu -gpu=ccnative \
    "../src/exampleOpenMP.c" -o "../bin/exampleOpenMP"

### Execution

In [None]:
#!/usr/bin/bash
! export OMP_PROC_BIND=close; \
  export OMP_PLACES=cores; \
  export OMP_TARGET_OFFLOAD=MANDATORY; \
  export OMP_DISPLAY_ENV=VERBOSE; \
  export OMP_DISPLAY_AFFINITY=TRUE; \
  ../bin/exampleOpenMP

OPENMP DISPLAY ENVIRONMENT BEGIN
  _OPENMP='202011'
  OMP_SCHEDULE='STATIC',0
  OMP_NUM_THREADS='1'
  OMP_NUM_TEAMS='0'
  OMP_NUM_TEAMS_DEV_0='0'
  OMP_NUM_TEAMS_DEV_1='0'
  OMP_NUM_TEAMS_DEV_2='0'
  OMP_NUM_TEAMS_DEV_3='0'
  OMP_NUM_TEAMS_DEV_4='0'
  OMP_NUM_TEAMS_DEV_5='0'
  OMP_NUM_TEAMS_DEV_6='0'
  OMP_NUM_TEAMS_DEV_7='0'
  OMP_NUM_TEAMS_DEV_8='0'
  OMP_NUM_TEAMS_DEV_9='0'
  OMP_NUM_TEAMS_DEV_10='0'
  OMP_NUM_TEAMS_DEV_11='0'
  OMP_NUM_TEAMS_DEV_12='0'
  OMP_NUM_TEAMS_DEV_13='0'
  OMP_NUM_TEAMS_DEV_14='0'
  OMP_NUM_TEAMS_DEV_15='0'
  OMP_DYNAMIC='TRUE'
  OMP_PROC_BIND='close'
  OMP_PLACES='cores'
  OMP_STACKSIZE='0B,0B'
  OMP_STACKSIZE_DEV_0='0B'
  OMP_STACKSIZE_DEV_1='0B'
  OMP_STACKSIZE_DEV_2='0B'
  OMP_STACKSIZE_DEV_3='0B'
  OMP_STACKSIZE_DEV_4='0B'
  OMP_STACKSIZE_DEV_5='0B'
  OMP_STACKSIZE_DEV_6='0B'
  OMP_STACKSIZE_DEV_7='0B'
  OMP_STACKSIZE_DEV_8='0B'
  OMP_STACKSIZE_DEV_9='0B'
  OMP_STACKSIZE_DEV_10='0B'
  OMP_STACKSIZE_DEV_11='0B'
  OMP_STACKSIZE_DEV_12='0B'
  OMP_STACKSIZ

## Example Code using Unified Shared Memory (USM)

* Source code available in [exampleOpenMP-unified_shared_memory.c](../src/exampleOpenMP-unified_shared_memory.c)

* Include OpenMP Support into your code

  ```c
  #ifdef _OPENMP
      #include <omp.h>
  #else
      #error "OpenMP support required"
  #endif
  #pragma omp requires unified_shared_memory

  int main() {
      // ...

      return 0;
  }
  ```

* Platform information
 
  ```c
  printf("OpenMP version (yyyymm): %i\n", _OPENMP);
  printf("Number of OpenMP devices: %i\n", omp_get_num_devices());
  ```

* Allocate memory on the host

  ```c
  double *a = (double *) malloc(size * sizeof(double));
  if (a == NULL) {
      errx(1, "malloc a[] failed");
  }
  ```

* Pre-process / initialize data on the host
  e.g. read data from storage

  ```c
  for (unsigned int i = 0; i < size; i++) {
      a[i] = 1.;
  }
  ```

* Automatically allocate memory on the device
* Automatically copy data from the host to the device

* Compute on the device

  ```c
  #pragma omp target teams distribute parallel for
  for (unsigned int i = 0; i < size; i++) {
      a[i]++;
  }
  ```

* Automatically transfer data back from the device to the host
* Automatically delete data on the device

* Post-process data on the host
  e.g. write data to storage

  ```c
  for (unsigned int i = 0; i < size; i++) {
      if (a[i] != 2.) {
          errx(2, "Computation on GPU failed");
      }
  }
  ```

* Free memory on the host

  ```c
  free(a);
  ```

### Compilation

In [None]:
#!/usr/bin/bash
# AMD ROCm Compiler
! amdclang \
    -O2 -march=native -flto -Wall -Wextra \
    -fopenmp --offload-arch=native -foffload-lto \
    "../src/exampleOpenMP-unified_shared_memory.c" -o "../bin/exampleOpenMP-unified_shared_memory"

In [None]:
#!/usr/bin/bash
# GCC
! gcc \
    -O2 -march=native -Wall -Wextra \
    -fopenmp \
    "../src/exampleOpenMP-unified_shared_memory.c" -o "../bin/exampleOpenMP-unified_shared_memory"

In [None]:
#!/usr/bin/bash
# Intel OneAPI Compiler
! icx \
    -O2 -xHost -ipo -Wall -Wextra \
    -qopenmp -fopenmp-targets=spir64 -foffload-lto \
    "../src/exampleOpenMP-unified_shared_memory.c" -o "../bin/exampleOpenMP-unified_shared_memory"

In [None]:
#!/usr/bin/bash
# LLVM
! clang-21 \
    -O2 -march=native -Wall -Wextra -fuse-ld=lld -Wl,-rpath=/usr/lib/llvm-21/lib \
    -fopenmp --offload-arch=native -foffload-lto \
    "../src/exampleOpenMP-unified_shared_memory.c" -o "../bin/exampleOpenMP-unified_shared_memory"

In [None]:
#!/usr/bin/bash
# NVIDIA HPC SDK Compiler
! module purge; \
  module add toolkit/nvidia-hpc-sdk/25.3; \
  nvc \
    -O2 -tp=host -Minform=inform \
    -mp -target=gpu -gpu=mem:unified \
    "../src/exampleOpenMP-unified_shared_memory.c" -o "../bin/exampleOpenMP-unified_shared_memory"

### Execution

In [None]:
#!/usr/bin/bash
! export OMP_PROC_BIND=close; \
  export OMP_PLACES=cores; \
  export OMP_TARGET_OFFLOAD=MANDATORY; \
  export OMP_DISPLAY_ENV=VERBOSE; \
  export OMP_DISPLAY_AFFINITY=TRUE; \
  ../bin/exampleOpenMP-unified_shared_memory

OPENMP DISPLAY ENVIRONMENT BEGIN
  _OPENMP='202011'
  OMP_SCHEDULE='STATIC',0
  OMP_NUM_THREADS='1'
  OMP_NUM_TEAMS='0'
  OMP_NUM_TEAMS_DEV_0='0'
  OMP_NUM_TEAMS_DEV_1='0'
  OMP_NUM_TEAMS_DEV_2='0'
  OMP_NUM_TEAMS_DEV_3='0'
  OMP_NUM_TEAMS_DEV_4='0'
  OMP_NUM_TEAMS_DEV_5='0'
  OMP_NUM_TEAMS_DEV_6='0'
  OMP_NUM_TEAMS_DEV_7='0'
  OMP_NUM_TEAMS_DEV_8='0'
  OMP_NUM_TEAMS_DEV_9='0'
  OMP_NUM_TEAMS_DEV_10='0'
  OMP_NUM_TEAMS_DEV_11='0'
  OMP_NUM_TEAMS_DEV_12='0'
  OMP_NUM_TEAMS_DEV_13='0'
  OMP_NUM_TEAMS_DEV_14='0'
  OMP_NUM_TEAMS_DEV_15='0'
  OMP_DYNAMIC='TRUE'
  OMP_PROC_BIND='close'
  OMP_PLACES='cores'
  OMP_STACKSIZE='0B,0B'
  OMP_STACKSIZE_DEV_0='0B'
  OMP_STACKSIZE_DEV_1='0B'
  OMP_STACKSIZE_DEV_2='0B'
  OMP_STACKSIZE_DEV_3='0B'
  OMP_STACKSIZE_DEV_4='0B'
  OMP_STACKSIZE_DEV_5='0B'
  OMP_STACKSIZE_DEV_6='0B'
  OMP_STACKSIZE_DEV_7='0B'
  OMP_STACKSIZE_DEV_8='0B'
  OMP_STACKSIZE_DEV_9='0B'
  OMP_STACKSIZE_DEV_10='0B'
  OMP_STACKSIZE_DEV_11='0B'
  OMP_STACKSIZE_DEV_12='0B'
  OMP_STACKSIZ