# C++ Standard Parallelism

* C++17 introduced parallel algorithms, extended in C++20
  * Includes parallel loops operations e.g. `for_each` and `transform_reduce`
  * Execution policies (`seq`, `par`) give compiler hints
  * Single source code for CPU and accelerator
* No explicit data placement / device selection
* Execution can be **serial**!
  * Parallel execution on CPUs or GPUs needs compiler support
  * Parallel execution on GPUs needs hardware capability: *Unified Shared Memory* or *Managed Memory*
* Links to external resources:
  * [C++ Reference Algorithms library](https://en.cppreference.com/w/cpp/algorithm.html)
  * [C++ Reference Ranges library](https://en.cppreference.com/w/cpp/ranges.html)
  * [HIPSTDPAR](https://github.com/ROCm/roc-stdpar) (Implementation and Status)
  * [NVIDIA HPC Compilers User's Guide - Using Stdpar](https://docs.nvidia.com/hpc-sdk/compilers/hpc-compilers-user-guide/index.html#using-stdpar)

## Supported Compilers

* AMD ROCm Compiler
* GCC (CPU only)
* Intel oneAPI Compiler (CPU only)
* LLVM (CPU only)
* NVIDIA HPC SDK Compiler

## Hardware portability

* CPUs
* AMD GPUs
* NVIDIA GPUs

## Example Code

* Source code available in [exampleStdpar.cpp](../src/exampleStdpar.cpp)

* Include C++ Standard Parallelism support into your code

  ```c++
  #include <algorithm> // for_each
  #include <execution> // execution::par_unseq
  #include <ranges>    // views::iota

  int main() {
      // ...

      return 0;
  }
  ```

* Allocate memory on the host

  ```c++
  double *a = new (std::nothrow) double[size];
  if (a == nullptr) {
      errx(1, "malloc a[] failed");
  }
  ```

* Pre-process / initialize data on the host
  e.g. read data from storage

  ```c++
  for (std::size_t i = 0; i < size; i++) {
      a[i] = 1.;
  }
  ```

* Automatically allocate memory on the device
* Automatically copy data from the host to the device
* Compute on the device
  * Without access to vector index

    ```c++
    std::for_each_n(
      std::execution::par_unseq, // parallel, unsequenced order
      a, size,
      // GPU kernel expressed as lambda expression
      [](double &a_i) {
          a_i++;
      });
    ```

  * With access to vector index

    ```c++
    auto range = std::ranges::iota_view<std::size_t, std::size_t>{0, size};
    std::for_each(
        std::execution::par_unseq, // parallel, unsequenced order
        range.begin(), range.end(),
        // GPU kernel expressed as lambda expression
        [&a](const auto &i) {
            a[i]++;
        });
    ```

* Automatically transfer data back from the device to the host
* Automatically delete data on the device
* Post-process data on the host
  e.g. write data to storage

  ```c++
  for (std::size_t i = 0; i < size; i++) {
        if (a[i] != 3.) {
            cout << "a[" << i << "] = " << a[i] << endl;
            errx(2, "Computation on GPU failed");
        }
    }
  ```

* Free memory on the host

  ```c++
  delete[] a;
  ```

### Compilation

In [None]:
#!/usr/bin/bash
# AMD ROCm Compiler
! hipcc \
    -O2 -march=native -flto -std=c++20 -Wall -Wextra \
    --hipstdpar --offload-arch=native -foffload-lto \
    "../src/exampleStdpar.cpp" -o "../bin/exampleStdpar"

In [1]:
#!/usr/bin/bash
# NVIDIA HPC SDK Compiler
! module purge; \
  module add toolkit/nvidia-hpc-sdk/25.3; \
  nvc++ \
    -O2 -tp=host -std=c++20 -Minform=inform \
    -Minfo=stdpar -Mneginfo=stdpar -fast -stdpar=gpu -gpu=ccnative \
    "../src/exampleStdpar.cpp" -o "../bin/exampleStdpar"


main:
     29, stdpar: Generating NVIDIA GPU code
         29, std::for_each_n with std::execution::par_unseq policy parallelized on GPU
main:
     39, stdpar: Generating NVIDIA GPU code
         39, std::for_each with std::execution::par_unseq policy parallelized on GPU
NVC++-I-0162-Not equal test of loop control variable .inl_.inl_i_3586_44329 replaced with < or > test. (../src/exampleStdpar.cpp: 66)
NVC++-I-0162-Not equal test of loop control variable .inl_.inl_i_3586_44329 replaced with < or > test. (../src/exampleStdpar.cpp: 66)
NVC++-I-0162-Not equal test of loop control variable .inl_.inl_i_3586_44329 replaced with < or > test. (../src/exampleStdpar.cpp: 66)
NVC++-I-0162-Not equal test of loop control variable .inl_.inl_i_3586_44329 replaced with < or > test. (../src/exampleStdpar.cpp: 66)
NVC++-I-0162-Not equal test of loop control variable .inl_.inl_i_3586_44329 replaced with < or > test. (../src/exampleStdpar.cpp: 66)
NVC++-I-0162-Not equal test of loop control variable .inl_

### Execution

In [2]:
#!/usr/bin/bash
# NVIDIA HPC SDK Compiler
! module purge; \
  module add toolkit/nvidia-hpc-sdk/25.3; \
  ../bin/exampleStdpar

* Allocate memory on the host
* Pre-process / initialize data on the host
  e.g. read data from storage
* Automatically allocate memory on the device
* Automatically copy data from the host to the device
* Compute on the device
* Automatically transfer data back from the device to the host
* Automatically delete data on the device
* Post-process data on the host
  e.g. write data to storage
* Free memory on the host
