# Kokkos

* Programming model in C++ for performance portable applications
* Abstractions for both parallel code execution and data management
* Open Source, Linux Foundation project
* Links to external resources:
  * [Kokkos Homepage](https://kokkos.org) (Documentation, API References and Tutorials)

## Supported Compilers

* All C++ compiler

## Hardware portability

* CPUs (OpenMP backend, Threads Backend)
* AMD GPUs (HIP backend)
* Intel GPUs (SYCL backend)
* NVIDIA GPUs (CUDA backend)

## Example (Mirror View)

* Source code available in [exampleKokkos.cpp](../src/exampleKokkos.cpp)

* Include Kokkos Support into your code

  ```c++
  #include <Kokkos_Core.hpp>

  void mainKokkos() {
      // ...
  }

  int main(int argc, char **argv) {
      Kokkos::initialize(argc, argv);
      // Bundle all Kokkos objects in function mainKokkos to ensure that destructors are called before Kokkos::finalize
      mainKokkos();
      // all Kokkos objects must be destroyed before Kokkos::finalize gets called!
      Kokkos::finalize();
      return 0;
  }
  ```

* A list of devices can be obtained by `../bin/exampleKokkos --config`
* Device selection can be done by setting `../bin/exampleKokkos --kokkos-device-id=...`

* Platform information

  ```c++
  Kokkos::print_configuration(cout);
  ```


* Allocate memory on the device

  ```c++
  Kokkos::View<double *> device_a(Kokkos::ViewAllocateWithoutInitializing("device_a"), size);
  ```

* Allocate memory on the host

  ```c++
  auto a = Kokkos::create_mirror_view(device_a);
  ```

* Pre-process / initialize data on the host
  e.g. read data from storage

  ```c++
  for (std::size_t i = 0; i < size; i++) {
      a[i] = 1.;
  }
  ```

* Copy data from the host to the device

  ```c++
  Kokkos::deep_copy(device_a, a);
  ```

* Compute on the device

  ```c++
  Kokkos::parallel_for(
      "Increment a[] on device", size,
      KOKKOS_LAMBDA(int i) {
          device_a[i]++;
      });
  ```

* Transfer data back from the device to the host

  ```c++
  Kokkos::deep_copy(a, device_a);
  ```

* Post-process data on the host
  e.g. write data to storage or perform consistency checks

  ```c++
  for (std::size_t i = 0; i < size; i++) {
      if (a[i] != 2.) {
          cout << "a[" << i << "] = " << a[i] << endl;
          errx(2, "Computation on GPU failed");
      }
  }
  ```

* Automatically delete data on the device
* Automatically free memory on the host

### Compilation

In [None]:
#!/usr/bin/bash
# AMD ROCm Compiler
! hipcc \
    -O2 -march=native -flto -Wall -Wextra -fopenmp \
    --offload-arch=native -foffload-lto \
    "../src/exampleKokkos.cpp" -o "../bin/exampleKokkos" -lkokkoscore

In [15]:
#!/usr/bin/bash
! module purge; \
  module add compiler/gnu/14; \
  module add devel/cuda/12.9; \
  module add devel/kokkos/4.7.01; \
  nvcc_wrapper --extended-lambda -O2 -march=native -Wall -Wextra -fopenmp \
    -arch=sm_90 \
    "../src/exampleKokkos.cpp" -o "../bin/exampleKokkos" -lkokkoscore -lcuda

### Execution

In [16]:
#!/usr/bin/bash
! module purge; \
  module add compiler/gnu/14; \
  module add devel/cuda/12.9; \
  module add devel/kokkos/4.7.01; \
  ../bin/exampleKokkos

  Kokkos Version: 4.7.1
Compiler:
  KOKKOS_COMPILER_GNU: 1420
  KOKKOS_COMPILER_NVCC: 1290
Architecture:
  CPU architecture: none
  Default Device: Cuda
  GPU architecture: HOPPER90
  platform: 64bit
Atomics:
Vectorization:
  KOKKOS_ENABLE_PRAGMA_IVDEP: no
  KOKKOS_ENABLE_PRAGMA_LOOPCOUNT: no
  KOKKOS_ENABLE_PRAGMA_UNROLL: no
  KOKKOS_ENABLE_PRAGMA_VECTOR: no
Memory:
Options:
  KOKKOS_ENABLE_ASM: yes
  KOKKOS_ENABLE_CXX17: yes
  KOKKOS_ENABLE_CXX20: no
  KOKKOS_ENABLE_CXX23: no
  KOKKOS_ENABLE_CXX26: no
  KOKKOS_ENABLE_DEBUG_BOUNDS_CHECK: no
  KOKKOS_ENABLE_HWLOC: yes
  KOKKOS_ENABLE_LIBDL: yes
Host Parallel Execution Space:
  KOKKOS_ENABLE_OPENMP: yes

OpenMP Runtime Configuration:
Kokkos::OpenMP thread_pool_topology[ 1 x 1 x 1 ]
Host Serial Execution Space:
  KOKKOS_ENABLE_SERIAL: yes

Serial Runtime Configuration:
Device Execution Space:
  KOKKOS_ENABLE_CUDA: yes
Cuda Options:
  KOKKOS_ENABLE_CUDA_RELOCATABLE_DEVICE_CODE: no
  KOKKOS_ENABLE_CUDA_UVM: no
  KOKKOS_ENABLE_IMPL_CUDA_MAL

## Example (Dual View)

* Source code available in [exampleKokkos.DualView.cpp](../src/exampleKokkos.DualView.cpp)

* Include Kokkos Support into your code

  ```c++
  #include <Kokkos_Core.hpp>
  #include <Kokkos_DualView.hpp>

  void mainKokkos() {
    // ...
  }

  int main(int argc, char **argv) {
      Kokkos ::initialize(argc, argv);
      // Bundle all Kokkos objects in function mainKokkos to ensure that destructors are called before Kokkos::finalize
      mainKokkos();
      // all Kokkos objects must be destroyed before Kokkos::finalize gets called!
      Kokkos ::finalize();
      return 0;
  }
  ```

* A list of devices can be obtained by `../bin/exampleKokkos.DualView --config`
* Device selection can be done by setting `../bin/exampleKokkos.DualView --kokkos-device-id=...`

* Platform information

  ```c++
  Kokkos::print_configuration(cout);
  ```

* Allocate memory on the host and the device

  ```c++
  Kokkos::DualView<double *> aDualView(Kokkos::ViewAllocateWithoutInitializing("vector a"), size);
  auto a = aDualView.view_host();
  auto device_a = aDualView.view_device();
  ```

* Pre-process / initialize data on the host
  e.g. read data from storage

  ```c++
  for (std::size_t i = 0; i < size; i++) {
      a[i] = 1;
  }
  ```

* Copy data from the host to the device

  ```c++
  aDualView.modify_host(); // mark as modified on the host
  aDualView.sync_device(); // sync to the device, when modified on the host
  ```

* Compute on the device

  ```c++
  Kokkos::parallel_for(
      "Increment a[] on device", size,
      KOKKOS_LAMBDA(int i) {
          device_a[i]++;
      });
  ```

* Transfer data back from the device to the host

  ```c++
  aDualView.modify_device(); // mark as modified on the device
  aDualView.sync_host();     // sync to the host, when modified on the device
  ```

* Post-process data on the host
  e.g. write data to storage or perform consistency checks

  ```c++
  for (std::size_t i = 0; i < size; i++) {
      if (a[i] != 2.) {
          cout << "a[" << i << "] = " << a[i] << endl;
          errx(2, "Computation on GPU failed");
      }
  }
  ```

* Automatically delete data on the device
* Automatically free memory on the host

### Compilation

In [None]:
#!/usr/bin/bash
# AMD ROCm Compiler
! hipcc \
    -O2 -march=native -flto -Wall -Wextra -fopenmp \
    --offload-arch=native -foffload-lto \
    "../src/exampleKokkos.DualView.cpp" -o "../bin/exampleKokkos.DualView" -lkokkoscore

In [None]:
#!/usr/bin/bash
! module purge; \
  module add compiler/gnu/14; \
  module add devel/cuda/12.9; \
  module add devel/kokkos/4.7.01; \
  nvcc_wrapper --extended-lambda \
    -O2 -march=native -Wall -Wextra -fopenmp \
    -arch=sm_90 \
    "../src/exampleKokkos.DualView.cpp" -o "../bin/exampleKokkos.DualView" -lkokkoscore -lcuda

### Execution

In [18]:
#!/usr/bin/bash
! module purge; \
  module add compiler/gnu/14; \
  module add devel/cuda/12.9; \
  module add devel/kokkos/4.7.01; \
  ../bin/exampleKokkos.DualView

  Kokkos Version: 4.7.1
Compiler:
  KOKKOS_COMPILER_GNU: 1420
  KOKKOS_COMPILER_NVCC: 1290
Architecture:
  CPU architecture: none
  Default Device: Cuda
  GPU architecture: HOPPER90
  platform: 64bit
Atomics:
Vectorization:
  KOKKOS_ENABLE_PRAGMA_IVDEP: no
  KOKKOS_ENABLE_PRAGMA_LOOPCOUNT: no
  KOKKOS_ENABLE_PRAGMA_UNROLL: no
  KOKKOS_ENABLE_PRAGMA_VECTOR: no
Memory:
Options:
  KOKKOS_ENABLE_ASM: yes
  KOKKOS_ENABLE_CXX17: yes
  KOKKOS_ENABLE_CXX20: no
  KOKKOS_ENABLE_CXX23: no
  KOKKOS_ENABLE_CXX26: no
  KOKKOS_ENABLE_DEBUG_BOUNDS_CHECK: no
  KOKKOS_ENABLE_HWLOC: yes
  KOKKOS_ENABLE_LIBDL: yes
Host Parallel Execution Space:
  KOKKOS_ENABLE_OPENMP: yes

OpenMP Runtime Configuration:
Kokkos::OpenMP thread_pool_topology[ 1 x 1 x 1 ]
Host Serial Execution Space:
  KOKKOS_ENABLE_SERIAL: yes

Serial Runtime Configuration:
Device Execution Space:
  KOKKOS_ENABLE_CUDA: yes
Cuda Options:
  KOKKOS_ENABLE_CUDA_RELOCATABLE_DEVICE_CODE: no
  KOKKOS_ENABLE_CUDA_UVM: no
  KOKKOS_ENABLE_IMPL_CUDA_MAL