# SYCL

* Higher-level programming model (APIs, ecosystem)
* Provides APIs to find devices, to manage data resources and code execution on those
* Standard C++, single source code for CPU and accelerator
* SCYLomatic: CUDA to SYCL converter
* Developed by Khronos Group
* Links to external resources:
  * [SYCL Khronos Homepage](https://www.khronos.org/sycl/) (Specification and Reference)
  * [SYCL.tech](https://sycl.tech/getting-started) (Community, Tutorial, eBook)

## Supported Compilers

* AdaptiveCpp (previously known as  hipSYCL / Open SYCL)
* Intel oneAPI Compiler

# Hardware portability

* CPUs
* AMD GPUs (AdaptiveCpp, Intel oneAPI Compiler with Codeplay Plugin)
* Intel GPUs (Intel oneAPI Compiler)
* NVIDIA GPUs (AdaptiveCpp, Intel oneAPI Compiler withCodeplay Plugin)

## Example

* Source code available in [exampleSYCL.cpp](../src/exampleSYCL.cpp)

* Include SYCL Support into your code

  ```c++
  #include <sycl/sycl.hpp>
  
  int main() {

      // Create a queue on the default device
      sycl::queue queue;
      auto device = queue.get_device();
      auto platform = device.get_platform();

      // ..
  }
  ```

* A list of devices can be obtained by `sycl-ls --verbose`
* Device selection can be done by setting environment variables `SYCL_DEVICE_FILTER` or `ONEAPI_DEVICE_SELECTOR`

* Platform information

  ```c++
  cout << "SYCL Platform: "
       << platform.get_info<sycl::info::platform::name>()
       << endl;
  cout << "SYCL Device name: "
       << device.get_info<sycl::info::device::name>()
       << endl;
  cout << "SYCL Driver version: "
       << device.get_info<sycl::info::device::driver_version>()
       << endl;
  cout << "Global memory size: "
       << device.get_info<sycl::info::device::global_mem_size>() / 1000000000. << " GB"
       << endl;
  ```

* Allocate memory on the host

  ```c++
  auto a = sycl::malloc_host<double>(size, queue);
  if (a == nullptr) {
      errx(1, "malloc a[] failed");
  }
  ```

* Pre-process / initialize data on the host
  e.g. read data from storage

  ```c++
  for (std::size_t i = 0; i < size; i++) {
      a[i] = 1.;
  }
  ```

* Allocate memory on the device

  ```c++
  double *device_a = sycl::malloc_device<double>(size, queue);
  if (device_a == nullptr) {
      errx(1, "sycl::malloc_device device_a[] failed");
  }
  ```

* Copy data from the host to the device

  ```c++
  queue.copy(a, device_a, size).wait();
  ```

* Compute on the device

  ```c++
  queue.parallel_for(
      size,
      // kernel expressed as lambda expression
      [=](auto &idx) {
          device_a[idx]++;
      });
  queue.wait();
  ```

* Transfer data back from the device to the host

  ```c++
  queue.copy(device_a, a, size).wait();
  ```

* Delete data on the device

  ```c++
  sycl::free(device_a, queue);
  ```

* Post-process data on the host
  e.g. write data to storage

  ```c++
  for (std::size_t i = 0; i < size; i++) {
      if (a[i] != 2.) {
          cout << "a[" << i << "] = " << a[i] << endl;
          errx(2, "Computation on GPU failed");
      }
  }
  ```

* Free memory on the host

  ```c++
  sycl::free(a, queue);
  ```

### Compilation

In [None]:
#!/usr/bin/bash
# Intel OneAPI Compiler with Codeplay NVIDIA plugin for NVIDIA H100 GPU
! icpx -O2 -march=native -Wall -Wextra \
    -fsycl -fsycl-targets=nvptx64-nvidia-cuda -Xsycl-target-backend --cuda-gpu-arch=sm_90 \
    "../src/exampleSYCL.cpp" -o "../bin/exampleSYCL"

In [9]:
#!/usr/bin/bash
# AdaptiveCpp for AMD GPUs
! acpp \
    -O2 -march=native -flto -Wall -Wextra \
    --acpp-targets=hip:gfx1032 \
    "../src/exampleSYCL.cpp" -o "../bin/exampleSYCL"

In [2]:
#!/usr/bin/bash
# AdaptiveCpp for NVIDIA GPUs
! module purge; \
  module add compiler/llvm/19; \
  module add devel/cuda/12.9; \
  unset OMP_NUM_THREADS; \
  /software/all/compiler/AdaptiveCpp/v25.10.0/bin/acpp \
    -O2 -march=native -flto -Wall -Wextra \
    --acpp-targets=cuda:sm_90 \
    "../src/exampleSYCL.cpp" -o "../bin/exampleSYCL"



### Execution

In [4]:
! module purge; \
  module add compiler/llvm/19; \
  module add devel/cuda/12.9; \
  unset OMP_NUM_THREADS; \
  ../bin/exampleSYCL

SYCL Platform: CUDA (platform 0)
SYCL Device name: NVIDIA H100
SYCL Driver version: 12080
Global memory size: 99.9605 GB
* Allocate memory on the host
* Pre-process / initialize data on the host
  e.g. read data from storage
* Allocate memory on the device
* Copy data from the host to the device
* Compute on the device
* Transfer data back from the device to the host
* Delete data on the device
* Post-process data on the host
  e.g. write data to storage
* Free memory on the host


## Example (using `sycl::accessor`)

* Source code available in [exampleSYCL-accessor.cpp](../src/exampleSYCL-accessor.cpp)

* Include SYCL Support into your code

  ```c++
  #include <sycl/sycl.hpp>
  
  int main() {

      // Create a queue on the default device
      sycl::queue queue;
      auto device = queue.get_device();
      auto platform = device.get_platform();

      // ..
  }
  ```

* A list of devices can be obtained by `sycl-ls --verbose`
* Device selection can be done by setting environment variables `SYCL_DEVICE_FILTER` or `ONEAPI_DEVICE_SELECTOR`

* Platform information

  ```c++
  cout << "SYCL Platform: "
       << platform.get_info<sycl::info::platform::name>()
       << endl;
  cout << "SYCL Device name: "
       << device.get_info<sycl::info::device::name>()
       << endl;
  cout << "SYCL Driver version: "
       << device.get_info<sycl::info::device::driver_version>()
       << endl;
  cout << "Global memory size: "
       << device.get_info<sycl::info::device::global_mem_size>() / 1000000000. << " GB"
       << endl;
  ```

* Create buffer
  * Automatically allocate memory on the host
  * Automatically allocate memory on the device

  ```c++
  sycl::buffer<double, 1> a{size};
  ```

* Pre-process / initialize data on the host
  e.g. read data from storage

  ```c++
  {
      auto host_access_a = a.get_host_access(sycl::write_only);
      for (std::size_t i = 0; i < size; i++) {
          host_access_a[i] = 1.;
      }
  }
  ```

* Automatically copy data from the host to the device
* Compute on the device

  ```c++
  queue.submit(
      // command group expressed as lambda expression
      [&](sycl::handler &handler) {
          auto device_access_a = a.get_access(handler, sycl::read_write);

          handler.parallel_for(
              size,
              // kernel expressed as lambda expression
              [=](sycl::id<1> idx) {
                  device_access_a[idx]++;
              });
      });
  queue.wait();
  ```

* Automatically transfer data back from the device to the host

  ```c++
  auto host_access_a = a.get_host_access(sycl::read_only);
  ```

* Post-process data on the host
  e.g. write data to storage

  ```c++
  for (std::size_t i = 0; i < size; i++) {
      if (host_access_a[i] != 2.) {
          cout << "a[" << i << "] = " << host_access_a[i] << endl;
          errx(2, "Computation on GPU failed");
      }
  }
  ```

### Compilation

In [None]:
#!/usr/bin/bash
# Intel OneAPI Compiler with Codeplay NVIDIA plugin for NVIDIA H100 GPU
! icpx -O2 -march=native -Wall -Wextra \
    -fsycl -fsycl-targets=nvptx64-nvidia-cuda -Xsycl-target-backend --cuda-gpu-arch=sm_90 \
    "../src/exampleSYCL-accessor.cpp" -o "../bin/exampleSYCL-accessor"

In [6]:
#!/usr/bin/bash
# AdaptiveCpp for AMD GPUs
! acpp \
    -O2 -march=native -flto -Wall -Wextra \
    --acpp-targets=hip:gfx1032 \
    "../src/exampleSYCL-accessor.cpp" -o "../bin/exampleSYCL-accessor"

In [None]:
#!/usr/bin/bash
# AdaptiveCpp for NVIDIA GPUs
! module purge; \
  module add compiler/llvm/19; \
  module add devel/cuda/12.9; \
  unset OMP_NUM_THREADS; \
  /software/all/compiler/AdaptiveCpp/v25.10.0/bin/acpp \
    -O2 -march=native -flto -Wall -Wextra \
    --acpp-targets=cuda:sm_90 \
  "../src/exampleSYCL-accessor.cpp" -o "../bin/exampleSYCL-accessor"



### Execution

In [6]:
#!/usr/bin/bash
! module purge; \
  module add compiler/llvm/19; \
  module add devel/cuda/12.9; \
  unset OMP_NUM_THREADS; \
  ../bin/exampleSYCL-accessor

SYCL Platform: CUDA (platform 0)
SYCL Device name: NVIDIA H100
SYCL Driver version: 12080
Global memory size: 99.9605 GB
* Create buffer
* Automatically allocate memory on the host
* Automatically allocate memory on the device
https://github.com/AdaptiveCpp/AdaptiveCpp/blob/develop/doc/performance.md
* Pre-process / initialize data on the host
  e.g. read data from storage
* Automatically copy data from the host to the device
* Compute on the device
* Automatically transfer data back from the device to the host
* Post-process data on the host
  e.g. write data to storage
