

## GPU PROGRAMMING 101 GRIDKA SCHOOL 2018

30 August 2018 | Andreas Herten | Forschungszentrum Jülich Handout Version



Member of the Helmholtz Association

# About, Outline

About me

- Physics: Dr. at PANDA (Particle Tracking with GPUs)
- Since then: NVIDIA Application Lab, POWER Acceleration and Design Centre Optimizing scientific applications for/on GPUs at Jülich Supercomputing Centre

**Motivation** Platform Hardware **Features High Throughput** Summary **Programming GPUs** Libraries Directives Languages Abstraction Libraries/DSL Tools Conclusions



JURECA: Top 500 #70

- 1999: General computations with shaders of graphics hardware
- 2001: NVIDIA GeForce 3 with programmable shaders [2]; 2003: DirectX 9 at ATI
- 2007: CUDA
- 2018: Top 500: 20 % with GPUs (#1, #3), Green 500: 7 of top 10 with GPUs

#### JURECA: Top 500 #70



Theoretical Peak Performance, Single Precision

JURECA: Top 500 #70





JURECA: Top 500 #70



Theoretical Peak Performance, Double Precision







# But why?!

# Let's find out!

# Platform



### CPU vs. GPU

#### A matter of specialties



Transporting one



**Transporting many** 



#### CPU vs. GPU Chip







### **GPU** Architecture

Overview

Aim: Hide Latency *Everything else follows* 

SIMT

Asynchronicity

Memory



### **GPU** Architecture

Overview

Aim: Hide Latency *Everything else follows* 

SIMT

Asynchronicity

Memory



Member of the Helmholtz Association

### Memory

#### GPU memory ain't no CPU memory

Unified Virtual Addressing

- GPU: accelerator / extension card
- → Separate device from CPU Separate memory, but UVA
  - Memory transfers need special consideration! Do as little as possible!
  - Formerly: Explicitly copy data to/from GPU Now: Done automatically (performance...?)

#### P100

16 GB RAM, 720 GB/s



V100

32 GB RAM, 900 GB/s







Device



Slide 6140

### Memory

#### GPU memory ain't no CPU memory

- GPU: accelerator / extension card
- → Separate device from CPU Separate memory, but UVA and UM
  - Memory transfers need special consideration! Do as little as possible!
  - Formerly: Explicitly copy data to/from GPU Now: Done automatically (performance...?)

#### P100

16 GB RAM, 720 GB/s



V100

32 GB RAM, 900 GB/s







Device



Unified Memory

Slide 6140

### **GPU** Architecture

Overview

Aim: Hide Latency *Everything else follows* 

SIMT

Asynchronicity

Memory





Following different streams

- Problem: Memory transfer is comparably slow
   Solution: Do something else in meantime (computation)!
- ightarrow Overlap tasks
  - Copy and compute engines run separately (streams)



- GPU needs to be fed: Schedule many computations
- CPU can do other work while GPU computes; synchronization
- Also: Fast switching of contexts to keep GPU busy (KGB)



### **GPU** Architecture

Overview

Aim: Hide Latency *Everything else follows* 

#### SIMT

#### Asynchronicity

#### Memory





#### Of threads and warps

#### CPU:

- Single Instruction, Multiple Data (SIMD)
- Simultaneous Multithreading (SMT)
- GPU: Single Instruction, Multiple Threads (SIMT)
  - CPU core  $\simeq$  GPU multiprocessor (SM)
  - Working unit: set of threads (32, a *warp*)
  - Fast switching of threads (large register file)
  - Branching if \_\_\_\_\_



Vector

SMT









### SIMT



#### Vector

 $\begin{bmatrix} A_0 \\ A_1 \\ A_2 \\ A_3 \end{bmatrix} + \begin{bmatrix} B_0 \\ B_1 \\ B_2 \\ B_3 \end{bmatrix} = \begin{bmatrix} C_0 \\ C_1 \\ C_2 \\ C_3 \end{bmatrix}$ 

SMT ad recore ad recore

SIMT



SUPERCOMPUTING CENTRE

Slide 10140

### SIMT





Vector





Graphics: volta-pictures

SIMT





JÜLICH SUPERCOMPUTING CENTRE

### SIMT



#### Vector

> SMT Thread Core Core Core

Graphics: volta-pictures

SIMT



Slide 10140

### **New: Tensor Cores**

New in Volta

- 8 Tensor Cores per Streaming Multiprocessor (SM) (640 total for V100)
- Performance: 125 TFLOP/s (half precision)
- Calculate  $\mathbf{A} \times \mathbf{B} + \mathbf{C} = \mathbf{D}$  (4 × 4 matrices;  $\mathbf{A}$ ,  $\mathbf{B}$ : half precision)
- $ightarrow \,$  64 floating-point FMA operations per clock (mixed precision)





Slide 11|40

# Low Latency vs. High Throughput

Maybe GPU's ultimate feature

CPU Minimizes latency within each thread

GPU Hides latency with computations from other thread warps

**CPU Core: Low Latency** 



GPU Streaming Multiprocessor: High Throughput



Thread/Warp Processing Context Switch Ready Waiting



### CPU vs. GPU

#### Let's summarize this!



#### Optimized for low latency

- + Large main memory
- + Fast clock rate
- + Large caches
- + Branch prediction
- + Powerful ALU
- Relatively low memory bandwidth
- Cache misses costly
- Low performance per watt



#### Optimized for high throughput

- + High bandwidth main memory
- + Latency tolerant (parallelism)
- + More compute resources
- + High performance per watt
- Limited memory capacity
- Low per-thread performance
- Extension card



Slide 13140

# **Programming GPUs**



### Preface: CPU

A simple CPU program as reference!

```
SAXPY: \vec{y} = a\vec{x} + \vec{y}, with single precision
Part of LAPACK BLAS Level 1
void saxpy(int n, float a, float * x, float * y) {
  for (int i = 0; i < n; i++)
    v[i] = a * x[i] + v[i]:
}
int a = 42;
int n = 10:
float x[n], y[n];
// fill x, y
saxpy(n, a, x, y);
```



## Libraries

Programming GPUs is easy: Just don't!

Use applications & libraries





# Libraries

#### Programming GPUs is easy: Just don't!

Use applications & libraries



JÜLICH SUPERCOMPUTING

CENTRE



- GPU-parallel BLAS (all 152 routines)
- Single, double, complex data types
- Constant competition with Intel's MKL
- Multi-GPU support
- → https://developer.nvidia.com/cublas http://docs.nvidia.com/cuda/cublas



### cuBLAS

#### Code example

```
int a = 42: int n = 10:
float x[n], y[n];
// fill x. v
cublasHandle t handle:
cublasCreate(&handle):
float * d x. * d y:
cudaMallocManaged(\&d x. n * sizeof(x[0]):
cudaMallocManaged(\delta d v, n * sizeof(v[0]):
cublasSetVector(n, sizeof(x[0]), x, 1, d x, 1);
cublasSetVector(n, sizeof(v[0]), v, 1, d v, 1);
cublasSaxpv(n. a. d x. 1. d v. 1):
cublasGetVector(n, sizeof(v[0]), d v, 1, v, 1);
cudaFree(d x): cudaFree(d v):
cublasDestrov(handle):
```

### cuBLAS

#### Code example

| <pre>int a = 42; int n = 10;<br/>float x[n], y[n];<br/>// fill x, y</pre>                                                                                                         |                                                |             |                                                                     |
|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------|-------------|---------------------------------------------------------------------|
| cublasHandle_t handle;<br>cublasCreate(&handle);                                                                                                                                  |                                                |             | Initialize                                                          |
| <pre>float * d_x, * d_y;<br/>cudaMallocManaged(&amp;d_x, n * s<br/>cudaMallocManaged(&amp;d_y, n * s<br/>cublasSetVector(n, sizeof(x[0<br/>cublasSetVector(n, sizeof(y[0]))</pre> | <pre>sizeof(y[0]);<br/>0]), x, 1, d_x, 1</pre> |             | Allocate GPU memory<br>Copy data to GPU                             |
| cublasSaxpy(n, a, d_x, 1, d_y                                                                                                                                                     | y, 1); <b>-</b>                                |             | Call BLAS routine                                                   |
| <pre>cublasGetVector(n, sizeof(y[0]), d_y, 1, y, 1);</pre> Copy result to                                                                                                         |                                                |             | Copy result to host                                                 |
| <pre>cudaFree(d_x); cudaFree(d_y);<br/>cublasDestroy(handle);</pre>                                                                                                               | 30 August 2018                                 | Slide 18140 | Finalize<br>Forschungszentrum<br>JÜLICH<br>SUPERCOMPUTING<br>CENTRE |

# Libraries

#### Programming GPUs is easy: Just don't!

Use applications & libraries



JÜLICH SUPERCOMPUTING

CENTRE

### Thrust

Iterators! Iterators everywhere! 🚀

- $\frac{\text{Thrust}}{\text{CUDA}} = \frac{\text{STL}}{\text{C++}}$
- Template library
- Based on iterators
- Data-parallel primitives (scan(), sort(), reduce(),...)
- Fully compatible with plain CUDA C (comes with CUDA Toolkit)
- Great with [](){} lambdas!
- → http://thrust.github.io/ http://docs.nvidia.com/cuda/thrust/



### Thrust

#### Code example with lambdas

```
int a = 42;
int n = 10;
thrust::host_vector<float> x(n), y(n);
// fill x, y
```

```
thrust::device_vector d_x = x, d_y = y;
```

```
using namespace thrust::placeholders;
thrust::transform(d_x.begin(), d_x.end(), d_y.begin(), d_y.begin(), a * _1 + _2);
```

x = d\_x;



### Thrust

Code example with lambdas

```
#include <thrust/for each.h>
#include <thrust/execution policv.h>
constexpr int gGpuThreshold = 10000;
void saxpy(float *x, float *y, float a, int N) {
    auto r = thrust::counting iterator<int>(0);
    auto lambda = [=] host device (int i) {
     v[i] = a * x[i] + v[i]:
    if(N > gGpuThreshold)
      thrust::for each(thrust::device. r. r+N. lambda):
   else
      thrust::for each(thrust::host, r, r+N, lambda);}
```



# **Programming GPUs**

### **Directives**



# **GPU** Programming with Directives

Keepin' you portable

Annotate usual source code by directives

#pragma acc loop
for (int i = 0; i < 1; i+\*) {};</pre>

- Also: Generalized API functions acc\_copy();
- Compiler interprets directives, creates according instructions

### Pro

- Portability
  - Other compiler? No problem! To it, it's a serial program
  - Different target architectures from same code
- Easy to program

### Con

- Compilers support limited
- Raw power hidden
- Somewhat harder to debug



Slide 23140

# **GPU** Programming with Directives

The power of... two.

OpenMP Standard for multithread programming on CPU, GPU since 4.0, better since 4.5

```
#pragma omp target map(tofrom:y), map(to:x)
#pragma omp teams num_teams(10) num_threads(10)
#pragma omp distribute
for ( ) {
    #pragma omp parallel for
    for ( ) {
        // ...
    }
}
```

OpenACC Similar to OpenMP, but more specifically for GPUs Might eventually be re-merged into OpenMP standard



## OpenACC

#### Code example

```
void saxpy_acc(int n, float a, float * x, float * y) {
    #pragma acc kernels
    for (int i = 0; i < n; i++)
        y[i] = a * x[i] + y[i];
}
int a = 42;
int n = 10;
float x[n], y[n];
// fill x, y
saxpy_acc(n, a, x, y);</pre>
```



# OpenACC

Code example

```
void saxpy_acc(int n, float a, float * x, float * y) {
    #pragma acc parallel loop copy(y) copyin(x)
    for (int i = 0; i < n; i++)
        y[i] = a * x[i] + y[i];
}
int a = 42;
int n = 10;
float x[n], y[n];
// fill x, y
saxpy_acc(n, a, x, y);</pre>
```



# OpenACC

Code example

```
void saxpy_acc(int n, float a, float * x, float * y) {
    #pragma acc parallel loop copy(y) copyin(x)
    for (int i = 0; i < n; i++)
    y[i] = a * x[i] + y[i];
}
int a = 42;
int n = 10;
float x[n], y[n];
// fill x, y</pre>
```

saxpy\_acc(n, a, x, y);



### **Programming GPUs**

### Languages



# Programming GPU Directly

Finally...

Two solutions:

OpenCL Open Computing Language by Khronos Group (Apple, IBM, NVIDIA, ...) 2009

- Platform: Programming language (OpenCL C/C++), API, and compiler
- Targets CPUs, GPUs, FPGAs, and other many-core machines
- Fully open source
- Different compilers available
- CUDA NVIDIA's GPU platform 2007
  - Platform: Drivers, programming language (CUDA C/C++), API, compiler, debuggers, profilers, ...
  - Only NVIDIA GPUs
  - Compilation with nvcc (free, but not open) clang has CUDA support, but CUDA needed for last step
  - Also: CUDA Fortran
- Choose what flavor you like, what colleagues/collaboration is using
- Hardest: Come up with parallelized algorithm



# **CUDA Threading Model**

Warp the kernel, it's a thread!

Methods to exploit parallelism:



- Block  $\rightarrow$  Grid
- Threads & blocks in 3D
- Parallel function: kernel
  - \_\_global\_\_ kernel(int a, float \* b) { }
  - Access own ID by global variables threadIdx.x, blockIdx.y,...
- Execution entity: threads
  - Lightweight  $\rightarrow$  fast switchting!
  - 1000s threads execute simultaneously  $\rightarrow$  order non-deterministic!
- $\Rightarrow$  Saxpy!





Slide 28140

### **CUDA SAXPY**

#### With runtime-managed data transfers



### **Programming GPUs**

### **Abstraction Libraries/DSL**



### **Abstraction Libraries & DSLs**

- Libraries with ready-programmed abstractions; partly compiler/transpiler necessary
- Have different backends to choose from for targeted accelerator
- Between Thrust, OpenACC, and CUDA
- Examples: Kokkos, Alpaka, Futhark, HIP, C++AMP, ...



## An Alternative: Kokkos

From Sandia National Laboratories

- C++ library for *performance* portability
- Data-parallel patterns, architecture-aware memory layouts, ...

```
Kokkos::View<double*> x("X", length);
Kokkos::View<double*> y("Y", length);
double a = 2.0;
// Fill x, y
Kokkos::parallel_for(length, KOKKOS_LAMBDA (const int& i) {
    x(i) = a*x(i) + y(i);
});
```

 $\rightarrow$  https://github.com/kokkos/kokkos/



# **Programming GPUs**

### Tools



### **GPU** Tools

The helpful helpers helping helpless (and others)

NVIDIA

cuda-gdbGDB-like command line utility for debuggingcuda-memcheckLike Valgrind's memcheck, for checking errors in memory accessesNsightIDE for GPU developing, based on Eclipse (Linux, OS X) or Visual Studio<br/>(Windows)nvprofCommand line profiler, including detailed performance countersVisual ProfilerTimeline profiling and annotated performance experiments

• OpenCL: CodeXL (Open Source, GPUOpen/AMD) – debugging, profiling.





### nvprof Command that line

### Usage: nvprof ./app

#### •••

| \$ nvprof | ./matrixMul  | -wA=102  | 4 -hA=1024 | -wB=1024   | -hB=1024   |                                                                                  |
|-----------|--------------|----------|------------|------------|------------|----------------------------------------------------------------------------------|
| ==37064=  | = Profiling  | applicat | ion: ./mat | rixMul -wA | =1024 -hA= | 1024 -wB=1024 -hB=1024                                                           |
| ==37064=  | = Profiling  | result:  |            |            |            |                                                                                  |
| Time(%)   | Time         | Calls    | Avg        | Min        | Max        | Name                                                                             |
| 99.19%    | 262.43ms     | 301      | 871.86us   | 863.88us   | 882.44us   | <pre>void matrixMulCUDA<int=32>(float*, float*, float*, int, int)</int=32></pre> |
| 0.58%     | 1.5428ms     |          | 771.39us   | 764.65us   | 778.12us   | [CUDA memcpy HtoD]                                                               |
| 0.23%     | 599.40us     |          | 599.40us   | 599.40us   | 599.40us   | [CUDA memcpy DtoH]                                                               |
|           |              |          |            |            |            |                                                                                  |
| ==37064=  | = API calls: |          |            |            |            |                                                                                  |
| Time(%)   | Time         | Calls    | Avg        | Min        | Max        | Name                                                                             |
| 61.26%    | 258.38ms     |          | 258.38ms   | 258.38ms   | 258.38ms   | cudaEventSynchronize                                                             |
| 35.68%    | 150.49ms     |          | 50.164ms   | 914.97us   | 148.65ms   | cudaMalloc                                                                       |
| 0.73%     | 3.0774ms     |          | 1.0258ms   | 1.0097ms   | 1.0565ms   | cudaMemcpy                                                                       |
| 0.62%     | 2.6287ms     |          | 657.17us   | 655.12us   | 660.56us   | cuDeviceTotalMem                                                                 |
| 0.56%     | 2.3408ms     | 301      | 7.7760us   | 7.3810us   | 53.103us   | cudaLaunch                                                                       |
| 0.48%     | 2.0111ms     | 364      | 5.5250us   | 235ns      | 201.63us   | cuDeviceGetAttribute                                                             |
| 0.21%     | 872.52us     |          | 872.52us   | 872.52us   | 872.52us   | cudaDeviceSynchronize                                                            |
|           |              |          |            |            |            |                                                                                  |



### nvprof Command that line

### With metrics: nvprof --metrics flop\_sp\_efficiency ./app

#### •••

| [Matrix Multiply Using CUDA]<br>==37122== NVPROF is profiling                                                                   |                                                                                                                                | Mul -wA=1024 -hA=1024 -wB=1024 -hB=10                                                    | 924    |        |        |
|---------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------|--------|--------|--------|
| done122== Replaying kernel "\<br>Performance= 26.61 GFlop/s, T<br>Checking computed result for<br>==37122== Profiling applicati | ernel<br>be replayed on device 0 in order<br>oid matrixMulCUDA <int=32>(float*,<br/>ime= 80.697 msec, Size= 214748364</int=32> | float*, float*, int, int)" (0 of 2)<br>8 Ops, WorkgroupSize= 1024 threads/b <sup>*</sup> |        |        |        |
| ==37122== Profiling result:<br>==37122== Metric result:<br>Invocations<br>Device "Tesla P100-SXM2-16GB                          | Metric Name<br>(0)"                                                                                                            | Metric Description                                                                       | Min    | Max    | Avg    |
| Kernel: void matrixMulCUE<br>301                                                                                                | A <int=32>(float*, float*, float*, float*, float*, flop_sp_efficiency</int=32>                                                 | int, int)<br>FLOP Efficiency(Peak Single)                                                | 22.96% | 23.40% | 23.15% |



# **Visual Profiler**

| viouut i       |                                                                                        |                         | N      | VIDIA Visual Profile | er                  |              |                       |       |  |  |  |
|----------------|----------------------------------------------------------------------------------------|-------------------------|--------|----------------------|---------------------|--------------|-----------------------|-------|--|--|--|
| Your new favor | er 🗟 🗟 📑 🛶 🐦 🕂 🗨 🛎                                                                     | F 🔨 🔣 🚆 🚆               |        |                      |                     |              |                       |       |  |  |  |
|                | 💺 *NewSession1 🖾                                                                       |                         |        |                      |                     |              |                       | - 0   |  |  |  |
|                |                                                                                        | 0,3 s                   | 0,35 s | 0,4 s                | 0,45 s              | 0,5 s        | 0,55 s                | 0,6 s |  |  |  |
|                | Process "matrixMul" (18924) Thread 39720768                                            |                         |        |                      |                     |              |                       |       |  |  |  |
|                | - Runtime API                                                                          |                         |        | cudaEventS           | ynchronize          |              |                       | cud   |  |  |  |
|                | Driver API     Profiling Overhead                                                      |                         |        |                      |                     | 1            | 1 1                   |       |  |  |  |
|                | [0] Tesla K40m Context 1 (CUDA)                                                        |                         | -      |                      |                     |              |                       |       |  |  |  |
|                | - 🍸 MemCpy (HtoD)                                                                      |                         |        |                      |                     | 1            |                       |       |  |  |  |
|                | MemCpy (DtoH) Compute                                                                  |                         |        |                      |                     |              |                       |       |  |  |  |
|                | Y 100,0% void mat Streams                                                              |                         |        |                      |                     |              |                       |       |  |  |  |
|                | L Default                                                                              |                         |        |                      |                     |              |                       |       |  |  |  |
|                |                                                                                        |                         |        |                      |                     |              |                       |       |  |  |  |
|                | 🕞 Analysis 🕞 Details 😳 Console 🗱 🕞 Settings 💿 🗮 💥 🐘 🔜 😂 🐼 🛃 🕞 🕫 😁 🗖 🕞 Properties 😫     |                         |        |                      |                     |              |                       |       |  |  |  |
|                | <pre><terminated> matrixMul on juhydra [Matrix Multiply Using CUDA]</terminated></pre> | Default                 |        |                      |                     |              |                       |       |  |  |  |
|                | GPU Device 0: "Tesla K40m" w                                                           |                         | 3.5    |                      |                     |              | ▼ Duration<br>Session |       |  |  |  |
|                | MatrixA(320,320), MatrixB(644<br>Computing result using CUDA I<br>done                 | Gernel                  |        |                      |                     |              | 36351011              |       |  |  |  |
|                | Performance= 351.01 GFlop/s,<br>Checking computed result for                           |                         |        | s, WorkgroupSize     | e= 1024 threads∕b   | lock         |                       |       |  |  |  |
|                | NOTE: The CUDA Couples are a                                                           | at waart fan aasfaswaa. |        | Basulka mau ua       | unu udaan CDII Baaa | t is eachlad |                       |       |  |  |  |



Slide 36140

### Conclusions



## **Summary of Acceleration Possibilities**



CENTRE

Forschungszentrum

## Omitted

### There's so much more!

### What I did not talk about

- Atomic solutions
- Shared memory
- Pinned memory
- Managed memory
- Debugging
- Overlapping streams
- Multi-GPU programming (intra-node; MPI)
- Cooperative groups
- Independent thread progress
- Half precision FP16
- ...





Thread Progress



### **Summary & Conclusion**

- GPUs can improve your performance many-fold
- For a fitting, parallelizable application
- Libraries are easiest
- Direct programming (plain CUDA) is most powerful
- OpenACC is somewhere in between (and portable)
- There are many tools helping the programmer
- ightarrow See it in action this afternoon at OpenACC tutorial





### **APPENDIX**



Appendix Further Reading & Links GPU Performances Glossary References



# **Further Reading & Links**

More!

- A discussion of SIMD, SIMT, SMT by Y. Kreinin.
- NVIDIA's documentation: docs.nvidia.com
- NVIDIA's Parallel For All blog



### Volta Performance

| Tesla Product                   | Tesla K40            | Tesla M40           | Tesla P100          | Tesla V100                 |
|---------------------------------|----------------------|---------------------|---------------------|----------------------------|
| GPU                             | GK180 (Kepler)       | GM200 (Maxwell)     | GP100 (Pascal)      | GV100 (Volta)              |
| SMs                             | 15                   | 24                  | 56                  | 80                         |
| TPCs                            | 15                   | 24                  | 28                  | 40                         |
| FP32 Cores / SM                 | 192                  | 128                 | 64                  | 64                         |
| FP32 Cores / GPU                | 2880                 | 3072                | 3584                | 5120                       |
| FP64 Cores / SM                 | 64                   | 4                   | 32                  | 32                         |
| FP64 Cores / GPU                | 960                  | 96                  | 1792                | 2560                       |
| Tensor Cores / SM               | NA                   | NA                  | NA                  | 8                          |
| Tensor Cores / GPU              | NA                   | NA                  | NA                  | 640                        |
| GPU Boost Clock                 | 810/875 MHz          | 1114 MHz            | 1480 MHz            | 1462 MHz                   |
| Peak FP32 TFLOPS <sup>1</sup>   | 5                    | 6.8                 | 10.6                | 15                         |
| Peak FP64 TFLOPS <sup>1</sup>   | 1.7                  | .21                 | 5.3                 | 7.5                        |
| Peak Tensor TFLOPS <sup>1</sup> | NA                   | NA                  | NA                  | 120                        |
| Texture Units                   | 240                  | 192                 | 224                 | 320                        |
| Memory Interface                | 384-bit GDDR5        | 384-bit GDDR5       | 4096-bit HBM2       | 4096-bit HBM2              |
| Memory Size                     | Up to 12 GB          | Up to 24 GB         | 16 GB               | 16 GB                      |
| L2 Cache Size                   | 1536 KB              | 3072 KB             | 4096 KB             | 6144 KB                    |
| Shared Memory Size /<br>SM      | 16 KB/32 KB/48<br>KB | 96 KB               | 64 KB               | Configurable u<br>to 96 KB |
| Register File Size / SM         | 256 KB               | 256 KB              | 256 KB              | 256KB                      |
| Register File Size /<br>GPU     | 3840 KB              | 6144 KB             | 14336 KB            | 20480 KB                   |
| TDP                             | 235 Watts            | 250 Watts           | 300 Watts           | 300 Watts                  |
| Transistors                     | 7.1 billion          | 8 billion           | 15.3 billion        | 21.1 billion               |
| GPU Die Size                    | 551 mm²              | 601 mm <sup>2</sup> | 610 mm <sup>2</sup> | 815 mm <sup>2</sup>        |
| Manufacturing<br>Process        | 28 nm                | 28 nm               | 16 nm FinFET+       | 12 nm FFN                  |

Figure: Tesla V100 performance characteristics in comparison [volta-pictures]



## Appendix

### **Glossary & References**



## **Glossary** I

- API A programmatic interface to software by well-defined functions. Short for application programming interface. 39, 45
- ATI Canada-based GPUs manufacturing company; bought by AMD in 2006. 3
- CUDA Computing platform for GPUs from NVIDIA. Provides, among others, CUDA C/C++. 3, 35, 45, 46, 47, 49, 59, 66
  - DSL A Domain-Specific Language is a specialization of a more general language to a specific domain. 2, 48, 49
  - MPI The Message Passing Interface, a API definition for multi-node computing. 58
- NVIDIA US technology company creating GPUs. 2, 3, 45, 52, 62, 65, 67





## **Glossary II**

OpenACC Directive-based programming, primarily for many-core machines. 40, 41, 42, 43, 49, 59

OpenCL The Open Computing Language. Framework for writing code for heterogeneous architectures (CPU, GPU, DSP, FPGA). The alternative to CUDA. 45, 52

OpenMP Directive-based programming, primarily for multi-threaded machines. 40

POWER CPU architecture from IBM, earlier: PowerPC. See also POWER8. 2, 66

POWER8 Version 8 of IBM's POWERprocessor, available also under the OpenPOWER Foundation. 66

SAXPY Single-precision  $A \times X + Y$ . A simple code example of scaling a vector and adding an offset. 28, 46, 47





### **Glossary III**

### Thrust A parallel algorithms library for (among others) GPUs. See https://thrust.github.io/.35

Volta GPU architecture from NVIDIA (announced 2017). 24



### **References I**

- [2] Chris McClanahan. "History and Evolution of GPU Architecture". In: A Survey Paper (2010). URL: http://mcclanahoochie.com/blog/wp-content/uploads/2011/03/gpuhist-paper.pdf (page 3).
- [3] Karl Rupp. Pictures: CPU/GPU Performance Comparison. URL: https://www.karlrupp.net/2013/06/cpu-gpu-and-mic-hardwarecharacteristics-over-time/ (pages 4–6).
- [7] Wes Breazell. Picture: Wizard. URL: https://thenounproject.com/wes13/collection/its-a-wizards-world/ (pages 29, 30, 34).



## **References: Images, Graphics I**

- [1] Igor Ovsyannykov. Yarn. Freely available at Unsplash. URL: https://unsplash.com/photos/hvILKk7SlH4.
- [4] Mark Lee. Picture: kawasaki ninja. URL: https://www.flickr.com/photos/pochacco20/39030210/ (page 11).
- [5] Shearings Holidays. Picture: Shearings coach 636. URL: https://www.flickr.com/photos/shearings/13583388025/ (page 11).
- [6] Nvidia Corporation. Pictures: Volta GPU. Volta Architecture Whitepaper. URL: https://images.nvidia.com/content/volta-architecture/pdf/Volta-Architecture-Whitepaper-v1.0.pdf.

