# From Milliwatts to PFLOPS

#### GridKa School 2014

Karlsruhe, Germany 02 September 2014 Dr.-Ing. Michael Klemm Intel



INFORMATION IN THIS DOCUMENT IS PROVIDED "AS IS". NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT.INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you infully evaluating your contemplated purchases, including the performance of that product when combined with other products.

Intel product plans in this presentation do not constitute Intel plan of record product roadmaps. Please contact your Intel representative to obtain Intel's current plan of record product roadmaps.

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not uniqueto Intel microprocessors. These optimizations include SSE2, SSE3, and SSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specificinstruction sets covered by this notice.Notice revision #20110804

All products, computer systems, dates, and figures specified are preliminary based on current expectations, and are subject to change without notice.

Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor family, not across different processor families. Go to: http://www.intel.com/products/processor\_number

Intel, processors, chipsets, and desktop boards may contain design defects or errors known as errata, which may cause the product to deviate from published specifications. Current characterized errata are available on request.

Intel, Intel Xeon, Intel Xeon Phi, Intel Hadoop Distribution, Intel Cluster Ready, Intel OpenMP, Intel CilkPlus, Intel Threaded Buildiingblocks, Intel Cluster Studio, Intel Parallel Studio, Intel CoarrayFortran, Intel Math KernalLibrary, Intel Enterprise Edition for LustreSoftware, Intel Composer, the Intel Xeon Phi logo, the Intel Xeon logo and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries.

Intel does not control or audit the design or implementation of third party benchmark data or Web sites referenced in this document. Intel encourages all of its customers to visit the referenced Web sites or others where similar performance benchmark data are reported and confirm whether the referenced benchmark data are accurate and reflect performance of systems available for purchase.

Other names, brands , and images may be claimed as the property of others.

Copyright © 2013, Intel Corporation. All rights reserved.

# **Legal Disclaimers: Performance**

Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, Go to: <u>http://www.intel.com/performance/resources/benchmark\_limitations.htm</u>.

Intel does not control or audit the design or implementation of third party benchmarks or Web sites referenced in this document. Intel encourages all of its customers to visit the referenced Web sites or others where similar performance benchmarks are reported and confirm whether the referenced benchmarks are accurate and reflect performance of systems available for purchase.

Relative performance is calculated by assigning a baseline value of 1.0 to one benchmark result, and then dividing the actual benchmark result for the baseline platform into each of the specific benchmark results of each of the other platforms, and assigning them a relative performance number that correlates with the performance improvements reported.

SPEC, SPECint, SPECfp, SPECrate. SPECpower, SPECjAppServer, SPECjEnterprise, SPECjbb, SPECompM, SPECompL, and SPEC MPI are trademarks of the Standard Performance Evaluation Corporation. See <u>http://www.spec.org</u> for more information.

TPC Benchmark is a trademark of the Transaction Processing Council. See <u>http://www.tpc.org</u> for more information.

SAP and SAP NetWeaver are the registered trademarks of SAP AG in Germany and in several other countries. See <u>http://www.sap.com/benchmark</u> for more information.

INFORMATION IN THIS DOCUMENT IS PROVIDED "AS IS". NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, reference www.intel.com/software/products.



# **Optimization Notice**

Intel<sup>®</sup> compilers, associated libraries and associated development tools may include or utilize options that optimize for instruction sets that are available in both Intel<sup>®</sup> and non-Intel microprocessors (for example SIMD instruction sets), but do not optimize equally for non-Intel microprocessors. In addition, certain compiler options for Intel compilers, including some that are not specific to Intel micro-architecture, are reserved for Intel microprocessors. For a detailed description of Intel compiler options, including the instruction sets and specific microprocessors they implicate, please refer to the "Intel<sup>®</sup> Compiler User and Reference Guides" under "Compiler Options." Many library routines that are part of Intel<sup>®</sup> compiler products are more highly optimized for Intel microprocessors than for other microprocessors. While the compilers and libraries in Intel<sup>®</sup> compiler products offer optimizations for both Intel and Intel-compatible microprocessors, depending on the options you select, your code and other factors, you likely will get extra performance on Intel microprocessors.

Intel<sup>®</sup> compilers, associated libraries and associated development tools may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include Intel<sup>®</sup> Streaming SIMD Extensions 2 (Intel<sup>®</sup> SSE2), Intel<sup>®</sup> Streaming SIMD Extensions 3 (Intel<sup>®</sup> SSE3), and Supplemental Streaming SIMD Extensions 3 (Intel<sup>®</sup> SSE3) instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors.

While Intel believes our compilers and libraries are excellent choices to assist in obtaining the best performance on Intel<sup>®</sup> and non-Intel microprocessors, Intel recommends that you evaluate other compilers and libraries to determine which best meet your requirements. We hope to win your business by striving to offer the best performance of any compiler or library; please let us know if you find we do not. Notice revision #20101101



Copyright ©2014 Intel Corporation. All rights reserved.

# The Path to Discovery & Innovatio

# EXPERIMENT

Observation



THEORY Mathematical Model



HPC Numerical Simulation





Copyright ©2014 Intel Corporation. All rights reserved.

# **Performance** it's all about Parallelism

and Energy Efficiency

# **HPC IMPERATIVES**

# High Performance Capabilities & Capacity

# Energy Efficiency TCO

# **Ease of Use** Productivity & Sustainability



Copyright ©2014 Intel Corporation. All rights reserved.

# Intel's Assets for HPC



## ... and many, many application experts



Copyright ©2014 Intel Corporation. All rights reserved.

# **Simplicity** is the ultimate sophistication.

- Leonardo da Vinci

# Transforming the Economics of HPC



#### Executing to Moore's Law

Predictable Silicon Track Record – well and alive at Intel. Enabling new devices with higher performance and functionality while controlling power, cost, and size



Future options subject to change without notice.

#### Tick-Tock Development Cycles Integrate. Innovate.



\*\*Intel® Architecture Instruction Set Extensions Programming Reference, #319433-012A, FEBRUARY 2012 ++Intel® Architecture Instruction Set Extensions Programming Reference, #319433-015, JULY 2013

Potential future options, subject to change without notice.





# From MILLIWATTS to TERAFLOPS



Smartphones with Intel® Inside Intel® Xeon® Processors

Intel<sup>®</sup> Many Integrated Core Architecture

# Energy Efficient



#### **Driving Innovation and Integration** Enabled by Leading Edge Process Technologies





#### Coming in the Future

Integrated Today

#### SYSTEM LEVEL BENEFITS IN COST, POWER, DENSITY, SCALABILITY & PERFORMANCE



12th International GridKa School 2014

## **The Magic of Integration** Moore's Law at Work & Architecture Innovations



#### 1970s 150 MFLOPS CRAY-1

2013 1000000 MFLOPS Intel® Xeon Phi<sup>™</sup>



Copyright ©2014 Intel Corporation. All rights reserved.

**#1** TOP500 June 2014 **33** PFLOPS HPL 54 PFLOPS Peak **32000** Intel<sup>®</sup> Xeon<sup>®</sup> E5v2 Processors **48000** Intel<sup>®</sup> Xeon Phi<sup>™</sup> Coprocessors

# Increasing Processor Performance

Through Many-Core Technologies for Highly Parallel Workloads



For illustration only. All dates, product descriptions, features, availability, and plans are forecasts and subject to change without notice.

# "Big Core" – "Small Core"



Different Optimization Points Common Programming Models and Architectural Elements



#### Intel<sup>®</sup> Xeon<sup>®</sup> Processor

Simply aggregating more cores generation after generation is not sufficient

Performance per core/thread must increase each generation, be as fast as possible

Power envelopes should stay flat or go down each generation

Balanced platform (Memory, I/O, Compute)

Cores, Threads, Caches, SIMD

#### Intel<sup>®</sup> Xeon Phi<sup>™</sup> Coprocessor

Optimized for highest compute per watt

Willing to trade performance per core/thread for aggregate performance

Power envelopes should also stay flat or go down every generation

Optimized for highly parallel workloads

Cores, Threads, Caches, SIMD

For illustration only



Copyright ©2014 Intel Corporation. All rights reserved.

# Intel SIMD Evolution



Potential future options and features subject to change without notice.

# Intel Roadmap to Exascale

#### Intel's Exascale Goal:

Reach Exascale by ~2020 with Intel technologies including Intel® Xeon Phi™ Coprocessors



#### Intel<sup>®</sup> Xeon Phi<sup>™</sup> Product Family Key ingredient in Intel Exascale Roadmap:

- Programmability
- Power efficiency
- Scalability
- Resiliency

Future options subject to change without notice.



# **Common Programming Models & Software Tools**

Common Intel<sup>®</sup> architecture enables applications to run across the full spectrum of Intel<sup>®</sup> Xeon<sup>®</sup> family based servers so programmers don't have to "start over".





Use the same development tools you used for Intel<sup>®</sup> Xeon<sup>®</sup> processors, such as Intel<sup>®</sup> Cluster Studio XE and Intel<sup>®</sup> Parallel Studio XE



# Intel<sup>®</sup> Xeon<sup>®</sup> E5 Processor Family

Foundation of HPC Performance suited for full scope of workloads

Industry leading performance and performance/watt for serial & parallel workloads

General purpose with focus on fast single core/thread performance with "moderate" number of cores



www.intel.com/xeon







#### www.intel.com/xeonphi



# Intel<sup>®</sup> Xeon Phi<sup>™</sup> Coprocessor

Up to 61 Cores, 244 Threads 512-bit SIMD instructions >1TFLOPS DP-F.P. peak Up to 16GB GDDR5 Memory, 352 GB/s PCIe\* x16 Up to 300W TDP (card)

22nm with the world's first 3-D Tri-Gate transistors Linux\* operating system IP addressable native node Common x86/IA Programming Models and SW-Tools



## Intel<sup>®</sup> Xeon Phi<sup>™</sup> Coprocessor Microarchitecture Overview



For illustration only.

Copyright ©2014 Intel Corporation. All rights reserved.

#### **Intel<sup>®</sup> Xeon Phi<sup>™</sup> Coprocessor** Codename: Knights Corner - It is so much more



Restrictive architectures limit the ability for applications to use arbitrary nested parallelism, functions calls and threading models



SOURCE

# Flexible Execution Models

Optimized Performance for all Workloads





# Manycore Processors – Example HPC Use Cases





For illustration only.

# **Highly Parallel Applications**



parallel processor (<1: Intel® Xeon® faster) – For illustration only

Efficient vectorization, threading, and parallel execution drives higher performance for suitable scalable applications



# Parallel Programming for Intel<sup>®</sup> Architecture (IA)

| NODES       | Use Intel <sup>®</sup> MPI, Co-Array Fortran                                                          |
|-------------|-------------------------------------------------------------------------------------------------------|
| CORES       | Use threads directly or e.g. via OpenMP*, pthreads<br>Use tasking, Intel® TBB / Cilk™ Plus            |
| VECTORS     | Intrinsics, auto-vectorization, vector-libraries<br>Language extensions for vector programming (SIMD) |
| BLOCKING    | Use caches to hide memory latency<br>Organize memory access for data reuse                            |
| DATA LAYOUT | Structure of arrays facilitates vector loads / stores, unit stride<br>Align data for vector accesses  |

#### Parallel programming to utilize the hardware resources, in an abstracted and portable way

Copyright ©2014 Intel Corporation. All rights reserved.

#### More Cores. Wider Vectors. Performance Delivered. Intel® Parallel Studio XE and Intel® Cluster Studio XE





# **OpenMP 4.0 Heterogeneous Programming**

```
#pragma omp target data device(0) map(alloc:tmp[:N]) map(to:input[:N)) map(from:res)
{
#pragma omp target device(0)
#pragma omp parallel for
for (i=0; i<N; i++)
tmp[i] = some_computation(input[i], i);
do_some_other_stuff_on_host();</pre>
```

```
#pragma omp target device(0)
#pragma omp parallel for reduction(+:res)
    for (i=0; i<N; i++)
        res += final_computation(tmp[i], i)
}</pre>
```



# Case Study: NWChem CCSD(T) Method



Edoardo Apra, Michael Klemm, and Karol Kowalski. Efficient Implementation of Many-body Quantum Chemical Methods on the Intel Xeon Phi Coprocessor. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, New Orleans, LA, November 2014. To appear.

Performance tests are measured using specific computer systems, components, software, operations, and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. System configuration: Atipa Visione vf442 server with two Intel Xeon E5-2670 8-core processors at 2.6 GHz (128 GB DDR3 with 1333 MHz, Scientific Linux release 6.5) and Intel C600 IOH, two Intel Xeon Phi coprocessors 5110P (GDDR5 with 3.6 GT/sec, driver v3.1.2-1, flash image/micro OS 2.1.02.0390, Intel Composer XE 14.0.1.106).

Copyright ©2014 Intel Corporation. All rights reserved.

#### Next Intel<sup>®</sup> Xeon Phi<sup>™</sup> Processor Codename: Knights Landing



Designed using Intel's cutting-edge **14nm process** 

Not bound by "offloading" bottlenecks
Standalone CPU
or PCIe Coprocessor

Leadership compute & memory bandwidth Integrated On-Package Memory

(intel)

All products, computer systems, dates and figures specified are preliminary based on current expectations, and are subject to change without notice.

## Next Intel<sup>®</sup> Xeon Phi<sup>™</sup> Processor Codename: Knights Landing

Intel<sup>®</sup> Silvermont Arch. Enhanced for HPC

Integrated Fabric

Processor Package

**Compute:** Energy-efficient IA cores<sup>2</sup> Microarchitecture enhanced for HPC<sup>3</sup> **3x** Single Thread Performance vs Knights Corner<sup>4</sup> Intel Xeon Processor Binary Compatible<sup>5</sup>

On-Package Memory:

Ip to 16GB at launch
 5X Bandwidth vs DDR4<sup>7</sup>
 5X Bandwidth vs DDR4<sup>7</sup>
 5x Power Efficiency<sup>6</sup>
 Jointly Developed with Micron Technology



Copyright ©2014 Intel Corporation. All rights reserved.

Journey to **Exascale** 

# Assume Exascale Computing at 20MW ....

#### New Forms of Energy



**Space Exploration** 





**Medical Innovation** 



And many others ....





Today's #25 system in a rack!



100x the performance of today's phone at the same power

1 Gigaflop **20mW** 



For illustration and concept only.

Copyright ©2014 Intel Corporation. All rights reserved.

# **HPC:** The Path to Exascale

Processors Intel® Xeon® Processor











# **HPC:** The Path to Exascale (cont.)

#### 

Cato

#### Networking



#### Reliability & Resiliency



#### Power Management





CAS COS ALS

# Intel TeraScale Research Areas

## MANY-CORE COMPUTING



#### **Teraflops** of computing power

# STACKED Memory



#### **Terabytes** of memory bandwidth

# SILICON Photonics



#### **Terabits** of I/O throughput

(intel)

Future vision, does not represent real products.

Copyright ©2014 Intel Corporation. All rights reserved.

# The Power of Solutions: Big Data Example

# Sort 1TB of Data:











Sort 1TB of Data:



Copyright ©2013 Intel Corporation. All rights reserved.





# **PERFORMANCE** it's all about **SCALABLE** parallelism

# **THINK PARALLEL!**

# INTEL INNOVATION & LEADERSHIP FOR THE ROAD AHEAD

# Thank You.

