### ExaNeSt@FPGA

#### INAF ICT Workshop 2019

## David Goz

With

L. Tornatore, G. Taffoni, S. Bertocco, A. Ragagnin, G. Murante and I. Coretti









## Next generation computing roadmap

Most of us rely on numerical codes to perform calculations.



Cosmological simulation of galaxy formation using GADGET code (springel 2005). Simulated disk galaxy in cosmological environment at present epoch (Goz et al. 2015). NGC 4414, a typical spiral galaxy in the constellation of Coma Berenice, is about 60 million light-years away from Earth (Credit HST).

- HPC numerical simulations are one of the more effective instrument to compare observations with theoretical models;
- The new generation of observational facilities also implies high performance data reduction and analysis tools.





#### Why Exa-scale?

" Crucial problems that we can only hope to address computationally require us to deliver effective computing power orders-of-magnitude greater than we can deploy today ". DOE's Office of Science, 2012

"EXA-scale" is the necessary upscale step that HPC needs to achieve in the next years.

It is defined as the frontier of a sustained performance around  $10^{18}$  flop/s with an energy performance of ~50 Gflop/W

There are deep consequences in the way we design, write, and optimize scientific codes.





# ExaNest European project

The Horizon2020 ExaNeSt project aims to demonstrate the feasibility of a European technology based ExaScale HPC system.

Who we are: the ExaNeSt consortium combines industrial and academic research expertise.
How we do it: following a co-design approach,
applications drive the HW development and test it;
applications are re-designed to develop new HPC SW able to exploit exascale HW.





ExaNest compute unit (QFDB):

- 4 Xilinx Zyng Ultrascale+ FPGAs;
- 4 ARMV8 cores @1.5GHz per FPGA;
- 16 GB of DDR4 memory per FPGA;
- one NVM SSD storage device.

The ExaNest Quad-FPGA Daughter



Board (QFDB) INAF ICT Workshop, Milano 2019 - David Goz



# ExaNest European project



Testbed of ExaNeSt in liquid cooling rack



#### Project information:

- web site: http://www.exanest.eu;
- duration 01 Dec. 2015 31 May 2019;
- budget: cost €8.44 M, EV contribution €8.44 M.

#### ExaNeSt testbed prototype:

- the HPC Testbed consists now of 6 liquid-cooled blades, which contain a total of 24 "QFDB" boards;
- 96 nodes (FPGA), 384 ARM A53 CPUs (1536 64-bit A53 cores), 1.5 TeraBytes of DRAM memory, and 6
   TeraBytes of SSD storage;
- the Interconnection Topology is a 3D Torus, and the network interfaces feature ExaNeSt-own remote DMA engines with 1024 channels each, multiple mailbox queues, and resilience features;
- more blades will be added.



#### Heterogeneous hardware

Node level heterogeneous architectures compared to traditional CPUs offer high peak performance.





INAF ICT Workshop, Milano 2019 – David Goz

6

## Embedded & mobile hardware

System-on-Chip (SoC) heterogeneous hardware compared to traditional hardware is more energy and cost efficient.



- SoC are in contrast to the motherboard-based PC architecture;
- SoC integrates CPU/GPU/memory interfaces into a single chip;
- SoC has reduced modularity and replaceability of components;
- energy-efficiency is the main concern;
- · ARM is the de facto soc technology.









#### CPU vs GPU





### A high-performance problem solved in parallel







## Two types of parallelism

- Task parallelism:
   different people are
   performing different tasks
   at the same time.
   Work suitable for CPU:
- Data parallelism:
   different people are performing the same task at the same time, but on different equivalent and independent data.
   Work suitable for GPU!







#### What is a Field Programmable Gate Array (FPGA)?



ALPSH

FPGA is a semiconductor device that can be programmed (i.e. no fixed architecture):

- desired functionality of the FPGA can be (re)-programmed by downloading a configuration into the device;
- flexible interconnect, highly parallel customizable architecture (both data-parallelism and task-parallelism);
- > optimal power efficiency (3-4x than of GPU);
- low level programming required for high performance;
- > currently double-precision arithmetic is
  - resource-eager and performance-poor.



### The INAF astrophysical codes in ExaNeSt

Hy-Nbody (D. GOZ et al. 2019): direct N-Body code to simulate cluster dynamics and close encounters.

**PINOCCHIO** (P. Monaco, T. Theuns & G. Taffoni, 2002): a fast code, based on Lagrangian perturbation theory, to generate catalogues of cosmological dark matter halos and their merger history.





GADGET (V. Springel 2005):

is an N-body and hydrodynamical code for large-scale, high-resolution numerical simulations of cosmic structure formation and evolution.







# Computing Platforms

| Platform | Desktop          | Firefly-RK3399     | ZedBoard                  | QFDB (ExaNeSt)                    |
|----------|------------------|--------------------|---------------------------|-----------------------------------|
|          | ASUS P8B75-M LX  | Rockchip RK3399    | Xilinx Zyng–1000<br>MPSOC | Xilinx Zyng–<br>Ultrascale+ MPsoC |
|          |                  |                    |                           |                                   |
| CPU      | Intel            | ARM                | ARM                       | ARM                               |
|          | <u>i7-3770×4</u> | <u>A72×2+A53×4</u> | A9x2                      | (A53x4)x4                         |
| GPU      | Nvidia GeForce   | ARM                |                           |                                   |
|          | <u>GTX-1080</u>  | <u> Mali–T864</u>  | None                      | None                              |
| FPGA     | None             | None               | Zyng-7000                 | <u>(Zyng-US+)</u> ×4              |
| RAM      | 16GB DDR3        | 4GB DDR3           | 512MB DDR3                | 16×4GB DDR4                       |
|          |                  |                    |                           |                                   |





## INCAS (INtensive Clustered Arm-SoC)

(INAF technical report: S. Bertocco DOI: 10.20371/INAF/PUB/2018\_00004)



| Nodes available | 8                                                                          |  |  |
|-----------------|----------------------------------------------------------------------------|--|--|
| SoC             | Rockchip - RK3399                                                          |  |  |
| CPU/node        | Six—Core ARM 64—bit<br>(Dual—Core Cortex—A72 and Quad—<br>Core Cortex—A53) |  |  |
| GPU/node        | ARM Mali—T864 MP4 Quad—<br>Core                                            |  |  |
| Ram memory/node | 4GB dual-channel DDR3                                                      |  |  |
| Network         | 1 Gbps Ethernet                                                            |  |  |
| OS              | Ubuntu 16.04 LTS                                                           |  |  |
| Compiler        | gcc version 7.3.0                                                          |  |  |
| MPI             | OpenMPI version 3.0.1                                                      |  |  |
| OpenCL          | OpenCL version 2.2                                                         |  |  |
| Job scheduler   | SLURM version 17.11                                                        |  |  |

Cluster components.





# Computational performances [Hy-Nbody]



- GTX and UltraScale+ outperfom others;
- UltraScale+ ≈ 100GFLOPS FP64.

 Only GPUs benefit from EX-arithmetic;

- for GTX  $t_{DP}/t_{EX} \approx 20;$
- for Mali  $t_{DP}/t_{EX} \approx 2$ .





# Energy consumption [Hy-Nbody]







# Multi-node performances [Hy-Nbody]







# Multi-node performances [PINOCCHIO]







# Conclusions

- The usage of heterogeneous computing in scientific research (not only HPC) appears to be inevitable;
- Soc technology is emerging as a promising alternative to "traditional" technologies for HPC;
- we will be forced to re-engineer our applications in order to exploit new exascale computing facilities (different devices, complex memory hierarchies);
- we will be forced to devise high performance-per-watt algorithms.

# Future developments

• To assess the energy footprint of our applications using the whole ExaNeSt prototype and compare with HPC resources.





# References

INAF Technical Reports:

- Goz D., Tornatore L., Bertocco S. and Taffoni G. DOI: 10.20371/INAF/PUB/2018\_00002;
- Goz D., Tornatore L., Bertocco S. and Taffoni G. DOI: 10.20371/INAF/PUB/2018\_00005;
- Goz D., Tornatore L., Bertocco S. and Taffoni G. DOI: 10.20371/INAF/PUB/2018\_00006;
- Bertocco S., Goz D., Tornatore L. and Taffoni G. DOI: 10.20371/INAF/PUB/2018\_00004.

Papers:

- Direct N-body code on low-power embedded ARM GPUS, Goz D., Bertocco S., Tornatore L., Taffoni G., Computing Conference 2019 proceedings;
- Low power high performance computing on Arm system-on-chip in Astrophysics, Taffoni G., Bertocco S., Coretti I., Goz D., Ragagnin A., Tornatore L., Springer series in "Advances in Intelligent Systems and Computing" 2019.









Horizon 2020

This work was carried out within the ExaNeSt (FET-HPC) project (grant no. 671553) funded by the European Unions Horizon 2020 research and innovation programme.



