### Making HPC more accessible: Effective HPC programming via domain specific abstractions

Nick Brown n.brown@epcc.ed.ac.uk

Emilien Bauer emilien.bauer@ed.ac.uk









### The challenge

 Writing parallel code that can exploit present day supercomputers is extremely hard and requires highly specialist skills





 It is no longer tenable to directly leverage serial languages and add in our own parallelism (e.g. MPI, CUDA, vectorisation etc)







# Domain Specific Languages to the rescue!

(0)

**CLAN** 

a language for fast, portable computation on images and tensors

Saiph

**PSyclone** 

ANTARÉ

NMODL

**Code Generation Framework** 

- Raise the abstraction level so the programmer can provide a high level description of their algorithm that the compiler can then exploit to make tricky, low level decisions around parallelism
- *Languages* is a poor term, *abstractions* is far better



#### Breaking down silos

- The elephant in the room is that these are all heavily siloed and reinvent the wheel
  - Requires significant development effort from the DSL designers
  - Risk for users (e.g. will the DSL be maintained in the future?)
  - Challenges supporting new architectures



There is therefore a sweet spot in the middle, where we gain the best of both worlds

## Step in MLIR and LLVM

 LLVM is the ubiquitous compiler framework that has been around for over 20 years



 In addition to providing its own compilers, AMD, Intel and Arm compilers are all built on-top of LLVM, as is the Cray C/C++ compiler and AMD Xilinx's FPGA HLS technology.



 MLIR was developed by Google in 2020 and since 2021 has been part of the main LLVM repository



- At its core MLIR is a framework for developing different types of Intermediate Representations (IR) at different levels
- Numerous (IR) dialects and transformations are provided which enables lowering between these
- Can add your own easily
- A big community has grown up



#### **MLIR example lowering**



 But MLIR is written in C++ and using specialist Tablegen configuration format for dialects, MLIR is esoteric and requires a steep learning curve



#### xDSL: A Python toolkit for MLIR

 Python toolkit for MLIR that enables high productivity development of dialects and transformations



- Contains existing MLIR dialects & transformations and we are adding HPC focussed ones too
- Whole load of other things also, such as an MLIR interpreter and Python frontend



#### xDSL

| 💱 main 👻 🐉 233 Branches 🛇 24 Tags            | Q Go to file t Add file *                                      | <> Code ▼ About                |
|----------------------------------------------|----------------------------------------------------------------|--------------------------------|
| 3 people pip prod(deps): bump pyright from 1 | .1.338 to 1.1.339 (#1844) 🚥 🗸 49822e6 · 3 hours ago 🔇          | A Python Compiler Des          |
| 🖿 .github                                    | misc: Bump MLIR to 98e674c9f16d677d95c67bc130e267fae           | 4 days ago 4 tiew license      |
| .vscode                                      | interactive: App (#1759)                                       | last month                     |
| bench                                        | ci: (ruff) Add rule UP035 (#1448)                              | 4 months ago                   |
| docs                                         | misc: Bump MLIR to 98e674c9f16d677d95c67bc130e267fae           | 4 days ago 🛛 😵 47 forks        |
| tests                                        | pip prod(deps): bump pyright from 1.1.337 to 1.1.338 (#1819)   | Report repository<br>yesterday |
| 🖿 xdsl                                       | pip prod(deps): bump pyright from 1.1.338 to 1.1.339 (#1844)   | 3 hours ago Releases 23        |
| C .coveragerc                                | misc: add Toy chapter 1 python code, examples and notebo       | last year v0.15.0 Latest       |
| .git-blame-ignored-commits                   | CI: switch formatter to black. (#763)                          | 8 months ago + 22 releases     |
| 🗅 .gitattributes                             | misc upgrade versioneer to 0.29 using versioneer install       | last month                     |
| 🗅 .gitignore                                 | bench: set up airspeed velocity for performance tracking (#1   | 6 months ago                   |
| .pre-commit-config.yaml                      | misc upgrade versioneer to 0.29 using versioneer install       | last month at month            |
| LICENSE                                      | Create python package                                          | 2 years ago                    |
| MANIFEST.in                                  | install: Add typing stubs in the PyPi install (#223)           | last year                      |
| 🗅 Makefile                                   | misc: (Makefile) Use a better message for tests target (#1787) | 3 weeks ago                    |
| README.md                                    | misc: Bump MLIR to 98e674c9f16d677d95c67bc130e267fae           | 4 days ago                     |
|                                              |                                                                |                                |





- Makes experimenting with MLIR trivial
- Can go between xDSL and MLIR, leveraging transformations in both
- For our DSL purposes also means that a DSL can be a thin abstraction layer ontop of xDSL which provides a wealth of dialects and transformations that will ultimately drive MLIR/LLVM

#### **intel** Google EHzürich Sineca Sino Sino Sino Sino Sino Sino Sino

sign Toolki

#### CIUK Theme: Productive supercomputing

### Making HPC More Accessible

- 1. For HPC developers as they can more easily leverage supercomputers by using Domain Specific Languages
- 2. For DSL developers as these are now a thin abstraction layer atop a common, well supported, ecosystem



#### **Domain Specific Compilation**

- The Open Earth Compiler project from ETH Zurich used MLIR for domain specific compilation of stencil codes
- Successfully leveraged MLIR's qualities to leverage high-level information and reach high throughput on GPUs



**ETH** zürich







Climate simulation

- Discovers stencil code in Fortran
- Apply Domain Specific optimizations
- Generates MPI, OpenMP, OpenACC code



- Seismic and fluid simulation, medical imagery
- Generates stencil code from Python PDEs
- Apply Domain Specific optimizations
- Generates MPI, OpenMP, OpenACC code



#### The broken silos



• Everything below the DSL layers is reinvented wheels



#### The sweet spot





Sharing infrastructure:

- Implementation and Maintenance cost is spread across projects
- Everyone gets all benefits
- Can still be driven by specific needs



#### The sweet spot





#### A flexible abstraction

```
%input = stencil.load(%input_buffer) :
   (!field<[0,7]xf64>)-> !temp<?xf64>
%out = stencil.apply(%arg = %input : !temp<?xf64>)-
> !temp<?xf64> {
   %l = stencil.access %arg[-1] : f64
   %c = stencil.access %arg[0] : f64
   %r = stencil.access %arg[1] : f64
   // %v = Some arbitrary computation
   stencil.return %v : f64
}
stencil.store %out to %target([1]:[6])
   : !temp<?xf64> to !field<[0,128]xf64>
```





#### **High-level distribution**





Halo exchange is a simple idea, let's *keep it* simple



#### **High-level distribution**





#### Performance of PSyclone & Devito

#### Single-node on ARCHER2





GPU on Cirrus (V100)

#### T/put (GPts/s) Devito xDSL10 5 0 $y^{a^{a^{2}}c^{b^{x}}}$ $D^{a^{2}}c^{b^{x}}$ $D^{a^{2}}$



#### Strong scaling on ARCHER2





Higher is better, PSyclone top row & Devito bottom row

#### Integration with Flang: Beyond DSLs

Higher is better

Throughput (MCells/s)

50

0

Gauss-536M

Gauss-2.15B

Benchmark

PW-268M

PW-536M

Crav Flang

PW-1.25B

- Performance falls short of Cray compiler for our stencil benchmarks (on a single core of ARCHER2, HPE Cray EX)
  - Gauss-268M Our theory was that we can gain a performance improvement by combining with domain specific optimisations



#### Integration with Flang: Beyond DSLs







#### Auto-optimisation for new architectures



Field Programmable Gate Arrays (FPGAs) RISC-V high core-count accelerator chip

RISC-V



- Very different algorithm layout on FPGAs from the Von Neumann counterpart
  - Requires significant experience, expertise and time to port codes to the architecture
  - Using our existing infrastructure and domain specific abstractions, can we automatically optimise algorithms for FPGAs?
    - So there is a single, unchanged, Von Neumann version driving them?

#### Automatic optimisation for FPGAs

- AMD Xilinx already have an LLVM backend
  - We added a new High Level Synthesis (HLS) MLIR dialect that then lowers to IR compatible with AMD Xilinx's backend

Clang

Optimized

binary



- Developed transformations from the existing stencil dialect to this new HLS dialect
  - Everything else remains the same in the compiler pass
  - DSLs/languages don't need any knowledge of the target architecture

#### Automatic optimisation for FPGAs



- On an AMD Xilinx U280 FPGA
- For PW advection, our approach is between 90 and 100 times faster than DaCE
- For tracer advection, our approach is between 14 and 21 times faster than DaCE



#### Conclusions and next steps...

- We can't keep reinventing the wheel when it comes to compiler infrastructure for DSLs
- LLVM and MLIR are a strong alternative for sharing
  - We have developed the xDSL Python framework to lower the barrier to entry and offer key HPC components so that the ecosystems supports HPC workloads
- A lot of potential for bringing domain specific abstractions into existing languages, and we should be investing in Flang
- To date our focus has been on stencils, are now generalising this to other patterns











Nick Brown

Emilien Bauer

Anton Lydike

- <u>https://xdsl.dev</u>
- <u>https://github.com/xdslproject/xdsl</u>
- https://xdsl.zulipchat.com/

