21st CDP Program

10 November 2024

Session 1: Invited Talks	Chair: J. Nelson Amaral (University of Alberta)
08:30-08:35	Welcome
08:35-09:20	Hexcute: A Compiler Framework for Automating Layout Synthesis in GPU Programs (slides) Xiao Zhang — Nvidia
09:20-10:00	Triton: an experience adapting OpenAI implementation on to the Intel Xe2 GPU architecture (slides) Ettore Tiotto — Intel
Session 2:	Chair: Mark Stoodley
13:00-13:30	Whole-Model Tuner for the IREE ML Compiler (slides) Jakub Kuderski, Bangtian Liu, Amily Wu, Max Dawkins — AMD
13:30-14:00	Exploring missed optimization opportunities in whole-program devirtualization and indirect call promotion (slides) Szymon Sobieszek, Ehsan Amiri, Congzhe Cao, Yangguang Li — Huawei
14:00-14:30	Compiler-Driven Performance Optimization for Neural Networks Klint Qinami — Meta
Session 3:	Chair: Ehsan Amiri
15:00-15:30	A Novel Compiler Transformation for Fast Sparse Matrix Multiplication in GPUs Hossein Albakri, Da Ma, Kazem Cheshmi — McMaster University
15:30-16:00	High-Level Optimization of Abstract Data Types(slides pdf slides pptx) Anthony Hunt, Emil Sekerinski — McMaster University
16:00-16:30	Creating Contexts for ZagSmalltalk (slides Daniel Franklin, David Mason — Toronto Metropolitan University

Hexcute: A Compiler Framework for Automating Layout Synthesis in GPU Programs
Xiao Zhang — Nvidia

Abstract: Efficient GPU programming is crucial for achieving high performance in deep learning (DL) applications. The performance of GPU programs depends on how data is parallelized across threads and arranged within memory subsystems. The mapping functions describing tensors on GPUs are known as tensor layouts. Low-level programming frameworks, such as CUTLASS and Hidet, provide expressive layout abstractions, but they demand considerable programming effort to manually specify optimal layouts. High-level GPU programming languages, such as Triton, rely on compiler heuristics to generate dataflow, layouts, and pipelining strategies in GPU programs. However, the heuristics for dataflow and pipelining strategies are not generalizable to complex operators. To balance expressiveness and programmability, we propose Hexcute, a compiler framework that automates layout synthesis while providing explicit control over dataflow and pipelining. Hexcute formalizes layout synthesis as a constraint programming problem and solves it with a type-inference-based algorithm. This approach enables systematic exploration of optimal layouts and instructions.

Our evaluation shows that Hexcute matches the performance of libraries like cuBLAS and FlashAttention on GEMM, Attention, and their variants, while reducing the amount of code by 1.27x-7.94x compared to CUTLASS. For mixed-type mixture-of-experts (MoE) operators, Hexcute achieves an average speedup of 6.46x over Triton. In the end-to-end evaluations of vLLM, Hexcute delivers up to 2.60x speedup on DeepSeek-R1-AWQ and 2.04x on a Mamba-based model.

Triton: an experience adapting OpenAI implementation on to the Intel Xe2 GPU architecture Ettore Tiotto — Intel

Abstract: Triton is a Pythonic DSL introduced by OpenAI and aiming to simplify GPU programming for deep learning workloads, without significantly sacrificing runtime performance. This talk will give an introduction of the Triton language, its programming model, and its associated compiler architecture. We will present our experience in retargeting the compiler backend to a significantly different GPU architecture, and discuss challenges and solutions to performance portability.

Whole-Model Tuner for the IREE ML Compiler
Jakub Kuderski, Bangtian Liu, Amily Wu, Max Dawkins — AMD

Abstract: SHARK Tuner is a whole-model tuner for the IREE (Intermediate Representation Execution Environment) ML compiler. The tuner takes an ML model compiled by IREE and searches for dispatch configuration parameters that yield the best performance across the whole model. We show why whole-model tuning is crucial to benefit from aggressive operation fusion and account for power-limited chips. We used the tuner in our most recent MLPerf submissions and achieved 15% end-to-end speedup on the SDXL image generation model on AMD MI300X and MI325X GPUs.

Exploring missed optimization opportunities in whole-program devirtualization and indirect call promotion
Szymon Sobieszek, Ehsan Amiri, Congzhe Cao, Yangguang Li — Huawei

Abstract: We present our ongoing work related to whole-program devirtualization and indirect call promotion optimizations. We show how by embedding more accurate information about class types in the IR through LLVM intrinsics we are able to catch some of the missing devirtualization opportunities. Our experiments so far show from hundreds to thousands additional devirtualized callsites in common codebases, such as MySQL, CLickHouse, and Xalan. Moreover, replacing indirect with direct calls opens up possibilities for performance improvement, with extra inlining as one of the promising directions. Additionally, we show how fine-tuning hotness thresholds can lead to improvements from the indirect call promotion pass.

Compiler-Driven Performance Optimization for Neural Networks
Klint Qinami — Meta

Abstract: We present compiler optimization techniques developed for MTIA's next-generation architecture, which delivers 3x performance improvement over the previous generation. Performance evaluation on production ranking and recommendation models demonstrates significant improvements in memory utilization and overall system efficiency. The techniques contribute to MTIA's 6x model serving throughput improvement and 1.5x performance-per-watt gains over the previous generation, enabling Meta to efficiently serve models ranging from low-complexity to high-complexity recommendation workloads with 10x-100x differences in model size. We describe a multi-stage compilation pipeline that leverages PyTorch's Inductor backend while introducing novel graph-level optimizations tailored for AI accelerators. Our approach addresses several key challenges: (1) tensor view elimination that converts explicit layout transformations into implicit tensor view manipulations, (2) memory-aware operator fusion strategies that consider both computational efficiency and memory hierarchy constraints, and (3) dynamic shape handling that maintains performance optimization paths despite runtime variability.

The compiler uses memory placement strategies that automatically partition tensors between fast on-chip SRAM and external DRAM based on access patterns, lifetime analysis, and fallback strategies. When SRAM capacity is exceeded, our spilling mechanisms intelligently migrate data while minimizing performance impact. We also employ scheduling and tiling optimizations that decompose large tensor operations into smaller blocks that fit within memory constraints while maximizing data reuse. Additionally, graph-level transformations simplify and canonicalize graphs, eliminate redundant operations, and support both vertical and horizontal fusions to improve compute density.

A Novel Compiler Transformation for Fast Sparse Matrix Multiplication in GPUs
Hossein Albakri, Da Ma, Kazem Cheshmi — McMaster University

Abstract: Sparse data structures are commonly used in neural networks to reduce the memory footprint. These data structures are compact but cause irregularities such as random memory accesses, which prevent efficient use of the memory hierarchy. GPUs are a common platform for machine learning practitioners, but running compact data structures on these devices often leads to slow-downs due to inefficient use of computing and memory resources. This paper proposes a new compiler transformation, enumerate-and-sparse-coarsen, that accelerates sparse matrix-matrix multiplication (SPMM) on GPU devices. The transformation increases data reuse in registers and caches while creating more balanced workloads for GPU computing resources. The transformation is tested on sparse neural networks in convolutional and transformer models. On an A100 GPU and across columns of matrix B (bCols) in A x B = C from range of 32 to 128, the transformation yields a geometric mean speedup of 1.84x to 2.27x compared to cuBLAS and cuSPARSE baselines, respectively.

High-Level Optimization of Abstract Data Types
Anthony Hunt, Emil Sekerinski — McMaster University

Abstract: Modelling languages support mathematical collection data types like sets and relations with a rich set of operations. However, modern programming languages only use a fraction of this expressive notation, preferring implementation-specific types like arrays and classes. This paper proposes a new language and term rewriting system to generate efficient implementations of complex data type operations with predictable running time and memory consumption.

Creating Contexts for ZagSmalltalk
Daniel Franklin, David Mason — Toronto Metropolitan University

Abstract: Zag Smalltalk is a new runtime and compiler for the Smalltalk programming language, described in the paper [1]. This talk will examine the execution model of Zag Smalltalk involving the creation of call contexts and the related lookup of variables either in the stack or from the context. The stack contains the live data that a method is working with, including references to self, parameters, and locals [2]. Before a send can be initiated, a context must be created to contain the return address to which the calling method is to return. Along with the return address, the context encapsulates the self, parameters, local variables, and its caller’s stack.

ZagSmalltalk will delay the creation of a context until:

a write needs to be made to a local
a block closure is created that needs to reference the context
a message send is about to execute

To access the locals, parameters and self in a program the operations that read and write to variables, such as push and pop, must access variables through a two part index. This index contains an offset in the stack and an index in the context. If no context exists, the offset will be used to access the parameter or self-object. Alternatively, if a context exists or is created, then the offset will reference the stack, and the index part will be used to access the variable in the context.

Normally, a context would be created whenever a send is detected, but creating contexts is expensive and the execution of a method may not result in the send being executed if another path in the method is created. Specifically in ZagSmalltalk a call to a primitive does not require a context unless an exception occurs. ZagSmalltalk makes use of aggressive inlining which may result in the send being removed. The talk will detail the context creation process in ZagSmalltalk.

References

[1] Dave Mason (2022), Design Principles for a High-Performance Smalltalk, International Workshop on Smalltalk, https://api.semanticscholar.org/CorpusID:259124052

[2] J. E. B. Moss (1987), Managing stack frames in Smalltalk, SIGPLAN Not. 22, 7 (July 1987), 229–240. https://doi.org/10.1145/960114.29675