ML Compilers

Automatically scheduling halide image processing pipelines.

Ravi Teja Mullapudi, Andrew Adams, Dillon Sharlet, Jonathan Ragan‐Kelley, Kayvon Fatahalian

2016

2 references

The Halide image processing language has proven to be an effective system for authoring high-performance image processing code. Halide programmers need only provide a high-level strategy for mapping an image processing pipeline to a parallel machine ...

View Paper PDF DOI

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu...

2023

1 reference

Large language models (LLMs) have transformed numerous AI applications. On-device LLM is becoming increasingly important: running LLMs locally on edge devices can reduce the cloud computing cost and protect users' privacy. However, the astronomical m...

View Paper PDF

Bring Your Own Codegen to Deep Learning Compiler

Zhi Chen, Cody Hao Yu, Trevor Morris, Jorn Tuyls, Yi-Hsiang Lai, Jared Roesch, Elliott Delaye, Vin S...

2021

1 reference

Deep neural networks (DNNs) have been ubiquitously applied in many applications, and accelerators are emerged as an enabler to support the fast and efficient inference tasks of these applications. However, to achieve high model coverage with high per...

View Paper PDF

Convolutional Sequence to Sequence Learning

Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, Yann N. Dauphin

2017

1 reference

The prevalent approach to sequence to sequence learning maps an input sequence to a variable length output sequence via recurrent neural networks. We introduce an architecture based entirely on convolutional neural networks. Compared to recurrent mod...

View Paper PDF

Cortex: A Compiler for Recursive Deep Learning Models

Pratik Fegade, Tianqi Chen, Phillip B. Gibbons, Todd C. Mowry

2020

1 reference

Optimizing deep learning models is generally performed in two steps: (i) high-level graph optimizations such as kernel fusion and (ii) low level kernel optimizations such as those found in vendor libraries. This approach often leaves significant perf...

View Paper PDF

Dropout: a simple way to prevent neural networks from overfitting.

Nitish Srivastava, Geoffrey E. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov

2014

2 references

View Paper DOI

Effective Approaches to Attention-based Neural Machine Translation

Minh-Thang Luong, Hieu Pham, Christopher D. Manning

2015

1 reference

An attentional mechanism has lately been used to improve neural machine translation (NMT) by selectively focusing on parts of the source sentence during translation. However, there has been little work exploring useful architectures for attention-bas...

View Paper PDF

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

Elias Frantar, Saleh Ashkboos, Torsten Hoefler, Dan Alistarh

2022

1 reference

Generative Pre-trained Transformer models, known as GPT or OPT, set themselves apart through breakthrough performance across complex language modelling tasks, but also by their extremely high computational and storage costs. Specifically, due to thei...

View Paper PDF

Linear Layouts: Robust Code Generation of Efficient Tensor Computation Using $\mathbb{F}_2$

Keren Zhou, Mario Lezcano, Adam Goucher, Akhmed Rakhmati, Jeff Niu, Justin Lebar, Pawel Szczerbuk, P...

2025

2 references

Efficient tensor computation is a cornerstone of modern deep learning (DL) workloads, yet existing approaches struggle to achieve flexible and performant design and implementation of tensor layouts -- mappings between logical tensors and hardware res...

View Paper PDF

LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation.

Chris Lattner, Vikram S. Adve

2004

1 reference

We describe LLVM (low level virtual machine), a compiler framework designed to support transparent, lifelong program analysis and transformation for arbitrary programs, by providing high-level information to compiler transformations at compile-time, ...

View Paper PDF DOI

More iteration space tiling.

Michael Wolfe

1989

1 reference

Subdividing the iteration space of a loop into blocks or tiles with a fixed maximum size has several advantages. Tiles become a natural candidate as the unit of work for parallel task scheduling. Synchronization between processors can be done between...

View Paper PDF DOI

Neural Machine Translation by Jointly Learning to Align and Translate

Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio

2014

3 references

Neural machine translation is a recently proposed approach to machine translation. Unlike the traditional statistical machine translation, the neural machine translation aims at building a single neural network that can be jointly tuned to maximize t...

View Paper PDF

Parallel random numbers: as easy as 1, 2, 3.

John K. Salmon, Mark A. Moraes, Ron O. Dror, David E. Shaw

2011

2 references

Most pseudorandom number generators (PRNGs) scale poorly to massively parallel high-performance computation because they are designed as sequentially dependent state transformations. We demonstrate that independent, keyed transformations of counters ...

View Paper DOI

Recipes for Pre-training LLMs with MXFP8

Asit Mishra, Dusan Stosic, Simon Layton, Paulius Micikevicius

2025

1 reference

Using fewer bits to represent model parameters and related tensors during pre-training has become a required technique for improving GPU efficiency without sacrificing accuracy. Microscaling (MX) formats introduced in NVIDIA Blackwell generation of G...

View Paper PDF

Self-attention Does Not Need $O(n^2)$ Memory

Markus N. Rabe, Charles Staats

2021

2 references

We present a very simple algorithm for attention that requires $O(1)$ memory with respect to sequence length and an extension to self-attention that requires $O(\log n)$ memory. This is in contrast with the frequently stated belief that self-attentio...

View Paper PDF

The CoRa Tensor Compiler: Compilation for Ragged Tensors with Minimal Padding

Pratik Fegade, Tianqi Chen, Phillip B. Gibbons, Todd C. Mowry

2021

1 reference

There is often variation in the shape and size of input data used for deep learning. In many cases, such data can be represented using tensors with non-uniform shapes, or ragged tensors. Due to limited and non-portable support for efficient execution...

View Paper PDF

T-MAC: CPU Renaissance via Table Lookup for Low-Bit LLM Deployment on Edge

Jianyu Wei, Shijie Cao, Ting Cao, Lingxiao Ma, Lei Wang, Yanyong Zhang, Mao Yang

2024

1 reference

The deployment of Large Language Models (LLMs) on edge devices is increasingly important to enhance on-device intelligence. Weight quantization is crucial for reducing the memory footprint of LLMs on devices. However, low-bit LLMs necessitate mixed p...

View Paper PDF DOI

Triton-distributed: Programming Overlapping Kernels on Distributed AI Systems with the Triton Compiler

Size Zheng, Wenlei Bao, Qi Hou, Xuegui Zheng, Jin Fang, Chenhui Huang, Tianqi Li, Haojie Duanmu, Ren...

2025

1 reference

In this report, we propose Triton-distributed, an extension of existing Triton compiler, to overcome the programming challenges in distributed AI systems. Triton-distributed is the first compiler that supports native overlapping optimizations for dis...

View Paper PDF

Understanding INT4 Quantization for Transformer Models: Latency Speedup, Composability, and Failure Cases

Xiaoxia Wu, Cheng Li, Reza Yazdani Aminabadi, Zhewei Yao, Yuxiong He

2023

1 reference

Improving the deployment efficiency of transformer-based language models has been challenging given their high computation and memory cost. While INT8 quantization has recently been shown to be effective in reducing both the memory cost and latency w...

View Paper PDF

UNIT: Unifying Tensorized Instruction Compilation

Jian Weng, Animesh Jain, Jie Wang, Leyuan Wang, Yida Wang, Tony Nowatzki

2021

1 reference

Because of the increasing demand for computation in DNN, researchers develope both hardware and software mechanisms to reduce the compute and memory burden. A widely adopted approach is to use mixed precision data types. However, it is hard to levera...

View Paper PDF DOI

Repositories

apache/tvm

iree-org/iree

onnx/onnx

openxla/xla

pytorch/glow

triton-lang/triton

Papers

Automatically scheduling halide image processing pipelines.

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

Bring Your Own Codegen to Deep Learning Compiler

Convolutional Sequence to Sequence Learning

Cortex: A Compiler for Recursive Deep Learning Models

Dropout: a simple way to prevent neural networks from overfitting.

Effective Approaches to Attention-based Neural Machine Translation

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

Linear Layouts: Robust Code Generation of Efficient Tensor Computation Using $\mathbb{F}_2$

LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation.

More iteration space tiling.

Neural Machine Translation by Jointly Learning to Align and Translate

Parallel random numbers: as easy as 1, 2, 3.

Recipes for Pre-training LLMs with MXFP8

Self-attention Does Not Need $O(n^2)$ Memory

The CoRa Tensor Compiler: Compilation for Ragged Tensors with Minimal Padding

T-MAC: CPU Renaissance via Table Lookup for Low-Bit LLM Deployment on Edge

Triton-distributed: Programming Overlapping Kernels on Distributed AI Systems with the Triton Compiler

Understanding INT4 Quantization for Transformer Models: Latency Speedup, Composability, and Failure Cases

UNIT: Unifying Tensorized Instruction Compilation