ML Compilers

8-bit Numerical Formats for Deep Neural Networks

Badreddine Noune, Philip Jones, Daniel Justus, Dominic Masters, Carlo Luschi

2022

12 references

Given the current trend of increasing size and complexity of machine learning architectures, it has become of critical importance to identify new approaches to improve the computational efficiency of model training. In this context, we address the ad...

View Paper PDF DOI

An Improved Algorithm for hypot(a,b)

Carlos F. Borges

2019

2 references

We develop a fast and accurate algorithm for evaluating $\sqrt{a^2+b^2}$ for two floating point numbers $a$ and $b$. Library functions that perform this computation are generally named {\tt hypot(a,b)}. We will compare four approaches that we will de...

View Paper PDF DOI

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Sergey Ioffe, Christian Szegedy

2015

17 references

Training Deep Neural Networks is complicated by the fact that the distribution of each layer's inputs changes during training, as the parameters of the previous layers change. This slows down the training by requiring lower learning rates and careful...

View Paper PDF DOI

DHEN: A Deep and Hierarchical Ensemble Network for Large-Scale Click-Through Rate Prediction

Buyun Zhang, Liang Luo, Xi Liu, Jay Li, Zeliang Chen, Weilin Zhang, Xiaohan Wei, Yuchen Hao, Michael...

2022

1 reference

Learning feature interactions is important to the model performance of online advertising services. As a result, extensive efforts have been devoted to designing effective architectures to learn feature interactions. However, we observe that the prac...

View Paper PDF

Fast splittable pseudorandom number generators

G. Steele, D. Lea, Christine H. Flood

2014

2 references

We describe a new algorithm SplitMix for an object-oriented and splittable pseudorandom number generator (PRNG) that is quite fast: 9 64-bit arithmetic/logical operations per 64 bits generated. A conventional linear PRNG object provides a generate me...

View Paper DOI

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

Tri Dao

2023

3 references

Scaling Transformers to longer sequence lengths has been a major problem in the last several years, promising to improve performance in language modeling and high-resolution image understanding, as well as to unlock new applications in code, audio, a...

View Paper PDF DOI

FP8 Formats for Deep Learning

Paulius Micikevicius, Dusan Stosic, Neil Burgess, Marius Cornea, Pradeep Dubey, Richard Grisenthwait...

2022

14 references

FP8 is a natural progression for accelerating deep learning training inference beyond the 16-bit formats common in modern processors. In this paper we propose an 8-bit floating point (FP8) binary interchange format consisting of two encodings - E4M3 ...

View Paper PDF DOI

Gaussian Error Linear Units (GELUs)

Dan Hendrycks, Kevin Gimpel

2016

11 references

We propose the Gaussian Error Linear Unit (GELU), a high-performing neural network activation function. The GELU activation function is $x\Phi(x)$, where $\Phi(x)$ the standard Gaussian cumulative distribution function. The GELU nonlinearity weights ...

View Paper PDF DOI

GSPMD: General and Scalable Parallelization for ML Computation Graphs

Yuanzhong Xu, HyoukJoong Lee, Dehao Chen, Blake Hechtman, Yanping Huang, Rahul Joshi, Maxim Krikun, ...

2021

5 references

We present GSPMD, an automatic, compiler-based parallelization system for common machine learning computations. It allows users to write programs in the same way as for a single device, then give hints through a few annotations on how to distribute t...

View Paper PDF DOI

HPTT: A High-Performance Tensor Transposition C++ Library

Paul Springer, Tong Su, Paolo Bientinesi

2017

1 reference

Recently we presented TTC, a domain-specific compiler for tensor transpositions. Despite the fact that the performance of the generated code is nearly optimal, due to its offline nature, TTC cannot be utilized in all the application codes in which th...

View Paper PDF

Leveraging the bfloat16 Artificial Intelligence Datatype For Higher-Precision Computations

Greg Henry, Ping Tak Peter Tang, Alexander Heinecke

2019

4 references

In recent years fused-multiply-add (FMA) units with lower-precision multiplications and higher-precision accumulation have proven useful in machine learning/artificial intelligence applications, most notably in training deep neural networks due to th...

View Paper PDF DOI

Nimble: Efficiently Compiling Dynamic Neural Networks for Model Inference

Haichen Shen, Jared Roesch, Zhi Chen, Wei Chen, Yong Wu, Mu Li, Vin Sharma, Zachary Tatlock, Yida Wa...

2020

1 reference

Modern deep neural networks increasingly make use of features such as dynamic control flow, data structures and dynamic tensor shapes. Existing deep learning systems focus on optimizing and executing static neural networks which assume a pre-determin...

View Paper PDF

Optimizing Winograd-Based Convolution with Tensor Cores.

Junhong Liu, Dongxu Yang, Junjie Lai

2021

1 reference

Convolution computing is one of the primary time consuming part of convolutional neural networks (CNNs). State of the art convolutional neural networks use samll, 3 × 3 filters. Recent work on Winograd convolution can reduce the computational complex...

View Paper DOI

Overlap Communication with Dependent Computation via Decomposition in Large Deep Learning Models

Shibo Wang, Jinliang Wei, Amit Sabne, Andy Davis, Berkin Ilbeyi, Blake A. Hechtman, Dehao Chen, K. M...

2022

1 reference

Large deep learning models have shown great potential with state-of-the-art results in many tasks. However, running these large models is quite challenging on an accelerator (GPU or TPU) because the on-device memory is too limited for the size of the...

View Paper PDF DOI

Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference

Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard, Hartwig Adam, D...

2017

6 references

The rising popularity of intelligent mobile devices and the daunting computational cost of deep learning-based models call for efficient and accurate on-device inference schemes. We propose a quantization scheme that allows inference to be carried ou...

View Paper PDF DOI

TPU-KNN: K Nearest Neighbor Search at Peak FLOP/s

Felix Chern, Blake Hechtman, Andy Davis, Ruiqi Guo, David Majnemer, Sanjiv Kumar

2022

5 references

This paper presents a novel nearest neighbor search algorithm achieving TPU (Google Tensor Processing Unit) peak performance, outperforming state-of-the-art GPU algorithms with similar level of recall. The design of the proposed algorithm is motivate...

View Paper PDF DOI

ZeRO-Offload: Democratizing Billion-Scale Model Training

2021

1 reference

Large-scale model training has been a playing ground for a limited few requiring complex model refactoring and access to prohibitively expensive GPU clusters. ZeRO-Offload changes the large model training landscape by making large model training acce...

View Paper DOI

Repositories

iree-org/iree

openxla/xla

Papers

8-bit Numerical Formats for Deep Neural Networks

An Improved Algorithm for hypot(a,b)

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

DHEN: A Deep and Hierarchical Ensemble Network for Large-Scale Click-Through Rate Prediction

Fast splittable pseudorandom number generators

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

FP8 Formats for Deep Learning

Gaussian Error Linear Units (GELUs)

GSPMD: General and Scalable Parallelization for ML Computation Graphs

HPTT: A High-Performance Tensor Transposition C++ Library

Leveraging the bfloat16 Artificial Intelligence Datatype For Higher-Precision Computations

Nimble: Efficiently Compiling Dynamic Neural Networks for Model Inference

Optimizing Winograd-Based Convolution with Tensor Cores.

Overlap Communication with Dependent Computation via Decomposition in Large Deep Learning Models

Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference

TPU-KNN: K Nearest Neighbor Search at Peak FLOP/s

ZeRO-Offload: Democratizing Billion-Scale Model Training