🧠

ML Compilers

Deep learning compilation frameworks and optimization techniques

Repositories

(6)

apache/tvm

17 papers

iree-org/iree

7 papers

onnx/onnx

21 papers

openxla/xla

11 papers

pytorch/glow

4 papers

triton-lang/triton

25 papers

Papers

(74)
Showing 20 of 74 papers

8-bit Numerical Formats for Deep Neural Networks

Badreddine Noune, Philip Jones, Daniel Justus, Dominic Masters, Carlo Luschi
2022
16 references

Given the current trend of increasing size and complexity of machine learning architectures, it has become of critical importance to identify new approaches to improve the computational efficiency of model training. In this context, we address the ad...

Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour

Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew T...
2017
3 references

Deep learning thrives with large neural networks and large datasets. However, larger networks and larger datasets result in longer training times that impede research and development progress. Distributed synchronous SGD offers a potential solution t...

Attention Is All You Need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kais...
2017
25 references

The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose ...

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Sergey Ioffe, Christian Szegedy
2015
30 references

Training Deep Neural Networks is complicated by the fact that the distribution of each layer's inputs changes during training, as the parameters of the previous layers change. This slows down the training by requiring lower learning rates and careful...

Fast Transformer Decoding: One Write-Head is All You Need

Noam Shazeer
2019
7 references

Multi-head attention layers, as used in the Transformer neural sequence model, are a powerful alternative to RNNs for moving information across and between sequences. While training these layers is generally fast and simple, due to parallelizability ...

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

Bridgit Dao
2023
3 references

Scaling Transformers to longer sequence lengths has been a major problem in the last several years, promising to improve performance in language modeling and high-resolution image understanding, as well as to unlock new applications in code, audio, a...

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, Christopher Ré
2022
6 references

Transformers are slow and memory-hungry on long sequences, since the time and memory complexity of self-attention are quadratic in sequence length. Approximate attention methods have attempted to address this problem by trading off model quality to r...

FP8 Formats for Deep Learning

Paulius Micikevicius, Dušan Stošić, Neil Burgess, Marius Cornea, Pradeep Dubey, Richard Grisenthwait...
2022
24 references

FP8 is a natural progression for accelerating deep learning training inference beyond the 16-bit formats common in modern processors. In this paper we propose an 8-bit floating point (FP8) binary interchange format consisting of two encodings - E4M3 ...

Gaussian Error Linear Units (GELUs)

Dan Hendrycks, Kevin Gimpel
2016
13 references

We propose the Gaussian Error Linear Unit (GELU), a high-performing neural network activation function. The GELU activation function is $x\Phi(x)$, where $\Phi(x)$ the standard Gaussian cumulative distribution function. The GELU nonlinearity weights ...

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, Sumit Sanghai
2023
13 references

Multi-query attention (MQA), which only uses a single key-value head, drastically speeds up decoder inference. However, MQA can lead to quality degradation, and moreover it may not be desirable to train a separate model just for faster inference. We ...

Group Normalization

Yuxin Wu, Kaiming He
2018
10 references

Batch Normalization (BN) is a milestone technique in the development of deep learning, enabling various networks to train. However, normalizing along the batch dimension introduces problems --- BN's error increases rapidly when the batch size becomes...

Instance Normalization: The Missing Ingredient for Fast Stylization

Dmitry Ulyanov, Andrea Vedaldi, Victor Lempitsky
2016
13 references

It this paper we revisit the fast stylization method introduced in Ulyanov et. al. (2016). We show how a small change in the stylization architecture results in a significant qualitative improvement in the generated images. The change is limited to s...

Layer Normalization

Chengyong Si, Jianqiang Shen, Xuan Zou, Lei Wang, Qidi Wu
2016
8 references

Training state-of-the-art, deep neural networks is computationally expensive. One way to reduce the training time is to normalize the activities of the neurons. A recently introduced technique called batch normalization uses the distribution of the s...

Leveraging the bfloat16 Artificial Intelligence Datatype For Higher-Precision Computations

Greg Henry, Ping Tang, Alexander Heinecke
2019
4 references

In recent years fused-multiply-add (FMA) units with lower-precision multiplications and higher-precision accumulation have proven useful in machine learning/artificial intelligence applications, most notably in training deep neural networks due to th...

On Multiplicative Integration with Recurrent Neural Networks

Yuhuai Wu, Saizheng Zhang, Ying Zhang, Yoshua Bengio, Ruslan Salakhutdinov
2016
3 references

We introduce a general and simple structural design called Multiplicative Integration (MI) to improve recurrent neural networks (RNNs). MI changes the way in which information from difference sources flows and is integrated in the computational build...

Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference

Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard, Hartwig Adam, D...
2017
6 references

The rising popularity of intelligent mobile devices and the daunting computational cost of deep learning-based models call for efficient and accurate on-device inference schemes. We propose a quantization scheme that allows inference to be carried ou...

RoFormer: Enhanced Transformer with Rotary Position Embedding

Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, Yunfeng Liu
2021
5 references

Position encoding recently has shown effective in the transformer architecture. It enables valuable supervision for dependency modeling between elements at different positions of the sequence. In this paper, we first investigate various methods to in...

Root Mean Square Layer Normalization

Biao Zhang, Rico Sennrich
2019
5 references

Layer normalization (LayerNorm) has been successfully applied to various deep neural networks to help stabilize training and boost model convergence because of its capability in handling re-centering and re-scaling of both inputs and weight matrix. H...

Spatial Transformer Networks

Max Jaderberg, Karen Simonyan, Andrew Zisserman, Koray Kavukcuoglu
2015
14 references

Convolutional Neural Networks define an exceptionally powerful class of models, but are still limited by the lack of ability to be spatially invariant to the input data in a computationally and parameter efficient manner. In this work we introduce a ...

Tensor Comprehensions: Framework-Agnostic High-Performance Machine Learning Abstractions

Nicolas Vasilache, Oleksandr Zinenko, Theodoros Theodoridis, Priya Goyal, Zachary DeVito, William S....
2018
5 references

Deep learning models with convolutional and recurrent networks are now ubiquitous and analyze massive amounts of audio, image, video, text and graph data, with applications in automatic translation, speech-to-text, scene understanding, ranking user p...