ML Compilers
Deep learning compilation frameworks and optimization techniques
Repositories
(2)Papers
(17)8-bit Numerical Formats for Deep Neural Networks
Given the current trend of increasing size and complexity of machine learning architectures, it has become of critical importance to identify new approaches to improve the computational efficiency of model training. In this context, we address the ad...
An Improved Algorithm for hypot(a,b)
We develop a fast and accurate algorithm for evaluating $\sqrt{a^2+b^2}$ for two floating point numbers $a$ and $b$. Library functions that perform this computation are generally named {\tt hypot(a,b)}. We will compare four approaches that we will de...
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
Training Deep Neural Networks is complicated by the fact that the distribution of each layer's inputs changes during training, as the parameters of the previous layers change. This slows down the training by requiring lower learning rates and careful...
DHEN: A Deep and Hierarchical Ensemble Network for Large-Scale Click-Through Rate Prediction
Learning feature interactions is important to the model performance of online advertising services. As a result, extensive efforts have been devoted to designing effective architectures to learn feature interactions. However, we observe that the prac...
Fast splittable pseudorandom number generators
We describe a new algorithm SplitMix for an object-oriented and splittable pseudorandom number generator (PRNG) that is quite fast: 9 64-bit arithmetic/logical operations per 64 bits generated. A conventional linear PRNG object provides a generate me...
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
Scaling Transformers to longer sequence lengths has been a major problem in the last several years, promising to improve performance in language modeling and high-resolution image understanding, as well as to unlock new applications in code, audio, a...
FP8 Formats for Deep Learning
FP8 is a natural progression for accelerating deep learning training inference beyond the 16-bit formats common in modern processors. In this paper we propose an 8-bit floating point (FP8) binary interchange format consisting of two encodings - E4M3 ...
Gaussian Error Linear Units (GELUs)
We propose the Gaussian Error Linear Unit (GELU), a high-performing neural network activation function. The GELU activation function is $x\Phi(x)$, where $\Phi(x)$ the standard Gaussian cumulative distribution function. The GELU nonlinearity weights ...
GSPMD: General and Scalable Parallelization for ML Computation Graphs
We present GSPMD, an automatic, compiler-based parallelization system for common machine learning computations. It allows users to write programs in the same way as for a single device, then give hints through a few annotations on how to distribute t...
HPTT: A High-Performance Tensor Transposition C++ Library
Recently we presented TTC, a domain-specific compiler for tensor transpositions. Despite the fact that the performance of the generated code is nearly optimal, due to its offline nature, TTC cannot be utilized in all the application codes in which th...
Leveraging the bfloat16 Artificial Intelligence Datatype For Higher-Precision Computations
In recent years fused-multiply-add (FMA) units with lower-precision multiplications and higher-precision accumulation have proven useful in machine learning/artificial intelligence applications, most notably in training deep neural networks due to th...
Nimble: Efficiently Compiling Dynamic Neural Networks for Model Inference
Modern deep neural networks increasingly make use of features such as dynamic control flow, data structures and dynamic tensor shapes. Existing deep learning systems focus on optimizing and executing static neural networks which assume a pre-determin...
Optimizing Winograd-Based Convolution with Tensor Cores.
Convolution computing is one of the primary time consuming part of convolutional neural networks (CNNs). State of the art convolutional neural networks use samll, 3 × 3 filters. Recent work on Winograd convolution can reduce the computational complex...
Overlap Communication with Dependent Computation via Decomposition in Large Deep Learning Models
Large deep learning models have shown great potential with state-of-the-art results in many tasks. However, running these large models is quite challenging on an accelerator (GPU or TPU) because the on-device memory is too limited for the size of the...
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference
The rising popularity of intelligent mobile devices and the daunting computational cost of deep learning-based models call for efficient and accurate on-device inference schemes. We propose a quantization scheme that allows inference to be carried ou...
TPU-KNN: K Nearest Neighbor Search at Peak FLOP/s
This paper presents a novel nearest neighbor search algorithm achieving TPU (Google Tensor Processing Unit) peak performance, outperforming state-of-the-art GPU algorithms with similar level of recall. The design of the proposed algorithm is motivate...
ZeRO-Offload: Democratizing Billion-Scale Model Training
Large-scale model training has been a playing ground for a limited few requiring complex model refactoring and access to prohibitively expensive GPU clusters. ZeRO-Offload changes the large model training landscape by making large model training acce...