ML Compilers

An Improved Algorithm for hypot(a,b)

Aluisio Cardoso Silva, Carlos Cristiano Hasenclever Borges

2019

2 references

We develop a fast and accurate algorithm for evaluating $\sqrt{a^2+b^2}$ for two floating point numbers $a$ and $b$. Library functions that perform this computation are generally named {\tt hypot(a,b)}. We will compare four approaches that we will de...

View Paper PDF DOI

Ansor: Generating High-Performance Tensor Programs for Deep Learning

Lianmin Zheng, Chengfan Jia, Minmin Sun, Zhao Wu, Cody Hao Yu, Ameer Haj-Ali, Yida Wang, Jun Yang, D...

2020

1 reference

High-performance tensor programs are crucial to guarantee efficient execution of deep neural networks. However, obtaining performant tensor programs for different operators on various hardware platforms is notoriously challenging. Currently, deep lea...

View Paper PDF

Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification.

Kaiming He, X. Zhang, Shaoqing Ren, Jian Sun

2015

9 references

Rectified activation units (rectifiers) are essential for state-of-the-art neural networks. In this work, we study rectifier neural networks for image classification from two aspects. First, we propose a Parametric Rectified Linear Unit (PReLU) that ...

View Paper PDF DOI

DHEN: A Deep and Hierarchical Ensemble Network for Large-Scale Click-Through Rate Prediction

Buyun Zhang, Liangchen Luo, Xi Liu, Jay Li, Zeliang Chen, Weilin Zhang, Xiaohan Wei, Y. Hao, Michael...

2022

1 reference

Learning feature interactions is important to the model performance of online advertising services. As a result, extensive efforts have been devoted to designing effective architectures to learn feature interactions. However, we observe that the prac...

View Paper PDF DOI

Efficient Streaming Language Models with Attention Sinks

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, Mike Lewis

2023

1 reference

Deploying Large Language Models (LLMs) in streaming applications such as multi-round dialogue, where long interactions are expected, is urgently needed but poses two major challenges. Firstly, during the decoding stage, caching previous tokens' Key a...

View Paper PDF

Error Analysis and Improving the Accuracy of Winograd Convolution for Deep Neural Networks

Barbara Barabasz, Andrew Anderson, Kirk M. Soodhalter, David Gregg

2018

1 reference

Popular deep neural networks (DNNs) spend the majority of their execution time computing convolutions. The Winograd family of algorithms can greatly reduce the number of arithmetic operations required and is present in many DNN software frameworks. H...

View Paper PDF

Fast Algorithms for Convolutional Neural Networks

Andrew Lavin, Scott Gray

2015

944 citations

2 references

Deep convolutional neural networks take GPU days of compute time to train on large data sets. Pedestrian detection for self driving cars requires very low latency. Image recognition for mobile phones is constrained by limited processing resources. Th...

View Paper PDF DOI

Fast splittable pseudorandom number generators

G. Steele, D. Lea, Christine H. Flood

2014

2 references

We describe a new algorithm SplitMix for an object-oriented and splittable pseudorandom number generator (PRNG) that is quite fast: 9 64-bit arithmetic/logical operations per 64 bits generated. A conventional linear PRNG object provides a generate me...

View Paper DOI

Finding All the Elementary Circuits of a Directed Graph

D. Barton Johnson

1975

1 reference

An algorithm is presented which finds all the elementary circuits of a directed graph in time bounded by $O((n + e)(c + 1))$ and space bounded by $O(n + e)$, where there are n vertices, e edges and c elementary circuits in the graph. The algorithm re...

View Paper DOI

GSPMD: General and Scalable Parallelization for ML Computation Graphs

Yuanzhong Xu, HyoukJoong Lee, Dehao Chen, Blake Hechtman, Yanping Huang, Rahul Joshi, Maxim Krikun, ...

2021

6 references

We present GSPMD, an automatic, compiler-based parallelization system for common machine learning computations. It allows users to write programs in the same way as for a single device, then give hints through a few annotations on how to distribute t...

View Paper PDF DOI

HPTT: A High-Performance Tensor Transposition C++ Library

P. Springer, Tong Su, P. Bientinesi

2017

1 reference

Recently we presented TTC, a domain-specific compiler for tensor transpositions. Despite the fact that the performance of the generated code is nearly optimal, due to its offline nature, TTC cannot be utilized in all the application codes in which th...

View Paper PDF DOI

Learning to Optimize Tensor Programs

Tianqi Chen, Lianmin Zheng, Eddie Yan, Ziheng Jiang, Thierry Moreau, Luis Ceze, Carlos Guestrin, Arv...

2018

1 reference

We introduce a learning-based framework to optimize tensor programs for deep learning workloads. Efficient implementations of tensor operators, such as matrix multiplication and high dimensional convolution, are key enablers of effective deep learnin...

View Paper PDF

Merge Path - A Visually Intuitive Approach to Parallel Merging

Oded Green, Saher Odeh, Yitzhak Birk

2014

1 reference

Merging two sorted arrays is a prominent building block for sorting and other functions. Its efficient parallelization requires balancing the load among compute cores, minimizing the extra work brought about by parallelization, and minimizing inter-t...

View Paper PDF

Nimble: Efficiently Compiling Dynamic Neural Networks for Model Inference

Haichen Shen, Jared Roesch, Zhi Chen, Wei Chen, Yong Wu, Mu Li, Vin Sharma, Zachary Tatlock, Yida Wa...

2020

2 references

Modern deep neural networks increasingly make use of features such as dynamic control flow, data structures and dynamic tensor shapes. Existing deep learning systems focus on optimizing and executing static neural networks which assume a pre-determin...

View Paper PDF DOI

Optimizing Winograd-Based Convolution with Tensor Cores.

Junhong Liu, Dongxu Yang, Junjie Lai

2021

1 reference

Convolution computing is one of the primary time consuming part of convolutional neural networks (CNNs). State of the art convolutional neural networks use samll, 3 × 3 filters. Recent work on Winograd convolution can reduce the computational complex...

View Paper DOI

Overlap Communication with Dependent Computation via Decomposition in Large Deep Learning Models

Shibo Wang, Jinliang Wei, Amit Sabne, Andy Davis, Berkin Ilbeyi, Blake A. Hechtman, Dehao Chen, K. M...

2022

1 reference

Large deep learning models have shown great potential with state-of-the-art results in many tasks. However, running these large models is quite challenging on an accelerator (GPU or TPU) because the on-device memory is too limited for the size of the...

View Paper PDF DOI

Recurrent Neural Network Regularization

Wojciech Zaremba, Ilya Sutskever, Oriol Vinyals

2014

2 references

We present a simple regularization technique for Recurrent Neural Networks (RNNs) with Long Short-Term Memory (LSTM) units. Dropout, the most successful technique for regularizing neural networks, does not work well with RNNs and LSTMs. In this paper...

View Paper PDF DOI

TPU-KNN: K Nearest Neighbor Search at Peak FLOP/s

Felix Chern, Blake Hechtman, Andy Davis, Ruiqi Guo, David Majnemer, Sanjiv Kumar

2022

5 references

This paper presents a novel nearest neighbor search algorithm achieving TPU (Google Tensor Processing Unit) peak performance, outperforming state-of-the-art GPU algorithms with similar level of recall. The design of the proposed algorithm is motivate...

View Paper PDF DOI

TVM: An Automated End-to-End Optimizing Compiler for Deep Learning

Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Meghan Cowan, Haichen Shen, Ley...

2018

3 references

There is an increasing need to bring machine learning to a wide diversity of hardware devices. Current frameworks rely on vendor-specific operator libraries and optimize for a narrow range of server-class GPUs. Deploying workloads to new platforms --...

View Paper PDF

ZeRO-Offload: Democratizing Billion-Scale Model Training

Jie Ren, Samyam Rajbhandari, Reza Yazdani Aminabadi, Olatunji Ruwase, Shuangyan Yang, Minjia Zhang, ...

2021

1 reference

Large-scale model training has been a playing ground for a limited few requiring complex model refactoring and access to prohibitively expensive GPU clusters. ZeRO-Offload changes the large model training landscape by making large model training acce...

View Paper PDF DOI

Repositories

apache/tvm

iree-org/iree

onnx/onnx

openxla/xla

pytorch/glow

triton-lang/triton

Papers

An Improved Algorithm for hypot(a,b)

Ansor: Generating High-Performance Tensor Programs for Deep Learning

Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification.

DHEN: A Deep and Hierarchical Ensemble Network for Large-Scale Click-Through Rate Prediction

Efficient Streaming Language Models with Attention Sinks

Error Analysis and Improving the Accuracy of Winograd Convolution for Deep Neural Networks

Fast Algorithms for Convolutional Neural Networks

Fast splittable pseudorandom number generators

Finding All the Elementary Circuits of a Directed Graph

GSPMD: General and Scalable Parallelization for ML Computation Graphs

HPTT: A High-Performance Tensor Transposition C++ Library

Learning to Optimize Tensor Programs

Merge Path - A Visually Intuitive Approach to Parallel Merging

Nimble: Efficiently Compiling Dynamic Neural Networks for Model Inference

Optimizing Winograd-Based Convolution with Tensor Cores.

Overlap Communication with Dependent Computation via Decomposition in Large Deep Learning Models

Recurrent Neural Network Regularization

TPU-KNN: K Nearest Neighbor Search at Peak FLOP/s

TVM: An Automated End-to-End Optimizing Compiler for Deep Learning

ZeRO-Offload: Democratizing Billion-Scale Model Training