17 papers
11 files
21 references

Papers Referenced in This Repository

FP8 Formats for Deep Learning

Paulius Micikevicius, Dušan Stošić, Neil Burgess, Marius Cornea, Pradeep Dubey, Richard Grisenthwait...
2022
24 references

FP8 is a natural progression for accelerating deep learning training inference beyond the 16-bit formats common in modern processors. In this paper we propose an 8-bit floating point (FP8) binary interchange format consisting of two encodings - E4M3 (4-bit exponent and 3-bit mantissa) and E5M2 (5-bi...

Show 5 references in code

TVM: An Automated End-to-End Optimizing Compiler for Deep Learning

Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Meghan Cowan, Haichen Shen, Ley...
2018
3 references

There is an increasing need to bring machine learning to a wide diversity of hardware devices. Current frameworks rely on vendor-specific operator libraries and optimize for a narrow range of server-class GPUs. Deploying workloads to new platforms -- such as mobile phones, embedded devices, and acce...

Show 1 reference in code

Learning to Optimize Tensor Programs

Tianqi Chen, Lianmin Zheng, Eddie Yan, Ziheng Jiang, Thierry Moreau, Luis Ceze, Carlos Guestrin, Arv...
2018
1 reference

We introduce a learning-based framework to optimize tensor programs for deep learning workloads. Efficient implementations of tensor operators, such as matrix multiplication and high dimensional convolution, are key enablers of effective deep learning systems. However, existing systems rely on manua...

Show 1 reference in code

Ansor: Generating High-Performance Tensor Programs for Deep Learning

Lianmin Zheng, Chengfan Jia, Minmin Sun, Zhao Wu, Cody Hao Yu, Ameer Haj-Ali, Yida Wang, Jun Yang, D...
2020
1 reference

High-performance tensor programs are crucial to guarantee efficient execution of deep neural networks. However, obtaining performant tensor programs for different operators on various hardware platforms is notoriously challenging. Currently, deep learning systems rely on vendor-provided kernel libra...

Show 1 reference in code

Nimble: Efficiently Compiling Dynamic Neural Networks for Model Inference

Haichen Shen, Jared Roesch, Zhi Chen, Wei Chen, Yong Wu, Mu Li, Vin Sharma, Zachary Tatlock, Yida Wa...
2020
2 references

Modern deep neural networks increasingly make use of features such as dynamic control flow, data structures and dynamic tensor shapes. Existing deep learning systems focus on optimizing and executing static neural networks which assume a pre-determined model architecture and input data shapes--assum...

Show 1 reference in code

Cortex: A Compiler for Recursive Deep Learning Models

Pratik Fegade, Tianqi Chen, Phillip B. Gibbons, Todd C. Mowry
2020
1 reference

Optimizing deep learning models is generally performed in two steps: (i) high-level graph optimizations such as kernel fusion and (ii) low level kernel optimizations such as those found in vendor libraries. This approach often leaves significant performance on the table, especially for the case of r...

Show 1 reference in code

UNIT: Unifying Tensorized Instruction Compilation

Jian Weng, Animesh Jain, Jie Wang, Leyuan Wang, Yida Wang, Tony Nowatzki
2021
1 reference

Because of the increasing demand for computation in DNN, researchers develope both hardware and software mechanisms to reduce the compute and memory burden. A widely adopted approach is to use mixed precision data types. However, it is hard to leverage mixed precision without hardware support becaus...

Show 1 reference in code

Bring Your Own Codegen to Deep Learning Compiler

Zhi Chen, Cody Hao Yu, Trevor Morris, Jorn Tuyls, Yi-Hsiang Lai, Jared Roesch, Elliott Delaye, Vin S...
2021
1 reference

Deep neural networks (DNNs) have been ubiquitously applied in many applications, and accelerators are emerged as an enabler to support the fast and efficient inference tasks of these applications. However, to achieve high model coverage with high performance, each accelerator vendor has to develop a...

Show 1 reference in code

The CoRa Tensor Compiler: Compilation for Ragged Tensors with Minimal Padding

Pratik Fegade, Tianqi Chen, Phillip B. Gibbons, Todd C. Mowry
2021
1 reference

There is often variation in the shape and size of input data used for deep learning. In many cases, such data can be represented using tensors with non-uniform shapes, or ragged tensors. Due to limited and non-portable support for efficient execution on ragged tensors, current deep learning framewor...

Show 1 reference in code

Group Normalization

Yuxin Wu, Kaiming He
2018
10 references

Batch Normalization (BN) is a milestone technique in the development of deep learning, enabling various networks to train. However, normalizing along the batch dimension introduces problems --- BN's error increases rapidly when the batch size becomes smaller, caused by inaccurate batch statistics es...

Show 1 reference in code

Merge Path - A Visually Intuitive Approach to Parallel Merging

Oded Green, Saher Odeh, Yitzhak Birk
2014
1 reference

Merging two sorted arrays is a prominent building block for sorting and other functions. Its efficient parallelization requires balancing the load among compute cores, minimizing the extra work brought about by parallelization, and minimizing inter-thread synchronization requirements. Efficient use ...

Show 1 reference in code

Spatial Transformer Networks

Max Jaderberg, Karen Simonyan, Andrew Zisserman, Koray Kavukcuoglu
2015
14 references

Convolutional Neural Networks define an exceptionally powerful class of models, but are still limited by the lack of ability to be spatially invariant to the input data in a computationally and parameter efficient manner. In this work we introduce a new learnable module, the Spatial Transformer, whi...

Show 1 reference in code

Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification.

Kaiming He, X. Zhang, Shaoqing Ren, Jian Sun
2015
9 references

Rectified activation units (rectifiers) are essential for state-of-the-art neural networks. In this work, we study rectifier neural networks for image classification from two aspects. First, we propose a Parametric Rectified Linear Unit (PReLU) that generalizes the traditional rectified unit. PReLU ...

Show 1 reference in code

Fast Algorithms for Convolutional Neural Networks

Andrew Lavin, Scott Gray
2015
920 citations
2 references

Deep convolutional neural networks take GPU days of compute time to train on large data sets. Pedestrian detection for self driving cars requires very low latency. Image recognition for mobile phones is constrained by limited processing resources. The success of convolutional neural networks in thes...

Show 1 reference in code

Error Analysis and Improving the Accuracy of Winograd Convolution for Deep Neural Networks

Barbara Barabasz, Andrew Anderson, Kirk M. Soodhalter, David Gregg
2018
1 reference

Popular deep neural networks (DNNs) spend the majority of their execution time computing convolutions. The Winograd family of algorithms can greatly reduce the number of arithmetic operations required and is present in many DNN software frameworks. However, the performance gain is at the expense of ...

Show 1 reference in code

Finding All the Elementary Circuits of a Directed Graph

D. Barton Johnson
1975
1 reference

An algorithm is presented which finds all the elementary circuits of a directed graph in time bounded by $O((n + e)(c + 1))$ and space bounded by $O(n + e)$, where there are n vertices, e edges and c elementary circuits in the graph. The algorithm resembles algorithms by Tiernan and Tarjan, but is f...

Show 1 reference in code

Efficient Streaming Language Models with Attention Sinks

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, Mike Lewis
2023
1 reference

Deploying Large Language Models (LLMs) in streaming applications such as multi-round dialogue, where long interactions are expected, is urgently needed but poses two major challenges. Firstly, during the decoding stage, caching previous tokens' Key and Value states (KV) consumes extensive memory. Se...

Show 1 reference in code
Link copied to clipboard!