11 papers
11 files
17 references

Papers Referenced in This Repository

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Sergey Ioffe, Christian Szegedy
2015
17 references

Training Deep Neural Networks is complicated by the fact that the distribution of each layer's inputs changes during training, as the parameters of the previous layers change. This slows down the training by requiring lower learning rates and careful parameter initialization, and makes it notoriousl...

Show 3 references in code

Leveraging the bfloat16 Artificial Intelligence Datatype For Higher-Precision Computations

Greg Henry, Ping Tak Peter Tang, Alexander Heinecke
2019
4 references

In recent years fused-multiply-add (FMA) units with lower-precision multiplications and higher-precision accumulation have proven useful in machine learning/artificial intelligence applications, most notably in training deep neural networks due to their extreme computational intensity. Compared to c...

Show 2 references in code

TPU-KNN: K Nearest Neighbor Search at Peak FLOP/s

Felix Chern, Blake Hechtman, Andy Davis, Ruiqi Guo, David Majnemer, Sanjiv Kumar
2022
5 references

This paper presents a novel nearest neighbor search algorithm achieving TPU (Google Tensor Processing Unit) peak performance, outperforming state-of-the-art GPU algorithms with similar level of recall. The design of the proposed algorithm is motivated by an accurate accelerator performance model tha...

Show 2 references in code

FP8 Formats for Deep Learning

Paulius Micikevicius, Dusan Stosic, Neil Burgess, Marius Cornea, Pradeep Dubey, Richard Grisenthwait...
2022
14 references

FP8 is a natural progression for accelerating deep learning training inference beyond the 16-bit formats common in modern processors. In this paper we propose an 8-bit floating point (FP8) binary interchange format consisting of two encodings - E4M3 (4-bit exponent and 3-bit mantissa) and E5M2 (5-bi...

Show 2 references in code

Gaussian Error Linear Units (GELUs)

Dan Hendrycks, Kevin Gimpel
2016
11 references

We propose the Gaussian Error Linear Unit (GELU), a high-performing neural network activation function. The GELU activation function is $x\Phi(x)$, where $\Phi(x)$ the standard Gaussian cumulative distribution function. The GELU nonlinearity weights inputs by their value, rather than gates inputs by...

Show 2 references in code

GSPMD: General and Scalable Parallelization for ML Computation Graphs

Yuanzhong Xu, HyoukJoong Lee, Dehao Chen, Blake Hechtman, Yanping Huang, Rahul Joshi, Maxim Krikun, ...
2021
5 references

We present GSPMD, an automatic, compiler-based parallelization system for common machine learning computations. It allows users to write programs in the same way as for a single device, then give hints through a few annotations on how to distribute tensors, based on which GSPMD will parallelize the ...

Show 1 reference in code

Overlap Communication with Dependent Computation via Decomposition in Large Deep Learning Models

Shibo Wang, Jinliang Wei, Amit Sabne, Andy Davis, Berkin Ilbeyi, Blake A. Hechtman, Dehao Chen, K. M...
2022
1 reference

Large deep learning models have shown great potential with state-of-the-art results in many tasks. However, running these large models is quite challenging on an accelerator (GPU or TPU) because the on-device memory is too limited for the size of these models. Intra-layer model parallelism is an app...

Show 1 reference in code

ZeRO-Offload: Democratizing Billion-Scale Model Training

2021
1 reference

Large-scale model training has been a playing ground for a limited few requiring complex model refactoring and access to prohibitively expensive GPU clusters. ZeRO-Offload changes the large model training landscape by making large model training accessible to nearly everyone. It can train models wit...

Show 1 reference in code

HPTT: A High-Performance Tensor Transposition C++ Library

Paul Springer, Tong Su, Paolo Bientinesi
2017
1 reference

Recently we presented TTC, a domain-specific compiler for tensor transpositions. Despite the fact that the performance of the generated code is nearly optimal, due to its offline nature, TTC cannot be utilized in all the application codes in which the tensor sizes and the necessary tensor permutatio...

Show 1 reference in code

An Improved Algorithm for hypot(a,b)

Carlos F. Borges
2019
2 references

We develop a fast and accurate algorithm for evaluating $\sqrt{a^2+b^2}$ for two floating point numbers $a$ and $b$. Library functions that perform this computation are generally named {\tt hypot(a,b)}. We will compare four approaches that we will develop in this paper to the current resident librar...

Show 1 reference in code

DHEN: A Deep and Hierarchical Ensemble Network for Large-Scale Click-Through Rate Prediction

Buyun Zhang, Liang Luo, Xi Liu, Jay Li, Zeliang Chen, Weilin Zhang, Xiaohan Wei, Yuchen Hao, Michael...
2022
1 reference

Learning feature interactions is important to the model performance of online advertising services. As a result, extensive efforts have been devoted to designing effective architectures to learn feature interactions. However, we observe that the practical performance of those designs can vary from d...

Show 1 reference in code
Link copied to clipboard!