openxla/xla - PaperGrep

Training Deep Neural Networks is complicated by the fact that the distribution of each layer's inputs changes during training, as the parameters of the previous layers change. This slows down the training by requiring lower learning rates and careful parameter initialization, and makes it notoriousl...

View Paper PDF DOI

Show 3 references in code

docs/operation_semantics.md:601

docs/operation_semantics.md:671

docs/operation_semantics.md:710

Leveraging the bfloat16 Artificial Intelligence Datatype For Higher-Precision Computations

Greg Henry, Ping Tak Peter Tang, Alexander Heinecke

2019

4 references

In recent years fused-multiply-add (FMA) units with lower-precision multiplications and higher-precision accumulation have proven useful in machine learning/artificial intelligence applications, most notably in training deep neural networks due to their extreme computational intensity. Compared to c...

View Paper PDF DOI

Show 2 references in code

xla/backends/gpu/codegen/triton/transforms/stablehlo_lower_to_triton.cc:523

xla/backends/gpu/codegen/triton/transforms/stablehlo_lower_to_triton.cc:562

TPU-KNN: K Nearest Neighbor Search at Peak FLOP/s

Felix Chern, Blake Hechtman, Andy Davis, Ruiqi Guo, David Majnemer, Sanjiv Kumar

2022

5 references

This paper presents a novel nearest neighbor search algorithm achieving TPU (Google Tensor Processing Unit) peak performance, outperforming state-of-the-art GPU algorithms with similar level of recall. The design of the proposed algorithm is motivated by an accurate accelerator performance model tha...

View Paper PDF DOI

Show 2 references in code

xla/hlo/translate/mhlo_to_hlo/mlir_hlo_to_hlo.cc:2354

xla/hlo/translate/mhlo_to_hlo/mlir_hlo_to_hlo.cc:4085

FP8 Formats for Deep Learning

Paulius Micikevicius, Dusan Stosic, Neil Burgess, Marius Cornea, Pradeep Dubey, Richard Grisenthwait...

2022

14 references

FP8 is a natural progression for accelerating deep learning training inference beyond the 16-bit formats common in modern processors. In this paper we propose an 8-bit floating point (FP8) binary interchange format consisting of two encodings - E4M3 (4-bit exponent and 3-bit mantissa) and E5M2 (5-bi...

View Paper PDF DOI

Show 2 references in code

xla/python/ifrt/dtype.cc:47

xla/python/ifrt/dtype.cc:93

Gaussian Error Linear Units (GELUs)

Dan Hendrycks, Kevin Gimpel

2016

11 references

We propose the Gaussian Error Linear Unit (GELU), a high-performing neural network activation function. The GELU activation function is $x\Phi(x)$, where $\Phi(x)$ the standard Gaussian cumulative distribution function. The GELU nonlinearity weights inputs by their value, rather than gates inputs by...

View Paper PDF DOI

Show 2 references in code

xla/service/cpu/onednn_contraction_rewriter.cc:288

xla/service/gpu/transforms/gemm_rewriter.cc:761

GSPMD: General and Scalable Parallelization for ML Computation Graphs

Yuanzhong Xu, HyoukJoong Lee, Dehao Chen, Blake Hechtman, Yanping Huang, Rahul Joshi, Maxim Krikun, ...

2021

5 references

We present GSPMD, an automatic, compiler-based parallelization system for common machine learning computations. It allows users to write programs in the same way as for a single device, then give hints through a few annotations on how to distribute tensors, based on which GSPMD will parallelize the ...

View Paper PDF DOI

Show 1 reference in code

docs/gpu_architecture.md:80

Overlap Communication with Dependent Computation via Decomposition in Large Deep Learning Models

Shibo Wang, Jinliang Wei, Amit Sabne, Andy Davis, Berkin Ilbeyi, Blake A. Hechtman, Dehao Chen, K. M...

2022

1 reference

Large deep learning models have shown great potential with state-of-the-art results in many tasks. However, running these large models is quite challenging on an accelerator (GPU or TPU) because the on-device memory is too limited for the size of these models. Intra-layer model parallelism is an app...

View Paper PDF DOI

Show 1 reference in code

docs/gpu_architecture.md:85

ZeRO-Offload: Democratizing Billion-Scale Model Training

2021

1 reference

Large-scale model training has been a playing ground for a limited few requiring complex model refactoring and access to prohibitively expensive GPU clusters. ZeRO-Offload changes the large model training landscape by making large model training accessible to nearly everyone. It can train models wit...

View Paper DOI

Show 1 reference in code

xla/core/host_offloading/README.md:8

HPTT: A High-Performance Tensor Transposition C++ Library

Paul Springer, Tong Su, Paolo Bientinesi

2017

1 reference

Recently we presented TTC, a domain-specific compiler for tensor transpositions. Despite the fact that the performance of the generated code is nearly optimal, due to its offline nature, TTC cannot be utilized in all the application codes in which the tensor sizes and the necessary tensor permutatio...

View Paper PDF

Show 1 reference in code

xla/pjrt/transpose.h:23

An Improved Algorithm for hypot(a,b)

Carlos F. Borges

2019

2 references

We develop a fast and accurate algorithm for evaluating $\sqrt{a^2+b^2}$ for two floating point numbers $a$ and $b$. Library functions that perform this computation are generally named {\tt hypot(a,b)}. We will compare four approaches that we will develop in this paper to the current resident librar...

View Paper PDF DOI

Show 1 reference in code

xla/service/elemental_ir_emitter.cc:1588

DHEN: A Deep and Hierarchical Ensemble Network for Large-Scale Click-Through Rate Prediction

Buyun Zhang, Liang Luo, Xi Liu, Jay Li, Zeliang Chen, Weilin Zhang, Xiaohan Wei, Yuchen Hao, Michael...

2022

1 reference

Learning feature interactions is important to the model performance of online advertising services. As a result, extensive efforts have been devoted to designing effective architectures to learn feature interactions. However, we observe that the practical performance of those designs can vary from d...

View Paper PDF

Show 1 reference in code

xla/service/gpu/transforms/collectives/all_reduce_splitter.h:63

Link copied to clipboard!