Papers
Browse academic papers referenced in production code
A closed-form formula for the Kullback-Leibler divergence between Cauchy distributions
We report a closed-form expression for the Kullback-Leibler divergence between Cauchy distributions which involves the calculation of a novel definite integral. The formula shows that the Kullback-Leibler divergence between Cauchy densities is always...
An Improved Algorithm for hypot(a,b)
We develop a fast and accurate algorithm for evaluating $\sqrt{a^2+b^2}$ for two floating point numbers $a$ and $b$. Library functions that perform this computation are generally named {\tt hypot(a,b)}. We will compare four approaches that we will de...
Base64 encoding and decoding at almost the speed of a memory copy
Many common document formats on the Internet are text-only such as email (MIME) and the Web (HTML, JavaScript, JSON and XML). To include images or executable code in these documents, we first encode them as text using base64. Standard base64 encoding...
Concurrent Hash Tables
Concurrent hash tables are one of the most important concurrent data structures, which are used in numerous applications. For some applications, it is common that hash table accesses dominate the execution time. To efficiently solve these problems in...
Faster Remainder by Direct Computation: Applications to Compilers and Software Libraries
On common processors, integer multiplication is many times faster than integer division. Dividing a numerator n by a divisor d is mathematically equivalent to multiplication by the inverse of the divisor (n / d = n x 1/d). If the divisor is known in ...
Generating Long Sequences with Sparse Transformers
Transformers are powerful sequence models, but require time and memory that grows quadratically with the sequence length. In this paper we introduce sparse factorizations of the attention matrix which reduce this to $O(n \sqrt{n})$. We also introduce...
Macro F1 and Macro F1
The 'macro F1' metric is frequently used to evaluate binary, multi-class and multi-label classification problems. Yet, we find that there exist two different formulas to calculate this quantity. In this note, we show that only under rare circumstance...
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
Recent work in language modeling demonstrates that training large transformer models advances the state of the art in Natural Language Processing applications. However, very large models can be quite difficult to train due to memory constraints. In t...
On the Variance of the Adaptive Learning Rate and Beyond
The learning rate warmup heuristic achieves remarkable success in stabilizing training, accelerating convergence and improving generalization for adaptive stochastic optimization algorithms like RMSprop and Adam. Here, we study its mechanism in detai...
Reliable and fast DWARF-based stack unwinding
Debug information, usually encoded in the DWARF format, is a hidden and obscure component of our computing infrastructure. Debug information is obviously used by debuggers, but it also plays a key role in program analysis tools, and, most surprisingl...
Ryū revisited: printf floating point conversion
Ryū Printf is a new algorithm to convert floating-point numbers to decimal strings according to the printf %f, %e, and %g formats: %f generates ‘full’ output (integer part of the input, dot, configurable number of digits), %e generates scientific out...
Stabilizing Transformers for Reinforcement Learning
Owing to their ability to both effectively integrate information over long time horizons and scale to massive amounts of data, self-attention architectures have recently shown breakthrough success in natural language processing (NLP), achieving state...
SWALP : Stochastic Weight Averaging in Low-Precision Training
Low precision operations can provide scalability, memory savings, portability, and energy efficiency. This paper proposes SWALP, an approach to low precision training that averages low-precision SGD iterates with a modified learning rate schedule. SW...
The continuous Bernoulli: fixing a pervasive error in variational autoencoders
Variational autoencoders (VAE) have quickly become a central tool in machine learning, applicable to a broad range of data types and latent variable models. By far the most common first step, taken by seminal papers and by core software libraries ali...
Tiramisu: A Polyhedral Compiler for Expressing Fast and Portable Code.
This paper introduces Tiramisu, a polyhedral framework designed to generate high performance code for multiple platforms including multicores, GPUs, and distributed machines. Tiramisu introduces a scheduling language with novel commands to explicitly...
Trivializations for Gradient-Based Optimization on Manifolds
We introduce a framework to study the transformation of problems with manifold constraints into unconstrained problems through parametrizations in terms of a Euclidean space. We call these parametrizations "trivializations". We prove conditions under...
ZeRO: Memory Optimizations Toward Training Trillion Parameter Models
Large deep learning models offer significant accuracy gains, but training billions to trillions of parameters is challenging. Existing solutions such as data and model parallelisms exhibit fundamental limitations to fit these models into limited devi...
Adafactor: Adaptive Learning Rates with Sublinear Memory Cost
In several recently proposed stochastic optimization methods (e.g. RMSProp, Adam, Adadelta), parameter updates are scaled by the inverse square roots of exponential moving averages of squared past gradients. Maintaining these per-parameter second-mom...
A System for Massively Parallel Hyperparameter Tuning
Modern learning models are characterized by large hyperparameter spaces and long training times. These properties, coupled with the rise of parallel computing and the growing demand to productionize machine learning workloads, motivate the need to de...
Distributed Prioritized Experience Replay
We propose a distributed architecture for deep reinforcement learning at scale, that enables agents to learn effectively from orders of magnitude more data than previously possible. The algorithm decouples acting from learning: the actors interact wi...