Papers - PaperGrep

Demystifying Why Local Aggregation Helps: Convergence Analysis of Hierarchical SGD

Jiayi Wang, Shiqiang Wang, Rong-Rong Chen, Mingyue Ji

2020

2 references

Hierarchical SGD (H-SGD) has emerged as a new distributed SGD algorithm for multi-level communication networks. In H-SGD, before each global aggregation, workers send their updated local models to local servers for aggregations. Despite recent resear...

View Paper PDF

Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs.

Yury A. Malkov, Dmitry A. Yashunin

2020

2 references

View Paper DOI

Efficient Memory Management for Deep Neural Net Inference

Yury Pisarchyk, Juhyun Lee

2020

2 references

While deep neural net inference was considered a task for servers only, latest advances in technology allow the task of inference to be moved to mobile and embedded devices, desired for various reasons ranging from latency to privacy. These devices a...

View Paper PDF

Enabling Fast Differentially Private SGD via Just-in-Time Compilation and Vectorization

Pranav Subramani, Nicholas Vadivelu, Gautam Kamath

2020

2 references

A common pain point in differentially private machine learning is the significant runtime overhead incurred when executing Differentially Private Stochastic Gradient Descent (DPSGD), which may be as large as two orders of magnitude. We thoroughly dem...

View Paper PDF

Stochastic Weight Averaging in Parallel: Large-Batch Training that Generalizes Well

Vipul Gupta, Santiago Akle Serrano, Dennis DeCoste

2020

2 references

We propose Stochastic Weight Averaging in Parallel (SWAP), an algorithm to accelerate DNN training. Our algorithm uses large mini-batches to compute an approximate solution quickly and then refines it by averaging the weights of multiple models compu...

View Paper PDF

A closed-form formula for the Kullback-Leibler divergence between Cauchy distributions

Frédéric Chyzak, Frank Nielsen

2019

2 references

We report a closed-form expression for the Kullback-Leibler divergence between Cauchy distributions which involves the calculation of a novel definite integral. The formula shows that the Kullback-Leibler divergence between Cauchy densities is always...

View Paper PDF

Fast Transformer Decoding: One Write-Head is All You Need

Noam Shazeer

2019

2 references

Multi-head attention layers, as used in the Transformer neural sequence model, are a powerful alternative to RNNs for moving information across and between sequences. While training these layers is generally fast and simple, due to parallelizability ...

View Paper PDF

Leveraging the bfloat16 Artificial Intelligence Datatype For Higher-Precision Computations

Greg Henry, Ping Tak Peter Tang, Alexander Heinecke

2019

2 references

In recent years fused-multiply-add (FMA) units with lower-precision multiplications and higher-precision accumulation have proven useful in machine learning/artificial intelligence applications, most notably in training deep neural networks due to th...

View Paper PDF

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, Bryan Catanzaro

2019

2 references

Recent work in language modeling demonstrates that training large transformer models advances the state of the art in Natural Language Processing applications. However, very large models can be quite difficult to train due to memory constraints. In t...

View Paper PDF

On the Variance of the Adaptive Learning Rate and Beyond

Liyuan Liu, Haoming Jiang, Pengcheng He, Weizhu Chen, Xiaodong Liu, Jianfeng Gao, Jiawei Han

2019

2 references

The learning rate warmup heuristic achieves remarkable success in stabilizing training, accelerating convergence and improving generalization for adaptive stochastic optimization algorithms like RMSprop and Adam. Here, we study its mechanism in detai...

View Paper PDF

Reliable and fast DWARF-based stack unwinding

T. Bastian, Stephen Kell, Francesco Zappa Nardelli

2019

2 references

Debug information, usually encoded in the DWARF format, is a hidden and obscure component of our computing infrastructure. Debug information is obviously used by debuggers, but it also plays a key role in program analysis tools, and, most surprisingl...

View Paper PDF DOI

Root Mean Square Layer Normalization

Biao Zhang, Rico Sennrich

2019

2 references

Layer normalization (LayerNorm) has been successfully applied to various deep neural networks to help stabilize training and boost model convergence because of its capability in handling re-centering and re-scaling of both inputs and weight matrix. H...

View Paper PDF

Ryū revisited: printf floating point conversion

Ulf Adams

2019

2 references

Ryū Printf is a new algorithm to convert floating-point numbers to decimal strings according to the printf %f, %e, and %g formats: %f generates ‘full’ output (integer part of the input, dot, configurable number of digits), %e generates scientific out...

View Paper PDF DOI

SWALP : Stochastic Weight Averaging in Low-Precision Training

Guandao Yang, Tianyi Zhang, Polina Kirichenko, Junwen Bai, Andrew Gordon Wilson, Christopher De Sa

2019

2 references

Low precision operations can provide scalability, memory savings, portability, and energy efficiency. This paper proposes SWALP, an approach to low precision training that averages low-precision SGD iterates with a modified learning rate schedule. SW...

View Paper PDF

The continuous Bernoulli: fixing a pervasive error in variational autoencoders

Gabriel Loaiza-Ganem, John P. Cunningham

2019

2 references

Variational autoencoders (VAE) have quickly become a central tool in machine learning, applicable to a broad range of data types and latent variable models. By far the most common first step, taken by seminal papers and by core software libraries ali...

View Paper PDF

Trivializations for Gradient-Based Optimization on Manifolds

Mario Lezcano-Casado

2019

2 references

We introduce a framework to study the transformation of problems with manifold constraints into unconstrained problems through parametrizations in terms of a Euclidean space. We call these parametrizations "trivializations". We prove conditions under...

View Paper PDF

ZeRO: Memory Optimizations Toward Training Trillion Parameter Models

Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, Yuxiong He

2019

2 references

Large deep learning models offer significant accuracy gains, but training billions to trillions of parameters is challenging. Existing solutions such as data and model parallelisms exhibit fundamental limitations to fit these models into limited devi...

View Paper PDF

Adafactor: Adaptive Learning Rates with Sublinear Memory Cost

Noam Shazeer, Mitchell Stern

2018

2 references

In several recently proposed stochastic optimization methods (e.g. RMSProp, Adam, Adadelta), parameter updates are scaled by the inverse square roots of exponential moving averages of squared past gradients. Maintaining these per-parameter second-mom...

View Paper PDF

Filter Pruning via Geometric Median for Deep Convolutional Neural Networks Acceleration

Yang He, Ping Liu, Ziwei Wang, Zhilan Hu, Yi Yang

2018

2 references

Previous works utilized ''smaller-norm-less-important'' criterion to prune filters with smaller norm values in a convolutional neural network. In this paper, we analyze this norm-based criterion and point out that its effectiveness depends on two req...

View Paper PDF

Group Normalization

Yuxin Wu, Kaiming He

2018

2 references

Batch Normalization (BN) is a milestone technique in the development of deep learning, enabling various networks to train. However, normalizing along the batch dimension introduces problems --- BN's error increases rapidly when the batch size becomes...

View Paper PDF