Papers
Browse academic papers referenced in production code
IR2VEC
We propose IR2VEC, a Concise and Scalable encoding infrastructure to represent programs as a distributed embedding in continuous space. This distributed embedding is obtained by combining representation learning methods with flow information to captu...
On Layer Normalization in the Transformer Architecture
The Transformer is widely used in natural language processing tasks. To train a Transformer however, one usually needs a carefully designed learning rate warm-up stage, which is shown to be crucial to the final performance but will slow down the opti...
Data-Free Quantization Through Weight Equalization and Bias Correction
We introduce a data-free quantization method for deep neural networks that does not require fine-tuning or hyperparameter selection. It achieves near-original model performance on common computer vision architectures and tasks. 8-bit fixed-point quan...
Leveraging the bfloat16 Artificial Intelligence Datatype For Higher-Precision Computations
In recent years fused-multiply-add (FMA) units with lower-precision multiplications and higher-precision accumulation have proven useful in machine learning/artificial intelligence applications, most notably in training deep neural networks due to th...
Mish: A Self Regularized Non-Monotonic Activation Function
We propose $\textit{Mish}$, a novel self-regularized non-monotonic activation function which can be mathematically defined as: $f(x)=x\tanh(softplus(x))$. As activation functions play a crucial role in the performance and training dynamics in neural ...
Searching for MobileNetV3
We present the next generation of MobileNets based on a combination of complementary search techniques as well as a novel architecture design. MobileNetV3 is tuned to mobile phone CPUs through a combination of hardware-aware network architecture sear...
Trained Quantization Thresholds for Accurate and Efficient Fixed-Point Inference of Deep Neural Networks
We propose a method of training quantization thresholds (TQT) for uniform symmetric quantizers using standard backpropagation and gradient descent. Contrary to prior work, we show that a careful analysis of the straight-through estimator for threshol...
Fast Random Integer Generation in an Interval
In simulations, probabilistic algorithms and statistical tests, we often generate random integers in an interval (e.g., [0,s)). For example, random integers in an interval are essential to the Fisher-Yates random shuffle. Consequently, popular langua...
GossipGraD: Scalable Deep Learning using Gossip Communication based Asynchronous Gradient Descent
In this paper, we present GossipGraD - a gossip communication protocol based Stochastic Gradient Descent (SGD) algorithm for scaling Deep Learning (DL) algorithms on large-scale systems. The salient features of GossipGraD are: 1) reduction in overall...
Quantizing deep convolutional networks for efficient inference: A whitepaper
We present an overview of techniques for quantizing convolutional neural networks for inference with integer weights and activations. Per-channel quantization of weights and per-layer quantization of activations to 8-bits of precision post-training p...
Ryū: fast float-to-string conversion
We present Ryū, a new routine to convert binary floating point numbers to their decimal representations using only fixed-size integer operations, and prove its correctness. Ryū is simpler and approximately three times faster than the previously faste...
Soft Actor-Critic Algorithms and Applications
Model-free deep reinforcement learning (RL) algorithms have been successfully applied to a range of challenging sequential decision making and control tasks. However, these methods typically suffer from two major challenges: high sample complexity an...
Neural Optimizer Search with Reinforcement Learning.
We present an approach to automate the process of discovering optimization methods, with a focus on deep learning architectures. We train a Recurrent Neural Network controller to generate a string in a domain specific language that describes a mathem...
Proximal Policy Optimization Algorithms
We propose a new family of policy gradient methods for reinforcement learning, which alternate between sampling data through interaction with the environment, and optimizing a "surrogate" objective function using stochastic gradient ascent. Whereas s...
Understanding and Improving Convolutional Neural Networks via Concatenated Rectified Linear Units
Recently, convolutional neural networks (CNNs) have been used as a powerful tool to solve many problems of machine learning and computer vision. In this paper, we aim to provide insight on the property of convolutional neural networks, as well as a g...
An Empirical Exploration of Recurrent Network Architectures.
This document examines the OData protocol as a new service oriented approach for distributed IT architectures. The main features of OData were compared with properties of well-established solutions like: REST, DCOM and Java RMI. OData's protocol is p...
Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs)
We introduce the "exponential linear unit" (ELU) which speeds up learning in deep neural networks and leads to higher classification accuracies. Like rectified linear units (ReLUs), leaky ReLUs (LReLUs) and parametrized ReLUs (PReLUs), ELUs alleviate...
Training Deep Networks with Structured Layers by Matrix Backpropagation
Deep neural network architectures have recently produced excellent results in a variety of areas in artificial intelligence and visual recognition, well surpassing traditional shallow architectures trained using hand-designed features. The power of d...
A Parallel Space Saving Algorithm For Frequent Items and the Hurwitz zeta distribution
We present a message-passing based parallel version of the Space Saving algorithm designed to solve the $k$--majority problem. The algorithm determines in parallel frequent items, i.e., those whose frequency is greater than a given threshold, and is ...