Papers - PaperGrep

GSPMD: General and Scalable Parallelization for ML Computation Graphs

Yuanzhong Xu, HyoukJoong Lee, Dehao Chen, Blake Hechtman, Yanping Huang, Rahul Joshi, Maxim Krikun, ...

2021

4 references

We present GSPMD, an automatic, compiler-based parallelization system for common machine learning computations. It allows users to write programs in the same way as for a single device, then give hints through a few annotations on how to distribute t...

View Paper PDF

On Layer Normalization in the Transformer Architecture

Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shuxin Zheng, Chen Xing, Huishuai Zhang, Yanyan Lan, ...

2020

4 references

The Transformer is widely used in natural language processing tasks. To train a Transformer however, one usually needs a carefully designed learning rate warm-up stage, which is shown to be crucial to the final performance but will slow down the opti...

View Paper PDF

Data-Free Quantization Through Weight Equalization and Bias Correction

Markus Nagel, Mart van Baalen, Tijmen Blankevoort, Max Welling

2019

4 references

We introduce a data-free quantization method for deep neural networks that does not require fine-tuning or hyperparameter selection. It achieves near-original model performance on common computer vision architectures and tasks. 8-bit fixed-point quan...

View Paper PDF

Mish: A Self Regularized Non-Monotonic Activation Function

Diganta Misra

2019

4 references

We propose $\textit{Mish}$, a novel self-regularized non-monotonic activation function which can be mathematically defined as: $f(x)=x\tanh(softplus(x))$. As activation functions play a crucial role in the performance and training dynamics in neural ...

View Paper PDF

Searching for MobileNetV3

Andrew Howard, Mark Sandler, Grace Chu, Liang-Chieh Chen, Bo Chen, Mingxing Tan, Weijun Wang, Yukun ...

2019

4 references

We present the next generation of MobileNets based on a combination of complementary search techniques as well as a novel architecture design. MobileNetV3 is tuned to mobile phone CPUs through a combination of hardware-aware network architecture sear...

View Paper PDF

Trained Quantization Thresholds for Accurate and Efficient Fixed-Point Inference of Deep Neural Networks

Sambhav R. Jain, Albert Gural, Michael Wu, Chris H. Dick

2019

4 references

We propose a method of training quantization thresholds (TQT) for uniform symmetric quantizers using standard backpropagation and gradient descent. Contrary to prior work, we show that a careful analysis of the straight-through estimator for threshol...

View Paper PDF

GossipGraD: Scalable Deep Learning using Gossip Communication based Asynchronous Gradient Descent

Jeff Daily, Abhinav Vishnu, Charles Siegel, Thomas Warfel, Vinay Amatya

2018

4 references

In this paper, we present GossipGraD - a gossip communication protocol based Stochastic Gradient Descent (SGD) algorithm for scaling Deep Learning (DL) algorithms on large-scale systems. The salient features of GossipGraD are: 1) reduction in overall...

View Paper PDF

Quantizing deep convolutional networks for efficient inference: A whitepaper

Raghuraman Krishnamoorthi

2018

4 references

We present an overview of techniques for quantizing convolutional neural networks for inference with integer weights and activations. Per-channel quantization of weights and per-layer quantization of activations to 8-bits of precision post-training p...

View Paper PDF

SGDR: Stochastic Gradient Descent with Warm Restarts.

Ilya Loshchilov, Frank Hutter

2017

4 references

View Paper DOI

Categorical Reparameterization with Gumbel-Softmax

Eric Jang, Shixiang Gu, Ben Poole

2016

4 references

Categorical variables are a natural choice for representing discrete structure in the world. However, stochastic neural networks rarely use categorical latent variables due to the inability to backpropagate through samples. In this work, we present a...

View Paper PDF

Device-to-device resource allocation in LTE-advanced networks by hybrid particle swarm optimization and genetic algorithm.

Shijie Sun, Kwang-Yul Kim, Oh-Soon Shin, Yoan Shin

2016

4 references

View Paper DOI

An Empirical Exploration of Recurrent Network Architectures.

Rafal Józefowicz, Wojciech Zaremba, Ilya Sutskever

2015

4 references

This document examines the OData protocol as a new service oriented approach for distributed IT architectures. The main features of OData were compared with properties of well-established solutions like: REST, DCOM and Java RMI. OData's protocol is p...

View Paper DOI

Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs)

Djork-Arné Clevert, Thomas Unterthiner, Sepp Hochreiter

2015

4 references

We introduce the "exponential linear unit" (ELU) which speeds up learning in deep neural networks and leads to higher classification accuracies. Like rectified linear units (ReLUs), leaky ReLUs (LReLUs) and parametrized ReLUs (PReLUs), ELUs alleviate...

View Paper PDF

Spatial Transformer Networks

Max Jaderberg, Karen Simonyan, Andrew Zisserman, Koray Kavukcuoglu

2015

4 references

Convolutional Neural Networks define an exceptionally powerful class of models, but are still limited by the lack of ability to be spatially invariant to the input data in a computationally and parameter efficient manner. In this work we introduce a ...

View Paper PDF

Improving neural networks by preventing co-adaptation of feature detectors

Geoffrey E. Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, Ruslan R. Salakhutdinov

2012

4 references

When a large feedforward neural network is trained on a small training set, it typically performs poorly on held-out test data. This "overfitting" is greatly reduced by randomly omitting half of the feature detectors on each training case. This preve...

View Paper PDF

Iconic indexing for video search.

Andrew Graves

2006

4 references

View Paper DOI

A simple method for generating gamma variables

George Marsaglia, Wai Wan Tsang

2000

4 references

<jats:p> We offer a procedure for generating a gamma variate as the cube of a suitably scaled normal variate. It is fast and simple, assuming one has a fast way to generate normal variables. In brief: generate a normal variate ...

View Paper DOI

TPU-KNN: K Nearest Neighbor Search at Peak FLOP/s

Felix Chern, Blake Hechtman, Andy Davis, Ruiqi Guo, David Majnemer, Sanjiv Kumar

2022

3 references

This paper presents a novel nearest neighbor search algorithm achieving TPU (Google Tensor Processing Unit) peak performance, outperforming state-of-the-art GPU algorithms with similar level of recall. The design of the proposed algorithm is motivate...

View Paper PDF

A Short Note about Kinetics-600

Joao Carreira, Eric Noland, Andras Banki-Horvath, Chloe Hillier, Andrew Zisserman

2018

3 references

We describe an extension of the DeepMind Kinetics human action dataset from 400 classes, each with at least 400 video clips, to 600 classes, each with at least 600 video clips. In order to scale up the dataset we changed the data collection process s...

View Paper PDF

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova

2018

3 references

We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models, BERT is designed to pre-train deep bidirectional representations fro...

View Paper PDF