Showing 20 of 613 papers

Mastering Diverse Domains through World Models

Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, Timothy Lillicrap
2023
41 references

Developing a general algorithm that learns to solve tasks across a wide range of applications has been a fundamental challenge in artificial intelligence. Although current reinforcement learning algorithms can be readily applied to tasks similar to w...

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Sergey Ioffe, Christian Szegedy
2015
30 references

Training Deep Neural Networks is complicated by the fact that the distribution of each layer's inputs changes during training, as the parameters of the previous layers change. This slows down the training by requiring lower learning rates and careful...

Attention Is All You Need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kais...
2017
25 references

The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose ...

FP8 Formats for Deep Learning

Paulius Micikevicius, Dušan Stošić, Neil Burgess, Marius Cornea, Pradeep Dubey, Richard Grisenthwait...
2022
24 references

FP8 is a natural progression for accelerating deep learning training inference beyond the 16-bit formats common in modern processors. In this paper we propose an 8-bit floating point (FP8) binary interchange format consisting of two encodings - E4M3 ...

Adam: A Method for Stochastic Optimization

Diederik P. Kingma, Jimmy Ba
2014
22 references

We introduce Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments. The method is straightforward to implement, is computationally efficient, has little me...

Fractional Max-Pooling

Benjamin Graham
2014
21 references

Convolutional networks almost always incorporate some form of spatial pooling, and very often it is alpha times alpha max-pooling with alpha=2. Max-pooling act on the hidden layers of the network, reducing their size by an integer multiplicative fact...

Mastering Atari with Discrete World Models.

Yujin Tang, Duong Nguyen, David Ha
2020
20 references

Inattentional blindness is the psychological phenomenon that causes one to miss things in plain sight. It is a consequence of the selective attention in perception that lets us remain focused on important parts of our world without distraction from i...

Gaussian processes for machine learning.

Edward S. Blurock
2006
17 references

AbstractMachine learning clustering techniques are used to characterize and, after the training phase, to identify phases within an ignition process. For the ethanol mechanism used in this paper, four physically identifiable phases were found and cha...

8-bit Numerical Formats for Deep Neural Networks

Badreddine Noune, Philip Jones, Daniel Justus, Dominic Masters, Carlo Luschi
2022
16 references

Given the current trend of increasing size and complexity of machine learning architectures, it has become of critical importance to identify new approaches to improve the computational efficiency of model training. In this context, we address the ad...

Spatial Transformer Networks

Max Jaderberg, Karen Simonyan, Andrew Zisserman, Koray Kavukcuoglu
2015
14 references

Convolutional Neural Networks define an exceptionally powerful class of models, but are still limited by the lack of ability to be spatially invariant to the input data in a computationally and parameter efficient manner. In this work we introduce a ...

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, Sumit Sanghai
2023
13 references

Multi-query attention (MQA), which only uses a single key-value head, drastically speeds up decoder inference. However, MQA can lead to quality degradation, and moreover it may not be desirable to train a separate model just for faster inference. We ...

Gaussian Error Linear Units (GELUs)

Dan Hendrycks, Kevin Gimpel
2016
13 references

We propose the Gaussian Error Linear Unit (GELU), a high-performing neural network activation function. The GELU activation function is $x\Phi(x)$, where $\Phi(x)$ the standard Gaussian cumulative distribution function. The GELU nonlinearity weights ...

Instance Normalization: The Missing Ingredient for Fast Stylization

Dmitry Ulyanov, Andrea Vedaldi, Victor Lempitsky
2016
13 references

It this paper we revisit the fast stylization method introduced in Ulyanov et. al. (2016). We show how a small change in the stylization architecture results in a significant qualitative improvement in the generated images. The change is limited to s...

Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions

Nathan Blow
2009
13 references

Low-rank matrix approximations, such as the truncated singular value decomposition and the rank-revealing QR decomposition, play a central role in data analysis and scientific computing. This work surveys and extends recent research which demonstrate...

Efficient Object Localization Using Convolutional Networks

Jonathan Tompson, Ross Goroshin, Arjun Jain, Yann LeCun, Christoph Bregler
2014
11 references

Recent state-of-the-art performance on human-body pose estimation has been achieved with Deep Convolutional Networks (ConvNets). Traditional ConvNet architectures include pooling and sub-sampling layers which reduce computational requirements, introd...

Number Parsing at a Gigabyte per Second

Daniel Lemire
2021
10 references

With disks and networks providing gigabytes per second, parsing decimal numbers from strings becomes a bottleneck. We consider the problem of parsing decimal numbers to the nearest binary floating-point value. The general problem requires variable-pr...

Additive Powers-of-Two Quantization: An Efficient Non-uniform Discretization for Neural Networks

Dongdi Zhao, Li Fan, Kashif Sharif, Guangmin Xia, Yu Wang
2019
10 references

We propose Additive Powers-of-Two~(APoT) quantization, an efficient non-uniform quantization scheme for the bell-shaped and long-tailed distribution of weights and activations in neural networks. By constraining all quantization levels as the sum of ...

Group Normalization

Yuxin Wu, Kaiming He
2018
10 references

Batch Normalization (BN) is a milestone technique in the development of deep learning, enabling various networks to train. However, normalizing along the batch dimension introduces problems --- BN's error increases rapidly when the batch size becomes...

Self-Normalizing Neural Networks.

Günter Klambauer, Thomas Unterthiner, Andreas Mayr, Sepp Hochreiter
2017
10 references

The Internet of Things (IoT) gains momentum. Developments regarding smart grids, intelligent transportation systems, and low-power networks for smart cities constitute significant drivers in the evolution of network industries. IoT creates an array o...

SGDR: Stochastic Gradient Descent with Warm Restarts.

Evans Lansing Smith
2017
10 references

In this paper, we describe a phenomenon, which we named "super-convergence",\nwhere neural networks can be trained an order of magnitude faster than with\nstandard training methods. The existence of super-convergence is relevant to\nunderstanding why...