🤖

Machine Learning

Machine learning frameworks, algorithms, and training systems

Repositories

(7)

huggingface/transformers

microsoft/onnxruntime

mlflow/mlflow

pytorch/pytorch

ray-project/ray

scikit-learn/scikit-learn

tensorflow/tensorflow

Papers

(373)

Showing 20 of 373 papers

A closed-form formula for the Kullback-Leibler divergence between Cauchy distributions

Frédéric Chyzak, Frank Nielsen

2019

2 references

We report a closed-form expression for the Kullback-Leibler divergence between Cauchy distributions which involves the calculation of a novel definite integral. The formula shows that the Kullback-Leibler divergence between Cauchy densities is always...

View Paper PDF DOI

ADADELTA: An Adaptive Learning Rate Method

Matthew D. Zeiler

2012

6 references

We present a novel per-dimension learning rate method for gradient descent called ADADELTA. The method dynamically adapts over time using only first order information and has minimal computational overhead beyond vanilla stochastic gradient descent. ...

View Paper PDF DOI

Adafactor: Adaptive Learning Rates with Sublinear Memory Cost

Noam Shazeer, Mitchell Stern

2018

2 references

In several recently proposed stochastic optimization methods (e.g. RMSProp, Adam, Adadelta), parameter updates are scaled by the inverse square roots of exponential moving averages of squared past gradients. Maintaining these per-parameter second-mom...

View Paper PDF DOI

Adam: A Method for Stochastic Optimization

Diederik P. Kingma, Jimmy Ba

2014

22 references

We introduce Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments. The method is straightforward to implement, is computationally efficient, has little me...

View Paper PDF DOI

Averaging Weights Leads to Wider Optima and Better Generalization

Pavel Izmailov, Dmitrii Podoprikhin, Timur Garipov, Dmitry Vetrov, Andrew Gordon Wilson

2018

6 references

Deep neural networks are typically trained by optimizing a loss function with an SGD variant, in conjunction with a decaying learning rate, until convergence. We show that simple averaging of multiple points along the trajectory of SGD, with a cyclic...

View Paper PDF DOI

Chakra: Advancing Performance Benchmarking and Co-design using Standardized Execution Traces

Srinivas Sridharan, Taekyung Heo, Louis Feng, Zhaodong Wang, Matt Bergeron, Wenyin Fu, Shengbao Zhen...

2023

4 references

Benchmarking and co-design are essential for driving optimizations and innovation around ML models, ML software, and next-generation hardware. Full workload benchmarks, e.g. MLPerf, play an essential role in enabling fair comparison across different ...

View Paper PDF DOI

Cyclical Learning Rates for Training Neural Networks

William F. Dean, John Sandford-Smith

2015

2 references

It is known that the learning rate is the most important hyper-parameter to tune for training deep neural networks. This paper describes a new method for setting the learning rate, named cyclical learning rates, which practically eliminates the need ...

View Paper PDF DOI

Decoupled Weight Decay Regularization

Ilya Loshchilov, Frank Hutter

2017

8 references

L$_2$ regularization and weight decay regularization are equivalent for standard stochastic gradient descent (when rescaled by the learning rate), but as we demonstrate this is \emph{not} the case for adaptive gradient algorithms, such as Adam. While...

View Paper PDF DOI

Efficient Simulation of the von Mises Distribution

D. J. Best, N. I. Fisher

1979

2 references

View Paper DOI

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

Bridgit Dao

2023

3 references

Scaling Transformers to longer sequence lengths has been a major problem in the last several years, promising to improve performance in language modeling and high-resolution image understanding, as well as to unlock new applications in code, audio, a...

View Paper PDF DOI

Generating Sequences With Recurrent Neural Networks

Alex Graves, Abdelrahman Mohamed, Geoffrey E. Hinton

2013

2 references

This paper shows how Long Short-term Memory recurrent neural networks can be used to generate complex sequences with long-range structure, simply by predicting one data point at a time. The approach is demonstrated for text (where the data are discre...

View Paper PDF DOI

Mish: A Self Regularized Non-Monotonic Activation Function

Diganta Misra

2019

4 references

We propose $\textit{Mish}$, a novel self-regularized non-monotonic activation function which can be mathematically defined as: $f(x)=x\tanh(softplus(x))$. As activation functions play a crucial role in the performance and training dynamics in neural ...

View Paper PDF DOI

Muon is Scalable for LLM Training

Jingyuan Liu, Jianlin Su, Xingcheng Yao, Zhejun Jiang, Guokun Lai, Yulun Du, Yiming Qin, Wei-Xin Xu,...

2025

2 references

Recently, the Muon optimizer based on matrix orthogonalization has demonstrated strong results in training small-scale language models, but the scalability to larger models has not been proven. We identify two crucial techniques for scaling up Muon: ...

View Paper PDF DOI

On the Variance of the Adaptive Learning Rate and Beyond

Liyuan Liu, Haoming Jiang, Pengcheng He, Weizhu Chen, Xiaodong Liu, Jianfeng Gao, Jiawei Han

2019

2 references

The learning rate warmup heuristic achieves remarkable success in stabilizing training, accelerating convergence and improving generalization for adaptive stochastic optimization algorithms like RMSprop and Adam. Here, we study its mechanism in detai...

View Paper PDF DOI

Rethinking the Inception Architecture for Computer Vision

Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, Zbigniew Wojna

2015

6 references

Convolutional networks are at the core of most state-of-the-art computer vision solutions for a wide variety of tasks. Since 2014 very deep convolutional networks started to become mainstream, yielding substantial gains in various benchmarks. Althoug...

View Paper PDF DOI

Spatial Transformer Networks

Max Jaderberg, Karen Simonyan, Andrew Zisserman, Koray Kavukcuoglu

2015

14 references

Convolutional Neural Networks define an exceptionally powerful class of models, but are still limited by the lack of ability to be spatially invariant to the input data in a computationally and parameter efficient manner. In this work we introduce a ...

View Paper PDF DOI

Stochastic Weight Averaging in Parallel: Large-Batch Training that Generalizes Well

Vipul Gupta, Santiago Akle Serrano, Dennis DeCoste

2020

2 references

We propose Stochastic Weight Averaging in Parallel (SWAP), an algorithm to accelerate DNN training. Our algorithm uses large mini-batches to compute an approximate solution quickly and then refines it by averaging the weights of multiple models compu...

View Paper PDF DOI

Super-Convergence: Very Fast Training of Neural Networks Using Large Learning Rates

Leslie N. Smith, Nicholay Topin

2017

2 references

In this paper, we describe a phenomenon, which we named "super-convergence", where neural networks can be trained an order of magnitude faster than with standard training methods. The existence of super-convergence is relevant to understanding why de...

View Paper PDF DOI

SWALP : Stochastic Weight Averaging in Low-Precision Training

Guandao Yang, Tianyi Zhang, Polina Kirichenko, Junwen Bai, Andrew Gordon Wilson, Christopher De Sa

2019

2 references

Low precision operations can provide scalability, memory savings, portability, and energy efficiency. This paper proposes SWALP, an approach to low precision training that averages low-precision SGD iterates with a modified learning rate schedule. SW...

View Paper PDF DOI

There Are Many Consistent Explanations of Unlabeled Data: Why You Should Average

Farhad Shokraneh, Clive E Adams, Mike Clarke, Laura Amato, Hilda Bastian, Elaine Beller, Jon Brassey...

2018

2 references

Presently the most successful approaches to semi-supervised learning are based on consistency regularization, whereby a model is trained to be robust to small perturbations of its inputs and parameters. To understand consistency regularization, we co...

View Paper PDF DOI

Previous Page 2 of 19 Next