pytorch/pytorch - PaperGrep

105 papers

89 files

194 references

Paper References by File

▶ aten/src/ATen/native/Distributions.h

A simple method for generating gamma variables

George Marsaglia, Wai Wan Tsang

2000

4 references

<jats:p> We offer a procedure for generating a gamma variate as the cube of a suitably scaled normal variate. It is fast and simple, assuming one has a fast way to generate normal variable...

View Paper DOI View on GitHub

A simple method for generating gamma variables

George Marsaglia, Wai Wan Tsang

2000

4 references

<jats:p> We offer a procedure for generating a gamma variate as the cube of a suitably scaled normal variate. It is fast and simple, assuming one has a fast way to generate normal variable...

View Paper DOI View on GitHub

▶ aten/src/ATen/native/sparse/SparseMatMul.cpp

Sparse matrix multiplication package (SMMP)

R. Bank, C. Douglas

1993

2 references

View Paper PDF DOI View on GitHub

▶ benchmarks/fastrnns/cells.py

On Multiplicative Integration with Recurrent Neural Networks

Yuhuai Wu, Saizheng Zhang, Ying Zhang, Yoshua Bengio, Ruslan Salakhutdinov

2016

2 references

View Paper PDF View on GitHub

▶ benchmarks/functional_autograd_benchmark/torchaudio_models.py

Wav2Letter: an End-to-End ConvNet-based Speech Recognition System

Ronan Collobert, Christian Puhrsch, Gabriel Synnaeve

2016

2 references

View Paper PDF View on GitHub

▶ benchmarks/functional_autograd_benchmark/torchvision_models.py

Deep Residual Learning for Image Recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun

2015

8 references

View Paper PDF View on GitHub

Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour

Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew T...

2017

2 references

View Paper PDF View on GitHub

Deep Residual Learning for Image Recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun

2015

8 references

View Paper PDF View on GitHub

Deep Residual Learning for Image Recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun

2015

8 references

View Paper PDF View on GitHub

▶ .claude/skills/docstring/SKILL.md

The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables

Chris J. Maddison, Andriy Mnih, Yee Whye Teh

2016

8 references

View Paper PDF View on GitHub

The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables

Chris J. Maddison, Andriy Mnih, Yee Whye Teh

2016

8 references

View Paper PDF View on GitHub

▶ docs/source/ddp_comm_hooks.md

PowerSGD: Practical Low-Rank Gradient Compression for Distributed Optimization

Thijs Vogels, Sai Praneeth Karimireddy, Martin Jaggi

2019

8 references

View Paper PDF View on GitHub

PowerSGD: Practical Low-Rank Gradient Compression for Distributed Optimization

Thijs Vogels, Sai Praneeth Karimireddy, Martin Jaggi

2019

8 references

View Paper PDF View on GitHub

▶ docs/source/nn.functional.rst

The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables

Chris J. Maddison, Andriy Mnih, Yee Whye Teh

2016

8 references

View Paper PDF View on GitHub

Categorical Reparameterization with Gumbel-Softmax

Eric Jang, Shixiang Gu, Ben Poole

2016

4 references

View Paper PDF View on GitHub

▶ docs/source/notes/autograd.rst

The Complex Gradient Operator and the CR-Calculus

Ken Kreutz-Delgado

2009

2 references

View Paper PDF View on GitHub

▶ docs/source/notes/gradcheck.rst

On the Computation of Complex-valued Gradients with Application to Statistically Optimum Beamforming

Christoph Boeddeker, Patrick Hanebrink, Lukas Drude, Jahn Heymann, Reinhold Haeb-Umbach

2017

9 references

View Paper PDF View on GitHub

▶ docs/source/optim.md

Averaging Weights Leads to Wider Optima and Better Generalization

Pavel Izmailov, Dmitrii Podoprikhin, Timur Garipov, Dmitry Vetrov, Andrew Gordon Wilson

2018

6 references

View Paper PDF View on GitHub

▶ docs/source/tensor_attributes.rst

FP8 Formats for Deep Learning

Paulius Micikevicius, Dusan Stosic, Neil Burgess, Marius Cornea, Pradeep Dubey, Richard Grisenthwait...

2022

169 citations

10 references

View Paper PDF View on GitHub

FP8 Formats for Deep Learning

Paulius Micikevicius, Dusan Stosic, Neil Burgess, Marius Cornea, Pradeep Dubey, Richard Grisenthwait...

2022

169 citations

10 references

View Paper PDF View on GitHub

8-bit Numerical Formats for Deep Neural Networks

Badreddine Noune, Philip Jones, Daniel Justus, Dominic Masters, Carlo Luschi

2022

10 references

View Paper PDF View on GitHub

8-bit Numerical Formats for Deep Neural Networks

Badreddine Noune, Philip Jones, Daniel Justus, Dominic Masters, Carlo Luschi

2022

10 references

View Paper PDF View on GitHub

▶ functorch/examples/dp_cifar10/README.md

Enabling Fast Differentially Private SGD via Just-in-Time Compilation and Vectorization

Pranav Subramani, Nicholas Vadivelu, Gautam Kamath

2020

2 references

View Paper PDF View on GitHub

▶ functorch/examples/maml_omniglot/maml-omniglot-higher.py

Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks

Chelsea Finn, Pieter Abbeel, Sergey Levine

2017

8 references

View Paper PDF View on GitHub

▶ functorch/examples/maml_omniglot/maml-omniglot-ptonly.py

Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks

Chelsea Finn, Pieter Abbeel, Sergey Levine

2017

8 references

View Paper PDF View on GitHub

▶ functorch/examples/maml_omniglot/maml-omniglot-transforms.py

Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks

Chelsea Finn, Pieter Abbeel, Sergey Levine

2017

8 references

View Paper PDF View on GitHub

▶ functorch/examples/maml_omniglot/README.md

Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks

Chelsea Finn, Pieter Abbeel, Sergey Levine

2017

8 references

View Paper PDF View on GitHub

▶ SECURITY.md

Forcing Generative Models to Degenerate Ones: The Power of Data Poisoning Attacks

Shuli Jiang, Swanand Ravindra Kadhe, Yi Zhou, Ling Cai, Nathalie Baracaldo

2023

2 references

View Paper PDF View on GitHub

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Evan Hubinger, Carson Denison, Jesse Mu, Mike Lambert, Meg Tong, Monte MacDiarmid, Tamera Lanham, Da...

2024

2 references

View Paper PDF View on GitHub

▶ torch/ao/nn/intrinsic/qat/modules/conv_fused.py

Quantizing deep convolutional networks for efficient inference: A whitepaper

Raghuraman Krishnamoorthi

2018

4 references

View Paper PDF View on GitHub

▶ torch/ao/nn/intrinsic/qat/modules/linear_fused.py

Quantizing deep convolutional networks for efficient inference: A whitepaper

Raghuraman Krishnamoorthi

2018

4 references

View Paper PDF View on GitHub

▶ torch/ao/pruning/_experimental/pruner/FPGM_pruner.py

Filter Pruning via Geometric Median for Deep Convolutional Neural Networks Acceleration

Yang He, Ping Liu, Ziwei Wang, Zhilan Hu, Yi Yang

2018

2 references

View Paper PDF View on GitHub

▶ torch/ao/pruning/scheduler/cubic_scheduler.py

To prune, or not to prune: exploring the efficacy of pruning for model compression

Michael Zhu, Suyog Gupta

2017

2 references

View Paper PDF View on GitHub

▶ torch/ao/quantization/_correct_bias.py

Data-Free Quantization Through Weight Equalization and Bias Correction

Markus Nagel, Mart van Baalen, Tijmen Blankevoort, Max Welling

2019

4 references

View Paper PDF View on GitHub

▶ torch/ao/quantization/_equalize.py

Data-Free Quantization Through Weight Equalization and Bias Correction

Markus Nagel, Mart van Baalen, Tijmen Blankevoort, Max Welling

2019

4 references

View Paper PDF View on GitHub

▶ torch/ao/quantization/experimental/adaround_fake_quantize.py

Up or Down? Adaptive Rounding for Post-Training Quantization

Markus Nagel, Rana Ali Amjad, Mart van Baalen, Christos Louizos, Tijmen Blankevoort

2020

6 references

View Paper PDF View on GitHub

▶ torch/ao/quantization/experimental/adaround_loss.py

Up or Down? Adaptive Rounding for Post-Training Quantization

Markus Nagel, Rana Ali Amjad, Mart van Baalen, Christos Louizos, Tijmen Blankevoort

2020

6 references

View Paper PDF View on GitHub

Up or Down? Adaptive Rounding for Post-Training Quantization

Markus Nagel, Rana Ali Amjad, Mart van Baalen, Christos Louizos, Tijmen Blankevoort

2020

6 references

View Paper PDF View on GitHub

▶ torch/ao/quantization/experimental/linear.py

Additive Powers-of-Two Quantization: An Efficient Non-uniform Discretization for Neural Networks

Yuhang Li, Xin Dong, Wei Wang

2019

10 references

View Paper PDF View on GitHub

▶ torch/ao/quantization/experimental/observer.py

Additive Powers-of-Two Quantization: An Efficient Non-uniform Discretization for Neural Networks

Yuhang Li, Xin Dong, Wei Wang

2019

10 references

View Paper PDF View on GitHub

▶ torch/ao/quantization/experimental/quantizer.py

Additive Powers-of-Two Quantization: An Efficient Non-uniform Discretization for Neural Networks

Yuhang Li, Xin Dong, Wei Wang

2019

10 references

View Paper PDF View on GitHub

Additive Powers-of-Two Quantization: An Efficient Non-uniform Discretization for Neural Networks

Yuhang Li, Xin Dong, Wei Wang

2019

10 references

View Paper PDF View on GitHub

Additive Powers-of-Two Quantization: An Efficient Non-uniform Discretization for Neural Networks

Yuhang Li, Xin Dong, Wei Wang

2019

10 references

View Paper PDF View on GitHub

▶ torch/ao/quantization/observer.py

Trained Quantization Thresholds for Accurate and Efficient Fixed-Point Inference of Deep Neural Networks

Sambhav R. Jain, Albert Gural, Michael Wu, Chris H. Dick

2019

4 references

View Paper PDF View on GitHub

▶ torch/ao/quantization/utils.py

Trained Quantization Thresholds for Accurate and Efficient Fixed-Point Inference of Deep Neural Networks

Sambhav R. Jain, Albert Gural, Michael Wu, Chris H. Dick

2019

4 references

View Paper PDF View on GitHub

▶ torch/autograd/gradcheck.py

On the Computation of Complex-valued Gradients with Application to Statistically Optimum Beamforming

Christoph Boeddeker, Patrick Hanebrink, Lukas Drude, Jahn Heymann, Reinhold Haeb-Umbach

2017

9 references

View Paper PDF View on GitHub

▶ torch/csrc/api/include/torch/nn/modules/normalization.h

Layer Normalization

Jimmy Lei Ba, Jamie Ryan Kiros, Geoffrey E. Hinton

2016

6 references

View Paper PDF View on GitHub

▶ torch/csrc/autograd/FunctionsManual.cpp

The differentiation of pseudo-inverses and non-linear least squares problems whose variables separate

G. Golub, V. Pereyra

1972

2 references

View Paper DOI View on GitHub

On the Computation of Complex-valued Gradients with Application to Statistically Optimum Beamforming

Christoph Boeddeker, Patrick Hanebrink, Lukas Drude, Jahn Heymann, Reinhold Haeb-Umbach

2017

9 references

View Paper PDF View on GitHub

On the Computation of Complex-valued Gradients with Application to Statistically Optimum Beamforming

Christoph Boeddeker, Patrick Hanebrink, Lukas Drude, Jahn Heymann, Reinhold Haeb-Umbach

2017

9 references

View Paper PDF View on GitHub

▶ torch/csrc/profiler/unwind/fde.h

Reliable and fast DWARF-based stack unwinding

T. Bastian, Stephen Kell, Francesco Zappa Nardelli

2019

2 references

View Paper PDF DOI View on GitHub

▶ torch/distributed/algorithms/ddp_comm_hooks/powerSGD_hook.py

PowerSGD: Practical Low-Rank Gradient Compression for Distributed Optimization

Thijs Vogels, Sai Praneeth Karimireddy, Martin Jaggi

2019

8 references

View Paper PDF View on GitHub

PowerSGD: Practical Low-Rank Gradient Compression for Distributed Optimization

Thijs Vogels, Sai Praneeth Karimireddy, Martin Jaggi

2019

8 references

View Paper PDF View on GitHub

▶ torch/distributed/algorithms/model_averaging/averagers.py

Don't Use Large Mini-Batches, Use Local SGD

Tao Lin, Sebastian U. Stich, Kumar Kshitij Patel, Martin Jaggi

2018

6 references

View Paper PDF View on GitHub

▶ torch/distributed/algorithms/model_averaging/hierarchical_model_averager.py

Demystifying Why Local Aggregation Helps: Convergence Analysis of Hierarchical SGD

Jiayi Wang, Shiqiang Wang, Rong-Rong Chen, Mingyue Ji

2020

2 references

View Paper PDF View on GitHub

Don't Use Large Mini-Batches, Use Local SGD

Tao Lin, Sebastian U. Stich, Kumar Kshitij Patel, Martin Jaggi

2018

6 references

View Paper PDF View on GitHub

▶ torch/distributed/benchmarks/README.md

PipeDream: Fast and Efficient Pipeline Parallel DNN Training

Aaron Harlap, Deepak Narayanan, Amar Phanishayee, Vivek Seshadri, Nikhil Devanur, Greg Ganger, Phil ...

2018

2 references

View Paper PDF View on GitHub

▶ torch/distributed/fsdp/fully_sharded_data_parallel.py

Automatic Cross-Replica Sharding of Weight Update in Data-Parallel Training

Yuanzhong Xu, HyoukJoong Lee, Dehao Chen, Hongjun Choi, Blake Hechtman, Shibo Wang

2020

2 references

View Paper PDF View on GitHub

GossipGraD: Scalable Deep Learning using Gossip Communication based Asynchronous Gradient Descent

Jeff Daily, Abhinav Vishnu, Charles Siegel, Thomas Warfel, Vinay Amatya

2018

4 references

View Paper PDF View on GitHub

GossipGraD: Scalable Deep Learning using Gossip Communication based Asynchronous Gradient Descent

Jeff Daily, Abhinav Vishnu, Charles Siegel, Thomas Warfel, Vinay Amatya

2018

4 references

View Paper PDF View on GitHub

▶ torch/distributed/fsdp/_fully_shard/_fsdp_param_group.py

DoRA: Weight-Decomposed Low-Rank Adaptation

Shih-Yang Liu, Chien-Yi Wang, Hongxu Yin, Pavlo Molchanov, Yu-Chiang Frank Wang, Kwang-Ting Cheng, M...

2024

2 references

View Paper PDF View on GitHub

▶ torch/distributed/optim/post_localSGD_optimizer.py

Don't Use Large Mini-Batches, Use Local SGD

Tao Lin, Sebastian U. Stich, Kumar Kshitij Patel, Martin Jaggi

2018

6 references

View Paper PDF View on GitHub

▶ torch/distributed/optim/zero_redundancy_optimizer.py

ZeRO: Memory Optimizations Toward Training Trillion Parameter Models

Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, Yuxiong He

2019

2 references

View Paper PDF View on GitHub

▶ torch/distributed/pipelining/schedules.py

Breadth-First Pipeline Parallelism

Joel Lamy-Poirier

2022

2 references

View Paper PDF View on GitHub

Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM

Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Patwary, Vijay Anand Ko...

2021

2 references

View Paper PDF View on GitHub

Zero Bubble Pipeline Parallelism

Penghui Qi, Xinyi Wan, Guangxing Huang, Min Lin

2023

4 references

View Paper PDF View on GitHub

Zero Bubble Pipeline Parallelism

Penghui Qi, Xinyi Wan, Guangxing Huang, Min Lin

2023

4 references

View Paper PDF View on GitHub

DeepSeek-V3 Technical Report

DeepSeek-AI, Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, C...

2024

2 references

View Paper PDF View on GitHub

▶ torch/distributed/tensor/parallel/style.py

Reducing Activation Recomputation in Large Transformer Models

Vijay Korthikanti, Jared Casper, Sangkug Lym, Lawrence McAfee, Michael Andersch, Mohammad Shoeybi, B...

2022

2 references

View Paper PDF View on GitHub

▶ torch/distributed/tensor/README.md

GSPMD: General and Scalable Parallelization for ML Computation Graphs

Yuanzhong Xu, HyoukJoong Lee, Dehao Chen, Blake Hechtman, Yanping Huang, Rahul Joshi, Maxim Krikun, ...

2021

4 references

View Paper PDF View on GitHub

GSPMD: General and Scalable Parallelization for ML Computation Graphs

Yuanzhong Xu, HyoukJoong Lee, Dehao Chen, Blake Hechtman, Yanping Huang, Rahul Joshi, Maxim Krikun, ...

2021

4 references

View Paper PDF View on GitHub

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, Bryan Catanzaro

2019

2 references

View Paper PDF View on GitHub

▶ torch/distributions/continuous_bernoulli.py

The continuous Bernoulli: fixing a pervasive error in variational autoencoders

Gabriel Loaiza-Ganem, John P. Cunningham

2019

2 references

View Paper PDF View on GitHub

▶ torch/distributions/__init__.py

TensorFlow Distributions

Joshua V. Dillon, Ian Langmore, Dustin Tran, Eugene Brevdo, Srinivas Vasudevan, Dave Moore, Brian Pa...

2017

2 references

View Paper PDF View on GitHub

Gradient Estimation Using Stochastic Computation Graphs

John Schulman, Nicolas Heess, Theophane Weber, Pieter Abbeel

2015

2 references

View Paper PDF View on GitHub

▶ torch/distributions/kl.py

A closed-form formula for the Kullback-Leibler divergence between Cauchy distributions

Frédéric Chyzak, Frank Nielsen

2019

2 references

We report a closed-form expression for the Kullback-Leibler divergence between Cauchy distributions which involves the calculation of a novel definite integral. The formula shows that the Kullback-Lei...

View Paper PDF View on GitHub

▶ torch/headeronly/util/Float8_e4m3fn.h

FP8 Formats for Deep Learning

Paulius Micikevicius, Dusan Stosic, Neil Burgess, Marius Cornea, Pradeep Dubey, Richard Grisenthwait...

2022

169 citations

10 references

View Paper PDF View on GitHub

▶ torch/headeronly/util/Float8_e4m3fnuz.h

8-bit Numerical Formats for Deep Neural Networks

Badreddine Noune, Philip Jones, Daniel Justus, Dominic Masters, Carlo Luschi

2022

10 references

View Paper PDF View on GitHub

▶ torch/headeronly/util/Float8_e5m2fnuz.h

8-bit Numerical Formats for Deep Neural Networks

Badreddine Noune, Philip Jones, Daniel Justus, Dominic Masters, Carlo Luschi

2022

10 references

View Paper PDF View on GitHub

▶ torch/headeronly/util/Float8_e5m2.h

FP8 Formats for Deep Learning

Paulius Micikevicius, Dusan Stosic, Neil Burgess, Marius Cornea, Pradeep Dubey, Richard Grisenthwait...

2022

169 citations

10 references

View Paper PDF View on GitHub

▶ torch/_inductor/fx_passes/efficient_conv_bn_eval.py

Efficient ConvBN Blocks for Transfer Learning and Beyond

Kaichao You, Guo Qin, Anchang Bao, Meng Cao, Ping Huang, Jiulong Shan, Mingsheng Long

2023

4 references

View Paper PDF View on GitHub

Efficient ConvBN Blocks for Transfer Learning and Beyond

Kaichao You, Guo Qin, Anchang Bao, Meng Cao, Ping Huang, Jiulong Shan, Mingsheng Long

2023

4 references

View Paper PDF View on GitHub

▶ torch/__init__.py

Comparison of a Complete Percutaneous versus Surgical Approach to Aortic Valve Replacement and Revascularization in Patients at Intermediate Surgical Risk: Results from the Randomized SURTAVI Trial.

L. Søndergaard, J. Popma, M. Reardon, N. Van Mieghem, G. Deeb, S. Kodali, I. George, Mathew R. Willi...

2019

2 references

View Paper PDF DOI View on GitHub

Leveraging the bfloat16 Artificial Intelligence Datatype For Higher-Precision Computations

Greg Henry, Ping Tak Peter Tang, Alexander Heinecke

2019

2 references

View Paper PDF View on GitHub

▶ torch/_lobpcg.py

A Case for a Biorthogonal Jacobi--Davidson Method: Restarting and Correction Equation

Andreas Stathopoulos

2002

2 references

View Paper DOI View on GitHub

Toward the Optimal Preconditioned Eigensolver: Locally Optimal Block Preconditioned Conjugate Gradient Method

Andrew V. Knyazev

2001

2 references

View Paper DOI View on GitHub

A robust and efficient implementation of LOBPCG

Jed A. Duersch, Meiyue Shao, Chao Yang, Ming Gu

2017

2 references

View Paper PDF DOI View on GitHub

▶ torch/_lowrank.py

Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions

Nathan Halko, Per-Gunnar Martinsson, Joel A. Tropp

2009

12 references

View Paper PDF View on GitHub

Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions

Nathan Halko, Per-Gunnar Martinsson, Joel A. Tropp

2009

12 references

View Paper PDF View on GitHub

Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions

Nathan Halko, Per-Gunnar Martinsson, Joel A. Tropp

2009

12 references

View Paper PDF View on GitHub

Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions

Nathan Halko, Per-Gunnar Martinsson, Joel A. Tropp

2009

12 references

View Paper PDF View on GitHub

Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions

Nathan Halko, Per-Gunnar Martinsson, Joel A. Tropp

2009

12 references

View Paper PDF View on GitHub

Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions

Nathan Halko, Per-Gunnar Martinsson, Joel A. Tropp

2009

12 references

View Paper PDF View on GitHub

▶ torch/nativert/executor/memory/GreedyBySize.cpp

Efficient Memory Management for Deep Neural Net Inference

Yury Pisarchyk, Juhyun Lee

2020

2 references

View Paper PDF View on GitHub

▶ torch/nn/functional.py

Fractional Max-Pooling

Benjamin Graham

2014

21 references

View Paper PDF View on GitHub

Fractional Max-Pooling

Benjamin Graham

2014

21 references

View Paper PDF View on GitHub

Language Modeling with Gated Convolutional Networks

Yann N. Dauphin, Angela Fan, Michael Auli, David Grangier

2016

2 references

View Paper PDF View on GitHub

Gaussian Error Linear Units (GELUs)

Dan Hendrycks, Kevin Gimpel

2016

9 references

View Paper PDF View on GitHub

The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables

Chris J. Maddison, Andriy Mnih, Yee Whye Teh

2016

8 references

View Paper PDF View on GitHub

Categorical Reparameterization with Gumbel-Softmax

Eric Jang, Shixiang Gu, Ben Poole

2016

4 references

View Paper PDF View on GitHub

Gaussian Error Linear Units (GELUs)

Dan Hendrycks, Kevin Gimpel

2016

9 references

View Paper PDF View on GitHub

Sigmoid-Weighted Linear Units for Neural Network Function Approximation in Reinforcement Learning

Stefan Elfwing, Eiji Uchibe, Kenji Doya

2017

5 references

View Paper PDF View on GitHub

Searching for Activation Functions

Prajit Ramachandran, Barret Zoph, Quoc V. Le

2017

6 references

View Paper PDF View on GitHub

Mish: A Self Regularized Non-Monotonic Activation Function

Diganta Misra

2019

4 references

View Paper PDF View on GitHub

Searching for MobileNetV3

Andrew Howard, Mark Sandler, Grace Chu, Liang-Chieh Chen, Bo Chen, Mingxing Tan, Weijun Wang, Yukun ...

2019

4 references

View Paper PDF View on GitHub

Rethinking the Inception Architecture for Computer Vision

Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, Zbigniew Wojna

2015

6 references

View Paper PDF View on GitHub

Spatial Transformer Networks

Max Jaderberg, Karen Simonyan, Andrew Zisserman, Koray Kavukcuoglu

2015

4 references

View Paper PDF View on GitHub

Spatial Transformer Networks

Max Jaderberg, Karen Simonyan, Andrew Zisserman, Koray Kavukcuoglu

2015

4 references

View Paper PDF View on GitHub

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

Tri Dao

2023

2 references

View Paper PDF View on GitHub

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, Sumit Sanghai

2023

4 references

View Paper PDF View on GitHub

▶ torch/nn/modules/activation.py

Understanding and Improving Convolutional Neural Networks via Concatenated Rectified Linear Units

Wenling Shang, Kihyuk Sohn, Diogo Almeida, Honglak Lee

2016

3 references

View Paper PDF View on GitHub

Empirical Evaluation of Rectified Activations in Convolutional Network

Bing Xu, Naiyan Wang, Tianqi Chen, Mu Li

2015

2 references

View Paper PDF View on GitHub

Gaussian Error Linear Units (GELUs)

Dan Hendrycks, Kevin Gimpel

2016

9 references

View Paper PDF View on GitHub

Sigmoid-Weighted Linear Units for Neural Network Function Approximation in Reinforcement Learning

Stefan Elfwing, Eiji Uchibe, Kenji Doya

2017

5 references

View Paper PDF View on GitHub

Searching for Activation Functions

Prajit Ramachandran, Barret Zoph, Quoc V. Le

2017

6 references

View Paper PDF View on GitHub

Mish: A Self Regularized Non-Monotonic Activation Function

Diganta Misra

2019

4 references

View Paper PDF View on GitHub

Searching for MobileNetV3

Andrew Howard, Mark Sandler, Grace Chu, Liang-Chieh Chen, Bo Chen, Mingxing Tan, Weijun Wang, Yukun ...

2019

4 references

View Paper PDF View on GitHub

Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs)

Djork-Arné Clevert, Thomas Unterthiner, Sepp Hochreiter

2015

4 references

View Paper PDF View on GitHub

Continuously Differentiable Exponential Linear Units

Jonathan T. Barron

2017

2 references

View Paper PDF View on GitHub

Self-Normalizing Neural Networks

Günter Klambauer, Thomas Unterthiner, Andreas Mayr, Sepp Hochreiter

2017

10 references

View Paper PDF View on GitHub

Attention Is All You Need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kais...

2017

15 references

View Paper PDF View on GitHub

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, Christopher Ré

2022

4 references

View Paper PDF View on GitHub

▶ torch/nn/modules/adaptive.py

Efficient softmax approximation for GPUs

Edouard Grave, Armand Joulin, Moustapha Cissé, David Grangier, Hervé Jégou

2016

2 references

View Paper PDF View on GitHub

▶ torch/nn/modules/batchnorm.py

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Sergey Ioffe, Christian Szegedy

2015

11 references

View Paper PDF View on GitHub

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Sergey Ioffe, Christian Szegedy

2015

11 references

View Paper PDF View on GitHub

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Sergey Ioffe, Christian Szegedy

2015

11 references

View Paper PDF View on GitHub

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Sergey Ioffe, Christian Szegedy

2015

11 references

View Paper PDF View on GitHub

▶ torch/nn/modules/dropout.py

Improving neural networks by preventing co-adaptation of feature detectors

Geoffrey E. Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, Ruslan R. Salakhutdinov

2012

4 references

View Paper PDF View on GitHub

Efficient Object Localization Using Convolutional Networks

Jonathan Tompson, Ross Goroshin, Arjun Jain, Yann LeCun, Christopher Bregler

2014

11 references

View Paper PDF View on GitHub

Efficient Object Localization Using Convolutional Networks

Jonathan Tompson, Ross Goroshin, Arjun Jain, Yann LeCun, Christopher Bregler

2014

11 references

View Paper PDF View on GitHub

Efficient Object Localization Using Convolutional Networks

Jonathan Tompson, Ross Goroshin, Arjun Jain, Yann LeCun, Christopher Bregler

2014

11 references

View Paper PDF View on GitHub

Self-Normalizing Neural Networks

Günter Klambauer, Thomas Unterthiner, Andreas Mayr, Sepp Hochreiter

2017

10 references

View Paper PDF View on GitHub

Self-Normalizing Neural Networks

Günter Klambauer, Thomas Unterthiner, Andreas Mayr, Sepp Hochreiter

2017

10 references

View Paper PDF View on GitHub

Efficient Object Localization Using Convolutional Networks

Jonathan Tompson, Ross Goroshin, Arjun Jain, Yann LeCun, Christopher Bregler

2014

11 references

View Paper PDF View on GitHub

▶ torch/nn/modules/instancenorm.py

Instance Normalization: The Missing Ingredient for Fast Stylization

Dmitry Ulyanov, Andrea Vedaldi, Victor Lempitsky

2016

6 references

View Paper PDF View on GitHub

Instance Normalization: The Missing Ingredient for Fast Stylization

Dmitry Ulyanov, Andrea Vedaldi, Victor Lempitsky

2016

6 references

View Paper PDF View on GitHub

Instance Normalization: The Missing Ingredient for Fast Stylization

Dmitry Ulyanov, Andrea Vedaldi, Victor Lempitsky

2016

6 references

View Paper PDF View on GitHub

▶ torch/nn/modules/loss.py

Estimating the mean and variance of the target probability distribution

D. Nix, A. Weigend

1994

2 references

View Paper DOI View on GitHub

Fast R-CNN

Ross Girshick

2015

2 references

View Paper PDF View on GitHub

Rethinking the Inception Architecture for Computer Vision

Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, Zbigniew Wojna

2015

6 references

View Paper PDF View on GitHub

▶ torch/nn/modules/normalization.py

Layer Normalization

Jimmy Lei Ba, Jamie Ryan Kiros, Geoffrey E. Hinton

2016

6 references

View Paper PDF View on GitHub

Group Normalization

Yuxin Wu, Kaiming He

2018

2 references

View Paper PDF View on GitHub

Root Mean Square Layer Normalization

Biao Zhang, Rico Sennrich

2019

2 references

View Paper PDF View on GitHub

▶ torch/nn/modules/pixelshuffle.py

Device-to-device resource allocation in LTE-advanced networks by hybrid particle swarm optimization and genetic algorithm.

Shijie Sun, Kwang-Yul Kim, Oh-Soon Shin, Yoan Shin

2016

4 references

View Paper DOI View on GitHub

Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network

Wenzhe Shi, Jose Caballero, Ferenc Huszár, Johannes Totz, Andrew P. Aitken, Rob Bishop, Daniel Rueck...

2016

5 references

View Paper PDF View on GitHub

Device-to-device resource allocation in LTE-advanced networks by hybrid particle swarm optimization and genetic algorithm.

Shijie Sun, Kwang-Yul Kim, Oh-Soon Shin, Yoan Shin

2016

4 references

View Paper DOI View on GitHub

Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network

Wenzhe Shi, Jose Caballero, Ferenc Huszár, Johannes Totz, Andrew P. Aitken, Rob Bishop, Daniel Rueck...

2016

5 references

View Paper PDF View on GitHub

▶ torch/nn/modules/pooling.py

Fractional Max-Pooling

Benjamin Graham

2014

21 references

View Paper PDF View on GitHub

Fractional Max-Pooling

Benjamin Graham

2014

21 references

View Paper PDF View on GitHub

▶ torch/nn/modules/rnn.py

Long Short-Term Memory Based Recurrent Neural Network Architectures for Large Vocabulary Speech Recognition

Haşim Sak, Andrew Senior, Françoise Beaufays

2014

2 references

View Paper PDF View on GitHub

▶ torch/nn/modules/transformer.py

Attention Is All You Need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kais...

2017

15 references

View Paper PDF View on GitHub

Attention Is All You Need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kais...

2017

15 references

View Paper PDF View on GitHub

Attention Is All You Need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kais...

2017

15 references

View Paper PDF View on GitHub

Attention Is All You Need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kais...

2017

15 references

View Paper PDF View on GitHub

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, Christopher Ré

2022

4 references

View Paper PDF View on GitHub

On Layer Normalization in the Transformer Architecture

Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shuxin Zheng, Chen Xing, Huishuai Zhang, Yanyan Lan, ...

2020

4 references

View Paper PDF View on GitHub

Attention Is All You Need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kais...

2017

15 references

View Paper PDF View on GitHub

On Layer Normalization in the Transformer Architecture

Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shuxin Zheng, Chen Xing, Huishuai Zhang, Yanyan Lan, ...

2020

4 references

View Paper PDF View on GitHub

▶ torch/nn/utils/parametrizations.py

Trivializations for Gradient-Based Optimization on Manifolds

Mario Lezcano-Casado

2019

2 references

View Paper PDF View on GitHub

Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks

Tim Salimans, Diederik P. Kingma

2016

6 references

View Paper PDF View on GitHub

Spectral Normalization for Generative Adversarial Networks

Takeru Miyato, Toshiki Kataoka, Masanori Koyama, Yuichi Yoshida

2018

6 references

View Paper PDF View on GitHub

▶ torch/nn/utils/spectral_norm.py

Spectral Normalization for Generative Adversarial Networks

Takeru Miyato, Toshiki Kataoka, Masanori Koyama, Yuichi Yoshida

2018

6 references

View Paper PDF View on GitHub

Spectral Normalization for Generative Adversarial Networks

Takeru Miyato, Toshiki Kataoka, Masanori Koyama, Yuichi Yoshida

2018

6 references

View Paper PDF View on GitHub

▶ torch/nn/utils/weight_norm.py

Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks

Tim Salimans, Diederik P. Kingma

2016

6 references

View Paper PDF View on GitHub

Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks

Tim Salimans, Diederik P. Kingma

2016

6 references

View Paper PDF View on GitHub

▶ torch/onnx/ops/__init__.py

RoFormer: Enhanced Transformer with Rotary Position Embedding

Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, Yunfeng Liu

2021

2 references

View Paper PDF View on GitHub

Attention Is All You Need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kais...

2017

15 references

View Paper PDF View on GitHub

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, Sumit Sanghai

2023

4 references

View Paper PDF View on GitHub

Fast Transformer Decoding: One Write-Head is All You Need

Noam Shazeer

2019

2 references

View Paper PDF View on GitHub

▶ torch/optim/adadelta.py

ADADELTA: An Adaptive Learning Rate Method

Matthew D. Zeiler

2012

6 references

View Paper PDF View on GitHub

▶ torch/optim/_adafactor.py

Adafactor: Adaptive Learning Rates with Sublinear Memory Cost

Noam Shazeer, Mitchell Stern

2018

2 references

View Paper PDF View on GitHub

Decoupled Weight Decay Regularization

Ilya Loshchilov, Frank Hutter

2017

8 references

View Paper PDF View on GitHub

▶ torch/optim/adamax.py

Adam: A Method for Stochastic Optimization

Diederik P. Kingma, Jimmy Ba

2014

17 references

View Paper PDF View on GitHub

▶ torch/optim/adam.py

Adam: A Method for Stochastic Optimization

Diederik P. Kingma, Jimmy Ba

2014

17 references

View Paper PDF View on GitHub

▶ torch/optim/adamw.py

Decoupled Weight Decay Regularization

Ilya Loshchilov, Frank Hutter

2017

8 references

View Paper PDF View on GitHub

▶ torch/optim/lr_scheduler.py

SGDR: Stochastic Gradient Descent with Warm Restarts

Ilya Loshchilov, Frank Hutter

2016

8 references

View Paper PDF View on GitHub

Cyclical Learning Rates for Training Neural Networks

Leslie N. Smith

2015

2 references

View Paper PDF View on GitHub

SGDR: Stochastic Gradient Descent with Warm Restarts

Ilya Loshchilov, Frank Hutter

2016

8 references

View Paper PDF View on GitHub

Super-Convergence: Very Fast Training of Neural Networks Using Large Learning Rates

Leslie N. Smith, Nicholay Topin

2017

2 references

View Paper PDF View on GitHub

▶ torch/optim/_muon.py

Muon is Scalable for LLM Training

Jingyuan Liu, Jianlin Su, Xingcheng Yao, Zhejun Jiang, Guokun Lai, Yulun Du, Yidao Qin, Weixin Xu, E...

2025

2 references

View Paper PDF View on GitHub

▶ torch/optim/nadam.py

Decoupled Weight Decay Regularization

Ilya Loshchilov, Frank Hutter

2017

8 references

View Paper PDF View on GitHub

▶ torch/optim/radam.py

On the Variance of the Adaptive Learning Rate and Beyond

Liyuan Liu, Haoming Jiang, Pengcheng He, Weizhu Chen, Xiaodong Liu, Jianfeng Gao, Jiawei Han

2019

2 references

View Paper PDF View on GitHub

Decoupled Weight Decay Regularization

Ilya Loshchilov, Frank Hutter

2017

8 references

View Paper PDF View on GitHub

▶ torch/optim/rmsprop.py

Generating Sequences With Recurrent Neural Networks

Alex Graves

2013

2 references

View Paper PDF View on GitHub

▶ torch/optim/sparse_adam.py

Adam: A Method for Stochastic Optimization

Diederik P. Kingma, Jimmy Ba

2014

17 references

View Paper PDF View on GitHub

▶ torch/optim/swa_utils.py

Averaging Weights Leads to Wider Optima and Better Generalization

Pavel Izmailov, Dmitrii Podoprikhin, Timur Garipov, Dmitry Vetrov, Andrew Gordon Wilson

2018

6 references

View Paper PDF View on GitHub

There Are Many Consistent Explanations of Unlabeled Data: Why You Should Average

Ben Athiwaratkun, Marc Finzi, Pavel Izmailov, Andrew Gordon Wilson

2018

2 references

View Paper PDF View on GitHub

SWALP : Stochastic Weight Averaging in Low-Precision Training

Guandao Yang, Tianyi Zhang, Polina Kirichenko, Junwen Bai, Andrew Gordon Wilson, Christopher De Sa

2019

2 references

View Paper PDF View on GitHub

Stochastic Weight Averaging in Parallel: Large-Batch Training that Generalizes Well

Vipul Gupta, Santiago Akle Serrano, Dennis DeCoste

2020

2 references

View Paper PDF View on GitHub

Averaging Weights Leads to Wider Optima and Better Generalization

Pavel Izmailov, Dmitrii Podoprikhin, Timur Garipov, Dmitry Vetrov, Andrew Gordon Wilson

2018

6 references

View Paper PDF View on GitHub

▶ torch/profiler/profiler.py

Chakra: Advancing Performance Benchmarking and Co-design using Standardized Execution Traces

Srinivas Sridharan, Taekyung Heo, Louis Feng, Zhaodong Wang, Matt Bergeron, Wenyin Fu, Shengbao Zhen...

2023

4 references

View Paper PDF View on GitHub

Chakra: Advancing Performance Benchmarking and Co-design using Standardized Execution Traces

Srinivas Sridharan, Taekyung Heo, Louis Feng, Zhaodong Wang, Matt Bergeron, Wenyin Fu, Shengbao Zhen...

2023

4 references

View Paper PDF View on GitHub

▶ torch/_refs/nn/functional/__init__.py

Self-Normalizing Neural Networks

Günter Klambauer, Thomas Unterthiner, Andreas Mayr, Sepp Hochreiter

2017

10 references

View Paper PDF View on GitHub

▶ torch/signal/windows/windows.py

Some windows with very good sidelobe behavior

A. Nuttall

1981

2 references

View Paper PDF DOI View on GitHub

Papers Referenced in This Repository

Attention Is All You Need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kais...

2017

15 references

The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer...

View Paper PDF

Show 7 references in code

torch/nn/modules/activation.py:1093

torch/nn/modules/transformer.py:63

torch/nn/modules/transformer.py:324

torch/nn/modules/transformer.py:559

torch/nn/modules/transformer.py:664

torch/nn/modules/transformer.py:987

torch/onnx/ops/__init__.py:379

Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions

Nathan Halko, Per-Gunnar Martinsson, Joel A. Tropp

2009

12 references

Low-rank matrix approximations, such as the truncated singular value decomposition and the rank-revealing QR decomposition, play a central role in data analysis and scientific computing. This work surveys and extends recent research which demonstrates that randomization offers a powerful tool for pe...

View Paper PDF

Show 6 references in code

torch/_lowrank.py:57

torch/_lowrank.py:58

torch/_lowrank.py:135

torch/_lowrank.py:136

torch/_lowrank.py:245

torch/_lowrank.py:246

Additive Powers-of-Two Quantization: An Efficient Non-uniform Discretization for Neural Networks

Yuhang Li, Xin Dong, Wei Wang

2019

10 references

We propose Additive Powers-of-Two~(APoT) quantization, an efficient non-uniform quantization scheme for the bell-shaped and long-tailed distribution of weights and activations in neural networks. By constraining all quantization levels as the sum of Powers-of-Two terms, APoT quantization enjoys high...

View Paper PDF

Show 5 references in code

torch/ao/quantization/experimental/linear.py:86

torch/ao/quantization/experimental/observer.py:39

torch/ao/quantization/experimental/quantizer.py:35

torch/ao/quantization/experimental/quantizer.py:63

torch/ao/quantization/experimental/quantizer.py:90

The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables

Chris J. Maddison, Andriy Mnih, Yee Whye Teh

2016

8 references

The reparameterization trick enables optimizing large scale stochastic computation graphs via gradient descent. The essence of the trick is to refactor each stochastic node into a differentiable function of its parameters and a random variable with fixed distribution. After refactoring, the gradient...

View Paper PDF

Show 4 references in code

.claude/skills/docstring/SKILL.md:165

.claude/skills/docstring/SKILL.md:321

docs/source/nn.functional.rst:114

torch/nn/functional.py:2205

PowerSGD: Practical Low-Rank Gradient Compression for Distributed Optimization

Thijs Vogels, Sai Praneeth Karimireddy, Martin Jaggi

2019

8 references

We study gradient compression methods to alleviate the communication bottleneck in data-parallel distributed optimization. Despite the significant attention received, current compression schemes either do not scale well or fail to achieve the target test accuracy. We propose a new low-rank gradient ...

View Paper PDF

Show 4 references in code

docs/source/ddp_comm_hooks.md:58

docs/source/ddp_comm_hooks.md:218

torch/distributed/algorithms/ddp_comm_hooks/powerSGD_hook.py:346

torch/distributed/algorithms/ddp_comm_hooks/powerSGD_hook.py:660

On the Computation of Complex-valued Gradients with Application to Statistically Optimum Beamforming

Christoph Boeddeker, Patrick Hanebrink, Lukas Drude, Jahn Heymann, Reinhold Haeb-Umbach

2017

9 references

This report describes the computation of gradients by algorithmic differentiation for statistically optimum beamforming operations. Especially the derivation of complex-valued functions is a key component of this approach. Therefore the real-valued algorithmic differentiation is extended via the com...

View Paper PDF

Show 4 references in code

docs/source/notes/gradcheck.rst:90

torch/autograd/gradcheck.py:402

torch/csrc/autograd/FunctionsManual.cpp:3757

torch/csrc/autograd/FunctionsManual.cpp:3859

FP8 Formats for Deep Learning

Paulius Micikevicius, Dusan Stosic, Neil Burgess, Marius Cornea, Pradeep Dubey, Richard Grisenthwait...

2022

169 citations

10 references

FP8 is a natural progression for accelerating deep learning training inference beyond the 16-bit formats common in modern processors. In this paper we propose an 8-bit floating point (FP8) binary interchange format consisting of two encodings - E4M3 (4-bit exponent and 3-bit mantissa) and E5M2 (5-bi...

View Paper PDF

Show 4 references in code

docs/source/tensor_attributes.rst:32

docs/source/tensor_attributes.rst:33

torch/headeronly/util/Float8_e4m3fn.h:14

torch/headeronly/util/Float8_e5m2.h:14

8-bit Numerical Formats for Deep Neural Networks

Badreddine Noune, Philip Jones, Daniel Justus, Dominic Masters, Carlo Luschi

2022

10 references

Given the current trend of increasing size and complexity of machine learning architectures, it has become of critical importance to identify new approaches to improve the computational efficiency of model training. In this context, we address the advantages of floating-point over fixed-point repres...

View Paper PDF

Show 4 references in code

docs/source/tensor_attributes.rst:34

docs/source/tensor_attributes.rst:35

torch/headeronly/util/Float8_e4m3fnuz.h:17

torch/headeronly/util/Float8_e5m2fnuz.h:17

Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks

Chelsea Finn, Pieter Abbeel, Sergey Levine

2017

8 references

We propose an algorithm for meta-learning that is model-agnostic, in the sense that it is compatible with any model trained with gradient descent and applicable to a variety of different learning problems, including classification, regression, and reinforcement learning. The goal of meta-learning is...

View Paper PDF

Show 4 references in code

functorch/examples/maml_omniglot/maml-omniglot-higher.py:21

functorch/examples/maml_omniglot/maml-omniglot-ptonly.py:21

functorch/examples/maml_omniglot/maml-omniglot-transforms.py:21

functorch/examples/maml_omniglot/README.md:3

Fractional Max-Pooling

Benjamin Graham

2014

21 references

Convolutional networks almost always incorporate some form of spatial pooling, and very often it is alpha times alpha max-pooling with alpha=2. Max-pooling act on the hidden layers of the network, reducing their size by an integer multiplicative factor alpha. The amazing by-product of discarding 75%...

View Paper PDF

Show 4 references in code

torch/nn/functional.py:477

torch/nn/functional.py:596

torch/nn/modules/pooling.py:954

torch/nn/modules/pooling.py:1041

Self-Normalizing Neural Networks

Günter Klambauer, Thomas Unterthiner, Andreas Mayr, Sepp Hochreiter

2017

10 references

Deep Learning has revolutionized vision via convolutional neural networks (CNNs) and natural language processing via recurrent neural networks (RNNs). However, success stories of Deep Learning with standard feed-forward neural networks (FNNs) are rare. FNNs that perform well are typically shallow an...

View Paper PDF

Show 4 references in code

torch/nn/modules/activation.py:710

torch/nn/modules/dropout.py:262

torch/nn/modules/dropout.py:314

torch/_refs/nn/functional/__init__.py:123

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Sergey Ioffe, Christian Szegedy

2015

11 references

Training Deep Neural Networks is complicated by the fact that the distribution of each layer's inputs changes during training, as the parameters of the previous layers change. This slows down the training by requiring lower learning rates and careful parameter initialization, and makes it notoriousl...

View Paper PDF

Show 4 references in code

torch/nn/modules/batchnorm.py:284

torch/nn/modules/batchnorm.py:396

torch/nn/modules/batchnorm.py:507

torch/nn/modules/batchnorm.py:619

Efficient Object Localization Using Convolutional Networks

Jonathan Tompson, Ross Goroshin, Arjun Jain, Yann LeCun, Christopher Bregler

2014

11 references

Recent state-of-the-art performance on human-body pose estimation has been achieved with Deep Convolutional Networks (ConvNets). Traditional ConvNet architectures include pooling and sub-sampling layers which reduce computational requirements, introduce invariance and prevent over-training. These be...

View Paper PDF

Show 4 references in code

torch/nn/modules/dropout.py:114

torch/nn/modules/dropout.py:169

torch/nn/modules/dropout.py:217

torch/nn/modules/dropout.py:316

Decoupled Weight Decay Regularization

Ilya Loshchilov, Frank Hutter

2017

8 references

L$_2$ regularization and weight decay regularization are equivalent for standard stochastic gradient descent (when rescaled by the learning rate), but as we demonstrate this is \emph{not} the case for adaptive gradient algorithms, such as Adam. While common implementations of these algorithms employ...

View Paper PDF

Show 4 references in code

torch/optim/_adafactor.py:325

torch/optim/adamw.py:122

torch/optim/nadam.py:275

torch/optim/radam.py:250

Deep Residual Learning for Image Recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun

2015

8 references

Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of l...

View Paper PDF

Show 3 references in code

benchmarks/functional_autograd_benchmark/torchvision_models.py:90

benchmarks/functional_autograd_benchmark/torchvision_models.py:285

benchmarks/functional_autograd_benchmark/torchvision_models.py:295

Averaging Weights Leads to Wider Optima and Better Generalization

Pavel Izmailov, Dmitrii Podoprikhin, Timur Garipov, Dmitry Vetrov, Andrew Gordon Wilson

2018

6 references

Deep neural networks are typically trained by optimizing a loss function with an SGD variant, in conjunction with a decaying learning rate, until convergence. We show that simple averaging of multiple points along the trajectory of SGD, with a cyclical or constant learning rate, leads to better gene...

View Paper PDF

Show 3 references in code

docs/source/optim.md:523

torch/optim/swa_utils.py:208

torch/optim/swa_utils.py:429

Up or Down? Adaptive Rounding for Post-Training Quantization

Markus Nagel, Rana Ali Amjad, Mart van Baalen, Christos Louizos, Tijmen Blankevoort

2020

6 references

When quantizing neural networks, assigning each floating-point weight to its nearest fixed-point value is the predominant approach. We find that, perhaps surprisingly, this is not the best we can do. In this paper, we propose AdaRound, a better weight-rounding mechanism for post-training quantizatio...

View Paper PDF

Show 3 references in code

torch/ao/quantization/experimental/adaround_fake_quantize.py:16

torch/ao/quantization/experimental/adaround_loss.py:13

torch/ao/quantization/experimental/adaround_loss.py:56

Don't Use Large Mini-Batches, Use Local SGD

Tao Lin, Sebastian U. Stich, Kumar Kshitij Patel, Martin Jaggi

2018

6 references

Mini-batch stochastic gradient methods (SGD) are state of the art for distributed training of deep neural networks. Drastic increases in the mini-batch sizes have lead to key efficiency and scalability gains in recent years. However, progress faces a major roadblock, as models trained with large bat...

View Paper PDF

Show 3 references in code

torch/distributed/algorithms/model_averaging/averagers.py:41

torch/distributed/algorithms/model_averaging/hierarchical_model_averager.py:25

torch/distributed/optim/post_localSGD_optimizer.py:10

Gaussian Error Linear Units (GELUs)

Dan Hendrycks, Kevin Gimpel

2016

9 references

We propose the Gaussian Error Linear Unit (GELU), a high-performing neural network activation function. The GELU activation function is $x\Phi(x)$, where $\Phi(x)$ the standard Gaussian cumulative distribution function. The GELU nonlinearity weights inputs by their value, rather than gates inputs by...

View Paper PDF

Show 3 references in code

torch/nn/functional.py:2019

torch/nn/functional.py:2382

torch/nn/modules/activation.py:442

Instance Normalization: The Missing Ingredient for Fast Stylization

Dmitry Ulyanov, Andrea Vedaldi, Victor Lempitsky

2016

6 references

It this paper we revisit the fast stylization method introduced in Ulyanov et. al. (2016). We show how a small change in the stylization architecture results in a significant qualitative improvement in the generated images. The change is limited to swapping batch normalization with instance normaliz...

View Paper PDF

Show 3 references in code

torch/nn/modules/instancenorm.py:134

torch/nn/modules/instancenorm.py:249

torch/nn/modules/instancenorm.py:365

Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks

Tim Salimans, Diederik P. Kingma

2016

6 references

We present weight normalization: a reparameterization of the weight vectors in a neural network that decouples the length of those weight vectors from their direction. By reparameterizing the weights in this way we improve the conditioning of the optimization problem and we speed up convergence of s...

View Paper PDF

Show 3 references in code

torch/nn/utils/parametrizations.py:352

torch/nn/utils/weight_norm.py:2

torch/nn/utils/weight_norm.py:102

Spectral Normalization for Generative Adversarial Networks

Takeru Miyato, Toshiki Kataoka, Masanori Koyama, Yuichi Yoshida

2018

6 references

One of the challenges in the study of generative adversarial networks is the instability of its training. In this paper, we propose a novel weight normalization technique called spectral normalization to stabilize the training of the discriminator. Our new normalization technique is computationally ...

View Paper PDF

Show 3 references in code

torch/nn/utils/parametrizations.py:559

torch/nn/utils/spectral_norm.py:2

torch/nn/utils/spectral_norm.py:289

Adam: A Method for Stochastic Optimization

Diederik P. Kingma, Jimmy Ba

2014

17 references

We introduce Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments. The method is straightforward to implement, is computationally efficient, has little memory requirements, is invariant to diagonal rescal...

View Paper PDF

Show 3 references in code

torch/optim/adamax.py:220

torch/optim/adam.py:339

torch/optim/sparse_adam.py:188

A simple method for generating gamma variables

George Marsaglia, Wai Wan Tsang

2000

4 references

<jats:p> We offer a procedure for generating a gamma variate as the cube of a suitably scaled normal variate. It is fast and simple, assuming one has a fast way to generate normal variables. In brief: generate a normal variate <jats:italic>x</jats:italic> and a un...

View Paper DOI

Show 2 references in code

aten/src/ATen/native/Distributions.h:101

aten/src/ATen/native/Distributions.h:102

Categorical Reparameterization with Gumbel-Softmax

Eric Jang, Shixiang Gu, Ben Poole

2016

4 references

Categorical variables are a natural choice for representing discrete structure in the world. However, stochastic neural networks rarely use categorical latent variables due to the inability to backpropagate through samples. In this work, we present an efficient gradient estimator that replaces the n...

View Paper PDF

Show 2 references in code

docs/source/nn.functional.rst:115

torch/nn/functional.py:2207

Quantizing deep convolutional networks for efficient inference: A whitepaper

Raghuraman Krishnamoorthi

2018

4 references

We present an overview of techniques for quantizing convolutional neural networks for inference with integer weights and activations. Per-channel quantization of weights and per-layer quantization of activations to 8-bits of precision post-training produces classification accuracies within 2% of flo...

View Paper PDF

Show 2 references in code

torch/ao/nn/intrinsic/qat/modules/conv_fused.py:164

torch/ao/nn/intrinsic/qat/modules/linear_fused.py:96

Data-Free Quantization Through Weight Equalization and Bias Correction

Markus Nagel, Mart van Baalen, Tijmen Blankevoort, Max Welling

2019

4 references

We introduce a data-free quantization method for deep neural networks that does not require fine-tuning or hyperparameter selection. It achieves near-original model performance on common computer vision architectures and tasks. 8-bit fixed-point quantization is essential for efficient inference on m...

View Paper PDF

Show 2 references in code

torch/ao/quantization/_correct_bias.py:108

torch/ao/quantization/_equalize.py:211

Trained Quantization Thresholds for Accurate and Efficient Fixed-Point Inference of Deep Neural Networks

Sambhav R. Jain, Albert Gural, Michael Wu, Chris H. Dick

2019

4 references

We propose a method of training quantization thresholds (TQT) for uniform symmetric quantizers using standard backpropagation and gradient descent. Contrary to prior work, we show that a careful analysis of the straight-through estimator for threshold gradients allows for a natural range-precision t...

View Paper PDF

Show 2 references in code

torch/ao/quantization/observer.py:337

torch/ao/quantization/utils.py:642

Layer Normalization

Jimmy Lei Ba, Jamie Ryan Kiros, Geoffrey E. Hinton

2016

6 references

Training state-of-the-art, deep neural networks is computationally expensive. One way to reduce the training time is to normalize the activities of the neurons. A recently introduced technique called batch normalization uses the distribution of the summed input to a neuron over a mini-batch of train...

View Paper PDF

Show 2 references in code

torch/csrc/api/include/torch/nn/modules/normalization.h:51

torch/nn/modules/normalization.py:110

GossipGraD: Scalable Deep Learning using Gossip Communication based Asynchronous Gradient Descent

Jeff Daily, Abhinav Vishnu, Charles Siegel, Thomas Warfel, Vinay Amatya

2018

4 references

In this paper, we present GossipGraD - a gossip communication protocol based Stochastic Gradient Descent (SGD) algorithm for scaling Deep Learning (DL) algorithms on large-scale systems. The salient features of GossipGraD are: 1) reduction in overall communication complexity from {\Theta}(log(p)) fo...

View Paper PDF

Show 2 references in code

torch/distributed/fsdp/fully_sharded_data_parallel.py:2016

torch/distributed/fsdp/fully_sharded_data_parallel.py:2027

Zero Bubble Pipeline Parallelism

Penghui Qi, Xinyi Wan, Guangxing Huang, Min Lin

2023

4 references

Pipeline parallelism is one of the key components for large-scale distributed training, yet its efficiency suffers from pipeline bubbles which were deemed inevitable. In this work, we introduce a scheduling strategy that, to our knowledge, is the first to successfully achieve zero pipeline bubbles u...

View Paper PDF

Show 2 references in code

torch/distributed/pipelining/schedules.py:2625

torch/distributed/pipelining/schedules.py:2819

GSPMD: General and Scalable Parallelization for ML Computation Graphs

Yuanzhong Xu, HyoukJoong Lee, Dehao Chen, Blake Hechtman, Yanping Huang, Rahul Joshi, Maxim Krikun, ...

2021

4 references

We present GSPMD, an automatic, compiler-based parallelization system for common machine learning computations. It allows users to write programs in the same way as for a single device, then give hints through a few annotations on how to distribute tensors, based on which GSPMD will parallelize the ...

View Paper PDF

Show 2 references in code

torch/distributed/tensor/README.md:32

torch/distributed/tensor/README.md:153

Efficient ConvBN Blocks for Transfer Learning and Beyond

Kaichao You, Guo Qin, Anchang Bao, Meng Cao, Ping Huang, Jiulong Shan, Mingsheng Long

2023

4 references

Convolution-BatchNorm (ConvBN) blocks are integral components in various computer vision tasks and other domains. A ConvBN block can operate in three modes: Train, Eval, and Deploy. While the Train mode is indispensable for training models from scratch, the Eval mode is suitable for transfer learnin...

View Paper PDF

Show 2 references in code

torch/_inductor/fx_passes/efficient_conv_bn_eval.py:21

torch/_inductor/fx_passes/efficient_conv_bn_eval.py:91

Sigmoid-Weighted Linear Units for Neural Network Function Approximation in Reinforcement Learning

Stefan Elfwing, Eiji Uchibe, Kenji Doya

2017

5 references

In recent years, neural networks have enjoyed a renaissance as function approximators in reinforcement learning. Two decades after Tesauro's TD-Gammon achieved near top-level human performance in backgammon, the deep reinforcement learning algorithm DQN achieved human-level performance in many Atari...

View Paper PDF

Show 2 references in code

torch/nn/functional.py:2385

torch/nn/modules/activation.py:445

Searching for Activation Functions

Prajit Ramachandran, Barret Zoph, Quoc V. Le

2017

6 references

The choice of activation functions in deep networks has a significant effect on the training dynamics and task performance. Currently, the most successful and widely-used activation function is the Rectified Linear Unit (ReLU). Although various hand-designed alternatives to ReLU have been proposed, ...

View Paper PDF

Show 2 references in code

torch/nn/functional.py:2386

torch/nn/modules/activation.py:446

Mish: A Self Regularized Non-Monotonic Activation Function

Diganta Misra

2019

4 references

We propose $\textit{Mish}$, a novel self-regularized non-monotonic activation function which can be mathematically defined as: $f(x)=x\tanh(softplus(x))$. As activation functions play a crucial role in the performance and training dynamics in neural networks, we validated experimentally on several w...

View Paper PDF

Show 2 references in code

torch/nn/functional.py:2407

torch/nn/modules/activation.py:492

Searching for MobileNetV3

Andrew Howard, Mark Sandler, Grace Chu, Liang-Chieh Chen, Bo Chen, Mingxing Tan, Weijun Wang, Yukun ...

2019

4 references

We present the next generation of MobileNets based on a combination of complementary search techniques as well as a novel architecture design. MobileNetV3 is tuned to mobile phone CPUs through a combination of hardware-aware network architecture search (NAS) complemented by the NetAdapt algorithm an...

View Paper PDF

Show 2 references in code

torch/nn/functional.py:2434

torch/nn/modules/activation.py:531

Rethinking the Inception Architecture for Computer Vision

Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, Zbigniew Wojna

2015

6 references

Convolutional networks are at the core of most state-of-the-art computer vision solutions for a wide variety of tasks. Since 2014 very deep convolutional networks started to become mainstream, yielding substantial gains in various benchmarks. Although increased model size and computational cost tend...

View Paper PDF

Show 2 references in code

torch/nn/functional.py:3456

torch/nn/modules/loss.py:1282

Spatial Transformer Networks

Max Jaderberg, Karen Simonyan, Andrew Zisserman, Koray Kavukcuoglu

2015

4 references

Convolutional Neural Networks define an exceptionally powerful class of models, but are still limited by the lack of ability to be spatially invariant to the input data in a computationally and parameter efficient manner. In this work we introduce a new learnable module, the Spatial Transformer, whi...

View Paper PDF

Show 2 references in code

torch/nn/functional.py:5157

torch/nn/functional.py:5261

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, Sumit Sanghai

2023

4 references

Multi-query attention (MQA), which only uses a single key-value head, drastically speeds up decoder inference. However, MQA can lead to quality degradation, and moreover it may not be desirable to train a separate model just for faster inference. We (1) propose a recipe for uptraining existing multi...

View Paper PDF

Show 2 references in code

torch/nn/functional.py:6085

torch/onnx/ops/__init__.py:380

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, Christopher Ré

2022

4 references

Transformers are slow and memory-hungry on long sequences, since the time and memory complexity of self-attention are quadratic in sequence length. Approximate attention methods have attempted to address this problem by trading off model quality to reduce the compute complexity, but often do not ach...

View Paper PDF

Show 2 references in code

torch/nn/modules/activation.py:1154

torch/nn/modules/transformer.py:736

Device-to-device resource allocation in LTE-advanced networks by hybrid particle swarm optimization and genetic algorithm.

Shijie Sun, Kwang-Yul Kim, Oh-Soon Shin, Yoan Shin

2016

4 references

View Paper DOI

Show 2 references in code

torch/nn/modules/pixelshuffle.py:21

torch/nn/modules/pixelshuffle.py:80

Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network

Wenzhe Shi, Jose Caballero, Ferenc Huszár, Johannes Totz, Andrew P. Aitken, Rob Bishop, Daniel Rueck...

2016

5 references

Recently, several models based on deep neural networks have achieved great success in terms of both reconstruction accuracy and computational performance for single image super-resolution. In these methods, the low resolution (LR) input image is upscaled to the high resolution (HR) space using a sin...

View Paper PDF

Show 2 references in code

torch/nn/modules/pixelshuffle.py:48

torch/nn/modules/pixelshuffle.py:107

On Layer Normalization in the Transformer Architecture

Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shuxin Zheng, Chen Xing, Huishuai Zhang, Yanyan Lan, ...

2020

4 references

The Transformer is widely used in natural language processing tasks. To train a Transformer however, one usually needs a carefully designed learning rate warm-up stage, which is shown to be crucial to the final performance but will slow down the optimization and bring more hyper-parameter tunings. I...

View Paper PDF

Show 2 references in code

torch/nn/modules/transformer.py:942

torch/nn/modules/transformer.py:1126

SGDR: Stochastic Gradient Descent with Warm Restarts

Ilya Loshchilov, Frank Hutter

2016

8 references

Restart techniques are common in gradient-free optimization to deal with multimodal functions. Partial warm restarts are also gaining popularity in gradient-based optimization to improve the rate of convergence in accelerated gradient schemes to deal with ill-conditioned functions. In this paper, we...

View Paper PDF

Show 2 references in code

torch/optim/lr_scheduler.py:1386

torch/optim/lr_scheduler.py:2137

Chakra: Advancing Performance Benchmarking and Co-design using Standardized Execution Traces

Srinivas Sridharan, Taekyung Heo, Louis Feng, Zhaodong Wang, Matt Bergeron, Wenyin Fu, Shengbao Zhen...

2023

4 references

Benchmarking and co-design are essential for driving optimizations and innovation around ML models, ML software, and next-generation hardware. Full workload benchmarks, e.g. MLPerf, play an essential role in enabling fair comparison across different software and hardware stacks especially once syste...

View Paper PDF

Show 2 references in code

torch/profiler/profiler.py:119

torch/profiler/profiler.py:603

Sparse matrix multiplication package (SMMP)

R. Bank, C. Douglas

1993

2 references

View Paper PDF DOI

Show 1 reference in code

aten/src/ATen/native/sparse/SparseMatMul.cpp:32

On Multiplicative Integration with Recurrent Neural Networks

Yuhuai Wu, Saizheng Zhang, Ying Zhang, Yoshua Bengio, Ruslan Salakhutdinov

2016

2 references

We introduce a general and simple structural design called Multiplicative Integration (MI) to improve recurrent neural networks (RNNs). MI changes the way in which information from difference sources flows and is integrated in the computational building block of an RNN, while introducing almost no e...

View Paper PDF

Show 1 reference in code

benchmarks/fastrnns/cells.py:9

Wav2Letter: an End-to-End ConvNet-based Speech Recognition System

Ronan Collobert, Christian Puhrsch, Gabriel Synnaeve

2016

2 references

This paper presents a simple end-to-end model for speech recognition, combining a convolutional network based acoustic model and a graph decoding. It is trained to output letters, with transcribed speech, without the need for force alignment of phonemes. We introduce an automatic segmentation criter...

View Paper PDF

Show 1 reference in code

benchmarks/functional_autograd_benchmark/torchaudio_models.py:18

Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour

Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew T...

2017

2 references

Deep learning thrives with large neural networks and large datasets. However, larger networks and larger datasets result in longer training times that impede research and development progress. Distributed synchronous SGD offers a potential solution to this problem by dividing SGD minibatches over a ...

View Paper PDF

Show 1 reference in code

benchmarks/functional_autograd_benchmark/torchvision_models.py:203

The Complex Gradient Operator and the CR-Calculus

Ken Kreutz-Delgado

2009

2 references

A thorough discussion and development of the calculus of real-valued functions of complex-valued vectors is given using the framework of the Wirtinger Calculus. The presented material is suitable for exposition in an introductory Electrical Engineering graduate level course on the use of complex gra...

View Paper PDF

Show 1 reference in code

docs/source/notes/autograd.rst:575

Enabling Fast Differentially Private SGD via Just-in-Time Compilation and Vectorization

Pranav Subramani, Nicholas Vadivelu, Gautam Kamath

2020

2 references

A common pain point in differentially private machine learning is the significant runtime overhead incurred when executing Differentially Private Stochastic Gradient Descent (DPSGD), which may be as large as two orders of magnitude. We thoroughly demonstrate that by exploiting powerful language prim...

View Paper PDF

Show 1 reference in code

functorch/examples/dp_cifar10/README.md:9

Forcing Generative Models to Degenerate Ones: The Power of Data Poisoning Attacks

Shuli Jiang, Swanand Ravindra Kadhe, Yi Zhou, Ling Cai, Nathalie Baracaldo

2023

2 references

Growing applications of large language models (LLMs) trained by a third party raise serious concerns on the security vulnerability of LLMs.It has been demonstrated that malicious actors can covertly exploit these vulnerabilities in LLMs through poisoning attacks aimed at generating undesirable outpu...

View Paper PDF

Show 1 reference in code

SECURITY.md:41

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Evan Hubinger, Carson Denison, Jesse Mu, Mike Lambert, Meg Tong, Monte MacDiarmid, Tamera Lanham, Da...

2024

2 references

Humans are capable of strategically deceptive behavior: behaving helpfully in most situations, but then behaving very differently in order to pursue alternative objectives when given the opportunity. If an AI system learned such a deceptive strategy, could we detect it and remove it using current st...

View Paper PDF

Show 1 reference in code

SECURITY.md:42

Filter Pruning via Geometric Median for Deep Convolutional Neural Networks Acceleration

Yang He, Ping Liu, Ziwei Wang, Zhilan Hu, Yi Yang

2018

2 references

Previous works utilized ''smaller-norm-less-important'' criterion to prune filters with smaller norm values in a convolutional neural network. In this paper, we analyze this norm-based criterion and point out that its effectiveness depends on two requirements that are not always met: (1) the norm de...

View Paper PDF

Show 1 reference in code

torch/ao/pruning/_experimental/pruner/FPGM_pruner.py:16

To prune, or not to prune: exploring the efficacy of pruning for model compression

Michael Zhu, Suyog Gupta

2017

2 references

Model pruning seeks to induce sparsity in a deep neural network's various connection matrices, thereby reducing the number of nonzero-valued parameters in the model. Recent reports (Han et al., 2015; Narang et al., 2017) prune deep networks at the cost of only a marginal loss in accuracy and achieve...

View Paper PDF

Show 1 reference in code

torch/ao/pruning/scheduler/cubic_scheduler.py:68

The differentiation of pseudo-inverses and non-linear least squares problems whose variables separate

G. Golub, V. Pereyra

1972

2 references

For given data $(t_i ,y_i ),i = 1, \cdots ,m$, we consider the least squares fit of nonlinear models of the form \[ \eta ({\bf a},{\boldsymbol \alpha} ;t) = \sum _{j = 1}^n {a_j \varphi _j ({\boldsymbol \alpha} ;t),\qquad {\bf a} \in \mathcal{R}^n ,\qquad {\boldsymbol \alpha} \in \mathcal{R}^k .} \]...

View Paper DOI

Show 1 reference in code

torch/csrc/autograd/FunctionsManual.cpp:2057

Reliable and fast DWARF-based stack unwinding

T. Bastian, Stephen Kell, Francesco Zappa Nardelli

2019

2 references

Debug information, usually encoded in the DWARF format, is a hidden and obscure component of our computing infrastructure. Debug information is obviously used by debuggers, but it also plays a key role in program analysis tools, and, most surprisingly, it can be relied upon by the runtime of high-le...

View Paper PDF DOI

Show 1 reference in code

torch/csrc/profiler/unwind/fde.h:34

Demystifying Why Local Aggregation Helps: Convergence Analysis of Hierarchical SGD

Jiayi Wang, Shiqiang Wang, Rong-Rong Chen, Mingyue Ji

2020

2 references

Hierarchical SGD (H-SGD) has emerged as a new distributed SGD algorithm for multi-level communication networks. In H-SGD, before each global aggregation, workers send their updated local models to local servers for aggregations. Despite recent research efforts, the effect of local aggregation on glo...

View Paper PDF

Show 1 reference in code

torch/distributed/algorithms/model_averaging/hierarchical_model_averager.py:20

PipeDream: Fast and Efficient Pipeline Parallel DNN Training

Aaron Harlap, Deepak Narayanan, Amar Phanishayee, Vivek Seshadri, Nikhil Devanur, Greg Ganger, Phil ...

2018

2 references

PipeDream is a Deep Neural Network(DNN) training system for GPUs that parallelizes computation by pipelining execution across multiple machines. Its pipeline parallel computing model avoids the slowdowns faced by data-parallel training when large models and/or limited network bandwidth induce high c...

View Paper PDF

Show 1 reference in code

torch/distributed/benchmarks/README.md:11

Automatic Cross-Replica Sharding of Weight Update in Data-Parallel Training

Yuanzhong Xu, HyoukJoong Lee, Dehao Chen, Hongjun Choi, Blake Hechtman, Shibo Wang

2020

2 references

In data-parallel synchronous training of deep neural networks, different devices (replicas) run the same program with different partitions of the training batch, but weight update computation is repeated on all replicas, because the weights do not have a batch dimension to partition. This can be a b...

View Paper PDF

Show 1 reference in code

torch/distributed/fsdp/fully_sharded_data_parallel.py:120

DoRA: Weight-Decomposed Low-Rank Adaptation

Shih-Yang Liu, Chien-Yi Wang, Hongxu Yin, Pavlo Molchanov, Yu-Chiang Frank Wang, Kwang-Ting Cheng, M...

2024

2 references

Among the widely used parameter-efficient fine-tuning (PEFT) methods, LoRA and its variants have gained considerable popularity because of avoiding additional inference costs. However, there still often exists an accuracy gap between these methods and full fine-tuning (FT). In this work, we first in...

View Paper PDF

Show 1 reference in code

torch/distributed/fsdp/_fully_shard/_fsdp_param_group.py:254

ZeRO: Memory Optimizations Toward Training Trillion Parameter Models

Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, Yuxiong He

2019

2 references

Large deep learning models offer significant accuracy gains, but training billions to trillions of parameters is challenging. Existing solutions such as data and model parallelisms exhibit fundamental limitations to fit these models into limited device memory, while obtaining computation, communicat...

View Paper PDF

Show 1 reference in code

torch/distributed/optim/zero_redundancy_optimizer.py:293

Breadth-First Pipeline Parallelism

Joel Lamy-Poirier

2022

2 references

We introduce Breadth-First Pipeline Parallelism, a novel training schedule which optimizes the combination of pipeline and data parallelism. Breadth-First Pipeline Parallelism lowers training time, cost and memory usage by combining a high GPU utilization with a small batch size per GPU, and by maki...

View Paper PDF

Show 1 reference in code

torch/distributed/pipelining/schedules.py:2298

Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM

Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Patwary, Vijay Anand Ko...

2021

2 references

Large language models have led to state-of-the-art accuracies across a range of tasks. However, training these models efficiently is challenging for two reasons: a) GPU memory capacity is limited, making it impossible to fit large models on even a multi-GPU server, and b) the number of compute opera...

View Paper PDF

Show 1 reference in code

torch/distributed/pipelining/schedules.py:2504

DeepSeek-V3 Technical Report

DeepSeek-AI, Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, C...

2024

2 references

We present DeepSeek-V3, a strong Mixture-of-Experts (MoE) language model with 671B total parameters with 37B activated for each token. To achieve efficient inference and cost-effective training, DeepSeek-V3 adopts Multi-head Latent Attention (MLA) and DeepSeekMoE architectures, which were thoroughly...

View Paper PDF

Show 1 reference in code

torch/distributed/pipelining/schedules.py:3005

Reducing Activation Recomputation in Large Transformer Models

Vijay Korthikanti, Jared Casper, Sangkug Lym, Lawrence McAfee, Michael Andersch, Mohammad Shoeybi, B...

2022

2 references

Training large transformer models is one of the most important computational challenges of modern AI. In this paper, we show how to significantly accelerate training of large transformer models by reducing activation recomputation. Activation recomputation is commonly used to work around memory capa...

View Paper PDF

Show 1 reference in code

torch/distributed/tensor/parallel/style.py:331

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, Bryan Catanzaro

2019

2 references

Recent work in language modeling demonstrates that training large transformer models advances the state of the art in Natural Language Processing applications. However, very large models can be quite difficult to train due to memory constraints. In this work, we present our techniques for training v...

View Paper PDF

Show 1 reference in code

torch/distributed/tensor/README.md:171

The continuous Bernoulli: fixing a pervasive error in variational autoencoders

Gabriel Loaiza-Ganem, John P. Cunningham

2019

2 references

Variational autoencoders (VAE) have quickly become a central tool in machine learning, applicable to a broad range of data types and latent variable models. By far the most common first step, taken by seminal papers and by core software libraries alike, is to model MNIST data using a deep network pa...

View Paper PDF

Show 1 reference in code

torch/distributions/continuous_bernoulli.py:47

TensorFlow Distributions

Joshua V. Dillon, Ian Langmore, Dustin Tran, Eugene Brevdo, Srinivas Vasudevan, Dave Moore, Brian Pa...

2017

2 references

The TensorFlow Distributions library implements a vision of probability theory adapted to the modern deep-learning paradigm of end-to-end differentiable computation. Building on two basic abstractions, it offers flexible building blocks for probabilistic computation. Distributions provide fast, nume...

View Paper PDF

Show 1 reference in code

torch/distributions/__init__.py:8

Gradient Estimation Using Stochastic Computation Graphs

John Schulman, Nicolas Heess, Theophane Weber, Pieter Abbeel

2015

2 references

In a variety of problems originating in supervised, unsupervised, and reinforcement learning, the loss function is defined by an expectation over a collection of random variables, which might be part of a probabilistic model or the external world. Estimating the gradient of this loss function, using...

View Paper PDF

Show 1 reference in code

torch/distributions/__init__.py:23

A closed-form formula for the Kullback-Leibler divergence between Cauchy distributions

Frédéric Chyzak, Frank Nielsen

2019

2 references

View Paper PDF

Show 1 reference in code

torch/distributions/kl.py:954

Comparison of a Complete Percutaneous versus Surgical Approach to Aortic Valve Replacement and Revascularization in Patients at Intermediate Surgical Risk: Results from the Randomized SURTAVI Trial.

L. Søndergaard, J. Popma, M. Reardon, N. Van Mieghem, G. Deeb, S. Kodali, I. George, Mathew R. Willi...

2019

2 references

BACKGROUND For patients with severe aortic stenosis (AS) and coronary artery disease (CAD), the completely percutaneous approach to aortic valve replacement and revascularization has not been compared to the standard surgical approach. METHODS The prospective SURTAVI trial enrolled intermediate-ri...

View Paper PDF DOI

Show 1 reference in code

torch/__init__.py:1604

Leveraging the bfloat16 Artificial Intelligence Datatype For Higher-Precision Computations

Greg Henry, Ping Tak Peter Tang, Alexander Heinecke

2019

2 references

In recent years fused-multiply-add (FMA) units with lower-precision multiplications and higher-precision accumulation have proven useful in machine learning/artificial intelligence applications, most notably in training deep neural networks due to their extreme computational intensity. Compared to c...

View Paper PDF

Show 1 reference in code

torch/__init__.py:1618

A Case for a Biorthogonal Jacobi--Davidson Method: Restarting and Correction Equation

Andreas Stathopoulos

2002

2 references

We propose a biorthogonal Jacobi--Davidson method (biJD), which can be viewed as an explicitly biorthogonalized, restarted Lanczos method, that uses the approximate solution of a correction equation to expand its basis. Through an elegant formulation, the algorithm allows for all the functionalities...

View Paper DOI

Show 1 reference in code

torch/_lobpcg.py:373

Toward the Optimal Preconditioned Eigensolver: Locally Optimal Block Preconditioned Conjugate Gradient Method

Andrew V. Knyazev

2001

2 references

We describe new algorithms of the locally optimal block preconditioned conjugate gradient (LOBPCG) method for symmetric eigenvalue problems, based on a local optimization of a three-term recurrence, and suggest several other new methods. To be able to compare numerically different methods in the cla...

View Paper DOI

Show 1 reference in code

torch/_lobpcg.py:489

A robust and efficient implementation of LOBPCG

Jed A. Duersch, Meiyue Shao, Chao Yang, Ming Gu

2017

2 references

Locally Optimal Block Preconditioned Conjugate Gradient (LOBPCG) is widely used to compute eigenvalues of large sparse symmetric matrices. The algorithm can suffer from numerical instability if it is not implemented with care. This is especially problematic when the number of eigenpairs to be comput...

View Paper PDF DOI

Show 1 reference in code

torch/_lobpcg.py:504

Efficient Memory Management for Deep Neural Net Inference

Yury Pisarchyk, Juhyun Lee

2020

2 references

While deep neural net inference was considered a task for servers only, latest advances in technology allow the task of inference to be moved to mobile and embedded devices, desired for various reasons ranging from latency to privacy. These devices are not only limited by their compute power and bat...

View Paper PDF

Show 1 reference in code

torch/nativert/executor/memory/GreedyBySize.cpp:77

Language Modeling with Gated Convolutional Networks

Yann N. Dauphin, Angela Fan, Michael Auli, David Grangier

2016

2 references

The pre-dominant approach to language modeling to date is based on recurrent neural networks. Their success on this task is often linked to their ability to capture unbounded context. In this paper we develop a finite context approach through stacked convolutions, which can be more efficient since t...

View Paper PDF

Show 1 reference in code

torch/nn/functional.py:1747

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

Tri Dao

2023

2 references

Scaling Transformers to longer sequence lengths has been a major problem in the last several years, promising to improve performance in language modeling and high-resolution image understanding, as well as to unlock new applications in code, audio, and video generation. The attention layer is the ma...

View Paper PDF

Show 1 reference in code

torch/nn/functional.py:6081

Understanding and Improving Convolutional Neural Networks via Concatenated Rectified Linear Units

Wenling Shang, Kihyuk Sohn, Diogo Almeida, Honglak Lee

2016

3 references

Recently, convolutional neural networks (CNNs) have been used as a powerful tool to solve many problems of machine learning and computer vision. In this paper, we aim to provide insight on the property of convolutional neural networks, as well as a generic method to improve the performance of many C...

View Paper PDF

Show 1 reference in code

torch/nn/modules/activation.py:126

Empirical Evaluation of Rectified Activations in Convolutional Network

Bing Xu, Naiyan Wang, Tianqi Chen, Mu Li

2015

2 references

In this paper we investigate the performance of different types of rectified activation functions in convolutional neural network: standard rectified linear unit (ReLU), leaky rectified linear unit (Leaky ReLU), parametric rectified linear unit (PReLU) and a new randomized leaky rectified linear uni...

View Paper PDF

Show 1 reference in code

torch/nn/modules/activation.py:158

Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs)

Djork-Arné Clevert, Thomas Unterthiner, Sepp Hochreiter

2015

4 references

We introduce the "exponential linear unit" (ELU) which speeds up learning in deep neural networks and leads to higher classification accuracies. Like rectified linear units (ReLUs), leaky ReLUs (LReLUs) and parametrized ReLUs (PReLUs), ELUs alleviate the vanishing gradient problem via the identity f...

View Paper PDF

Show 1 reference in code

torch/nn/modules/activation.py:577

Continuously Differentiable Exponential Linear Units

Jonathan T. Barron

2017

2 references

Exponential Linear Units (ELUs) are a useful rectifier for constructing deep learning architectures, as they may speed up and otherwise improve learning by virtue of not have vanishing gradients and by having mean activations near zero. However, the ELU activation as parametrized in [1] is not conti...

View Paper PDF

Show 1 reference in code

torch/nn/modules/activation.py:652

Efficient softmax approximation for GPUs

Edouard Grave, Armand Joulin, Moustapha Cissé, David Grangier, Hervé Jégou

2016

2 references

We propose an approximate strategy to efficiently train neural network based language models over very large vocabularies. Our approach, called adaptive softmax, circumvents the linear dependency on the vocabulary size by exploiting the unbalanced word distribution to form clusters that explicitly m...

View Paper PDF

Show 1 reference in code

torch/nn/modules/adaptive.py:28

Improving neural networks by preventing co-adaptation of feature detectors

Geoffrey E. Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, Ruslan R. Salakhutdinov

2012

4 references

When a large feedforward neural network is trained on a small training set, it typically performs poorly on held-out test data. This "overfitting" is greatly reduced by randomly omitting half of the feature detectors on each training case. This prevents complex co-adaptations in which a feature dete...

View Paper PDF

Show 1 reference in code

torch/nn/modules/dropout.py:66

Estimating the mean and variance of the target probability distribution

D. Nix, A. Weigend

1994

2 references

Introduces a method that estimates the mean and the variance of the probability distribution of the target as a function of the input, given an assumed target error-distribution model. Through the activation of an auxiliary output unit, this method provides a measure of the uncertainty of the usual ...

View Paper DOI

Show 1 reference in code

torch/nn/modules/loss.py:441

Fast R-CNN

Ross Girshick

2015

2 references

This paper proposes a Fast Region-based Convolutional Network method (Fast R-CNN) for object detection. Fast R-CNN builds on previous work to efficiently classify object proposals using deep convolutional networks. Compared to previous work, Fast R-CNN employs several innovations to improve training...

View Paper PDF

Show 1 reference in code

torch/nn/modules/loss.py:1041

Group Normalization

Yuxin Wu, Kaiming He

2018

2 references

Batch Normalization (BN) is a milestone technique in the development of deep learning, enabling various networks to train. However, normalizing along the batch dimension introduces problems --- BN's error increases rapidly when the batch size becomes smaller, caused by inaccurate batch statistics es...

View Paper PDF

Show 1 reference in code

torch/nn/modules/normalization.py:244

Root Mean Square Layer Normalization

Biao Zhang, Rico Sennrich

2019

2 references

Layer normalization (LayerNorm) has been successfully applied to various deep neural networks to help stabilize training and boost model convergence because of its capability in handling re-centering and re-scaling of both inputs and weight matrix. However, the computational overhead introduced by L...

View Paper PDF

Show 1 reference in code

torch/nn/modules/normalization.py:337

Long Short-Term Memory Based Recurrent Neural Network Architectures for Large Vocabulary Speech Recognition

Haşim Sak, Andrew Senior, Françoise Beaufays

2014

2 references

Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture that has been designed to address the vanishing and exploding gradient problems of conventional RNNs. Unlike feedforward neural networks, RNNs have cyclic connections making them powerful for modeling sequences. They have...

View Paper PDF

Show 1 reference in code

torch/nn/modules/rnn.py:845

Trivializations for Gradient-Based Optimization on Manifolds

Mario Lezcano-Casado

2019

2 references

We introduce a framework to study the transformation of problems with manifold constraints into unconstrained problems through parametrizations in terms of a Euclidean space. We call these parametrizations "trivializations". We prove conditions under which a trivialization is sound in the context of...

View Paper PDF

Show 1 reference in code

torch/nn/utils/parametrizations.py:253

RoFormer: Enhanced Transformer with Rotary Position Embedding

Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, Yunfeng Liu

2021

2 references

Position encoding recently has shown effective in the transformer architecture. It enables valuable supervision for dependency modeling between elements at different positions of the sequence. In this paper, we first investigate various methods to integrate positional information into the learning p...

View Paper PDF

Show 1 reference in code

torch/onnx/ops/__init__.py:297

Fast Transformer Decoding: One Write-Head is All You Need

Noam Shazeer

2019

2 references

Multi-head attention layers, as used in the Transformer neural sequence model, are a powerful alternative to RNNs for moving information across and between sequences. While training these layers is generally fast and simple, due to parallelizability across the length of the sequence, incremental inf...

View Paper PDF

Show 1 reference in code

torch/onnx/ops/__init__.py:381

ADADELTA: An Adaptive Learning Rate Method

Matthew D. Zeiler

2012

6 references

We present a novel per-dimension learning rate method for gradient descent called ADADELTA. The method dynamically adapts over time using only first order information and has minimal computational overhead beyond vanilla stochastic gradient descent. The method requires no manual tuning of a learning...

View Paper PDF

Show 1 reference in code

torch/optim/adadelta.py:239

Adafactor: Adaptive Learning Rates with Sublinear Memory Cost

Noam Shazeer, Mitchell Stern

2018

2 references

In several recently proposed stochastic optimization methods (e.g. RMSProp, Adam, Adadelta), parameter updates are scaled by the inverse square roots of exponential moving averages of squared past gradients. Maintaining these per-parameter second-moment estimators requires memory equal to the number...

View Paper PDF

Show 1 reference in code

torch/optim/_adafactor.py:323

Cyclical Learning Rates for Training Neural Networks

Leslie N. Smith

2015

2 references

It is known that the learning rate is the most important hyper-parameter to tune for training deep neural networks. This paper describes a new method for setting the learning rate, named cyclical learning rates, which practically eliminates the need to experimentally find the best values and schedul...

View Paper PDF

Show 1 reference in code

torch/optim/lr_scheduler.py:1890

Super-Convergence: Very Fast Training of Neural Networks Using Large Learning Rates

Leslie N. Smith, Nicholay Topin

2017

2 references

In this paper, we describe a phenomenon, which we named "super-convergence", where neural networks can be trained an order of magnitude faster than with standard training methods. The existence of super-convergence is relevant to understanding why deep networks generalize well. One of the key elemen...

View Paper PDF

Show 1 reference in code

torch/optim/lr_scheduler.py:2393

Muon is Scalable for LLM Training

Jingyuan Liu, Jianlin Su, Xingcheng Yao, Zhejun Jiang, Guokun Lai, Yulun Du, Yidao Qin, Weixin Xu, E...

2025

2 references

Recently, the Muon optimizer based on matrix orthogonalization has demonstrated strong results in training small-scale language models, but the scalability to larger models has not been proven. We identify two crucial techniques for scaling up Muon: (1) adding weight decay and (2) carefully adjustin...

View Paper PDF

Show 1 reference in code

torch/optim/_muon.py:282

On the Variance of the Adaptive Learning Rate and Beyond

Liyuan Liu, Haoming Jiang, Pengcheng He, Weizhu Chen, Xiaodong Liu, Jianfeng Gao, Jiawei Han

2019

2 references

The learning rate warmup heuristic achieves remarkable success in stabilizing training, accelerating convergence and improving generalization for adaptive stochastic optimization algorithms like RMSprop and Adam. Here, we study its mechanism in details. Pursuing the theory behind warmup, we identify...

View Paper PDF

Show 1 reference in code

torch/optim/radam.py:246

Generating Sequences With Recurrent Neural Networks

Alex Graves

2013

2 references

This paper shows how Long Short-term Memory recurrent neural networks can be used to generate complex sequences with long-range structure, simply by predicting one data point at a time. The approach is demonstrated for text (where the data are discrete) and online handwriting (where the data are rea...

View Paper PDF

Show 1 reference in code

torch/optim/rmsprop.py:238

There Are Many Consistent Explanations of Unlabeled Data: Why You Should Average

Ben Athiwaratkun, Marc Finzi, Pavel Izmailov, Andrew Gordon Wilson

2018

2 references

Presently the most successful approaches to semi-supervised learning are based on consistency regularization, whereby a model is trained to be robust to small perturbations of its inputs and parameters. To understand consistency regularization, we conceptually explore how loss geometry interacts wit...

View Paper PDF

Show 1 reference in code

torch/optim/swa_utils.py:211

SWALP : Stochastic Weight Averaging in Low-Precision Training

Guandao Yang, Tianyi Zhang, Polina Kirichenko, Junwen Bai, Andrew Gordon Wilson, Christopher De Sa

2019

2 references

Low precision operations can provide scalability, memory savings, portability, and energy efficiency. This paper proposes SWALP, an approach to low precision training that averages low-precision SGD iterates with a modified learning rate schedule. SWALP is easy to implement and can match the perform...

View Paper PDF

Show 1 reference in code

torch/optim/swa_utils.py:213

Stochastic Weight Averaging in Parallel: Large-Batch Training that Generalizes Well

Vipul Gupta, Santiago Akle Serrano, Dennis DeCoste

2020

2 references

We propose Stochastic Weight Averaging in Parallel (SWAP), an algorithm to accelerate DNN training. Our algorithm uses large mini-batches to compute an approximate solution quickly and then refines it by averaging the weights of multiple models computed independently and in parallel. The resulting m...

View Paper PDF

Show 1 reference in code

torch/optim/swa_utils.py:216

Some windows with very good sidelobe behavior

A. Nuttall

1981

2 references

Some of the windows presented by Harris [1] are not correct in terms of their reported peak sidelobes and optimal behavior. We present corrected plots of Harris' windows and also derive additional windows with very good sidelobes and optimal behavior under several different constraints. The temporal...

View Paper PDF DOI

Show 1 reference in code

torch/signal/windows/windows.py:849

Link copied to clipboard!