🤖

Machine Learning

Machine learning frameworks, algorithms, and training systems

Repositories

(7)

huggingface/transformers

microsoft/onnxruntime

mlflow/mlflow

pytorch/pytorch

ray-project/ray

scikit-learn/scikit-learn

tensorflow/tensorflow

Papers

(373)

Showing 20 of 373 papers

Automatic Cross-Replica Sharding of Weight Update in Data-Parallel Training

Yuanzhong Xu, HyoukJoong Lee, Dehao Chen, Hongjun Choi, Blake Hechtman, Shibo Wang

2020

2 references

In data-parallel synchronous training of deep neural networks, different devices (replicas) run the same program with different partitions of the training batch, but weight update computation is repeated on all replicas, because the weights do not ha...

View Paper PDF DOI

Breadth-First Pipeline Parallelism

Joel Lamy-Poirier

2022

2 references

We introduce Breadth-First Pipeline Parallelism, a novel training schedule which optimizes the combination of pipeline and data parallelism. Breadth-First Pipeline Parallelism lowers training time, cost and memory usage by combining a high GPU utiliz...

View Paper PDF DOI

DeepSeek-V3 Technical Report

DeepSeek-AI, Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, C...

2024

2 references

We present DeepSeek-V3, a strong Mixture-of-Experts (MoE) language model with 671B total parameters with 37B activated for each token. To achieve efficient inference and cost-effective training, DeepSeek-V3 adopts Multi-head Latent Attention (MLA) an...

View Paper PDF DOI

Demystifying Why Local Aggregation Helps: Convergence Analysis of Hierarchical SGD

Jiayi Wang, Shiqiang Wang, Rong‐Rong Chen, Mingyue Ji

2020

2 references

Hierarchical SGD (H-SGD) has emerged as a new distributed SGD algorithm for multi-level communication networks. In H-SGD, before each global aggregation, workers send their updated local models to local servers for aggregations. Despite recent resear...

View Paper PDF DOI

Don't Use Large Mini-Batches, Use Local SGD

Tao Lin, Sebastian U. Stich, Kumar Kshitij Patel, Martin Jaggi

2018

6 references

Mini-batch stochastic gradient methods (SGD) are state of the art for distributed training of deep neural networks. Drastic increases in the mini-batch sizes have lead to key efficiency and scalability gains in recent years. However, progress faces a...

View Paper PDF DOI

DoRA: Weight-Decomposed Low-Rank Adaptation

Dacao Zhang, Fan Yang, Kun Zhang, Xin-Qiao Li, Wei Si, Richang Hong, Meng Wang

2024

2 references

Among the widely used parameter-efficient fine-tuning (PEFT) methods, LoRA and its variants have gained considerable popularity because of avoiding additional inference costs. However, there still often exists an accuracy gap between these methods an...

View Paper PDF DOI

Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM

Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Patwary, Vijay Anand Ko...

2021

2 references

Large language models have led to state-of-the-art accuracies across a range of tasks. However, training these models efficiently is challenging for two reasons: a) GPU memory capacity is limited, making it impossible to fit large models on even a mu...

View Paper PDF DOI

Estimating the mean and variance of the target probability distribution

David A. Nix, Andreas S. Weigend

1994

2 references

Introduces a method that estimates the mean and the variance of the probability distribution of the target as a function of the input, given an assumed target error-distribution model. Through the activation of an auxiliary output unit, this method p...

View Paper DOI

Fast R-CNN

Ross Girshick

2015

2 references

This paper proposes a Fast Region-based Convolutional Network method (Fast R-CNN) for object detection. Fast R-CNN builds on previous work to efficiently classify object proposals using deep convolutional networks. Compared to previous work, Fast R-C...

View Paper PDF DOI

GossipGraD: Scalable Deep Learning using Gossip Communication based Asynchronous Gradient Descent

Jeff Daily, Abhinav Vishnu, Charles Siegel, Thomas Warfel, Vinay Amatya

2018

4 references

In this paper, we present GossipGraD - a gossip communication protocol based Stochastic Gradient Descent (SGD) algorithm for scaling Deep Learning (DL) algorithms on large-scale systems. The salient features of GossipGraD are: 1) reduction in overall...

View Paper PDF DOI

Gradient Estimation Using Stochastic Computation Graphs

John Schulman, Nicolas Heess, Theophane Weber, Pieter Abbeel

2015

2 references

In a variety of problems originating in supervised, unsupervised, and reinforcement learning, the loss function is defined by an expectation over a collection of random variables, which might be part of a probabilistic model or the external world. Es...

View Paper PDF DOI

PowerSGD: Practical Low-Rank Gradient Compression for Distributed Optimization

Thijs Vogels, Sai Praneeth Karimireddy, Martin Jaggi

2019

8 references

We study gradient compression methods to alleviate the communication bottleneck in data-parallel distributed optimization. Despite the significant attention received, current compression schemes either do not scale well or fail to achieve the target ...

View Paper PDF DOI

Reducing Activation Recomputation in Large Transformer Models

Vijay Korthikanti, Jared Casper, Sangkug Lym, Lawrence McAfee, Michael Andersch, Mohammad Shoeybi, B...

2022

2 references

Training large transformer models is one of the most important computational challenges of modern AI. In this paper, we show how to significantly accelerate training of large transformer models by reducing activation recomputation. Activation recompu...

View Paper PDF DOI

Spectral Normalization for Generative Adversarial Networks

Takeru Miyato, Toshiki Kataoka, Masanori Koyama, Yuichi Yoshida

2018

6 references

One of the challenges in the study of generative adversarial networks is the instability of its training. In this paper, we propose a novel weight normalization technique called spectral normalization to stabilize the training of the discriminator. O...

View Paper PDF DOI

TensorFlow Distributions

Joshua C. Chang, Tom Chou

2017

2 references

The TensorFlow Distributions library implements a vision of probability theory adapted to the modern deep-learning paradigm of end-to-end differentiable computation. Building on two basic abstractions, it offers flexible building blocks for probabili...

View Paper PDF DOI

The continuous Bernoulli: fixing a pervasive error in variational autoencoders

Gabriel Loaiza-Ganem, John P. Cunningham

2019

2 references

Variational autoencoders (VAE) have quickly become a central tool in machine learning, applicable to a broad range of data types and latent variable models. By far the most common first step, taken by seminal papers and by core software libraries ali...

View Paper PDF DOI

Trivializations for Gradient-Based Optimization on Manifolds

Mario Lezcano-Casado

2019

2 references

We introduce a framework to study the transformation of problems with manifold constraints into unconstrained problems through parametrizations in terms of a Euclidean space. We call these parametrizations "trivializations". We prove conditions under...

View Paper PDF DOI

Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks

Zhiguang Wang, Tim Oates, James Lo

2016

6 references

We present weight normalization: a reparameterization of the weight vectors in a neural network that decouples the length of those weight vectors from their direction. By reparameterizing the weights in this way we improve the conditioning of the opt...

View Paper PDF DOI

Zero Bubble Pipeline Parallelism

Penghui Qi, Xinyi Wan, Guangxing Huang, Min Lin

2023

4 references

Pipeline parallelism is one of the key components for large-scale distributed training, yet its efficiency suffers from pipeline bubbles which were deemed inevitable. In this work, we introduce a scheduling strategy that, to our knowledge, is the fir...

View Paper PDF DOI

ZeRO: Memory Optimizations Toward Training Trillion Parameter Models

Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, Yuxiong He

2019

2 references

Large deep learning models offer significant accuracy gains, but training billions to trillions of parameters is challenging. Existing solutions such as data and model parallelisms exhibit fundamental limitations to fit these models into limited devi...

View Paper PDF DOI

Previous Page 3 of 19 Next