Papers
Browse academic papers referenced in production code
Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM
Large language models have led to state-of-the-art accuracies across a range of tasks. However, training these models efficiently is challenging for two reasons: a) GPU memory capacity is limited, making it impossible to fit large models on even a mu...
Faster-Than-Native Alternatives for x86 VP2INTERSECT Instructions
We present faster-than-native alternatives for the full AVX512-VP2INTERSECT instruction subset using basic AVX512F instructions. These alternatives compute only one of the output masks, which is sufficient for the typical case of computing the inters...
GLaM: Efficient Scaling of Language Models with Mixture-of-Experts
Scaling language models with more data, compute and parameters has driven significant progress in natural language processing. For example, thanks to scaling, GPT-3 was able to achieve strong results on in-context learning tasks. However, training th...
Going deeper with Image Transformers.
Transformers have been recently adapted for large scale image classification, achieving high scores shaking up the long supremacy of convolutional neural networks. However the optimization of image transformers has been little studied so far. In this...
LoRA: Low-Rank Adaptation of Large Language Models
An important paradigm of natural language processing consists of large-scale pre-training on general domain data and adaptation to particular tasks or domains. As we pre-train larger models, full fine-tuning, which retrains all model parameters, beco...
RLIBM-PROG: Progressive Polynomial Approximations for Fast Correctly Rounded Math Libraries
This paper presents a novel method for generating a single polynomial approximation that produces correctly rounded results for all inputs of an elementary function for multiple representations. The generated polynomial approximation has the nice pro...
Self-attention Does Not Need $O(n^2)$ Memory
We present a very simple algorithm for attention that requires $O(1)$ memory with respect to sequence length and an extension to self-attention that requires $O(\log n)$ memory. This is in contrast with the frequently stated belief that self-attentio...
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
In deep learning, models typically reuse the same parameters for all inputs. Mixture of Experts (MoE) defies this and instead selects different parameters for each incoming example. The result is a sparsely-activated model -- with outrageous numbers ...
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale.
While the Transformer architecture has become the de-facto standard for natural language processing tasks, its applications to computer vision remain limited. In vision, attention is either applied in conjunction with convolutional networks, or used ...
Automatic Cross-Replica Sharding of Weight Update in Data-Parallel Training
In data-parallel synchronous training of deep neural networks, different devices (replicas) run the same program with different partitions of the training batch, but weight update computation is repeated on all replicas, because the weights do not ha...
Conformer: Convolution-augmented Transformer for Speech Recognition
Recently Transformer and Convolution neural network (CNN) based models have shown promising results in Automatic Speech Recognition (ASR), outperforming Recurrent neural networks (RNNs). Transformer models are good at capturing content-based global i...
Demystifying Why Local Aggregation Helps: Convergence Analysis of Hierarchical SGD
Hierarchical SGD (H-SGD) has emerged as a new distributed SGD algorithm for multi-level communication networks. In H-SGD, before each global aggregation, workers send their updated local models to local servers for aggregations. Despite recent resear...
DNS Cache Poisoning Attack Reloaded: Revolutions with Side Channels
In this paper, we report a series of flaws in the software stack that leads to a strong revival of DNS cache poisoning --- a classic attack which is mitigated in practice with simple and effective randomization-based defenses such as randomized sourc...
Efficient Memory Management for Deep Neural Net Inference
While deep neural net inference was considered a task for servers only, latest advances in technology allow the task of inference to be moved to mobile and embedded devices, desired for various reasons ranging from latency to privacy. These devices a...
Enabling Fast Differentially Private SGD via Just-in-Time Compilation and Vectorization
A common pain point in differentially private machine learning is the significant runtime overhead incurred when executing Differentially Private Stochastic Gradient Descent (DPSGD), which may be as large as two orders of magnitude. We thoroughly dem...
Nimble: Efficiently Compiling Dynamic Neural Networks for Model Inference
Modern deep neural networks increasingly make use of features such as dynamic control flow, data structures and dynamic tensor shapes. Existing deep learning systems focus on optimizing and executing static neural networks which assume a pre-determin...
Provably Efficient Online Hyperparameter Optimization with Population-Based Bandits
Many of the recent triumphs in machine learning are dependent on well-tuned hyperparameters. This is particularly prominent in reinforcement learning (RL) where a small change in the configuration can lead to failure. Despite the importance of tuning...
Scalable range locks for scalable address spaces and beyond
Range locks are a synchronization construct designed to provide concurrent access to multiple threads (or processes) to disjoint parts of a shared resource. Originally conceived in the file system context, range locks are gaining increasing interest ...
Stochastic Weight Averaging in Parallel: Large-Batch Training that Generalizes Well
We propose Stochastic Weight Averaging in Parallel (SWAP), an algorithm to accelerate DNN training. Our algorithm uses large mini-batches to compute an approximate solution quickly and then refines it by averaging the weights of multiple models compu...
Test-Case Reduction via Test-Case Generation: Insights from the Hypothesis Reducer (Tool Insights Paper).
We describe internal test-case reduction, the method of test-case reduction employed by Hypothesis, a widely-used property-based testing library for Python. The key idea of internal test-case reduction is that instead of applying test-case reduction ...