microsoft/onnxruntime

Denoising diffusion probabilistic models (DDPMs) have achieved high quality image generation without adversarial training, yet they require simulating a Markov chain for many steps to produce a sample. To accelerate sampling, we present denoising diffusion implicit models (DDIMs), a more efficient c...

View Paper PDF

Show 4 references in code

onnxruntime/python/tools/transformers/models/stable_diffusion/diffusion_schedulers.py:127

onnxruntime/python/tools/transformers/models/stable_diffusion/diffusion_schedulers.py:147

onnxruntime/python/tools/transformers/models/stable_diffusion/diffusion_schedulers.py:174

onnxruntime/python/tools/transformers/models/stable_diffusion/diffusion_schedulers.py:177

Spatial Transformer Networks

Max Jaderberg, Karen Simonyan, Andrew Zisserman, Koray Kavukcuoglu

2015

14 references

Convolutional Neural Networks define an exceptionally powerful class of models, but are still limited by the lack of ability to be spatially invariant to the input data in a computationally and parameter efficient manner. In this work we introduce a new learnable module, the Spatial Transformer, whi...

View Paper PDF DOI

Show 2 references in code

docs/ContribOperators.md:2419

onnxruntime/core/graph/contrib_ops/contrib_defs.cc:987

Group Normalization

Yuxin Wu, Kaiming He

2018

10 references

Batch Normalization (BN) is a milestone technique in the development of deep learning, enabling various networks to train. However, normalizing along the batch dimension introduces problems --- BN's error increases rapidly when the batch size becomes smaller, caused by inaccurate batch statistics es...

View Paper PDF DOI

Show 2 references in code

docs/ContribOperators.md:2466

onnxruntime/core/graph/contrib_ops/diffusion_defs.cc:25

QLoRA: Efficient Finetuning of Quantized LLMs

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, Luke Zettlemoyer

2023

2 references

We present QLoRA, an efficient finetuning approach that reduces memory usage enough to finetune a 65B parameter model on a single 48GB GPU while preserving full 16-bit finetuning task performance. QLoRA backpropagates gradients through a frozen, 4-bit quantized pretrained language model into Low Ran...

View Paper PDF

Show 2 references in code

docs/ContribOperators.md:2743

onnxruntime/core/graph/contrib_ops/contrib_defs.cc:3607

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

William Fedus, Barret Zoph, Noam Shazeer

2021

2 references

In deep learning, models typically reuse the same parameters for all inputs. Mixture of Experts (MoE) defies this and instead selects different parameters for each incoming example. The result is a sparsely-activated model -- with outrageous numbers of parameters -- but a constant computational cost...

View Paper PDF

Show 2 references in code

docs/ContribOperators.md:3079

onnxruntime/core/graph/contrib_ops/contrib_defs.cc:1387

GLaM: Efficient Scaling of Language Models with Mixture-of-Experts

Nan Du, Yanping Huang, Andrew M. Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi...

2021

2 references

Scaling language models with more data, compute and parameters has driven significant progress in natural language processing. For example, thanks to scaling, GPT-3 was able to achieve strong results on in-context learning tasks. However, training these large dense models requires significant amount...

View Paper PDF

Show 2 references in code

docs/ContribOperators.md:3080

onnxruntime/core/graph/contrib_ops/contrib_defs.cc:1388

Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

Marah Abdin, Jyoti Aneja, Hany Awadalla, Ahmed Awadallah, Ammar Ahmad Awan, Nguyen Bach, Amit Bahree...

2024

2 references

We introduce phi-3-mini, a 3.8 billion parameter language model trained on 3.3 trillion tokens, whose overall performance, as measured by both academic benchmarks and internal testing, rivals that of models such as Mixtral 8x7B and GPT-3.5 (e.g., phi-3-mini achieves 69% on MMLU and 8.38 on MT-bench)...

View Paper PDF

Show 2 references in code

docs/ContribOperators.md:5796

onnxruntime/core/graph/contrib_ops/bert_defs.cc:1455

Generating Long Sequences with Sparse Transformers

Rewon Child, Scott Gray, Alec Radford, Ilya Sutskever

2019

2 references

Transformers are powerful sequence models, but require time and memory that grows quadratically with the sequence length. In this paper we introduce sparse factorizations of the attention matrix which reduce this to $O(n \sqrt{n})$. We also introduce a) a variation on architecture and initialization...

View Paper PDF

Show 2 references in code

docs/ContribOperators.md:5798

onnxruntime/core/graph/contrib_ops/bert_defs.cc:1457

FP8 Formats for Deep Learning

Paulius Micikevicius, Dušan Stošić, Neil Burgess, Marius Cornea, Pradeep Dubey, Richard Grisenthwait...

2022

24 references

FP8 is a natural progression for accelerating deep learning training inference beyond the 16-bit formats common in modern processors. In this paper we propose an 8-bit floating point (FP8) binary interchange format consisting of two encodings - E4M3 (4-bit exponent and 3-bit mantissa) and E5M2 (5-bi...

View Paper PDF DOI

Show 1 reference in code

csharp/tools/Microsoft.ML.OnnxRuntime.PerfTool/OnnxMl.cs:4501

8-bit Numerical Formats for Deep Neural Networks

Badreddine Noune, Philip Jones, Daniel Justus, Dominic Masters, Carlo Luschi

2022

16 references

Given the current trend of increasing size and complexity of machine learning architectures, it has become of critical importance to identify new approaches to improve the computational efficiency of model training. In this context, we address the advantages of floating-point over fixed-point repres...

View Paper PDF DOI

Show 1 reference in code

csharp/tools/Microsoft.ML.OnnxRuntime.PerfTool/OnnxMl.cs:4502

Self-Attention with Relative Position Representations

Peter Shaw, Jakob Uszkoreit, Ashish Vaswani

2018

1 reference

Relying entirely on an attention mechanism, the Transformer introduced by Vaswani et al. (2017) achieves state-of-the-art results for machine translation. In contrast to recurrent and convolutional neural networks, it does not explicitly model relative or absolute position information in its structu...

View Paper PDF

Show 1 reference in code

docs/ContribOperators.md:5245

Neural Machine Translation by Jointly Learning to Align and Translate

Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio

2014

3 references

Neural machine translation is a recently proposed approach to machine translation. Unlike the traditional statistical machine translation, the neural machine translation aims at building a single neural network that can be jointly tuned to maximize the translation performance. The models proposed re...

View Paper PDF

Show 1 reference in code

onnxruntime/contrib_ops/cpu/attnlstm/bahdanau_attention.h:14

Rethinking Channel Dimensions to Isolate Outliers for Low-bit Weight Quantization of Large Language Models

Jung Hwan Heo, Jeonghoon Kim, Beomseok Kwon, Byeongwook Kim, Se Jung Kwon, Dongsoo Lee

2023

1 reference

Large Language Models (LLMs) have recently demonstrated remarkable success across various tasks. However, efficiently serving LLMs has been a challenge due to the large memory bottleneck, specifically in small batch inference settings (e.g. mobile devices). Weight-only quantization can be a promisin...

View Paper PDF

Show 1 reference in code

onnxruntime/python/tools/quantization/matmul_nbits_quantizer.py:212

SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models

Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, Song Han

2022

1 reference

Large language models (LLMs) show excellent performance but are compute- and memory-intensive. Quantization can reduce memory and accelerate inference. However, existing methods cannot maintain accuracy and hardware efficiency at the same time. We propose SmoothQuant, a training-free, accuracy-prese...

View Paper PDF

Show 1 reference in code

onnxruntime/python/tools/transformers/models/llama/convert_to_onnx.py:701

Elucidating the Design Space of Diffusion-Based Generative Models

Tero Karras, M. Aittala, Timo Aila, S. Laine

2022

1 reference

We argue that the theory and practice of diffusion-based generative models are currently unnecessarily convoluted and seek to remedy the situation by presenting a design space that clearly separates the concrete design choices. This lets us identify several changes to both the sampling and training ...

View Paper PDF DOI

Show 1 reference in code

onnxruntime/python/tools/transformers/models/stable_diffusion/diffusion_schedulers.py:556

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, Christopher Ré

2022

6 references

Transformers are slow and memory-hungry on long sequences, since the time and memory complexity of self-attention are quadratic in sequence length. Approximate attention methods have attempted to address this problem by trading off model quality to reduce the compute complexity, but often do not ach...

View Paper PDF DOI

Show 1 reference in code

onnxruntime/python/tools/transformers/models/stable_diffusion/README.md:5

Self-attention Does Not Need $O(n^2)$ Memory

Markus N. Rabe, Charles Staats

2021

2 references

We present a very simple algorithm for attention that requires $O(1)$ memory with respect to sequence length and an extension to self-attention that requires $O(\log n)$ memory. This is in contrast with the frequently stated belief that self-attention requires $O(n^2)$ memory. While the time complex...

View Paper PDF

Show 1 reference in code

onnxruntime/python/tools/transformers/models/stable_diffusion/README.md:6

Adam: A Method for Stochastic Optimization

Diederik P. Kingma, Jimmy Ba

2014

22 references

We introduce Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments. The method is straightforward to implement, is computationally efficient, has little memory requirements, is invariant to diagonal rescal...

View Paper PDF DOI

Show 1 reference in code

orttraining/orttraining/python/training/optim/fused_adam.py:62

Link copied to clipboard!