triton-lang/triton

Tiramisu: A Polyhedral Compiler for Expressing Fast and Portable Code.

Riyadh Baghdadi, Jessica M. Ray, Malek Ben Romdhane, Emanuele Del Sozzo, Abdurrahman Akkas, Yunming ...

2019

2 references

This paper introduces Tiramisu, a polyhedral framework designed to generate high performance code for multiple platforms including multicores, GPUs, and distributed machines. Tiramisu introduces a scheduling language with novel commands to explicitly manage the complexities that arise when targeting...

View Paper PDF DOI

Show 2 references in code

docs/programming-guide/chapter-1/introduction.rst:66

docs/programming-guide/chapter-2/related-work.rst:205

Tensor Comprehensions: Framework-Agnostic High-Performance Machine Learning Abstractions

Nicolas Vasilache, Oleksandr Zinenko, Theodoros Theodoridis, Priya Goyal, Zachary DeVito, William S....

2018

5 references

Deep learning models with convolutional and recurrent networks are now ubiquitous and analyze massive amounts of audio, image, video, text and graph data, with applications in automatic translation, speech-to-text, scene understanding, ranking user preferences, ad placement, etc. Competing framework...

View Paper PDF DOI

Show 2 references in code

docs/programming-guide/chapter-1/introduction.rst:67

docs/programming-guide/chapter-2/related-work.rst:206

Polly - Performing Polyhedral Optimizations on a Low-Level Intermediate Representation.

T. Grosser, Armin Größlinger, C. Lengauer

2012

2 references

The polyhedral model for loop parallelization has proved to be an effective tool for advanced optimization and automatic parallelization of programs in higher-level languages. Yet, to integrate such optimizations seamlessly into production compilers, they must be performed on the compiler's internal...

View Paper DOI

Show 2 references in code

docs/programming-guide/chapter-2/related-work.rst:115

docs/programming-guide/chapter-2/related-work.rst:209

Automatically scheduling halide image processing pipelines.

Ravi Teja Mullapudi, Andrew Adams, Dillon Sharlet, Jonathan Ragan‐Kelley, Kayvon Fatahalian

2016

2 references

The Halide image processing language has proven to be an effective system for authoring high-performance image processing code. Halide programmers need only provide a high-level strategy for mapping an image processing pipeline to a parallel machine (a schedule ), and the Halide compiler carries out...

View Paper PDF DOI

Show 2 references in code

docs/programming-guide/chapter-2/related-work.rst:159

docs/programming-guide/chapter-2/related-work.rst:213

Linear Layouts: Robust Code Generation of Efficient Tensor Computation Using $\mathbb{F}_2$

Keren Zhou, Mario Lezcano, Adam Goucher, Akhmed Rakhmati, Jeff Niu, Justin Lebar, Pawel Szczerbuk, P...

2025

2 references

Efficient tensor computation is a cornerstone of modern deep learning (DL) workloads, yet existing approaches struggle to achieve flexible and performant design and implementation of tensor layouts -- mappings between logical tensors and hardware resources. The increasing complexity of DL algorithms...

View Paper PDF

Show 2 references in code

python/triton/experimental/gluon/language/_layouts.py:161

python/tutorials/gluon/02-layouts.py:849

Dropout: a simple way to prevent neural networks from overfitting.

Nitish Srivastava, Geoffrey E. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov

2014

2 references

View Paper DOI

Show 2 references in code

python/tutorials/04-low-memory-dropout.py:21

python/tutorials/04-low-memory-dropout.py:175

Parallel random numbers: as easy as 1, 2, 3.

John K. Salmon, Mark A. Moraes, Ron O. Dror, David E. Shaw

2011

2 references

Most pseudorandom number generators (PRNGs) scale poorly to massively parallel high-performance computation because they are designed as sequentially dependent state transformations. We demonstrate that independent, keyed transformations of counters produce a large alternative class of PRNGs with ex...

View Paper DOI

Show 2 references in code

python/tutorials/04-low-memory-dropout.py:105

python/tutorials/04-low-memory-dropout.py:174

Layer Normalization

Chengyong Si, Jianqiang Shen, Xuan Zou, Lei Wang, Qidi Wu

2016

8 references

Training state-of-the-art, deep neural networks is computationally expensive. One way to reduce the training time is to normalize the activities of the neurons. A recently introduced technique called batch normalization uses the distribution of the summed input to a neuron over a mini-batch of train...

View Paper PDF DOI

Show 2 references in code

python/tutorials/05-layer-norm.py:19

python/tutorials/05-layer-norm.py:381

Triton-distributed: Programming Overlapping Kernels on Distributed AI Systems with the Triton Compiler

Size Zheng, Wenlei Bao, Qi Hou, Xuegui Zheng, Jin Fang, Chenhui Huang, Tianqi Li, Haojie Duanmu, Ren...

2025

1 reference

In this report, we propose Triton-distributed, an extension of existing Triton compiler, to overcome the programming challenges in distributed AI systems. Triton-distributed is the first compiler that supports native overlapping optimizations for distributed AI workloads, providing a good coverage o...

View Paper PDF

Show 1 reference in code

docs/meetups/11-05-2025/notes.md:21

Sequence to Sequence Learning with Neural Networks.

Ilya Sutskever, Oriol Vinyals, Quoc V. Le

2014

1 reference

View Paper

Show 1 reference in code

docs/programming-guide/chapter-1/introduction.rst:63

You Only Look Once: Unified, Real-Time Object Detection.

Joseph Redmon, Santosh Divvala, Ross Girshick, Ali Farhadi

2016

1 reference

View Paper DOI

Show 1 reference in code

docs/programming-guide/chapter-1/introduction.rst:64

Superhuman Accuracy on the SNEMI3D Connectomics Challenge.

Kisuk Lee, Jonathan Zung, Peter Li, Viren Jain, H. Sebastian Seung

2017

1 reference

View Paper

Show 1 reference in code

docs/programming-guide/chapter-1/introduction.rst:65

Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines

Jonathan Ragan‐Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, Saman Amarasinghe

2013

1 reference

Image processing pipelines combine the challenges of stencil computations and stream programs. They are composed of large graphs of different stencil stages, as well as complex reductions, and stages with global or data-dependent access patterns. Because of their complex structure, the performance d...

View Paper DOI

Show 1 reference in code

docs/programming-guide/chapter-1/introduction.rst:68

TVM: An Automated End-to-End Optimizing Compiler for Deep Learning

Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Meghan Cowan, Haichen Shen, Ley...

2018

3 references

There is an increasing need to bring machine learning to a wide diversity of hardware devices. Current frameworks rely on vendor-specific operator libraries and optimize for a narrow range of server-class GPUs. Deploying workloads to new platforms -- such as mobile phones, embedded devices, and acce...

View Paper PDF

Show 1 reference in code

docs/programming-guide/chapter-1/introduction.rst:69

The Cache Performance and Optimizations of Blocked Algorithms.

Monica D. Lam, Edward Rothberg, Michael E. Wolf

1991

1 reference

article Free Access Share on The cache performance and optimizations of blocked algorithms Authors: Monica D. Lam View Profile , Edward E. Rothberg View Profile , Michael E. Wolf View Profile Authors Info & Claims ACM SIGOPS Operating Systems ReviewVolume 25Issue Special IssueApr. 1991 pp 63–74https...

View Paper DOI

Show 1 reference in code

docs/programming-guide/chapter-1/introduction.rst:70

LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation.

Chris Lattner, Vikram S. Adve

2004

1 reference

We describe LLVM (low level virtual machine), a compiler framework designed to support transparent, lifelong program analysis and transformation for arbitrary programs, by providing high-level information to compiler transformations at compile-time, link-time, run-time, and in idle time between runs...

View Paper PDF DOI

Show 1 reference in code

docs/programming-guide/chapter-2/related-work.rst:200

More iteration space tiling.

Michael Wolfe

1989

1 reference

Subdividing the iteration space of a loop into blocks or tiles with a fixed maximum size has several advantages. Tiles become a natural candidate as the unit of work for parallel task scheduling. Synchronization between processors can be done between tiles, reducing synchronization frequency (at som...

View Paper PDF DOI

Show 1 reference in code

docs/programming-guide/chapter-2/related-work.rst:201

On the Complexity of Loop Fusion.

Alain Darte

1999

1 reference

Loop fusion is a program transformation that combines several loops into one. It is used in parallelizing compilers mainly for increasing the granularity of loops and for improving data reuse. The goal of this paper is to study, from a theoretical point of view, several variants of the loop fusion p...

View Paper DOI

Show 1 reference in code

docs/programming-guide/chapter-2/related-work.rst:202

Automatic loop interchange.

John R. Allen, Ken Kennedy

1984

1 reference

Parallel and vector machines are becoming increasingly important to many computation intensive applications. Effectively utilizing such architectures, particularly from sequential languages such as Fortran, has demanded increasingly sophisticated compilers. In general, a compiler needs to significan...

View Paper PDF DOI

Show 1 reference in code

docs/programming-guide/chapter-2/related-work.rst:203

Scanning Polyhedra with DO Loops.

Corinne Ancourt, François Irigoin

1991

1 reference

Article Scanning polyhedra with DO loops Share on Authors: Corinne Ancourt View Profile , François Irigoin View Profile Authors Info & Claims PPOPP '91: Proceedings of the third ACM SIGPLAN symposium on Principles and practice of parallel programmingApril 1991 Pages 39–50https://doi.org/10.1145/1096...

View Paper DOI

Show 1 reference in code

docs/programming-guide/chapter-2/related-work.rst:204

Diesel: DSL for linear algebra and neural net computations on GPUs.

Venmugil Elango, Norm Rubin, M. Ravishankar, Hariharan Sandanagobalane, Vinod Grover

2018

1 reference

We present a domain specific language compiler, Diesel, for basic linear algebra and neural network computations, that accepts input expressions in an intuitive form and generates high performing code for GPUs. The current trend is to represent a neural network as a computation DAG, where each node ...

View Paper DOI

Show 1 reference in code

docs/programming-guide/chapter-2/related-work.rst:207

MLIR: A Compiler Infrastructure for the End of Moore's Law.

Chris Lattner, Jacques A. Pienaar, Mehdi Amini, Uday Bondhugula, River Riddle, Albert Cohen 0001, Ta...

2020

1 reference

View Paper

Show 1 reference in code

docs/programming-guide/chapter-2/related-work.rst:208

Semi-Automatic Composition of Loop Transformations for Deep Parallelism and Memory Hierarchies.

Sylvain Girbal, Nicolas Vasilache, Cédric Bastoul, Albert Cohen, David Parello, Marc Sigler, Olivier...

2006

1 reference

View Paper PDF DOI

Show 1 reference in code

docs/programming-guide/chapter-2/related-work.rst:211

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, Christopher Ré

2022

6 references

Transformers are slow and memory-hungry on long sequences, since the time and memory complexity of self-attention are quadratic in sequence length. Approximate attention methods have attempted to address this problem by trading off model quality to reduce the compute complexity, but often do not ach...

View Paper PDF DOI

Show 1 reference in code

python/tutorials/06-fused-attention.py:11

Self-attention Does Not Need $O(n^2)$ Memory

Markus N. Rabe, Charles Staats

2021

2 references

We present a very simple algorithm for attention that requires $O(1)$ memory with respect to sequence length and an extension to self-attention that requires $O(\log n)$ memory. This is in contrast with the frequently stated belief that self-attention requires $O(n^2)$ memory. While the time complex...

View Paper PDF

Show 1 reference in code

python/tutorials/06-fused-attention.py:12

triton-lang/triton

Paper References by File

Papers Referenced in This Repository