25 papers
8 files
33 references

Papers Referenced in This Repository

Tiramisu: A Polyhedral Compiler for Expressing Fast and Portable Code.

Riyadh Baghdadi, Jessica M. Ray, Malek Ben Romdhane, Emanuele Del Sozzo, Abdurrahman Akkas, Yunming ...
2019
2 references

This paper introduces Tiramisu, a polyhedral framework designed to generate high performance code for multiple platforms including multicores, GPUs, and distributed machines. Tiramisu introduces a scheduling language with novel commands to explicitly manage the complexities that arise when targeting...

Show 2 references in code

Tensor Comprehensions: Framework-Agnostic High-Performance Machine Learning Abstractions

Nicolas Vasilache, Oleksandr Zinenko, Theodoros Theodoridis, Priya Goyal, Zachary DeVito, William S....
2018
5 references

Deep learning models with convolutional and recurrent networks are now ubiquitous and analyze massive amounts of audio, image, video, text and graph data, with applications in automatic translation, speech-to-text, scene understanding, ranking user preferences, ad placement, etc. Competing framework...

Show 2 references in code

Polly - Performing Polyhedral Optimizations on a Low-Level Intermediate Representation.

T. Grosser, Armin Größlinger, C. Lengauer
2012
2 references

The polyhedral model for loop parallelization has proved to be an effective tool for advanced optimization and automatic parallelization of programs in higher-level languages. Yet, to integrate such optimizations seamlessly into production compilers, they must be performed on the compiler's internal...

Show 2 references in code

Automatically scheduling halide image processing pipelines.

Ravi Teja Mullapudi, Andrew Adams, Dillon Sharlet, Jonathan Ragan‐Kelley, Kayvon Fatahalian
2016
2 references

The Halide image processing language has proven to be an effective system for authoring high-performance image processing code. Halide programmers need only provide a high-level strategy for mapping an image processing pipeline to a parallel machine (a schedule ), and the Halide compiler carries out...

Show 2 references in code

Linear Layouts: Robust Code Generation of Efficient Tensor Computation Using $\mathbb{F}_2$

Keren Zhou, Mario Lezcano, Adam Goucher, Akhmed Rakhmati, Jeff Niu, Justin Lebar, Pawel Szczerbuk, P...
2025
2 references

Efficient tensor computation is a cornerstone of modern deep learning (DL) workloads, yet existing approaches struggle to achieve flexible and performant design and implementation of tensor layouts -- mappings between logical tensors and hardware resources. The increasing complexity of DL algorithms...

Show 2 references in code

Dropout: a simple way to prevent neural networks from overfitting.

Nitish Srivastava, Geoffrey E. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov
2014
2 references
Show 2 references in code

Parallel random numbers: as easy as 1, 2, 3.

John K. Salmon, Mark A. Moraes, Ron O. Dror, David E. Shaw
2011
2 references

Most pseudorandom number generators (PRNGs) scale poorly to massively parallel high-performance computation because they are designed as sequentially dependent state transformations. We demonstrate that independent, keyed transformations of counters produce a large alternative class of PRNGs with ex...

Show 2 references in code

Layer Normalization

Chengyong Si, Jianqiang Shen, Xuan Zou, Lei Wang, Qidi Wu
2016
8 references

Training state-of-the-art, deep neural networks is computationally expensive. One way to reduce the training time is to normalize the activities of the neurons. A recently introduced technique called batch normalization uses the distribution of the summed input to a neuron over a mini-batch of train...

Show 2 references in code

Triton-distributed: Programming Overlapping Kernels on Distributed AI Systems with the Triton Compiler

Size Zheng, Wenlei Bao, Qi Hou, Xuegui Zheng, Jin Fang, Chenhui Huang, Tianqi Li, Haojie Duanmu, Ren...
2025
1 reference

In this report, we propose Triton-distributed, an extension of existing Triton compiler, to overcome the programming challenges in distributed AI systems. Triton-distributed is the first compiler that supports native overlapping optimizations for distributed AI workloads, providing a good coverage o...

Show 1 reference in code

Sequence to Sequence Learning with Neural Networks.

Ilya Sutskever, Oriol Vinyals, Quoc V. Le
2014
1 reference
Show 1 reference in code

You Only Look Once: Unified, Real-Time Object Detection.

Joseph Redmon, Santosh Divvala, Ross Girshick, Ali Farhadi
2016
1 reference
Show 1 reference in code

Superhuman Accuracy on the SNEMI3D Connectomics Challenge.

Kisuk Lee, Jonathan Zung, Peter Li, Viren Jain, H. Sebastian Seung
2017
1 reference
Show 1 reference in code

Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines

Jonathan Ragan‐Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, Saman Amarasinghe
2013
1 reference

Image processing pipelines combine the challenges of stencil computations and stream programs. They are composed of large graphs of different stencil stages, as well as complex reductions, and stages with global or data-dependent access patterns. Because of their complex structure, the performance d...

Show 1 reference in code

TVM: An Automated End-to-End Optimizing Compiler for Deep Learning

Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Meghan Cowan, Haichen Shen, Ley...
2018
3 references

There is an increasing need to bring machine learning to a wide diversity of hardware devices. Current frameworks rely on vendor-specific operator libraries and optimize for a narrow range of server-class GPUs. Deploying workloads to new platforms -- such as mobile phones, embedded devices, and acce...

Show 1 reference in code

The Cache Performance and Optimizations of Blocked Algorithms.

Monica D. Lam, Edward Rothberg, Michael E. Wolf
1991
1 reference

article Free Access Share on The cache performance and optimizations of blocked algorithms Authors: Monica D. Lam View Profile , Edward E. Rothberg View Profile , Michael E. Wolf View Profile Authors Info & Claims ACM SIGOPS Operating Systems ReviewVolume 25Issue Special IssueApr. 1991 pp 63–74https...

Show 1 reference in code

LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation.

Chris Lattner, Vikram S. Adve
2004
1 reference

We describe LLVM (low level virtual machine), a compiler framework designed to support transparent, lifelong program analysis and transformation for arbitrary programs, by providing high-level information to compiler transformations at compile-time, link-time, run-time, and in idle time between runs...

Show 1 reference in code

More iteration space tiling.

Michael Wolfe
1989
1 reference

Subdividing the iteration space of a loop into blocks or tiles with a fixed maximum size has several advantages. Tiles become a natural candidate as the unit of work for parallel task scheduling. Synchronization between processors can be done between tiles, reducing synchronization frequency (at som...

Show 1 reference in code

On the Complexity of Loop Fusion.

Alain Darte
1999
1 reference

Loop fusion is a program transformation that combines several loops into one. It is used in parallelizing compilers mainly for increasing the granularity of loops and for improving data reuse. The goal of this paper is to study, from a theoretical point of view, several variants of the loop fusion p...

Show 1 reference in code

Automatic loop interchange.

John R. Allen, Ken Kennedy
1984
1 reference

Parallel and vector machines are becoming increasingly important to many computation intensive applications. Effectively utilizing such architectures, particularly from sequential languages such as Fortran, has demanded increasingly sophisticated compilers. In general, a compiler needs to significan...

Show 1 reference in code

Scanning Polyhedra with DO Loops.

Corinne Ancourt, François Irigoin
1991
1 reference

Article Scanning polyhedra with DO loops Share on Authors: Corinne Ancourt View Profile , François Irigoin View Profile Authors Info & Claims PPOPP '91: Proceedings of the third ACM SIGPLAN symposium on Principles and practice of parallel programmingApril 1991 Pages 39–50https://doi.org/10.1145/1096...

Show 1 reference in code

Diesel: DSL for linear algebra and neural net computations on GPUs.

Venmugil Elango, Norm Rubin, M. Ravishankar, Hariharan Sandanagobalane, Vinod Grover
2018
1 reference

We present a domain specific language compiler, Diesel, for basic linear algebra and neural network computations, that accepts input expressions in an intuitive form and generates high performing code for GPUs. The current trend is to represent a neural network as a computation DAG, where each node ...

Show 1 reference in code

MLIR: A Compiler Infrastructure for the End of Moore's Law.

Chris Lattner, Jacques A. Pienaar, Mehdi Amini, Uday Bondhugula, River Riddle, Albert Cohen 0001, Ta...
2020
1 reference
Show 1 reference in code

Semi-Automatic Composition of Loop Transformations for Deep Parallelism and Memory Hierarchies.

Sylvain Girbal, Nicolas Vasilache, Cédric Bastoul, Albert Cohen, David Parello, Marc Sigler, Olivier...
2006
1 reference
Show 1 reference in code

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, Christopher Ré
2022
6 references

Transformers are slow and memory-hungry on long sequences, since the time and memory complexity of self-attention are quadratic in sequence length. Approximate attention methods have attempted to address this problem by trading off model quality to reduce the compute complexity, but often do not ach...

Show 1 reference in code

Self-attention Does Not Need $O(n^2)$ Memory

Markus N. Rabe, Charles Staats
2021
2 references

We present a very simple algorithm for attention that requires $O(1)$ memory with respect to sequence length and an extension to self-attention that requires $O(\log n)$ memory. This is in contrast with the frequently stated belief that self-attention requires $O(n^2)$ memory. While the time complex...

Show 1 reference in code
Link copied to clipboard!