Papers
Browse academic papers referenced in production code
An Efficient Method for Generating Discrete Random Variables with General Distributions
article Free Access Share on An Efficient Method for Generating Discrete Random Variables with General Distributions Author: Alastair J. Walker Department of Electrical Engineering, University of Witwatersrand, 1 Jan Smuts Ave., Johnnesburg 2001, Sou...
The differentiation of pseudo-inverses and non-linear least squares problems whose variables separate
For given data $(t_i ,y_i ),i = 1, \cdots ,m$, we consider the least squares fit of nonlinear models of the form \[ \eta ({\bf a},{\boldsymbol \alpha} ;t) = \sum _{j = 1}^n {a_j \varphi _j ({\boldsymbol \alpha} ;t),\qquad {\bf a} \in \mathcal{R}^n ,\...
On Grouping for Maximum Homogeneity
Abstract Given a set of arbitrary numbers, what is a practical procedure for grouping them so that the variance within groups is minimized? An answer to this question, including a description of an automatic computer program, is given for problems up...
Evolutionary-scale prediction of atomic level protein structure with a language model
AbstractArtificial intelligence has the potential to open insight into the structure of proteins at the scale of evolution. It has only recently been possible to extend protein structure prediction to two hundred million cataloged proteins. Character...
A Stop-the-World Debugger for Erlang (and the BEAM)
Erlang and the BEAM are remarkable for their tracing capabilities and the type of troubleshooting this enables on live production systems. At other stages of the development cycle, though, a traditional debugger is arguably more natural and convenien...
Bridging the Gap Between Promise and Performance for Microscaling FP4 Quantization
The recent hardware-accelerated microscaling 4-bit floating-point formats such as MXFP4 and NVFP4, supported on NVIDIA and AMD GPUs, promise to revolutionize large language model (LLM) inference. Yet, their practical benefits remain unproven. We pres...
Decoding the Molecular Language of Proteins with Evolla
Abstract Proteins, nature’s intricate molecular machines, are the products of billions of years of evolution and play fundamental roles in sustaining life. Yet, deciphering their molecular language - that is, understanding how protein sequences and s...
Parallel Runtime Interface for Fortran (PRIF) Specification, Revision 0.6
This document specifies an interface to support the multi-image parallelism features of Fortran, named the Parallel Runtime Interface for Fortran (PRIF). PRIF is a solution in which a runtime library is primarily responsible for implementing coarray ...
Recipes for Pre-training LLMs with MXFP8
Using fewer bits to represent model parameters and related tensors during pre-training has become a required technique for improving GPU efficiency without sacrificing accuracy. Microscaling (MX) formats introduced in NVIDIA Blackwell generation of G...
Top-H Decoding: Adapting the Creativity and Coherence with Bounded Entropy in Text Generation
Large language models (LLMs), despite their impressive performance across a wide range of tasks, often struggle to balance two competing objectives in open-ended text generation: fostering diversity and creativity while preserving logical coherence. ...
Triton-distributed: Programming Overlapping Kernels on Distributed AI Systems with the Triton Compiler
In this report, we propose Triton-distributed, an extension of existing Triton compiler, to overcome the programming challenges in distributed AI systems. Triton-distributed is the first compiler that supports native overlapping optimizations for dis...
Compact Language Models via Pruning and Knowledge Distillation
Large language models (LLMs) targeting different deployment scales and sizes are currently produced by training each variant from scratch; this is extremely compute-intensive. In this paper, we investigate if pruning an existing LLM and then re-train...
Deriving Activation Functions Using Integration
Our work proposes a novel approach to designing activation functions by focusing on their gradients and deriving the corresponding activation functions using integration. We introduce the Expanded Integral of the Exponential Linear Unit (xIELU), a tr...
Lightweight, Modular Verification for WebAssembly-to-Native Instruction Selection
Language-level guarantees---like module runtime isolation for WebAssembly (Wasm)---are only as strong as the compiler that produces a final, native-machine-specific executable. The process of lowering language-level constructions to ISA-specific inst...
MOLPIPx: an end-to-end differentiable package for permutationally invariant polynomials in Python and Rust
In this work, we present MOLPIPx, a versatile library designed to seamlessly integrate Permutationally Invariant Polynomials (PIPs) with modern machine learning frameworks, enabling the efficient development of linear models, neural networks, and Gau...
OMPTBench – OpenMP Tool Interface Conformance Testing
OpenMP® is a highly relevant parallelization standard in high-performance computing and all major compiler vendors support it. The standard defines the OpenMP Tool Interface (OMPT) as a mechanism for third-party tools to obtain information on dedicat...
ompTest – Unit Testing with OMPT
OpenMP® is a widely used API in high-performance computing that enables parallelization on the host as well as offload work to an accelerator, such as a GPU. The OpenMP specification defines an OpenMP Tool Interface (OMPT), which allows a third-party...
Optimistic and Scalable Global Function Merging
Function merging is a pivotal technique for reducing code size by combining identical or similar functions into a single function. While prior research has extensively explored this technique, it has not been assessed in conjunction with function out...