huggingface/transformers

Most large multimodal models (LMMs) are implemented by feeding visual tokens as a sequence into the first layer of a large language model (LLM). The resulting architecture is simple but significantly increases computation and memory costs, as it has to handle a large number of additional tokens in i...

View Paper PDF

Show 5 references in code

src/transformers/models/qwen3_omni_moe/modeling_qwen3_omni_moe.py:1694

src/transformers/models/qwen3_omni_moe/modeling_qwen3_omni_moe.py:2935

src/transformers/models/qwen3_vl/modeling_qwen3_vl.py:835

src/transformers/models/qwen3_vl/modular_qwen3_vl.py:729

src/transformers/models/qwen3_vl_moe/modeling_qwen3_vl_moe.py:938

On the Reliability of Watermarks for Large Language Models

John Kirchenbauer, Jonas Geiping, Yuxin Wen, Manli Shu, Khalid Saifullah, Kezhi Kong, Kasun Fernando...

2023

4 references

As LLMs become commonplace, machine-generated text has the potential to flood the internet with spam, social media bots, and valueless content. Watermarking is a simple and effective strategy for mitigating such harms by enabling the detection and documentation of LLM-generated text. Yet a crucial q...

View Paper PDF DOI

Show 4 references in code

src/transformers/generation/configuration_utils.py:1272

src/transformers/generation/configuration_utils.py:1273

src/transformers/generation/logits_process.py:2534

src/transformers/generation/logits_process.py:2535

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, Sumit Sanghai

2023

13 references

Multi-query attention (MQA), which only uses a single key-value head, drastically speeds up decoder inference. However, MQA can lead to quality degradation, and moreover it may not be desirable to train a separate model just for faster inference. We (1) propose a recipe for uptraining existing multi...

View Paper PDF DOI

Show 4 references in code

src/transformers/models/longcat_flash/configuration_longcat_flash.py:52

src/transformers/models/qwen3_next/configuration_qwen3_next.py:56

src/transformers/models/qwen3_vl_moe/configuration_qwen3_vl_moe.py:56

src/transformers/models/qwen3_vl_moe/modular_qwen3_vl_moe.py:78

Language models enable zero-shot prediction of the effects of mutations on protein function

Joshua Meier, Roshan Rao, Robert Verkuil, Jason Liu, Tom Sercu, Alexander Rives

2021

3 references

Modeling the effect of sequence variation on function is a fundamental problem for understanding and designing proteins. Since evolution encodes information about function into patterns in protein sequences, unsupervised models of variant effects can be learned from sequence data. The approach to da...

View Paper PDF DOI

Show 3 references in code

docs/source/fr/index.md:110

docs/source/ja/index.md:107

docs/source/ko/index.md:100

Neural Networks Fail to Learn Periodic Functions and How to Fix It

Liu Ziyin, T. Hartwig, Masahito Ueda

2020

3 references

Previous literature offers limited clues on how to learn a periodic function using modern neural networks. We start with a study of the extrapolation properties of neural networks; we prove and demonstrate experimentally that the standard activations functions, such as ReLU, tanh, sigmoid, along wit...

View Paper

Show 3 references in code

src/transformers/models/qwen2_5_omni/modeling_qwen2_5_omni.py:3155

src/transformers/models/qwen2_5_omni/modular_qwen2_5_omni.py:3313

src/transformers/models/qwen3_omni_moe/modeling_qwen3_omni_moe.py:3652

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale.

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Untert...

2020

2 references

While the Transformer architecture has become the de-facto standard for natural language processing tasks, its applications to computer vision remain limited. In vision, attention is either applied in conjunction with convolutional networks, or used to replace certain components of convolutional net...

View Paper PDF DOI

Show 2 references in code

docs/source/en/model_doc/deit.md:29

docs/source/ja/model_doc/deit.md:23

Evolutionary-scale prediction of atomic level protein structure with a language model

Zeming Lin, Halil Akin, Roshan Rao, Brian Hie, Zhongkai Zhu, Wenting Lu, Nikita Smetanin, Robert Ver...

2 references

AbstractArtificial intelligence has the potential to open insight into the structure of proteins at the scale of evolution. It has only recently been possible to extend protein structure prediction to two hundred million cataloged proteins. Characterizing the structures of the exponentially growing ...

View Paper DOI

Show 2 references in code

docs/source/en/model_doc/esm.md:37

docs/source/ko/model_doc/esm.md:23

Conformer: Convolution-augmented Transformer for Speech Recognition

Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo Wang, Zh...

2020

2 references

Recently Transformer and Convolution neural network (CNN) based models have shown promising results in Automatic Speech Recognition (ASR), outperforming Recurrent neural networks (RNNs). Transformer models are good at capturing content-based global interactions, while CNNs exploit local features eff...

View Paper PDF

Show 2 references in code

src/transformers/models/lasr/modeling_lasr.py:465

src/transformers/models/lasr/modular_lasr.py:442

Going deeper with Image Transformers.

Hugo Touvron, Matthieu Cord, Alexandre Sablayrolles, Gabriel Synnaeve, Hervé Jeǵou

2021

2 references

Transformers have been recently adapted for large scale image classification, achieving high scores shaking up the long supremacy of convolutional neural networks. However the optimization of image transformers has been little studied so far. In this work, we build and optimize deeper transformer ne...

View Paper PDF DOI

Show 2 references in code

src/transformers/models/mimi/modeling_mimi.py:491

src/transformers/models/qwen3_omni_moe/modeling_qwen3_omni_moe.py:3467

Decoding the Molecular Language of Proteins with Evolla

Xibin Zhou, Chenchen Han, Yingqi Zhang, Jin Su, Kai Zhuang, Shiyu Jiang, Zichen Yuan, Wei Zheng, Fen...

2025

1 reference

Abstract Proteins, nature’s intricate molecular machines, are the products of billions of years of evolution and play fundamental roles in sustaining life. Yet, deciphering their molecular language - that is, understanding how protein sequences and structures encode and determine biological function...

View Paper PDF DOI

Show 1 reference in code

docs/source/en/model_doc/evolla.md:22

Compact Language Models via Pruning and Knowledge Distillation

Saurav Muralidharan, Sharath Turuvekere Sreenivas, Raviraj Joshi, Marcin Chochowski, Mostofa Patwary...

2024

1 reference

Large language models (LLMs) targeting different deployment scales and sizes are currently produced by training each variant from scratch; this is extremely compute-intensive. In this paper, we investigate if pruning an existing LLM and then re-training it with a fraction (<3%) of the original train...

View Paper PDF

Show 1 reference in code

docs/source/en/model_doc/nemotron.md:114

XLM-V: Overcoming the Vocabulary Bottleneck in Multilingual Masked Language Models

Davis Liang, Hila Gonen, Yuning Mao, Rui Hou, Naman Goyal, Marjan Ghazvininejad, Luke Zettlemoyer, M...

2023

1 reference

Large multilingual language models typically rely on a single vocabulary shared across 100+ languages. As these models have increased in parameter count and depth, vocabulary size has remained largely unchanged. This \textit{vocabulary bottleneck} limits the representational capabilities of multilin...

View Paper PDF DOI

Show 1 reference in code

docs/source/en/model_doc/xlm-v.md:28

Bridging the Gap Between Promise and Performance for Microscaling FP4 Quantization

Vage Egiazarian, Roberto L. Castro, Denis Kuznedelev, Andrei Panferov, Eldar Kurtic, Shubhra Pandit,...

2025

1 reference

The recent hardware-accelerated microscaling 4-bit floating-point formats such as MXFP4 and NVFP4, supported on NVIDIA and AMD GPUs, promise to revolutionize large language model (LLM) inference. Yet, their practical benefits remain unproven. We present the first comprehensive study of MXFP4 and NVF...

View Paper PDF

Show 1 reference in code

docs/source/en/quantization/fp_quant.md:21

Gaussian Error Linear Units (GELUs)

Dan Hendrycks, Kevin Gimpel

2016

13 references

We propose the Gaussian Error Linear Unit (GELU), a high-performing neural network activation function. The GELU activation function is $x\Phi(x)$, where $\Phi(x)$ the standard Gaussian cumulative distribution function. The GELU nonlinearity weights inputs by their value, rather than gates inputs by...

View Paper PDF DOI

Show 1 reference in code

src/transformers/activations.py:95

Sigmoid-Weighted Linear Units for Neural Network Function Approximation in Reinforcement Learning

Stefan Elfwing, Eiji Uchibe, Kenji Doya

2017

6 references

In recent years, neural networks have enjoyed a renaissance as function approximators in reinforcement learning. Two decades after Tesauro's TD-Gammon achieved near top-level human performance in backgammon, the deep reinforcement learning algorithm DQN achieved human-level performance in many Atari...

View Paper PDF DOI

Show 1 reference in code

src/transformers/activations.py:97

Searching for Activation Functions

C. Ramachandran, K. Dhanalakshmi, L. Vanitha

2017

7 references

The choice of activation functions in deep networks has a significant effect on the training dynamics and task performance. Currently, the most successful and widely-used activation function is the Rectified Linear Unit (ReLU). Although various hand-designed alternatives to ReLU have been proposed, ...

View Paper PDF DOI

Show 1 reference in code

src/transformers/activations.py:98

Deriving Activation Functions Using Integration

Allen Hao Huang, Imanol Schlag

2024

1 reference

Our work proposes a novel approach to designing activation functions by focusing on their gradients and deriving the corresponding activation functions using integration. We introduce the Expanded Integral of the Exponential Linear Unit (xIELU), a trainable piecewise activation function derived by i...

View Paper PDF

Show 1 reference in code

src/transformers/activations.py:226

Exploiting independent filter bandwidth of human factor cepstral coefficients in automatic speech recognition.

Mark D. Skowronski, John G. Harris

2004

1 reference

Mel frequency cepstral coefficients (MFCC) are the most widely used speech features in automatic speech recognition systems, primarily because the coefficients fit well with the assumptions used in hidden Markov models and because of the superior noise robustness of MFCC over alternative feature set...

View Paper PDF DOI

Show 1 reference in code

src/transformers/audio_utils.py:478

Top-H Decoding: Adapting the Creativity and Coherence with Bounded Entropy in Text Generation

Erfan Baghaei Potraghloo, Seyedarmin Azizi, Souvik Kundu, Massoud Pedram

2025

1 reference

Large language models (LLMs), despite their impressive performance across a wide range of tasks, often struggle to balance two competing objectives in open-ended text generation: fostering diversity and creativity while preserving logical coherence. Existing truncated sampling techniques, including ...

View Paper PDF

Show 1 reference in code

src/transformers/generation/logits_process.py:595

Link copied to clipboard!