FP8 Formats for Deep Learning

Paulius Micikevicius, Dušan Stošić, Neil Burgess, Marius Cornea, Pradeep Dubey, Richard Grisenthwaite, Sangwon Ha, Alexander Heinecke, Patrick Judd, John Kamalu, Naveen Mellempudi, Stuart F. Oberman, Mohammad Shoeybi, Kin Wai Michael Siu, Hao Wu

2022

24 references

Abstract

FP8 is a natural progression for accelerating deep learning training inference beyond the 16-bit formats common in modern processors. In this paper we propose an 8-bit floating point (FP8) binary interchange format consisting of two encodings - E4M3 (4-bit exponent and 3-bit mantissa) and E5M2 (5-bit exponent and 2-bit mantissa). While E5M2 follows IEEE 754 conventions for representatio of special values, E4M3's dynamic range is extended by not representing infinities and having only one mantissa bit-pattern for NaNs. We demonstrate the efficacy of the FP8 format on a variety of image and language tasks, effectively matching the result quality achieved by 16-bit training sessions. Our study covers the main modern neural network architectures - CNNs, RNNs, and Transformer-based models, leaving all the hyperparameters unchanged from the 16-bit baseline training sessions. Our training experiments include large, up to 175B parameter, language models. We also examine FP8 post-training-quantization of language models trained using 16-bit formats that resisted fixed point int8 quantization.

View Paper PDF DOI

⚙️ Classical Compilers 🤖 Machine Learning 🧠 ML Compilers 🖥️ Operating Systems

8 repositories

20 references

Code References

▶ apache/tvm

3 files

▶ src/support/scalars.h

L65

// See https://arxiv.org/pdf/2209.05433.pdf

L72

// See https://arxiv.org/pdf/2209.05433.pdf

▶ src/tir/op/op.cc

L307

// according to https://arxiv.org/pdf/2209.05433.pdf

L367

// according to https://arxiv.org/pdf/2209.05433.pdf

▶ src/tir/transforms/dtype_conversion.h

L123

// Reference: https://arxiv.org/abs/2209.05433

▶ freebsd/freebsd-src

1 file

▶ contrib/llvm-project/llvm/include/llvm/ADT/APFloat.h

L160

// layout S1E5M2 as described in https://arxiv.org/abs/2209.05433.

L173

// bit layout S1E4M3 as described in https://arxiv.org/abs/2209.05433.

▶ iree-org/iree

1 file

▶ runtime/src/iree/base/internal/math.h

L516

// F8E5M2 type, https://arxiv.org/abs/2209.05433

L521

// F8E4M3FN type, https://arxiv.org/abs/2209.05433. The paper doesn't use the FN

▶ llvm/llvm-project

1 file

▶ llvm/include/llvm/ADT/APFloat.h

L201

// layout S1E5M2 as described in https://arxiv.org/abs/2209.05433.

L214

// bit layout S1E4M3 as described in https://arxiv.org/abs/2209.05433.

▶ microsoft/onnxruntime

1 file

▶ csharp/tools/Microsoft.ML.OnnxRuntime.PerfTool/OnnxMl.cs

L4501

/// FP8 Formats for Deep Learning, https://arxiv.org/abs/2209.05433,

▶ onnx/onnx

2 files

▶ docs/docsgen/source/technical/float8.md

L18

[FP8 Formats for Deep Learning](https://arxiv.org/abs/2209.05433)

▶ docs/IR.md

L424

Floating Point Types|float16, float32, float64, bfloat16, float8e4m3fn, float8e5m2, float8e4m3fnuz, float8e5m2fnuz, float4e2m1|Values adhering to the IEEE 754-2008 standard representation of floating-point data or defined in papers [FP8 Formats for Deep Learning](https://arxiv.org/abs/2209.05433), [8-bit Numerical Formats for Deep Neural Networks](https://arxiv.org/abs/2206.02915), and the [Open Compute Project](https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf)

▶ openxla/xla

1 file

▶ xla/python/ifrt/dtype.cc

L47

// The following types are https://arxiv.org/abs/2209.05433

L93

// The following types are https://arxiv.org/abs/2209.05433

▶ pytorch/pytorch

3 files

▶ docs/source/tensor_attributes.rst

L32

``torch.float8_e4m3fn`` [shell]_, [1]_ 8-bit floating point, S-E-M 1-4-3, from https://arxiv.org/abs/2209.05433

L33

``torch.float8_e5m2`` [shell]_ 8-bit floating point, S-E-M 1-5-2, from https://arxiv.org/abs/2209.05433

▶ torch/headeronly/util/Float8_e4m3fn.h

L14

/// Implementation based on the paper https://arxiv.org/pdf/2209.05433.pdf

▶ torch/headeronly/util/Float8_e5m2.h

L14

/// Implementation based on the paper https://arxiv.org/pdf/2209.05433.pdf

Link copied to clipboard!