Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference

Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard, Hartwig Adam, Dmitry Kalenichenko
2017
6 references

Abstract

The rising popularity of intelligent mobile devices and the daunting computational cost of deep learning-based models call for efficient and accurate on-device inference schemes. We propose a quantization scheme that allows inference to be carried out using integer-only arithmetic, which can be implemented more efficiently than floating point inference on commonly available integer-only hardware. We also co-design a training procedure to preserve end-to-end model accuracy post quantization. As a result, the proposed quantization scheme improves the tradeoff between accuracy and on-device latency. The improvements are significant even on MobileNets, a model family known for run-time efficiency, and are demonstrated in ImageNet classification and COCO detection on popular CPUs.

3 repositories
6 references

Code References

iree-org/iree
2 files
compiler/src/iree/compiler/GlobalOptimization/QuantizedConvToConv.cpp
2
// https://arxiv.org/abs/1712.05877.
// https://arxiv.org/abs/1712.05877.
compiler/src/iree/compiler/GlobalOptimization/QuantizedMatmulToMatmul.cpp
1
// https://arxiv.org/abs/1712.05877.
llvm/llvm-project
1 file
mlir/docs/Quantization.md
1
[in this paper](https://arxiv.org/abs/1712.05877) with many extensions and
tensorflow/tensorflow
1 file
tensorflow/lite/toco/graph_transformations/ensure_uint8_weights_safe_for_fast_int8_kernels.cc
2
// https://arxiv.org/abs/1712.05877
// O(N^3) GEMM cost). See https://arxiv.org/pdf/1712.05877, section
Link copied to clipboard!