CAT: Compression-aware training for bandwidth reduction

Chaim Baskin, Brian Chmiel, Evgenii Zheltonozhskii, Ron Banner, Alex M. Bronstein, Avi Mendelson

Research output: Contribution to journalArticlepeer-review

5 Scopus citations

Abstract

One major obstacle hindering the ubiquitous use of CNNs for inference is their relatively high memory bandwidth requirements, which can be the primary energy consumer and throughput bottleneck in hardware accelerators. Inspired by quantization-aware training approaches, we propose a compression-aware training (CAT) method that involves training the model to allow better compression of weights and feature maps during neural network deployment. Our method trains the model to achieve low-entropy feature maps, enabling efficient compression at inference time using classical transform coding methods. CAT significantly improves the state-of-the-art results reported for quantization evaluated on various vision and NLP tasks, such as image classification (ImageNet), image detection (Pascal VOC), sentiment analysis (CoLa), and textual entailment (MNLI). For example, on ResNet-18, we achieve near baseline ImageNet accuracy with an average representation of only 1.5 bits per value with 5-bit quantization. Moreover, we show that entropy reduction of weights and activations can be applied together, further improving bandwidth reduction. Reference implementation is available.

Original languageEnglish
JournalJournal of Machine Learning Research
Volume22
StatePublished - 1 Jan 2021
Externally publishedYes

Keywords

  • Custom hardware for deep learning
  • Deep learning
  • Efficient inference
  • Entropy encoding
  • Neural network compression

ASJC Scopus subject areas

  • Software
  • Control and Systems Engineering
  • Statistics and Probability
  • Artificial Intelligence

Fingerprint

Dive into the research topics of 'CAT: Compression-aware training for bandwidth reduction'. Together they form a unique fingerprint.

Cite this