Repository logo
  • English
  • العربية
  • বাংলা
  • Català
  • Čeština
  • Deutsch
  • Ελληνικά
  • Español
  • Suomi
  • Français
  • Gàidhlig
  • हिंदी
  • Magyar
  • Italiano
  • Қазақ
  • Latviešu
  • Nederlands
  • Polski
  • Português
  • Português do Brasil
  • Srpski (lat)
  • Српски
  • Svenska
  • Türkçe
  • Yкраї́нська
  • Tiếng Việt
Log In
New user? Click here to register.Have you forgotten your password?
  1. Home
  2. Scholalry Output
  3. Publications
  4. COMPRIZE: Assessing the Fusion of Quantization and Compression on DNN Hardware Accelerators
 
  • Details

COMPRIZE: Assessing the Fusion of Quantization and Compression on DNN Hardware Accelerators

Source
Proceedings of the IEEE International Conference on VLSI Design
ISSN
10639667
Date Issued
2024-01-01
Author(s)
Patel, Vrajesh
Shah, Neel
Krishna, Aravind
Glint, Tom
Ronak, Abdul
Mekie, Joycee  
DOI
10.1109/VLSID60093.2024.00048
Abstract
The fast advancement in complex Deep Neural Network (DNN) models, along with the availability of large amounts of training data, has created a huge demand for computational resources. When these novel workloads are offloaded on general-purpose computing cores, notable predicaments arise concerning memory utilization and power outlay. Consequently, a diverse array of methodologies have been investigated to confront these issues. Among these, DNN accelerator architectures have surfaced as a prominent solution. In this work, we propose to perform data-aware computation for inferencing various workloads that lead to a highly optimized DNN accelerator. Recent research advocates using Approximate Fixed Point Posit (quantized) representation (ApproxPOS) for inferencing workloads to enhance system performance. We perform compression along with quantization for inferencing AlexNet, ResNet, and LeNet and show system-level benefits in terms of latency and energy. We selectively compress and decompress inputs and outputs for each layer of the workload based on its sparsity. The model used for inferencing is trained in a quantization-aware manner by modifying the Pytorch framework. Our system-level analysis shows that on performing data-aware computation for a fixed area, the proposed implementation on average consumes ∼ 15.4 × ∼ 11.6 ×, and ∼ 3.5 × less energy for AlexNet, ResNet, and LeNet respectively and achieves on average ∼ 2 × speedup as compared to the baseline, FP32. The area overhead due to additional circuit requirements for compression and decompression was negligible (within 0.5%) since it only requires an additional register and a counter. We demonstrate our work on Simba architecture, which is extendable to any other accelerator.
Unpaywall
URI
https://d8.irins.org/handle/IITG2025/29135
Subjects
Compression | DNN Accelerators | Neural Networks | Quantization
IITGN Knowledge Repository Developed and Managed by Library

Built with DSpace-CRIS software - Extension maintained and optimized by 4Science

  • Privacy policy
  • End User Agreement
  • Send Feedback
Repository logo COAR Notify