Repository logo
  • English
  • العربية
  • বাংলা
  • Català
  • Čeština
  • Deutsch
  • Ελληνικά
  • Español
  • Suomi
  • Français
  • Gàidhlig
  • हिंदी
  • Magyar
  • Italiano
  • Қазақ
  • Latviešu
  • Nederlands
  • Polski
  • Português
  • Português do Brasil
  • Srpski (lat)
  • Српски
  • Svenska
  • Türkçe
  • Yкраї́нська
  • Tiếng Việt
Log In
New user? Click here to register.Have you forgotten your password?
  1. Home
  2. Scholalry Output
  3. Publications
  4. D-STACK: High Throughput DNN Inference by Effective Multiplexing and Spatio-Temporal Scheduling of GPUs
 
  • Details

D-STACK: High Throughput DNN Inference by Effective Multiplexing and Spatio-Temporal Scheduling of GPUs

Source
IEEE Transactions on Cloud Computing
Date Issued
2024-01-01
Author(s)
Dhakal, Aditya
Kulkarni, Sameer G.  
Ramakrishnan, K. K.
DOI
10.1109/TCC.2024.3476210
Volume
12
Issue
4
Abstract
Hardware accelerators such as GPUs are required for real-time, low latency inference with Deep Neural Networks (DNN). Providing inference services in the cloud can be resource intensive, and effectively utilizing accelerators in the cloud is important. Spatial multiplexing of the GPU, while limiting the GPU resources (GPU%) to each DNN to the right amount, leads to higher GPU utilization and higher inference throughput. Right-sizing the GPU for each DNN the optimal batching of requests to balance throughput and service level objectives (SLOs), and maximizing throughput by appropriately scheduling DNNs are still significant challenges.This article introduces a dynamic and fair spatio-temporal scheduler (D-STACK) for multiple DNNs to run in the GPU concurrently. We develop and validate a model that estimates the parallelism each DNN can utilize and a lightweight optimization formulation to find an efficient batch size for each DNN. Our holistic inference framework provides high throughput while meeting application SLOs. We compare D-STACK with other GPU multiplexing and scheduling methods (e.g., NVIDIA Triton, Clipper, Nexus), using popular DNN models. Our controlled experiments with multiplexing several popular DNN models achieve up to 1.6× improvement in GPU utilization and up to 4× improvement in inference throughput.
Unpaywall
URI
https://d8.irins.org/handle/IITG2025/29205
Subjects
Datasets | gaze detection | neural networks | text tagging
IITGN Knowledge Repository Developed and Managed by Library

Built with DSpace-CRIS software - Extension maintained and optimized by 4Science

  • Privacy policy
  • End User Agreement
  • Send Feedback
Repository logo COAR Notify