Repository logo
  • English
  • العربية
  • বাংলা
  • Català
  • Čeština
  • Deutsch
  • Ελληνικά
  • Español
  • Suomi
  • Français
  • Gàidhlig
  • हिंदी
  • Magyar
  • Italiano
  • Қазақ
  • Latviešu
  • Nederlands
  • Polski
  • Português
  • Português do Brasil
  • Srpski (lat)
  • Српски
  • Svenska
  • Türkçe
  • Yкраї́нська
  • Tiếng Việt
Log In
New user? Click here to register.Have you forgotten your password?
  1. Home
  2. Scholalry Output
  3. Publications
  4. GSLICE: Controlled spatial sharing of GPUs for a scalable inference platform
 
  • Details

GSLICE: Controlled spatial sharing of GPUs for a scalable inference platform

Source
Socc 2020 Proceedings of the 2020 ACM Symposium on Cloud Computing
Date Issued
2020-10-12
Author(s)
Dhakal, Aditya
Kulkarni, Sameer G.  
Ramakrishnan, K. K.
DOI
10.1145/3419111.3421284
Abstract
The increasing demand for cloud-based inference services requires the use of Graphics Processing Unit (GPU). It is highly desirable to utilize GPU efficiently by multiplexing different inference tasks on the GPU. Batched processing, CUDA streams and Multi-process-service (MPS) help. However, we find that these are not adequate for achieving scalability by efficiently utilizing GPUs, and do not guarantee predictable performance. GSLICE addresses these challenges by incorporating a dynamic GPU resource allocation and management framework to maximize performance and resource utilization. We virtualize the GPU by apportioning the GPU resources across different Inference Functions (IFs), thus providing isolation and guaranteeing performance. We develop self-learning and adaptive GPU resource allocation and batching schemes that account for network traffic characteristics, while also keeping inference latencies below service level objectives. GSLICE adapts quickly to the streaming data's workload intensity and the variability of GPU processing costs. GSLICE provides scalability of the GPU for IF processing through efficient and controlled spatial multiplexing, coupled with a GPU resource re-allocation scheme with near-zero (< 100µs) downtime. Compared to default MPS and TensorRT, GSLICE improves GPU utilization efficiency by 60 - 800% and achieves 2 - 13X improvement in aggregate throughput.
Publication link
https://dl.acm.org/doi/pdf/10.1145/3419111.3421284
URI
https://d8.irins.org/handle/IITG2025/23968
IITGN Knowledge Repository Developed and Managed by Library

Built with DSpace-CRIS software - Extension maintained and optimized by 4Science

  • Privacy policy
  • End User Agreement
  • Send Feedback
Repository logo COAR Notify