Skip to main navigation Skip to search Skip to main content

Cost Efficient GPU Cluster Management for Training and Inference of Deep Learning

  • Dong Ki Kang
  • , Ki Beom Lee
  • , Young Chon Kim*
  • *Corresponding author for this work

    Research output: Contribution to journalJournal articlepeer-review

    Abstract

    Expanding the scale of GPU-based deep learning (DL) clusters would bring not only accelerated AI services but also significant energy consumption costs. In this paper, we propose a cost efficient deep learning job allocation (CE-DLA) approach minimizing the energy consumption cost for the DL cluster operation while guaranteeing the performance requirements of user requests. To do this, we first categorize the DL jobs into two classes: training jobs and inference jobs. Through the architecture-agnostic modeling, our CE-DLA approach is able to conduct the delicate mapping of heterogeneous DL jobs to GPU computing nodes. Second, we design the electricity price-aware DL job allocation so as to minimize the energy consumption cost of the cluster. We show that our approach efficiently avoids the peak-rate time slots of the GPU computing nodes by using the sophisticated mixed-integer nonlinear problem (MINLP) formulation. We additionally integrate the dynamic right-sizing (DRS) method with our CE-DLA approach, so as to minimize the energy consumption of idle nodes having no running job. In order to investigate the realistic behavior of our approach, we measure the actual output from the NVIDIA-based GPU devices with well-known deep neural network (DNN) models. Given the real trace data of the electricity price, we show that the CE-DLA approach outperforms the competitors in views of both the energy consumption cost and the performance for DL job processing.

    Original languageEnglish
    Article number474
    JournalEnergies
    Volume15
    Issue number2
    DOIs
    StatePublished - 2022.01.1

    Keywords

    • Deep learning
    • Electricity price
    • Energy consumption cost
    • GPU-based cluster
    • Inference
    • Job allocation
    • Training

    Quacquarelli Symonds(QS) Subject Topics

    • Mathematics
    • Engineering - Electrical & Electronic
    • Engineering - Petroleum

    Fingerprint

    Dive into the research topics of 'Cost Efficient GPU Cluster Management for Training and Inference of Deep Learning'. Together they form a unique fingerprint.

    Cite this