EdgeAI | Hao Dai

Based on the paradigm of deep learning models and edge computing, edge intelligence technology has demonstrated significant advantages in aspects like service response time, data privacy protection, and resource elastic expansion.

This has led intelligent applications to gradually infiltrate various life scenarios in modern society. However, with the further development of intelligent applications, especially in mobile scenarios, and the growing demand for refined and personalized services, services provided by edge networks based on deep learning models are facing significant challenges in the following three aspects:

There is a mismatch between the limited computing and bandwidth resources of edge networks and the increasing demand for deep learning capabilities.
The mobility and heterogeneity of terminal devices pose additional challenges to model inference and training.
Mobile services require a transition from the classical supervised learning mode to an interactive learning mode. This undoubtedly presents new demands and challenges for edge intelligence.

Workflow of edge intelligence. After DNN model is trained in the cloud center, it is dispatched to devices. Devices performs inference and generates data, which is then uploaded to the cloud center for further training.

Focusing on the three aforementioned challenges in edge intelligence, we take deep reinforcement learning (DRL) as the representative decision-making intelligent model and starts from the three main stages of deep learning (model deployment, model inference, and model training). The primary emphasis is on the problems of resource coordination and latency optimization for edge intelligence services.

Edge intelligent deployment architecture. On the left, models are dispatched from the cloud center to devices through edge server caching. Middle, efficient offloading of inference tasks from devices to edge server. Right, orchestration of inference and training tasks is seamlessly managed on the edge servers.

Cost-Efficient Dispatch

This research primarily focuses on the collaborative deployment of deep learning models in cloud-edge-end scenarios. It establishes offline and online foundational models and provides corresponding theoretical upper bounds, aiming to achieve theoretically guaranteed minimization of DNN model dispatch costs.

To this end, in the cloud-edge-end scenario, this research conducts a formal analysis of this problem and proves its computational complexity. By caching and scheduling DNN models on edge servers, the research derives an optimal model distribution algorithm, DTSharing, for scenarios with predictable offline requests. Building on this, the research further explores scenarios with dynamically changing, unpredictable requests, proving a lower bound of 2 for the online problem. It designs and implements an online algorithm, OTSharing, which guarantees a competitive ratio of 2.5. This method not only effectively reduces the communication and storage costs required for model dispatch but also reveals the analytical relationship between network transmission costs and caching costs, laying a theoretical foundation for optimizing model deployment.

Inference Acceleration

This research mainly addresses the limitation of limited computational capabilities on devices. It proposes offloading partial tasks to edge servers to effectively reduce task latency in an inference acceleration model. By analyzing the computational and communication requirements of deep neural networks, this research establishes a mathematical model for accelerating tasks by splitting DNNs across multiple devices. Combining the principles of game theory, the research designs a distributed partition algorithm tailored for dynamically changing scenarios to adaptively adjust model partition.

Building upon this, the research implements a framework called “Coknight” for accelerating DRL inference, aiming to provide an acceleration engine for interactive learning methods of this kind. This framework, while ensuring a reduction in single-device inference latency, effectively reduces GPU occupancy on edge servers, enabling simultaneous acceleration for a large number of devices. Additionally, this partitioning method is applicable to other computational workloads, offering a new approach for task offloading acceleration in edge computing.

Concurrent Inference and Training

This research addresses the issue of resource contention caused by parallel execution of multiple tasks in edge servers. By modeling tasks as network flows, it designs a scheduling algorithm to optimize the overall task latency.

Given the dual role of edge servers in simultaneously performing inference and training in edge-cloud collaborative scenarios, this research implements a parallel task scheduling framework named CoTraX. It models task scheduling as two correlated network flows and designs an adaptive training flow control algorithm to optimize the overall latency of both tasks. This framework provides modeling tools for scenarios involving the concurrent execution of multiple tasks in edge intelligence and offers a solution for the parallel scheduling of inference and training tasks.

Reference

[1] Dai, H., Wu, J., Wang, Y., Yen, J., Zhang, Y., & Xu, C. (2023). Cost-efficient sharing algorithms for dnn model serving in mobile edge networks. IEEE Transactions on Services Computing.
[2] Dai, H., Wu, J., Wang, Y., & Xu, C. (2022). Towards scalable and efficient Deep-RL in edge computing: A game-based partition approach. Journal of Parallel and Distributed Computing, 168, 108-119.