1. Introduction
Nowadays, there are thousands or even tens of thousands of satellites in Low Earth Orbit (LEO) [
1,
2]. LEO satellites operate at altitudes of 160 to 2000 km [
3], and a large number of LEO satellites can form a constellation of an LEO satellite network, providing global coverage services for terrestrial users [
4,
5]. Some people suggest using it as a key infrastructure for the upcoming 6G network and beyond [
6,
7,
8,
9]. However, the cost of constructing an LEO satellite constellation is enormous. The cost of an LEO satellite is approximately USD 65,000; although it is nearly 10,000 times lower than the cost of a costly “exquisite” satellite (e.g.,
$650,000,000) [
10], LEO satellite constellations based on a large number of LEO satellites still face huge economic costs. For example, one of the state-of-the-art LEO satellite networks, SpaceX’s Starlink, has over 2000 satellites in different LEO groups currently. Furthermore, the Federal Communications Commission (FCC) has approved Starlink to bring that number up to 12,000 [
4]. Moreover, there are many competitors in the industry for LEO satellite constellation construction; OneWeb and O3b are another two leading enterprises in this industry [
11,
12]. Intense competition drives companies to create better satellite products, which will inevitably exacerbate the already significant economic pressure.To ensure the economic sustainability of LEO satellite constellation construction, people must find an effective solution to balance the revenue and expenditure of the satellite industry.
In today’s era, the Internet of Things (IoT) technology [
13] has experienced rapid development and popularization with broad market prospects. By 2025, the number of connected IoT terminal equipment is expected to reach 27 billion [
14]. As a promising emerging technology, IoT has been widely applied in many scenarios, such as healthcare, transportation, smart cities, smart homes, etc., and it has achieved excellent application performance [
15,
16,
17,
18]. While people enjoy the convenient services of IoT, they are also paying for IoT products. IoT offers a wide range of products, covers a wide range of fields, has a large user base, and can provide paid services. Therefore, integrating LEO satellites with IoT services can provide economic support for LEO satellite construction and achieve the economic sustainability of LEO satellite constellations.
Not only will the addition of the IoT technology provide economic support for LEO satellite construction, but the global coverage of LEO satellite constellations also contributes to the popularization of IoT technology and products. LEO satellite networks can cover areas with difficult conditions, such as deserts, oceans, etc., which traditional terrestrial networks cannot achieve [
19,
20]. IoT terminals in many industries such as transportation (maritime, highway, railway, aviation), maritime monitoring, and farming are located in remote areas without cellular connections [
21]; the assistance of LEO satellites can help them solve communication problems. In addition, the infrastructure of terrestrial networks is susceptible to natural disasters such as earthquakes and hurricanes, which can cause communication interruptions in severe cases [
11]. Therefore, LEO satellite networks help to eliminate regional restrictions on the use of IoT technology and mitigate the impact of natural disasters on communication.
IoT products can provide economic support for LEO satellite construction, and LEO satellite networks can expand the coverage of IoT products. In addition to the complementary relationship between the IoT and LEO satellite constructions mentioned above, the integration of IoT and LEO satellite constellations is also a trend. In traditional ways, due to the powerful computing and storage capabilities, cloud servers are people’s preferred choice for providing IoT services. To complete IoT services, cloud servers must receive raw data from terminal devices and then return services. However, the distance between cloud servers and terminal IoT equipment is relatively long and inevitably brings high latency, which may be unbearable for some latency-sensitive IoT applications. In addition, the growth in data volume brought about by the development of the Internet has also posed new challenges to the capabilities of the communication network, which has brought more significant pressure to the cloud server form [
22,
23]. Recently, edge computing has provided a solution to the above dilemma. Edge computing migrates services from remote cloud to network edge closer to users. Therefore, applying the edge-computing paradigm to IoT can achieve shorter communication distances and faster services [
24,
25,
26]. Edge servers are closer to users than cloud servers, so edge computing is expected to solve the high-latency challenge brought by cloud-based service provision [
27,
28].
Satellite Edge Computing (SEC) is proposed as a new promising computing platform. It uses LEO satellites as edge servers, and it has a lot of technical and theoretical support for its feasibility [
5,
10,
29,
30]. For example, in [
19], the author proposes a multi-purpose satellite, iSat, that demonstrates the feasibility of configuring computing and storage resources on the LEO satellite. Moreover, the latency from terrestrial stations to visible LEO satellites can be reduced to 1–4 ms [
30], which is quite friendly for latency-sensitive IoT applications.
Based on the above discussion, we imagine a way for the IoT to help the economic sustainability of LEO satellite constellations: that is, IoT products rely on LEO satellites to provide services. In this mode, users pay for IoT services, and IoT product providers pay for LEO satellite service providers; the economic sustainability of LEO satellite constellation construction will be guaranteed. However, there is a problem that must be considered, which is that both edge servers on the terrestrial and satellites serving as edge servers are resource-limited, with limited computing capabilities, making it difficult to handle complex tasks. Many IoT applications use Deep Neural Networks (DNNs), especially Convolutional Neural Networks (CNNs) to provide Artificial Intelligence (AI) services, such as image classification, object detection, text processing, etc. [
31,
32,
33]. Complex CNN faces significant demands for computing resources, and it generally cannot be executed by an edge server independently. Addressing the challenges, edge-distributed inference is a popular solution, and how to implement distributed inference in the edge environment to shorten the inference latency is a research hotspot.
As far as we know, this article is the first work to use edge-distributed computing in the SEC scenario. However, in the traditional edge computing scenario, there are many research studies on distributed inference, which can provide reference for our research [
34,
35,
36,
37,
38,
39,
40,
41,
42,
43,
44]. To shorten the inference latency, a lot of schemes of edge-distributed inference have been proposed in recent years. In these works, using model compression techniques, such as model pruning, to achieve distributed inference disrupts the structure of the original model and can have unpredictable impacts on the accuracy of model inference [
34,
35,
36]. Building a new deep model suitable for distributed inference requires retraining the model, increasing resource and time costs [
44]. In contrast, the method of directly performing distributed partitioning and redeployment on the original model has been widely welcomed, as it does not change the parameters and structure of the original model nor does it require retraining [
37,
38,
39,
40,
41,
42,
43].
Motivation. We have noticed that most existing works on distributed partitioning CNN models have not considered how to trade off communication latency and computing latency, but it is important to shorten the inference latency, and how to make the trade-off is tricky. In
Section 2.3, we will give a more detailed introduction to the trade-off.
Contributions. To shorten the latency of edge-distributed inference, in this article, we propose EDIJP, which is an edge-distributed inference framework based on joint partitioning. Regarding our framework, joint partitioning integrates model partitioning and data partitioning. We used an iterative algorithm to trade off communication latency and computing latency. The main contributions of this article are summarized as follows:
We propose a joint partitioning scheme that effectively reduces the latency of edge inference.
We shorten the inference latency depending on the trade-off between communication and computing in an edge-distributed inference scene.
To trade off the communication latency and the computing latency, we design an iterative algorithm to gradually obtain distributed partitioning and deployment results.
We validated the method’s effectiveness using the most general CNN models, VGG16 [
45] and Alexnet [
46], and the CloudSim [
47] simulation platform.
4. Joint Workload Partition Moudle
In this section, we introduced the inherent logic of JPM completing distributed partitioning and deployment, including linear programming modeling for data partitioning problems, the initialization algorithm, and the iterative partition algorithm.
4.1. Problem Formulation
Under the paradigm of SEC, we assume that the distributed environment consists of M workers, which can be represented as . We use to record the computing power of all workers, where the computing power of worker is numerically represented as , . The out-bandwidth of workers is represented by a set , where represents the out-bandwidth of worker , .
We focus on a single sub-model that consists of
N partial dependency layers, such as convolution layers and pooling layers. The network layers in the sub-model can be represented as
. The set
contains the computing workload of all the network layers, where
represents the computing workload of the layer
. We obtain the
using the ptflops tool. As stipulated in
Section 3.2, we use
H to represent the original input feature height of each network layer, and
can describe the data transmission situation from
to
.
We use a data partition to split the output data of this sub-model. The data partitioning is represented by
;
is the proportion of the output data of worker
to the original output data of the sub-model. To ensure that
X is an effective data partition,
X must comply with the following constraints, Formulas (
9) and (
10), ensuring that the sub-model output can be concatenated into the original output by all workers.
We note that the inference completion time for this sub-model on the worker
is
, and
can be calculated using the following Equation (
11).
where
represents the transmission time of input data from the requester
to worker
, and
represents the execution time at worker
. The inference completion time of the sub-model on the worker is equal to the time it takes for the worker to obtain the required input data plus the worker’s execution time.
The transmission time from device
j to device
i can be computed as (
12), where
is the volume of data transmitted from
to
. In a 64-bit system, a float32 number occupies 4 Bytes. Assuming the input feature’s width and number of channels are
and
, respectively, the calculation of
is shown in Equation (
13).
The computation time of worker
for the sub-model is denoted by
; we obtain it using (
14).
Our goal is to minimize the sub-model’s inference latency, which is equal to the latest time for all workers to complete the sub-model. Therefore, we build an optimization objective as (
15).
The optimization problem (
15) is a Linear Programming (LP) problem, and it is an NP-complete problem [
49]. In this article, we use the Mosek tool to solve the linear programming problem in the Python environment.
4.2. Algorithm Design
To obtain a distributed partitioning and deployment solution, we adopted the initialization algorithm and the iterative partition algorithm. Using the initialization algorithm, we obtain the original partition for the feature extraction part and the execution worker for the classification part. Using the iterative partition algorithm, we further implement the joint partition strategy on the feature extraction part, striving for a shorter inference latency.
4.2.1. Initialization
When the original CNN is divided into multiple sub-models by the model, each sub-model is parallelly executed on multiple workers due to data partitioning. The inference completion time of a sub-model on any one worker
is calculated as shown in Formula (
16),
where
represents the transmission time of input data from worker
to worker
, and
represents the execution time at worker
. Because all workers may transmit data to
as input data for the sub-model in
, we take the final time when all workers transmit data to
in this round as the starting time for
to perform sub-model inference.
As the CNN model can be divided into the feature extraction part and classification part, we consider the feature extraction sub-model, denoted by
, as the initial model that needs to be further partitioned, and the classification sub-model, denoted by
, needs a worker
to execute it. As shown in Algorithm 1, we first solve the optimization objective (
15) for
and obtain the data partition for the output feature of
. Then, we use a traversal method to obtain
k, which is the deploy worker number for the classification sub-model.
Algorithm 1: Initialization Algorithm |
Input: : CNN model; |
: computing workload set; |
M: the number of workers. : out-bandwidth set; |
: worker computing power set ; |
Output: : the initial partition set; |
k: the number of workers that execute the classification sub-model. |
1 Split M into feature extraction sub-model and classification sub-model ; |
2 Solve the problem (15) for , obtain the data partition result and the inferred completion time of on each worker; |
3 Initial total inference latency of as = Infinity; |
![Sustainability 16 01599 i001]() |
11 return , . |
4.2.2. Iterative Partition
After Algorithm 1, we obtain the initial data partition of . We further implement joint partitioning on based on to obtain a shorter inference latency. We adopt an iterative Algorithm 2 to gradually obtain the final partition result.
Based on
generated from Algorithm 1, we can give the
an initial partition. We assume that the network layers number of
is
. As shown in Algorithm 2, we traverse the network layers from layer
to layer
, treating layer
to the current layer, which we named as
, as a sub-model for data partitioning, solving the optimization objective (
15), and obtaining a new partition
x and a new total inference time
T using Equation (
16). If the new inference time
T is smaller than
,
x will be retained, and the layer number
will also be recorded. Otherwise, no operation is performed, and we continue to traverse forward. Perform the above steps until reaching layer
; then, the iteration is reached, and the final partition result
is obtained. We can deploy all the sub-models by
,
, and
k.
Algorithm 2: Iterative Partition Algorithm |
Input: : CNN model; |
: the number of ; |
: computation workload set; |
: out-bandwidth set; |
: computing power set; |
: the initial partition set from Algorithm 1; |
={}: the initial layer set used to contain all the end layer numbers of sub-models; |
: current inference latency; |
T: record for new inference latency. |
Output: , |
![Sustainability 16 01599 i002]() |
10 return , . |
4.2.3. Complexity
The LP problem is an NP-complete problem [
49]; there are a lot of algorithms to solve the LP problem [
50]. In the worse case of solving the LP problem, the time complexity of using Input Sparsity Time algorithms [
51] is
, and that value would be
using Vaidya’s 89 algorithm [
52], where d is the number of constraints, n is the number of variables, and L is the number of bits. In the 64-bit operating system, using the float32 form, the value of L is 32. For our optimization objective (
15), the number of variables is the workers’ number M, and the number of constraints is 2+M. So, we can view the time complexity of solving the LP problem (
15) as
approximately. Generally, the time complexity of solving the maximum problem is
, where n is the scale of the problem. The time complexities of the other operations in our algorithms, such as split, assignment, and insert, are all O(1). Therefore, we can calculate the time complexity of the two algorithms in this article, the time complexity of Algorithm 1 is
, and the time complexity of Algorithm 2 is O(
), where
is the number of network layers in the feature extraction part, and M is the number of workers. The time complexity values of Algorithm 1 and Algorithm 2 can be simplified to
and O(
), respectively. Overall, the time complexity of the method proposed in this article is O(
).
6. Related Work
In this section, we conducted a theoretical analysis of the existing research on accelerating the DNN inference of edge-distributed inference. To the best of our knowledge, there is no research on edge-distributed inference in the SEC, so we analyzed the related works in terrestrial edge computing.
To accelerate the inference of DNN, there is a work to achieve the distributed DNN inference using pruning [
44]. In [
44], the authors use a class-aware pruning scheme [
56] to trim the original DNN so that the new model small DNN (SNN) can only cover a portion of the original output categories. Using this principle, the author obtains several SNNs for distinguishing output categories based on the original DNN, which are deployed on multiple edge servers and collaborate to obtain inference results that can fully cover the category range of the original output. However, changing the parameters and structure of the original model can easily have unpredictable impacts on the inference accuracy.
There is also a method that proposes a loosely coupled CNN structure to fundamentally solve the problem of distributed deployment [
42]. In [
42], the authors designed a new loosely coupled structure (LCS) to adapt the distributed CNN inference. However, this approach faces the need for retraining new models, requiring different parameters for different purposes and corresponding to different datasets, which is a huge consumption of computing resources and time.
The method that distributed partition DNNs effectively avoids the challenges brought by the above two methods. Distributed partitioning and the deployment of DNNs can take place without changing the structure and parameters of the original model, and it does not need to retrain the model [
38,
41,
57,
58,
59]. In [
41], the authors deploy workload in a distributed manner based on network layer types. They use an input partition for convolution layers and a weight partition for full connection layers. BiasedOne-Dimensional Partition (BODP), Modified Spectral Co-Clustering (MSCC), and Modified Spectral Co-Clustering (MSCC) are proposed as adjuncts. In [
38], the authors propose an Fused Tile Partitioning (FTP) method to parallelize the convolution operation; it can divide the feature maps of each layer into small tiles in a grid fashion. Both [
57,
58] rely on Deep Reinforcement Learning (DRL). In [
57], the authors search for an optimal partition location for each layer volume using the Layer Configuration-based Partition Scheme Search (LC-PSS). For a layer volume splitter, they model the split process as a Markov Decision Process (MDP) and use DRL to make optimal split decisions. In [
58], the authors use DRL techniques to assist in task allocation in edge-distributed inference. Modeling the partition process as MDP, DRL agents use inference latency and layer configuration as the rewards and states, and then the optimal segmentation decisions for each layer volume are made one by one. In [
59], the authors use a host edge server to configure multiple secondary edge servers, and the overlapping zone of sub-tasks on the secondary edge servers is executed on the host edge server.
Considering that these distributed partitioning studies did not consider the trade-off between communication latency and computation latency, which is important for shortening inference latency and trick, we propose the EDIJP framework. Our proposed framework is based on joint partitioning for distributed deployment. We model the data partition problem as an LP problem and design an iterative algorithm to achieve the trade-off of communication latency and computation latency so that the inference latency can be shortened as much as possible.