Long Short-Term Memory-Based Non-Uniform Coding Transmission Strategy for a 360-Degree Video

Guo, Jia; Li, Chengrui; Zhu, Jinqi; Li, Xiang; Gao, Qian; Chen, Yunhe; Feng, Weijia

doi:10.3390/electronics13163281

Open AccessArticle

Long Short-Term Memory-Based Non-Uniform Coding Transmission Strategy for a 360-Degree Video

by

Jia Guo

¹

,

Chengrui Li

²,

Jinqi Zhu

¹,

Xiang Li

^1,*,

Qian Gao

¹,

Yunhe Chen

¹ and

Weijia Feng

¹

College of Computer and Information Engineering, Tianjin Normal University, Tianjin 300380, China

²

School of Software, Tiangong University, Tianjin 300387, China

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(16), 3281; https://doi.org/10.3390/electronics13163281

Submission received: 19 July 2024 / Revised: 5 August 2024 / Accepted: 16 August 2024 / Published: 19 August 2024

Download

Browse Figures

Versions Notes

Abstract

:

This paper studies an LSTM-based adaptive transmission method for a 360-degree video and proposes a non-uniform encoding transmission strategy based on LSTM. Our goal is to maximize the user’s video experience by dynamically dividing the 360-degree video into tiles of different numbers and sizes, and selecting different bitrates for each tile. This aims to reduce buffering events and video jitter. To determine the optimal number and size of tiles at the current moment, we constructed a dual-layer stacked LSTM network model. This model predicts, in real-time, the number, size, and bitrate of the tiles needed for the next moment of the 360-degree video based on the distance between the user’s eyes and the screen. In our experiments, we used an exhaustive algorithm to calculate the optimal tile division and bitrate selection scheme for a 360-degree video under different network conditions, and used this dataset to train our prediction model. Finally, by comparing with other advanced algorithms, we demonstrated the superiority of our proposed method.

Keywords:

360-degree video; non-uniform coding; LSTM-based; adaptive transmission

1. Introduction

With the development of network technology, the proportion of video traffic in global mobile internet traffic is increasing. A 360-degree video allows users to have an immersive experience while watching videos. This form of video has become increasingly popular, enabling users to obtain 360-degree video content around them by wearing VR headsets or holding mobile phones, and switching their viewing perspectives by turning their heads or rotating their devices. This feature of 360-degree video also results in a significantly larger encoding volume. For example, transmitting a 4K resolution 120-frame 360-degree video requires up to 4.2 Gbps of network bandwidth [1]. Therefore, 360-degree video transmission requires a higher network bandwidth compared to traditional 2D video transmission [2]. The spatial and temporal distributions of wireless resources in cellular networks are uneven, making it difficult to ensure a consistent user experience when watching videos. This issue becomes particularly severe when a large number of users simultaneously access the wireless cellular network, leading to unstable media experiences. Consequently, optimizing the transmission of a 360-degree video over wireless networks presents a significant challenge.

The popular encoding method for 360-degree videos is spherical projection tiling. This method allows for a more flexible transmission of 360-degree video data. Since users can only see one area (viewport) of the 360-degree video at a time, the transmission volume of media data can be reduced by transmitting only the current viewport being watched by the user. However, this transmission method can cause video stuttering and buffering when users switch viewports, affecting their media experience. Another optimization approach is to reduce the quality of video in unviewed viewports, thus reducing the volume of media transmission, improving network utilization, and minimizing video stuttering caused by viewport switching [3]. A viewport-aware real-time 360-degree video streaming framework, Live360, has been proposed to optimize end-to-end video streaming. The framework prioritizes the content from 360-degree cameras based on users’ real-time viewing interests and uploads more attractive content at higher bitrates. The authors also redefine the Quality of Experience (QoE) metrics for 360-degree real-time video viewers and efficiently solve the optimization problem using dynamic programming [4]. A multimodal spatiotemporal attention transformer has been proposed to generate multiple viewpoint trajectories and their probabilities given a historical trajectory. The authors then propose an ABR algorithm based on multi-agent deep reinforcement learning (MADRL), utilizing multi-viewpoint prediction for 360-degree video streaming to maximize different QoE objectives under various network conditions [5].

Based on the above, it can be seen that tile-based encoding is the most popular 360-degree video encoding scheme that can achieve the aforementioned functionalities, reducing the overall transmission load and enhancing the user media experience. However, since each tile is encoded and decoded individually, a problem arises. When the tile size is small, the overall video encoding volume increases; when the tile size is too large, more stuttering and buffering may occur during viewport switching. Thus, the tile size directly affects the user’s Quality of Experience (QoE). Additionally, when users watch videos, their attention range is limited and changes with the distance between their eyes and the screen. Selecting the appropriate number and size of tiles, as well as the bitrate for each tile based on the distance between the user’s eyes and the screen, is a challenge that needs to be addressed.

To address the aforementioned challenges, this paper proposes a non-uniform three-dimensional encoding scheme. Specifically, the video is divided into tiles of different sizes, with the tile sizes being related to the user’s attention mechanism and the distance between the user’s eyes and the screen. Based on this, different quality levels of video are allocated to each tile. Subsequently, this paper proposes a 360-degree adaptive transmission architecture based on non-uniform three-dimensional encoding, introduces a new QoE evaluation parameter, and digitally models the video transmission process to summarize the optimization problems that need to be addressed. Finally, a dual-layer stacked LSTM network (DLS-LSTM) is proposed to predict the number of tiles, the size of each tile, and the bitrate of the video transmitted for each tile at the next moment based on the current network bandwidth, the distance between the user’s eyes and the screen, and the server’s 360-degree video data. This paper constructs and labels a dataset of users watching 360-degree videos, simulates the scenario of users watching 360-degree videos based on existing wireless cellular network data, and uses exhaustive algorithms to calculate the optimal tile segmentation scheme and tile bitrate allocation scheme. Based on the dataset, a model for tile segmentation and bitrate allocation is trained and obtained.

As illustrated in Figure 1, this paper presents the LSTM-based non-uniform encoding 360-degree video adaptive transmission algorithm, which collects external data from the user’s video-watching environment and calculates the number, size, and bitrate of tiles required for the next moment. In the proposed non-uniform three-dimensional encoding scheme, the central position is defined as the user’s attention focus, which has the most significant impact on the user’s Quality of Experience (QoE). This central area is mandatorily divided into one tile, as it does not involve viewpoint switching issues, thus reducing the number of tiles to save network bandwidth. Outside the user’s focal point, two levels of tiles are set, each having a different impact on the overall QoE. The influence of each part on the user’s QoE can be controlled through weight parameters, which can be defined by the user based on their subjective experience. The number and size of tiles in areas other than the central focus are determined by the algorithm. Since the video can be encoded into data of different quality levels, the bitrate of each tile can be determined after the number and size of tiles are confirmed. Therefore, 360-degree videos can be encoded on a plane by the size and number of tiles and spatially by different bitrate encoding. The ultimate optimization goal is to maximize the user’s video experience.

The main contributions of this paper are as follows:

Proposed a non-uniform three-dimensional encoding method that provides tiles of different sizes and bitrates to users in real-time.
Introduced a new evaluation method for 360-degree adaptive transmission quality and proposed an adaptive transmission architecture based on non-uniform three-dimensional encoding. Additionally, digitally modeled the adaptive transmission of 360-degree video to summarize the optimization problems.
Developed an improved LSTM-based prediction algorithm to select tiles of appropriate sizes and bitrates for users. The effectiveness of the algorithm was demonstrated through simulation experiments.

2. Related Works

With the rapid development of virtual reality (VR) technology, 360-degree video transmission technology has also gained widespread attention. A 360-degree video can provide users with an immersive viewing experience, but its large-scale application faces many technical challenges, including high bandwidth requirements, low-latency transmission, and quality optimization. In recent years, researchers have proposed many innovative solutions, covering areas such as multimodal spatiotemporal attention mechanisms, multi-agent deep reinforcement learning, edge computing, and 5G transmission technology. This paper reviews these studies, categorizing them into three areas: viewport prediction and transmission optimization, multi-agent and deep learning applications, and edge computing and network optimization.

The first category is viewport prediction and transmission optimization, which is crucial in 360-degree video transmission. An accurate prediction of the user’s viewport can effectively reduce bandwidth consumption, improve video transmission efficiency, and enhance user experience. Ye proposed a 360-degree video caching solution based on viewport reconstruction that improves viewer QoE through tile adaptive streaming. This solution designs a QoE-driven reconstruction trigger scheme that decides whether to reconstruct based on current caching information and network conditions [6]. Chen introduced a viewport-aware real-time 360-degree video streaming framework, Live360, to optimize end-to-end video streaming. Wang proposed a novel framework, CoLive, for predicting the FoV in 360-degree video live streaming. CoLive accelerates FoV prediction by offloading model training from the viewer to the edge and migrating saliency feature detection to the server side [7]. Wang also introduced a practical tile size-aware deep neural network model with a decoupled self-attention architecture that accurately and efficiently predicts the transmission time of video tiles. The authors also designed a reinforcement learning-driven online correction algorithm to robustly compensate for improper quality allocation due to saliency bias [8]. Wang proposed a synergistic spatiotemporal user-perceived viewport prediction scheme for optimal adaptive 360-degree video streaming. The authors used a user-perceived viewpoint prediction model that provides a white-box solution for FoV prediction [9]. Chen introduced a new context-aware method to “learn” users’ viewing behavior and the predictor’s capability online, prioritizing resource allocation to tiles with significant QoE contributions. The researchers developed a restricted Markov decision process and applied predictive control of the model to interpret the competition for resources between reactive and proactive transmissions and tile retransmissions [10]. A joint two-tier 360-degree video streaming system is proposed based on utilities. To better enhance the accuracy of utility evaluation, the authors propose a novel approach that combines image quality and prediction accuracy to assess the contribution of each tile, balancing longer buffer durations and efficient bandwidth usage. Then, using model predictive control, the optimal video bitrate is determined by dynamically selecting tiles based on their characteristics [11]. Guo studied the optimal wireless streaming of multi-quality tiled 360 virtual reality (VR) videos in a multi-input multi-output (MIMO)–orthogonal frequency division multiple access (OFDMA) system from a multi-antenna server to multiple single-antenna users. In scenarios without user transcoding, the authors jointly optimized beamforming and subcarrier, transmit power, and rate allocation to minimize the total transmit power [12]. Zhao et al. investigated the adaptive streaming of one or more tiled 360 videos in a multi-carrier wireless system from a multi-antenna base station (BS) to one or more single-antenna users. The authors aimed to maximize video quality while maintaining minimal rebuffering time through a group of pictures (GOPs) encoding rate adaptation and transmission adaptation for each (transmission) slot [13]. A new viewpoint prediction framework has been proposed [14]. The authors enhance the accuracy of viewpoint prediction by extracting salient features from 360° videos and combining them with feedback from predicted viewpoints. The authors of [15] proposed a multi-control adaptive 360° video streaming algorithm that enhances adaptability to user viewport switching through multiple auxiliary controls. The authors of [16] proposed a new multicast transmission scheme for 360° video, combining Non-Orthogonal Multiple Access and Scalable Video Coding technologies to enhance the user video experience. The authors of [17] studied transmission technology for multi-user viewing of 360-degree videos. They adopted Time-Division Multiple Access (TDMA) to optimize transmission time and power allocation, thereby reducing average transmission energy consumption. An online adaptive bitrate algorithm for 360° videos has been proposed, which schedules the transmission of 360-degree video tiles and adjusts tile bitrates in real-time to reduce rebuffering occurrences [18]. The authors of [19] studied the problem of proactive data packet discarding during transmission. They proposed a dynamic programming algorithm based on an attention transformer to solve this issue.

The second category is multi-agent and deep learning applications. Multi-agent and deep learning technologies have shown great potential in optimizing 360-degree video transmission strategies and enhancing user experience. Zeng introduced a multi-agent deep reinforcement learning-based joint edge caching and bitrate selection strategy for multi-category 360-degree video streaming. This strategy adopts different edge caching and bitrate selection strategies for different video categories to achieve fine-grained performance optimization and improve users’ average QoE [20]. Li proposed a QoE fairness-aware multi-user bitrate allocation algorithm. This algorithm trains bitrate allocation strategies using multi-agent reinforcement learning based on the user viewpoint trajectories and preferences for video quality, buffering time, and quality switching, thus reducing the disparities in QoE of the user [21]. Jin introduced a novel intelligent edge caching framework to optimize overall user QoE while considering fairness and long-term system cost. This framework utilizes edge computing technology to cache and compute near the user, reducing transmission delay and bandwidth consumption [22]. Ao proposed a transformer-based computer vision model for adaptive bitrate allocation in live 360-degree videos. This model leverages the powerful feature extraction capabilities of transformers to improve video transmission efficiency and quality [23]. The authors of [24] studied super-resolution technology in 360-degree video transmission. Through a single-frame and multi-frame joint network, they upscaled low-resolution 360-degree videos to high-resolution outputs.

The third category is edge computing and network optimization. Edge computing and network optimization play a crucial role in 360-degree video transmission by reducing latency, optimizing bandwidth usage, and enhancing transmission efficiency, thereby significantly improving the user experience. Huang proposed an end-to-end virtual reality telepresence system that transmits 6K 360-degree video via 5G millimeter-wave radio. This system combines 5G technology with mobile edge computing nodes, significantly reducing latency and achieving efficient gaze compression, reaching WiFi-like performance on modern cellular networks [25]. Kumar introduced an intelligent framework called Seer, which fully implements 360-degree video streaming in cellular networks by leveraging multi-access edge computing. This framework integrates field-of-view prediction schemes and bitrate adaptive strategies into the MEC platform, significantly enhancing the efficiency and stability of video transmission [26]. Park proposed a QUIC-based multipath cross-layer scheduling scheme for 360-degree video streaming, improving user experience in wireless networks [27]. Simiscuka proposed an adaptive scheme based on Open-RAN to enhance the quality of Opera 360-degree content distribution. This scheme optimizes the flexibility and efficiency of video transmission through an open radio access network architecture [28]. Hassan introduced a novel approach using federated learning (UVPFL) to predict the viewpoint based on the user profile in real-time 360-degree video streaming [29]. Luo proposed a practical neural-enhanced 360-degree video streaming framework. When transmitting 360-degree video frames, the video server also sends a lightweight neural network model called MaskedEncoder to the client. Upon receiving the model, the client can reconstruct the original 360-degree video frames and start playback, significantly reducing bandwidth consumption and providing robustness against packet loss [30]. Groth proposed a wavelet-based video codec designed specifically for VR displays, capable of the real-time playback of high-resolution 360-degree videos [31]. The authors of [32] investigated a caching-based transmission method for 360-degree videos, aiming to enhance transmission efficiency through edge-supported transcoding. The goal is to optimize computational and resource utilization at the edge. In [33], the authors investigated edge computing-based 360-degree video transmission technology. They modeled the transcoding and caching at the video edge as a Markov Decision Process (MDP) and proposed a novel deep reinforcement learning method to address the problems of caching replacement and resource allocation. The authors of [34] investigated issues in edge computing-based 360-degree video transmission and proposed a novel caching strategy. This strategy can enhance caching efficiency and reduce edge computing energy consumption without prior knowledge of video content popularity. A multi-agent dual-agent regularized critic algorithm based on adaptive learning was proposed. By setting caches and utilizing super-resolution technology, this algorithm enhances users’ video experience [35]. The authors of [36] proposed a multicast transmission method for 360-degree videos based on client-side computational resources to enhance transmission efficiency.

The aforementioned studies have made significant contributions to the transmission of 360-degree videos. In these studies, 360-degree videos are segmented into tiles of the same size for transmission. However, if the tiles are too small, encoding each tile separately can lead to a large amount of video encoding data; if the tiles are too large, this can affect the user’s viewpoint switching experience. Furthermore, the existing literature does not propose a non-uniform encoding scheme. Therefore, the real-time calculation of the appropriate number of tiles, their sizes, and corresponding video bitrates to balance encoding volume and viewpoint switching remains an urgent challenge.

To address this issue, this paper proposes an adaptive transmission method based on non-uniform 3D encoding. This method segments a 360-degree video into tiles of varying importance based on the human eye attention mechanism. A dual-layer stacked LSTM network is introduced to dynamically calculate the optimal number of tiles, their sizes, and corresponding video bitrates in real-time. This approach improves the user’s video experience and reduces buffering caused by viewpoint switching.

3. Architecture and Model

In this section, we describe the concept and system architecture of the proposed transmission method. First, we describe the system model and formulate the optimization problem. Then, a prediction method based on LSTM is proposed to solve the optimization problem presented in this paper.

3.1. Overview of the Proposed Delivery System

Figure 2 illustrates the LSTM-based non-uniform coding transmission architecture for a 360-degree video proposed in this paper. First, information about the 360-degree video (such as bitrate, resolution, frame rate, etc.), network bandwidth, and the user’s viewing environment (e.g., the distance between the user and the screen, which can be estimated through the viewing device’s sensors) is collected. An exhaustive algorithm is used to determine the optimal solution under these conditions, including the number and size of tiles and the corresponding video bitrates required for the 360-degree video given the current network bandwidth. These data are then used as training data and input into the dual-layer stacked LSTM network (DLS-LSTM) proposed in this paper for model training. The trained model serves as the core component of the transmission architecture proposed in this paper, selecting the appropriate number and size of tiles and the corresponding video bitrates for the user based on real-time information (360-degree video, network bandwidth and viewing environment). This paper does not set an algorithm to predict the user’s viewport switch because the proposed DLS-LSTM already takes the user’s viewport switch into account when predicting the tiles that need to be sent. During the training of the model proposed in this paper, the data used for training include data on user viewport switches.

3.2. System Model

This paper defines four areas for a 360-degree video. The first area is the attention area, which is the central region, calculated as shown in Equation (9). This area is encoded using a single tile. The second area is the region near the attention area, which we define as the neighbor area. The third area is the transition area. The fourth area is defined as the pre-transmission area. The user’s viewport includes the attention area, neighbor area, and transition area. The tiles in the pre-transmission area do not affect the user’s perceived video quality but do impact the number of video rebuffering events.

Assume a user is watching a segment of a 360-degree video. The video is divided into several segments, each of which is encoded into multiple video tiles. Each tile is individually encoded into multiple quality levels of media data. Many factors affect the user’s QoE. In this paper, the user’s QoE is defined as being composed of three parts: the quality of video transmission, video quality jitter (the number and degree of quality level switches), and the number of rebuffering events. Due to the particularities of 360-degree video playback, this paper calculates the user’s QoE only within the scope of the user’s viewing perspective.

The QoE of the user within a time slice can be obtained by the following formula:

Q o E = Q - α P N - β R b u f,

(1)

where Q represents the quality of video transmission. Generally, Q can be obtained through subjective quality evaluation methods (MOS) or objective video evaluation methods (PSNR or SSIM). Parameters

α

and

β

are used to control video quality jitter and the impact of video buffering on the overall user experience, respectively.

P_{N}

represents video quality jitter, and

R_{buf}

represents the number of video buffering events.

P N = \frac{\sum_{i = 1}^{m} P I_{i}}{m} .

(2)

In this formula, the parameter i represents the i-th video tile in the 360-degree video, and

P I_{i}

represents the playback stability parameter of the i-th video tile.

P I = 1 - \frac{\sum_{f = 0}^{A} (|L_{h} - L_{A - f}| \times q (d))}{\sum_{f = 0}^{A} (l_{h} \times q (d))},

(3)

where

P I

represents the playback stability of a video tile. The value 1 subtracts the weighted sum of all bitrate switching steps in all video segments prior to transmission and the maximum quality value received within the transmission time. Parameter A represents the number of video segments,

L_{h}

represents the highest quality level of media data during transmission, and

L_{A - f}

represents the quality level of the video segment

S - f

.

q (d) = A - f

represents the higher negative gain obtained from the recent media bitrate switching period.

Q = \sum_{k = 1}^{m} (ϵ_{k} \times Q_{k}),

(4)

where parameter

Q_{k}

represents the video quality of the k-th video tile in the 360-degree video. Parameter

ϵ_{k}

represents the user’s QoE adjustment parameter. The value of

ϵ_{k}

is related to the area in which its corresponding tile is located, as defined in this paper.

Q_{k} = F (B_{k}^{ϵ}),

(5)

where function F represents the relationship between the selection of video bitrate and the user’s QoE, and

B_{k}^{ϵ}

represents the bitrate of the k-th video tile at quality level

ϵ

.

B_{k}^{ϵ} = max {v (e) ∣ e = 1 \dots l, v (e) \leq b_{k}},

(6)

B_{T} = \sum_{ζ = 1}^{m} B_{ζ},

(7)

where parameter

v (e)

represents the transmission bitrate of video tile at quality level e, and

b_{k}

is the network bandwidth allocated to video tile k. Parameter

B_{T}

represents the total bitrate of the current 360-degree video segments being transmitted.

B_{ζ}

represents the bitrate of the

ζ

-th video tile, and

B_{T} \leq Bandwidth

, where Bandwidth represents the current available bandwidth for media transmission.

B a n d w i d t h = \{\begin{matrix} \frac{\sum_{i = 1}^{M} (b_{μ}^{i} \times t_{s})}{t_{μ} - t_{μ - 1}}, & the first segment \\ δ \times \frac{\sum_{i = 1}^{M} (b_{μ - 1}^{i} \times t_{s})}{t_{μ - 1} - t_{μ - 2}} + (1 - δ) \times B^{*}, & otherwise \end{matrix}

(8)

where

B a n d w i d t h

is the predicted network bandwidth.

b_{μ}^{i}

represents the bitrate of the

μ

-th segment of the i-th tile.

t_{μ}

is the received time, and

t_{μ - 1}

is the sent time.

t_{s}

is the playback time of the segment.

B^{*}

is the available bandwidth for the last segment, and

δ

is the weight given to the current bandwidth. Our model remains feasible even when using other methods to estimate the available bandwidth.

The user’s attention area is related to the distance at which the user watches the video. Assuming the user’s attention angle is

θ

, the value of

θ

is related to the user’s eyes. The relationship can be described by the equation

tan (θ / 2) = (p / 2) / d

, where d is the distance from the user’s eyes to the screen. The side length P of the attention area can be calculated using the following formula:

p = tan (\frac{θ}{2}) \cdot d \cdot 2 .

(9)

Therefore, the optimization problem is formulated as

\begin{matrix} maximize (Q o E) \\ subject to Equations . (1) - (9) . \end{matrix}

(10)

4. Algorithm

This section primarily describes the LSTM-based non-uniform encoding 360-degree video adaptive transmission algorithm. The paper proposes an algorithm based on the dual-layer stacked LSTM network, and this section details the specific implementation steps of the algorithm.

This paper employs LSTM technology to optimize video transmission for several key reasons:

Firstly, LSTM networks excel in handling and predicting time series data. Given that video transmission involves a large continuous data stream, LSTM can capture the complex patterns of data changes over time, enabling more accurate predictions of potential variations during video transmission. Secondly, traditional neural networks face issues such as vanishing or exploding gradients when dealing with long-term dependencies. LSTM, with its specially designed memory units, effectively captures and maintains these long-term dependencies, which is crucial for considering network conditions and user viewpoint changes in video transmission. Thirdly, LSTM can dynamically adjust its output based on real-time input data, making it highly effective in real-time video transmission. It can adjust the number, size, and bitrate of video tiles according to current network bandwidth and user viewpoint switches, ensuring video quality and transmission efficiency. Lastly, LSTM’s design provides robust performance in handling uncertainty and noisy data. This adaptability is essential for coping with network bandwidth fluctuations and user behavior changes, thereby providing stable performance amidst various dynamic changes.

The user’s attention range is limited, so the most important area for the user’s video experience is the area of attention. Therefore, this paper considers the user’s attention model, where the user’s attention is defined as the center of the field of view, i.e., the direction and area the user is currently focusing on. This area has the greatest impact on the user’s QoE and is thus forcibly divided into a high-quality tile. The proposed algorithm identifies and quantifies the impact of each tile on the overall QoE, dynamically assigning weights to different tiles to determine which areas need higher encoding quality. The central area is usually the user’s primary focus and is forcibly segmented into a high-quality tile to ensure the best viewing experience. Adjacent tiles are divided into different quality levels based on the weights assigned by the attention mechanism. The quality of these tiles varies according to the user’s viewpoint changes and attention dispersion. The impact of each tile on the user’s QoE is controlled by dynamically calculated weight parameters from the proposed algorithm. These weight parameters can be optimized through user feedback and experimental data to best reflect the user’s viewing experience. Finally, the algorithm dynamically adjusts the number, size, and bitrate of tiles by incorporating the attention mechanism into the LSTM network. This dynamic adjustment mechanism ensures the best viewing experience even under changing network conditions and user viewpoint switches. Through these steps, this paper proposes a real-time tile quality allocation method that combines the attention mechanism and LSTM network to optimize the transmission and viewing experience of 360-degree videos. The specific implementation steps are as follows:

As shown in Figure 3, the collected data first undergo preprocessing for feature extraction and are then passed to the embedding layer for format conversion. Subsequently, the data are fed into two separate LSTM networks. The first LSTM network incorporates Dropout to randomly discard some neurons, thereby reducing the complexity of the neural network and preventing overfitting. After this, the data enter the concatenate layer, where the feature values are merged and then processed by a third LSTM network. This network further refines the feature values, providing a deeper analysis for the final result. Following the third LSTM network, an attention mechanism is introduced to dynamically adjust the weights of different parts, thereby modifying the input features. Finally, the data are input into the dense layer, where the sigmoid activation function is applied to predict the optimization target.

4.1. Generation of Training Data

This subsection proposes an exhaustive algorithm to calculate the optimal solution under different conditions (varying bandwidth; different videos; different numbers and sizes of tiles; and the corresponding bitrates for each tile). The exhaustive algorithm is slow in computation and challenging to apply in real-world environments. The purpose of proposing this exhaustive algorithm is to generate data for training.

The algorithm first initializes the four regions of the 360-degree video, the user’s eye-to-screen distance d, and the current network bandwidth. In this initialization, the Attention Area (AA) contains one tile, while the Neighbor Area (NA), Transition Area (TA), and Pre-fetching Transition Area (PTA) all have zero tiles, as described in lines 1–5. Next, lines 6–20 use nested loops to iterate through all possible bitrate combinations for the AA, NA, TA, and PTA regions. The nested loops traverse different combinations of tile numbers for the AA, NA, TA, and PTA regions. For each combination, lines 7–14 ensure that the total bitrate does not exceed the bandwidth limit, calculate the Quality of Experience (QoE) for each combination, nor store the QoE values. Finally, line 21 calculates the maximum QoE value to determine the tile segmentation scheme for NA, TA, and PTA, as well as the bitrate for each video tile.

Algorithm 1: Training Data Generation Algorithm (TDGA) is an algorithm used to generate training data with the objective of calculating the optimal video segmentation scheme and bitrate allocation under different network bandwidth and video quality conditions. The following is the complexity analysis of this algorithm.

Algorithm 1: Training Data Generation Algorithm (TDGA)

Initialize four regions of a 360-degree video.
Initialize the user’s eye-to-screen distance d and the current network bandwidth $B a n d w i d t h$ .
The attention area consists of 1 tile.
Initialize the number of tiles, where neighbor area (NA) = 0 tile,
transition area (TA) = 0 tile, pre-transmission area (PTA) = 0 tile.
The video can be divided into $V_{l}$ quality levels.
for $b i t r a t e_{A A}$ to $V_{l}$ do
for $b i t r a t e_{N A}$ to $V_{l}$ do
for $b i t r a t e_{T A}$ to $V_{l}$ do
for $b i t r a t e_{P T A}$ to $V_{l}$ do
for $N A = 0$ to X do
for $T A = 0$ to X do
for $P T A = 0$ to X do
if $\sum_{i = 1}^{N A} b i t r a t e_{N A} + \sum_{i = 1}^{T A} b i t r a t e_{T A} + \sum_{i = 1}^{P T A} b i t r a t e_{P T A} + b i t r a t e_{A A} \leq B a n d w i d t h$ then
Calculate $Q o E_{N A, T A, P T A}$
$B a n d w i d t h = B a n d w i d t h - \sum_{i = 1}^{N A} b i t r a t e_{N A} - \sum_{i = 1}^{T A} b i t r a t e_{T A} - \sum_{i = 1}^{P T A} b i t r a t e_{P T A} - b i t r a t e_{A A}$
end if
Store the value of $Q o E_{N A, T A, P T A}$
end for
end for
end for
end for
end for
end for
end for
Calculate the maximum QoE to determine the tiling scheme values $N A$ , $T A$ and $P T A$ , as well as the bitrate for each video tile $b i t r a t e_{A A}$ , $b i t r a t e_{N A}$ , $b i t r a t e_{T A}$ and $b i t r a t e_{P T A}$ .

Outer loop: The algorithm contains a quadruple loop corresponding to the bitrates of AA, NA, TA, and PTA. Assuming that the video quality level for each region is the same, the complexity of the outer loop is

O ({V_{l}}^{4})

.

Inner loop: For each bitrate combination, the algorithm uses three nested loops to traverse the number of NA, TA, and PTA. Assuming that the number of blocks in each region is the same, the complexity of the inner loop is

O (X^{3})

.

Total complexity: The total complexity of the algorithm is the product of the outer loop and inner loop complexities:

O ({V_{l}}^{4}) \times O (X^{3})

.

Assuming

V_{l} = 6

and

X = 50

, and running this algorithm on an Intel Core i9-14980HX CPU, the total complexity is

O ({V_{l}}^{4}) \times O (X^{3}) = 162,000,000

. The clock speed of the Intel Core i9-14980HX is 5.6 GHz (which means that it can execute approximately

5.6 \times 10^{9}

operations per second). Assume that in the worst case, one CPU cycle executes one instruction and each operation (each instruction within the loops) takes an average time of

T_{once}

.

T_{once} = \frac{1}{5.6 \times 10^{9} operations / second} = 1.79 \times 10^{- 10} seconds / operation .

(11)

Therefore, the time for one execution of the algorithm is

162,000,000 operations \times 1.79 \times 10^{- 10} seconds / operation \approx 29 seconds .

(12)

The above complexity analysis shows that the real-time performance of Algorithm TDGA is not ideal and cannot be directly applied to the 360 video transmission process. However, Algorithm TDGA can compute the optimal solution to the problem, making it suitable for generating training data for DLS-LSTM. Furthermore, the execution time of the algorithm is sufficient to support the generation of training data.

4.2. Train the Model

The loss function defined in this paper is

L = \frac{1}{N} \sum_{i = 1}^{N} {(Q_{pred, i} - Q_{true, i})}^{2} + λ \sum_{j = 1}^{M} |Δ Q_{j}|

(13)

where

Q_{pred, i}

and

Q_{true, i}

represent the predicted QoE value and the true QoE value for the i-th sample, respectively, N denotes the total number of samples,

λ

is the regularization parameter, and

Δ Q_{j}

represents the change in QoE for tile j.

The Adam optimization algorithm is used in this paper to ensure that the model converges quickly. The specific training steps are as follows:

Step 1: Input the preprocessed feature vectors into the embedding layer.

Step 2: The feature vectors enter the first LSTM layer for initial feature extraction, and Dropout is applied to prevent overfitting.

Step 3: The initially extracted feature values are fused in the concatenate layer and then fed into the second LSTM layer for further refinement.

Step 4: Introduce an attention mechanism to assign weights to different tiles, identify and quantify the impact of each tile on QoE, and predict the user’s future focus areas.

Step 5: The feature values processed by the attention mechanism are input into the dense layer, where the sigmoid activation function is used for the final prediction.

Step 6: Calculate the loss function, update the model parameters using the Adam optimization algorithm, and repeat the training until the model converges.

5. Simulation

5.1. Data Generation Experiment

In this section, we downloaded and segmented 200 sequences of 360-degree videos [37]. The distance of the user’s eyes was randomly generated within the range of 10 cm to 30 cm. The user’s visual focus angle was set to

θ = 25 °

. This study generated 50,000 data points using the TDGA algorithm, with 45,000 data points used as training data and 5000 data points used as validation data.

This paper uses the coefficient of determination (

R^{2}

) and the mean absolute error (MAE) to analyze the precision of the DLS-LSTM model estimation method. The accuracy of the DLS-LSTM model refers to the predicted QoE values of users under the transmission schemes. The formula for the coefficient of determination is given by Equation (14):

R^{2} = 1 - \frac{\sum_{i = 1}^{n} {(p_{i}^{'} - p_{i})}^{2}}{\sum_{i = 1}^{n} {(p_{i} - \bar{p})}^{2}},

(14)

where parameter

p_{i}^{'}

is the QoE value of the user for the ith transmission scheme,

p_{i}

is the true QoE value of the ith transmission scheme, n is the number of test samples and

\bar{p}

represents the average value of the true QoE. Here,

R^{2} \in [0, 1]

, with higher values indicating better accuracy of the estimation.

After 5000 tests, the average result of the 5000 tests is

R^{2} = 0.913

. The accuracy of our proposed DLS-LSTM model is sufficient to meet the computational requirements of real-time adaptive transmission schemes for 360-degree videos.

5.2. Simulation Experiment

To validate the proposed method, comprehensive simulation experiments were conducted. The specific simulation environment and parameters are as follows:

Video Data:

The 360-degree video sequences “King Kong”, “Lion”, “Spacewalk”, and “Everest” were used [37]. These video sequences are from publicly available 360-degree video datasets.

Network Environment:

We constructed a network simulation environment in NS3 (3.38) with the parameters shown in Table 1. In this simulated network environment, we modeled network conditions for three user mobility states: stationary user network (S-Network), slow-moving user network (SM-Network), and fast-moving user network (FM-Network). Background traffic was simulated using the ON/OFF model.

Encoding and Transmission Parameters:

The videos were encoded using JSVM (Joint Scalable Video Model) version 9.19 [38]. Each video can be divided into six layers with resolutions ranging from 2160P to 360P. Dynamic Adaptive Streaming over HTTP (DASH) technology was used to achieve transmission of different quality levels through layer extraction.

Evaluation Metrics:

Peak Signal-to-Noise Ratio (PSNR): A commonly used metric for evaluating image or video quality, primarily used to compare the differences between the original image or video and the compressed or processed version. Structural Similarity Index (MSSIM): An index for evaluating image or video quality, aiming to better reflect human visual perception. Unlike PSNR, which is based on pixel value differences, SSIM focuses on capturing structural information, brightness, and contrast to provide a quality evaluation more in line with human visual perception. Cumulative Distribution Function (CDF): An important concept in probability theory and statistics, used to describe the cumulative probability distribution of a random variable. In this paper, it is used to evaluate the distribution of video quality levels during transmission. Playback Stability: Evaluates the variation in video quality during playback. In this paper, the video quality level is obtained by calculating the average change in the quality level of each tile. Quality of Experience (QoE): Based on a calculated formula, considering video quality, quality jitter, and the number of buffering events. QoE is normalized in this paper.

Efficiency: This metric is assessed using objective quality indices such as PSNR and MSSIM. It measures video quality and transmission efficiency by comparing the sampled image with the original resolution.
The entire screen of the media player, with a resolution of $D R$ , is divided into z regions.

$\{\begin{matrix} M S S I M (A_{ν}, B_{ν}) = \frac{1}{z} \sum_{k = 1}^{z} S S I M (a_{ν, k}, b_{ν, k}) \\ z = \frac{D R}{min \{{(s_{d}^{z} \times s_{d}^{w})}_{1}, {(s_{d}^{z} \times s_{d}^{w})}_{2}, \dots, {(s_{d}^{z} \times s_{d}^{w})}_{N}\}} \\ S S I M (a_{ν, k}, b_{ν, k}) = g (a_{ν, k}, b_{ν, k}), \end{matrix}$

(15)

where g is a function of $a_{ν, k}$ and $b_{ν, k}$ . $a_{ν, k}$ is the original k-th tile video. $b_{ν, k}$ is the aware k-th tile video. Moreover, there is a one-to-one correspondence between $b_{ν, k}$ and $c_{ν, k}^{k}$ . $c_{ν, k}^{k}$ is the bitrate of the k-th tile.
Cumulative Distribution Function (CDF): This metric evaluates the distribution of transmitted video quality levels.
Playback Stability: This metric assesses the stability of video playback, employing a specific calculation method.
QoE: The QoE, as defined by Equation (1), has been normalized to improve its accuracy.

$Q o E_{n o r} = \frac{Q o E - Q o E_{m i n}}{Q o E_{m a x} - Q o E_{m i n}} .$

(16)

Comparison algorithm:

To evaluate the performance of the proposed DLS-LSTM algorithm, we conducted simulation experiments and compared it with other algorithms. The first comparison algorithm, referred to as Base-M, sends only the minimum quality level video to ensure smooth video playback. The second comparison algorithm is TBRA [39], which dynamically divides panoramic video into different numbers of tiles. The third comparison algorithm is Option-2 [40], which enhances viewport switching sensitivity by differentiating a transition region in the user’s FoV movement direction based on uniform tiling. The fourth comparison algorithm is Live360 [4], which optimizes 360° transmission using dynamic programming.

Simulation Steps:

Data Generation: Generate training data using exhaustive algorithms and simulate user viewing behavior under different bandwidth and viewing environments. Model Training: Train a dual-layer stacked LSTM network model using the generated data, incorporating an attention mechanism to dynamically adjust the number, size, and bitrate of the tiles. Performance Evaluation: Compare the performance of the proposed method with existing methods across different evaluation metrics through simulation experiments.

Finally, a subjective evaluation experiment was conducted using the Mean Opinion Score (MOS) to assess user experience. Based on the transmission strategies computed by different algorithms under various network conditions, the videos were re-edited to align with these strategies. Twenty-two participants were recruited to evaluate the videos under identical conditions across different network scenarios. Each participant provided MOS ratings for different videos, with scores ranging from 0 to 6, where higher scores indicate a better subjective viewing experience. The final score was the average of all participants’ ratings.

5.3. Experimental Result

In this study, the PSNR and MSSIM values are calculated by comparing the received video with the original highest-quality video. If the resolution of the received video is different from the original, it is upsampled to match the original resolution before calculating these values.

Figure 4 shows the PSNR values, MSSIM values, QoE values, and the CDF distribution of video quality levels for different algorithms when users watch the video ’Kingkong’ under various network conditions. Figure 5 displays the PSNR values, MSSIM values, QoE values, and the CDF distribution of video quality levels for different algorithms when users watch the video ’Lion’ under various network conditions. Figure 6 illustrates the PSNR values, MSSIM values, QoE values, and the CDF distribution of video quality levels for different algorithms when users watch the video ’Spacewalk’ under various network conditions. Figure 7 demonstrates the PSNR values, MSSIM values, QoE values, and the CDF distribution of video quality levels for different algorithms when users watch the video ’Everest’ under various network conditions.

From the above figures, it can be observed that the proposed DLS-LSTM method achieves the best results in terms of PSNR values, SSIM values, and user QoE. The CDF distribution of video quality levels shows that the DLS-LSTM method delivers a higher proportion of high-quality video levels, resulting in better video transmission quality compared to other methods. Among the three network conditions, the impact on network bandwidth is minimal when the user is stationary or moving at a low speed. Therefore, the user’s video experience does not differ significantly between these two network conditions when watching the four videos. However, when the user is moving at high speeds, such as on a highway or high-speed train, the network bandwidth decreases and fluctuates more severely. Under these network conditions, the proposed DLS-LSTM algorithm still achieves excellent performance. This is because our formulated optimization problem considers the impact of network changes and video quality fluctuations on user experience, resulting in more reasonable transmission strategy choices. Moreover, the proposed method is more versatile. In different video transmission scenarios, by modifying the definition of QoE or the optimization conditions and objectives, and obtaining training data through exhaustive algorithms, the DLS-LSTM can be trained to adapt to different application scenarios.

Table 2 shows the results of video quality fluctuations for different algorithms when users watch different videos under various network conditions. A PI value closer to 1 indicates more stable video transmission quality. It can be seen that the proposed DLS-LSTM algorithm demonstrates excellent quality stability. Naturally, the most stable video quality is achieved by the BASE-M method, as this method only transmits the video data at the most basic quality level, resulting in no quality fluctuations.

Figure 8 shows the results of the subjective quality evaluation experiment conducted in this study. The experiment demonstrates that the proposed DLS-LSTM method achieves the highest MOS scores across all network conditions and for all videos viewed by users. This result also confirms the accuracy of the QoE definition proposed in this paper. The BASE-M algorithm receives the lowest subjective evaluation from users because it only transmits video data at the lowest quality level.

6. Conclusions

This paper proposes an adaptive transmission method for 360-degree videos based on non-uniform three-dimensional encoding, aiming to mitigate the conflict between user viewing experience and network bandwidth usage present in traditional video transmission methods. By segmenting the video into tiles of varying sizes and allocating different quality levels to each tile based on the user’s attention mechanism and the distance between the user’s eyes and the screen, this paper optimizes the video transmission process. Utilizing a dual-layer stacked LSTM network model, this study dynamically predicts the number of tiles, tile sizes, and the bitrate for each tile, thereby achieving adaptive transmission. In simulation experiments, an exhaustive algorithm is proposed to generate training data for the DLS-LSTM network. The comparison results with other methods show that the proposed method significantly enhances the user viewing experience and reduces the buffering caused by viewpoint switching. Specifically, by dynamically adjusting the tile sizes and bitrates, the proposed method effectively reduces the usage of network bandwidth while ensuring video quality, thus improving network utilization. Additionally, the method demonstrates excellent performance in optimizing video transmission efficiency.

In conclusion, this research not only provides a new solution for 360-degree video transmission but also offers valuable insights for future multimedia transmission technology research. Future work will continue to optimize the performance of the model and validate its effectiveness in more practical scenarios to further enhance the user viewing experience and the efficiency of video transmission.

Author Contributions

Conceptualization, J.G. and J.Z.; methodology, J.G.; software, C.L. and Y.C.; validation, C.L. and Q.G.; formal analysis, X.L.; investigation, J.G.; resources, J.G.; data curation, W.F.; writing—original draft preparation, J.G.; writing—review and editing, J.G.; visualization, W.F.; supervision, J.Z.; project administration, X.L.; funding acquisition, J.G. All authors have read and agreed to the published version of the manuscript.

Funding

This work was substantially supported by the National Natural Science Foundation of China under Grant No. 62002263 and Tianjin Normal University Cybersecurity and Informatization Development Project No. 52WT2328.

Data Availability Statement

The data presented in this study are available on request from the corresponding author. The data are not publicly available as they are related to our subsequent research.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Yang, S.; He, Y.; Zheng, X. Fovr: Attention-based vr streaming through bandwidth-limited wireless networks. In Proceedings of the 2019 16th Annual IEEE International Conference on Sensing, Communication, and Networking (SECON), Boston, MA, USA, 10–13 June 2019; pp. 1–9. [Google Scholar]
Du, J.; Yu, F.R.; Lu, G.; Wang, J.; Jiang, J.; Chu, X. MEC-Assisted Immersive VR Video Streaming Over Terahertz Wireless Networks: A Deep Reinforcement Learning Approach. IEEE Internet Things J. 2020, 7, 9517–9529. [Google Scholar] [CrossRef]
Yang, Q.; Gao, W.; Li, C.; Wang, H.; Dai, W.; Zou, J.; Xiong, H.; Frossard, P. 360Spred: Saliency Prediction for 360-Degree Videos Based on 3D Separable Graph Convolutional Networks. IEEE Trans. Circuits Syst. Video Technol. 2024. [Google Scholar] [CrossRef]
Chen, J.; Luo, Z.; Wang, Z.; Hu, M.; Wu, D. Live360: Viewport-Aware Transmission Optimization in Live 360-Degree Video Streaming. IEEE Trans. Broadcast. 2023, 69, 85–96. [Google Scholar] [CrossRef]
Wang, H.; Long, Z.; Dong, H.; Saddik, A.E. MADRL-Based Rate Adaptation for 360 degree Video Streaming with Multi-Viewpoint Prediction. IEEE Internet Things J. 2024, 11, 26503–26517. [Google Scholar] [CrossRef]
Ye, Z.; Li, Q.; Ma, X.; Zhao, D.; Jiang, Y.; Ma, L.; Yi, B.; Muntean, G.-M. VRCT: A Viewport Reconstruction-Based 360 degree Video Caching Solution for Tile-Adaptive Streaming. IEEE Trans. Broadcast. 2023, 69, 691–703. [Google Scholar] [CrossRef]
Wang, M.; Chen, X.; Yang, X.; Peng, S.; Zhao, Y.; Xu, M.; Xu, C. CoLive: Edge-Assisted Clustered Learning Framework for Viewport Prediction in 360 degree Live Streaming. IEEE Trans. Multimed. 2024, 26, 5078–5091. [Google Scholar] [CrossRef]
Wang, S.; Yang, S.; Su, H.; Zhao, C.; Xu, C.; Qian, F.; Wang, N.; Xu, Z. Robust Saliency-Driven Quality Adaptation for Mobile 360-Degree Video Streaming. IEEE Trans. Mob. Comput. 2024, 23, 1312–1329. [Google Scholar] [CrossRef]
Wang, Y.; Li, J.; Li, Z.; Shang, S.; Liu, Y. Synergistic Temporal-Spatial User-Aware Viewport Prediction for Optimal Adaptive 360-Degree Video Streaming. IEEE Trans. Broadcast. 2024, 70, 453–467. [Google Scholar] [CrossRef]
Chen, C.-Y.; Hsieh, H.-Y. Cross-Frame Resource Allocation with Context-Aware QoE Estimation for 360 degree Video Streaming in Wireless Virtual Reality. IEEE Trans. Wirel. Commun. 2023, 22, 7887–7901. [Google Scholar] [CrossRef]
Li, Z.; Wang, Y.; Liu, Y.; Li, J.; Zhu, P. JUST360: Optimizing 360-Degree Video Streaming Systems with Joint Utility. IEEE Trans. Broadcast. 2024, 70, 468–481. [Google Scholar] [CrossRef]
Guo, C.; Zhao, L.; Cui, Y.; Liu, Z.; Ng, D.W.K. Power-Efficient Wireless Streaming of Multi-Quality Tiled 360 VR Video in MIMO-OFDMA Systems. IEEE Trans. Wirel. Commun. 2021, 20, 5408–5422. [Google Scholar] [CrossRef]
Zhao, L.; Cui, Y.; Liu, Z.; Zhang, Y.; Yang, S. Adaptive Streaming of 360 Videos with Perfect, Imperfect, and Unknown FoV Viewing Probabilities in Wireless Networks. IEEE Trans. Image Process. 2021, 30, 7744–7759. [Google Scholar] [CrossRef] [PubMed]
Li, J.; Han, L.; Zhang, C.; Li, Q.; Liu, Z. Spherical Convolution Empowered Viewport Prediction in 360 Video Multicast with Limited FoV Feedback. ACM Trans. Multimed. Comput. Commun. Appl. 2023, 19, 1551–6857. [Google Scholar] [CrossRef]
Xu, R.; Liu, C.; Hu, M.; Qian, S.; Zhang, Y.; Lin, T. OMMS: Multiple Control based Adaptive 360° Video Streaming. In Proceedings of the 15th ACM Multimedia Systems Conference, Bari, Italy, 15–18 April 2024; pp. 429–434. [Google Scholar] [CrossRef]
Gao, N.; Liu, G.; Feng, M.; Hua, X.; Jiang, T. Non-orthogonal Multiple Access Enhanced Scalable 360-degree Video Multicast. IEEE Trans. Multimed. 2024, 1–16. [Google Scholar] [CrossRef]
Guo, C.; Cui, Y.; Liu, Z. Optimal Multicast of Tiled 360 VR Video. IEEE Wirel. Commun. Lett. 2019, 8, 145–148. [Google Scholar] [CrossRef]
Zeynali, A.; Hajiesmaili, M.H.; Sitaraman, R.K. BOLA360: Near-optimal View and Bitrate Adaptation for 360-degree Video Streaming. In Proceedings of the 15th ACM Multimedia Systems Conference, Bari, Italy, 15–18 April 2024; pp. 12–22. [Google Scholar] [CrossRef]
Wang, H.; Dong, H.; Saddik, A.E. Tile-Weighted Rate-Distortion Optimized Packet Scheduling for 360° VR Video Streaming. IEEE Intell. Syst. 2024, 39, 60–72. [Google Scholar] [CrossRef]
Zeng, J.; Zhou, X.; Li, K. MADRL-Based Joint Edge Caching and Bitrate Selection for Multicategory 360 degree Video Streaming. IEEE Internet Things J. 2024, 11, 584–596. [Google Scholar] [CrossRef]
Li, Z.; Zhong, P.; Huang, J.; Gao, F.; Wang, J. Achieving QoE Fairness in Bitrate Allocation of 360 degree Video Streaming. IEEE Trans. Multimed. 2024, 26, 1169–1178. [Google Scholar] [CrossRef]
Jin, Y.; Liu, J.; Wang, F.; Cui, S. Ebublio: Edge-Assisted Multiuser 360 degree Video Streaming. IEEE Internet Things J. 2023, 10, 15408–15419. [Google Scholar] [CrossRef]
Ao, A.; Park, S. Applying Transformer-Based Computer Vision Models to Adaptive Bitrate Allocation for 360 degree Live Streaming. In Proceedings of the 2024 IEEE Wireless Communications and Networking Conference (WCNC), Dubai, United Arab Emirates, 21–24 April 2024; pp. 1–6. [Google Scholar] [CrossRef]
Liu, H.; Ma, W.; Ruan, Z.; Fang, C.; Shang, F.; Liu, Y.; Wang, L.; Wang, C.; Jiang, D. A single frame and multi-frame joint network for 360-degree panorama video super-resolution. Eng. Appl. Artif. Intell. 2024, 134, 108601. [Google Scholar] [CrossRef]
Huang, X.; Riddell, J.; Xiao, R. Virtual Reality Telepresence: 360-Degree Video Streaming with Edge-Compute Assisted Static Foveated Compression. IEEE Trans. Vis. Comput. Graph. 2023, 29, 4525–4534. [Google Scholar] [CrossRef]
Kumar, S.; Franklin, A.; Jin, J.; Dong, Y.-N. Seer: Learning-Based 360 degree Video Streaming for MEC-Equipped Cellular Networks. IEEE Trans. Netw. Sci. Eng. 2023, 10, 3308–3319. [Google Scholar] [CrossRef]
Park, S.; Das, S.R. Cross-Layer Scheduling in QUIC and Multipath QUIC for 360-Degree Video Streaming. In Proceedings of the 2024 IEEE Wireless Communications and Networking Conference (WCNC), Dubai, United Arab Emirates, 21–24 April 2024; pp. 1–6. [Google Scholar] [CrossRef]
Simiscuka, A.A.; Togou, M.A.; Zorrilla, M.; Muntean, G.-M. 360-ADAPT: An Open-RAN-Based Adaptive Scheme for Quality Enhancement of Opera 360 degree Content Distribution. IEEE Trans. Green Commun. Netw. 2024. [Google Scholar] [CrossRef]
Hassan, S.M.H.U.; Brennan, A.; Muntean, G.-M.; McManis, J. User Profile-Based Viewport Prediction Using Federated Learning in Real-Time 360-Degree Video Streaming. In Proceedings of the 2023 IEEE International Symposium on Broadband Multimedia Systems and Broadcasting (BMSB), Beijing, China, 14–16 June 2023; pp. 1–7. [Google Scholar] [CrossRef]
Luo, Z.; Chai, B.; Wang, Z.; Hu, M.; Wu, D. Masked360: Enabling Robust 360-Degree Video Streaming with Ultra Low Bandwidth Consumption. IEEE Trans. Vis. Comput. Graph. 2023, 29, 2690–2699. [Google Scholar] [CrossRef] [PubMed]
Groth, C.; Fricke, S.; Castillo, S.; Magnor, M. Wavelet-Based Fast Decoding of 360 degree Videos. IEEE Trans. Vis. Comput. Graph. 2023, 29, 2508–2516. [Google Scholar] [CrossRef] [PubMed]
Xiao, H.; Xu, C.; Feng, Z.; Ding, R.; Yang, S.; Zhong, L.; Liang, J.; Muntean, G.-M. A Transcoding-Enabled 360° VR Video Caching and Delivery Framework for Edge-Enhanced Next-Generation Wireless Networks. IEEE J. Sel. Areas Commun. 2022, 40, 1615–1631. [Google Scholar] [CrossRef]
Yang, T.; Tan, Z.; Xu, Y.; Cai, S. Collaborative Edge Caching and Transcoding for 360° Video Streaming Based on Deep Reinforcement Learning. IEEE Internet Things J. 2022, 9, 25551–25564. [Google Scholar] [CrossRef]
Yu, Z.; Liu, J.; Liu, S.; Yang, Q. Co-Optimizing Latency and Energy with Learning Based 360° Video Edge Caching Policy. In Proceedings of the 2022 IEEE Wireless Communications and Networking Conference (WCNC), Austin, TX, USA, 10–13 April 2022; pp. 2262–2267. [Google Scholar]
Zeng, J.; Zhou, X.; Li, K. Towards High-Quality Low-Latency 360° Video Streaming with Edge-Client Collaborative Caching and Super-Resolution. IEEE Internet Things J. 2024. [Google Scholar] [CrossRef]
Long, K.; Cui, Y.; Ye, C.; Liu, Z. Optimal Wireless Streaming of Multi-Quality 360 VR Video By Exploiting Natural, Relative Smoothness-Enabled, and Transcoding-Enabled Multicast Opportunities. IEEE Trans. Multimed. 2021, 23, 3670–3683. [Google Scholar] [CrossRef]
Youtube. 360 Degreevideo. Available online: https://www.youtube.com/results?search_query=360video (accessed on 2 February 2024).
JSVM 9 Software. Available online: https://vcgit.hhi.fraunhofer.de/jvet/jsvm (accessed on 8 January 2024).
Zhang, L.; Suo, Y.; Wu, X.; Wang, F.; Chen, Y.; Cui, L.; Liu, J.; Ming, Z. TBRA: Tiling and bitrate adaptation for mobile 360-degree video streaming. In Proceedings of the 29th ACM International Conference on Multimedia, Virtual Event, 20–24 October 2021; pp. 4007–4015. [Google Scholar]
Nguyen, D.V.; Tran, H.T.; Pham, A.T.; Thang, T.C. An optimal tile-based approach for viewport-adaptive 360-degree video streaming. IEEE J. Emerg. Sel. Top. Circuits Syst. 2019, 9, 29–42. [Google Scholar] [CrossRef]

Figure 1. Non-uniform three-dimensional encoding scheme.

Figure 2. Proposed delivery system.

Figure 3. DLS-LSTM model.

Figure 4. The algorithm performance for user-watched video streaming of ’King kong’ under different network conditions. (a) PSNR (S-Network). (b) PSNR (SM-Network). (c) PSNR (FM-Network). (d) MSSIM (S-Network). (e) MSSIM (SM-Network). (f) MSSIM (FM-Network). (g) QoE (S-Network). (h) QoE (SM-Network). (i) QoE (FM-Network). (j) CDF (S-Network). (k) CDF (SM-Network). (l) CDF (FM-Network).

Figure 5. The algorithm performance for user-watched video streaming of Lion under different network conditions. (a) PSNR (S-Network). (b) PSNR (SM-Network). (c) PSNR (FM-Network). (d) MSSIM (S-Network). (e) MSSIM (SM-Network). (f) MSSIM (FM-Network). (g) QoE (S-Network). (h) QoE (SM-Network). (i) QoE (FM-Network). (j) CDF (S-Network). (k) CDF (SM-Network). (l) CDF (FM-Network).

Figure 6. The algorithm performance for user-watched video streaming of Spacewalk under different network conditions. (a) PSNR (S-Network). (b) PSNR (SM-Network). (c) PSNR (FM-Network). (d) MSSIM (S-Network). (e) MSSIM (SM-Network). (f) MSSIM (FM-Network). (g) QoE (S-Network). (h) QoE (SM-Network). (i) QoE (FM-Network). (j) CDF (S-Network). (k) CDF (SM-Network). (l) CDF (FM-Network).

Figure 7. The algorithm performance for user-watched video streaming of Everest under different network conditions. (a) PSNR (S-Network). (b) PSNR (SM-Network). (c) PSNR (FM-Network). (d) MSSIM (S-Network). (e) MSSIM (SM-Network). (f) MSSIM (FM-Network). (g) QoE (S-Network). (h) QoE (SM-Network). (i) QoE (FM-Network). (j) CDF (S-Network). (k) CDF (SM-Network). (l) CDF (FM-Network).

Figure 8. The Mos for watching different video streaming under different network conditions. (a) MOS for watching ‘Kingkong’ under different network conditions. (b) Mos for watching ‘Lion’ under different network conditions. (c) Mos forwatching ‘Spacewalk’ under different network conditions. (d) Mos for watching ‘Everest’ under different network conditions.

Table 1. Simulation parameters.

System Bandwidth	20 MHz
Number of RBs	0–200
BS Tx Power	30 dBm
Subcarriers per RB	12
Subcarrier spacing	15 KHz
Bandwidth per RB	180 KHz
End to End RTT	100 ms
Pathloss model	COST 231 HATA urban
Fading model	Rayleigh Fading
Antenna type	Omnidirection
Doppler shift	30 Hz
Thermal noise density	−174 dBm/Hz
Modulation/coding rate settings	M-QAM
TCP layer	TCP SACK
TCP receive window	65,535 Bytes
Distance from the base station	2 km

Table 2. Playback stability for different algorithms and network conditions.

PI	Algorithm	Kingkong	Lion	Spacewalk	Everest
S-Network	DLS-LSTM	0.8841	0.8735	0.8921	0.8658
	BASE-M	1	1	1	1
	TBRA	0.8712	0.8687	0.8154	0.7458
	Option-2	0.8667	0.8562	0.8265	0.7839
	LIVE360	0.8718	0.8721	0.8897	0.8511
SM-Network	DLS-LSTM	0.8964	0.9124	0.8847	0.8796
	BASE-M	1	1	1	1
	TBRA	0.8547	0.8254	0.8325	0.8413
	Option-2	0.8695	0.865	0.8521	0.8525
	LIVE360	0.8764	0.8987	0.8819	0.8715
FM-Network	DLS-LSTM	0.8214	0.8241	0.8123	0.8054
	BASE-M	1	1	1	1
	TBRA	0.756	0.7698	0.7658	0.7125
	Option-2	0.7821	0.7811	0.7892	0.7654
	LIVE360	0.8194	0.8213	0.803	0.8014

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Guo, J.; Li, C.; Zhu, J.; Li, X.; Gao, Q.; Chen, Y.; Feng, W. Long Short-Term Memory-Based Non-Uniform Coding Transmission Strategy for a 360-Degree Video. Electronics 2024, 13, 3281. https://doi.org/10.3390/electronics13163281

AMA Style

Guo J, Li C, Zhu J, Li X, Gao Q, Chen Y, Feng W. Long Short-Term Memory-Based Non-Uniform Coding Transmission Strategy for a 360-Degree Video. Electronics. 2024; 13(16):3281. https://doi.org/10.3390/electronics13163281

Chicago/Turabian Style

Guo, Jia, Chengrui Li, Jinqi Zhu, Xiang Li, Qian Gao, Yunhe Chen, and Weijia Feng. 2024. "Long Short-Term Memory-Based Non-Uniform Coding Transmission Strategy for a 360-Degree Video" Electronics 13, no. 16: 3281. https://doi.org/10.3390/electronics13163281

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Long Short-Term Memory-Based Non-Uniform Coding Transmission Strategy for a 360-Degree Video

Abstract

1. Introduction

2. Related Works

3. Architecture and Model

3.1. Overview of the Proposed Delivery System

3.2. System Model

4. Algorithm

4.1. Generation of Training Data

4.2. Train the Model

5. Simulation

5.1. Data Generation Experiment

5.2. Simulation Experiment

5.3. Experimental Result

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI