Anomaly Detection and Early Warning Model for Latency in Private 5G Networks

Han, Jingyuan; Liu, Tao; Ma, Jingye; Zhou, Yi; Zeng, Xin; Xu, Ying

doi:10.3390/app122312472

Open AccessArticle

Anomaly Detection and Early Warning Model for Latency in Private 5G Networks

by

Jingyuan Han

¹

,

Tao Liu

¹

,

Jingye Ma

²

,

Yi Zhou

¹

,

Xin Zeng

^2,*

and

Ying Xu

³

¹

China Telecom Research Institute, Shanghai 200120, China

²

Department of Information and Communication Engineering, Tongji University, Shanghai 200070, China

³

Jiangxi Telecom, China Telecom, Nanchang 330000, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2022, 12(23), 12472; https://doi.org/10.3390/app122312472

Submission received: 21 October 2022 / Revised: 18 November 2022 / Accepted: 18 November 2022 / Published: 6 December 2022

(This article belongs to the Special Issue Federated and Transfer Learning Applications)

Download

Browse Figures

Versions Notes

Abstract

:

Different from previous generations of communication technology, 5G has tailored several modes especially for industrial applications, such as Ultra-Reliable Low-Latency Communications (URLLC) and Massive Machine Type Communications (mMTC). The industrial private 5G networks require high performance of latency, bandwidth, and reliability, while the deployment environment is usually complicated, causing network problems difficult to identify. This poses a challenge to the operation and maintenance (O&M) of private 5G networks. It is needed to quickly diagnose or predict faults based on high-dimensional data of networks and services to reduce the impact of network faults on services. This paper proposes a ConvAE-Latency model for anomaly detection, which enhances the correlation between target indicators and hidden features by multi-target learning. Meanwhile, transfer learning is applied for anomaly prediction in the proposed LstmAE-TL model to solve the problem of unbalanced samples. Based on the China Telecom data platform, the proposed models are deployed and tested in an Automated Guided Vehicles (AGVs) application scenario. The results have been improved compared to existing research.

Keywords:

private 5G networks; anomaly detection; abnormal early warning; autoencoder; long short-term memory; transfer learning

1. Introduction

With the rapid development of wireless communication technology, the 5G era has arrived. Operator services are gradually transforming from traditional services for the general public to customized services for business customers. Compared with existing traditional networks, private 5G networks have the characteristics of diversification, refinement, and certainty. The business scenarios in 5G involve various industrial firms with more complex and diverse terminal modes, which results in a significant increase in the business requirements for network performance indicators. According to the relevant survey report, the custom network for vehicle networking requires the latency of the vehicle-to-vehicle (V2V) service to be less than 200 ms. A minimum latency of 30 ms is required for automated guided vehicles (AGVs) in a smart factory, whereas the latency of a smart grid application scenario is the most sensitive, having a threshold of less than 15 ms. The impacts of not meeting these thresholds include large-scale production downtime, business interruptions, serious economic losses, and even personnel safety risks.

Also in the actual production of the existing network, such a massive system of customized communication systems faces many challenges. The existing network operations and maintenance (O&M) methods mainly rely on manual experience accumulation, which cannot realize the early warning of faults or proactively defend against fault occurrence. At the same time, the difficulty of O&M through manual experience alone has increased significantly because of the existence of multiple alarms behind the single-alarm phenomenon caused by the virtualization of 5G technology deployment, as well as the explosive growth of monitoring data in the O&M system due to refined management. The long anomaly location cycle and difficulty of root cause tracing seriously affect productivity and customer experience. In the face of these factors, there is an urgent need for efficient, rapid, accurate, and low-cost approaches to meet the growing demand for digital O&M.

The concept of data-driven artificial intelligence for IT operations, intelligent operations, and maintenance (AIOps) [1] was presented in 2016 and has become a trending research direction, aiming to solve the low efficiency of O&M caused by inexperience. It combines traditional O&M with artificial intelligence algorithms, big data and artificial intelligence, and other methods to improve the network’s O&M capabilities and reduce human interference in the O&M, with the ultimate goal of achieving unmanned and fully automated O&M. In 2018, the white paper “Enterprise AIOps Implementation Recommendations” was jointly initiated and developed by the Efficient Operations and Maintenance community and the AIOps Standard Working Group [2].

Based on the research work in [3], most existing intelligent O&M frameworks are based on machine learning methods, which mainly include the following three aspects: anomaly detection, anomaly localization, and abnormal early warning. Specific research is presented in the related work in Section 2. Most research in the field of anomaly monitoring determines whether an abnormal state exists at the current moment by calculating the error between the current and normal intervals. Based on association rule mining or decision trees, which are mature algorithms in anomaly localization, data of the top-ranked dimension can be returned by ranking the channel anomalous factors (or errors) within the anomaly duration or by reducing the search space complexity through algorithm design. However, existing anomaly localization studies mainly focus on research from the perspective of supervised classification, which is contrary to the background of most unlabeled data in the actual production environment. Abnormal early warning studies focus on regression prediction for time variable series or by matching the real-time data distribution with failure rules to obtain the probability of failure occurrence. After thorough research, it was unfortunately found that the progress of intelligent O&M research lies only in the field of scientific research, and there are still some difficulties in combining the algorithms of academia with the domain knowledge of the industry and applying them to actual industrial production scenarios.

In the early stage of this study, we completed the construction of the data monitoring platform for private 5G networks and found that the abnormal phenomenon of AGVs driving is strongly related to high latency, but the current O&M is based on manual experience or simple data analysis, but the latency-sensitive scenario requires strong timeliness. Therefore, based on the China Telecom network data platform, this study proposes an anomaly detection and abnormal early warning models for AGV’s abnormal driving scenarios in private 5G networks, which efficiently detects high latency using the proposed model (the ConvAE-Latency model). Based on this, the LstmAE-TL model is proposed to realize abnormal early warning with a 15 min interval using the characteristics of long short-term memory (LSTM). Simultaneously, transfer learning was used to solve the problem in which loss cannot converge in the training process of abnormal early warning detection owing to the small sample size. However, during data analysis, it was found that the percentage of abnormal samples was few, owing to private 5G networks running smoothly. Considering that the clustering of few samples cannot prove the correlation in the same class, we only studied anomaly detection and abnormal early warning, and anomaly location was not considered.

The main contributions of this study are as follows:

Instead of simulation data [4], the training data of proposed models are from the China Telecom network data platform, and the practicability of the ConvAE-Latency and LstmAE-TL models is verified.
Considering latency fluctuation as the important indicator for anomaly detection, compared to other methods [5,6], the ConvAE-Latency model uses the latency fitting module to enhance the correlation between target indicators and hidden features.
Transfer learning is applied to solve the problem of fewer abnormal data samples; because of the smooth running of private 5G networks, compared to related works [7,8], the LstmAE-TL model works better.

The remainder of this paper is organized as follows: Section 2 introduces state-of-the-art background information and studies, Section 3 introduces the method used in this study, and Section 4 shows the dataset and experimental results. Finally, Section 5 concludes the paper and discusses future research directions.

2. Related Works

Owing to the growing demand for data-driven operations and maintenance, AIOps [1] has become a prominent research direction; it uses artificial technologies such as big data analysis and machine learning to automate the management of O&M and make intelligent diagnostic decisions through specific policies and algorithms, which will replace the collaborative system of R&D and operations personnel and the automated operations and maintenance system based on expert systems to achieve faster, more accurate, and more efficient intelligent operations and maintenance. However, in the real O&M scenario, the abnormal indicators are not affected by a single dimensional key performance indicator (KPI), but by multi-dimensional abnormal KPIs.These lead to abnormal occurrences. Moreover, monitoring data have the characteristics of large volume, being difficult to clean, and having no labeling of abnormal data; therefore, realizing fast and accurate intelligent O&M is not easy. With the development of complex and refined O&M requirements, abnormaly detection, anomaly location, and abnormal early warning as key technologies in the AIOps field have become prominent research directions.

2.1. Anomaly Detection

Compared with traditional time-series-analysis-based anomaly monitoring models [9,10], most existing studies focus on machine-learning-algorithm-based approaches [6]. Shyu et al. [11] focused on the use of principal component analysis (PCA) for anomaly detection and constructed a prediction model to determine anomalies based on the principal and subprincipal components of normal instances. Liu et al. [12] considered the concept of isolation and proposed the isolation forest (IF) model. They considered that anomalies are sparser than other normal data points and can be isolated and divided into leaf nodes using few splits, making the path between the leaf node where the anomaly is located and the root node shorter.

The autoencoder network has also been applied to detect anomaly metrics using reconstruction errors [6,13,14]. Astha et al. [6] proposed an autoencoder-based anomaly detection and diagnosis method by designing a scoring algorithm to calculate the root-cause occurrence likelihood ranking. Zong et al. [13] proposed a method that combines a deep autoencoder (DAE) and Gaussian mixture model (GMM). The method performs well on several types of public data and outperforms many existing anomaly detection methods; however, it does not consider the temporal correlation and severity of anomalies in real scenarios and is not very robust against noise. Zhang et al. [14] used graph neural networks for the anomaly detection of temporal data using a combination of convolutional layers, convolutional long short-term memory networks (Conv-LSTM), and an autoencoder. This algorithm can eliminate the effect of noise between mutually uncorrelated features; however, the data size increases, leading to a high computation time and high computational load. He et al. [15] trained normal data using a temporal convolutional network (TCN) and used Gaussian distribution fitting for anomaly detection. Munir et al. [16] proposed a method that uses a convolutional neural network (CNN) for time-series prediction and then uses the Euclidean distance between predicted and true values for anomaly detection. In addition, Yahoo, Skyline, Netflix, and others have conducted related studies. Yahoo proposed a single-indicator anomaly detection system that combines three modules: timing modeling, anomaly detection, and alerting [17]. Netman Lab of Tsinghua University and Alibaba and Baidu also conducted related studies, such as the Donut algorithm for periodic temporal data detection based on the variational autoencoder (VAE) model [18] and the apprentice algorithm based on random forests [19].

2.2. Anomaly Location

After using anomaly detection algorithms to detect anomalies in multidimensional monitoring metrics at a certain time, we need to quickly and accurately find the specific dimension that causes these anomalies, that is, anomaly location. Association rule mining [20] and decision trees [21] are two mature anomaly location methods in intelligent O& M. Association rule mining is a rule-based machine learning algorithm that aims to discern strong rules present in a dataset using a number of metrics. When association rules are generated, they can be used to assist with the correlation analysis of events (errors and alerts). Decision trees belong to supervised learning in machine learning and represent mappings between object attributes and object values. Each node in the tree represents a judgment condition for object attributes, with branches representing objects that meet the conditions of the nodes and the leaf nodes of the tree representing the judgment results to which the objects belong. Tsinghua University and Baidu Inc. proposed a root cause analysis algorithm called HotSpot in 2018 [22], which uses Monte Carlo tree search to transform root cause analysis into a large spatial search problem with a heuristic search algorithm to locate the root causes that cause anomalies. Lin et al. proposed a multidimensional monitoring metric for the number of problem reports, called the iDice root cause analysis algorithm [23], which reduces the workload of O&M staff by reducing the combinations of attribute values under each dimension through three pruning strategies.

In addition, Bhagwan et al. proposed the Adtributer algorithm [24] for the KPI of the advertising revenue of a website. This algorithm uses the concepts of explanatory power and surprise to identify the set of the most likely root causes of anomalies after detecting anomalies in the KPI. Pinpoint [25] uses cluster analysis to diagnose the root causes of large dynamic Internet services. Argus [26] uses a hierarchical data structure for aggregation and performance metrics for groups of users with the same attributes and attempts to locate users with performance faults.

2.3. Abnormal Early Warning

Early warning is forward-looking compared to anomaly location; predicting the operational state of network or the faults of industrial equipment, which can improve the efficiency of maintenance and reduce loss. Currently, warning algorithms focus on traditional statistical-based logistic regression [27,28,29] as well as mathematical prediction techniques in which fuzzy theoretical models [30], and grayscale models [31] are also widely used.

However, owing to the complexity of prediction scenarios and the explosive increase in data volume, researchers are more inclined toward data-driven intelligent prediction techniques [8,32,33,34,35]. Zhiqiang et al. [8] proposed an improved intelligent early warning method based on moving window sparse principal component analysis (MWSPCA) applicable to complex chemical processes. The sparse principal component analysis algorithm was used to build the initial warning model, which was then updated using moving windows to make the warning model more suitable for the characteristics of time-varying data. MALHOTRA et al. [32] proposed a coding and decoding model based on LSTM networks. This method models the time dependence of the time series through LSTM networks and achieves a better generalization capability than traditional methods. Related studies [33,34] proposed a deep belief network fault location (DBN-FL) model based on a deep trust network, which sets a series of fault rule identification templates based on historical fault data, integrated data analysis results, and expert experience in the fault identification process. The probability of fault occurrence is obtained by matching the real-time data distribution with the fault rules. In this study [35], an implementation method for abnormal early warning technology based on the probability distribution and density estimation method was proposed, which calculates the probability value of feature distribution by extracting data features and then distinguishes fault data from normal data according to the probability value. The performance of logistic regression and machine learning methods was evaluated in this study. The results show that these methods effectively improve the prediction accuracy relative to traditional networks [36].

3. Proposed Methods

3.1. Framework

As private 5G networks are the key development direction of operators for business customers, compared with public users, private networks have more demanding requirements for various types of performance. Once faults occur, industrial production will be significantly affected; therefore, timely and efficient troubleshooting is particularly important. This poses a challenge to O&M of private 5G networks. It is needed to quickly diagnose or predict faults based on high-dimensional data of networks and services to reduce the impact of network faults on services.

Based on this research direction, we investigate the limitations of private 5G networks O&M and find that most business scenarios focus on latency-sensitive AGVs and band-width-sensitive video transmission. According to the feedback of O&M staff, AGVs demonstrating abnormal driving behavior occurred several times. Considering that this is highly correlated with the high latency, the O&M staff deployed a probe to private 5G network parks to obtain monitor data. The result shows that, when the AGVs work abnormally, the latency of private 5G network is higher than the specified threshold.

Regarding the latency-sensitive scenario in private 5G networks, this study proposes an intelligent O&M framework, as shown in Figure 1, including anomaly detection and early warning models. First, based on the wireless data of private 5G networks, an anomaly detection model (ConvAE-Latency) based on an autoencoder network is proposed to detect whether the latency of network is high by calculating reconstruction error. Moreover, based on the ConvAE-Latency model, the LstmAE-TL model based on LSTM is proposed, which can predict abnormal behavior after 15 min. In addition, transfer learning was used in the training process of LstmAE-TL to solve the problem of low learning speed and difficult convergence because of a limited number of samples.

3.2. Anomaly Detection

Traditional O&M methods is based on periodic inspection, assign work. Problems exist such as low efficiency and lack of timeliness of resource placement; thus, non-preventive O&M fails to meet the needs of private 5G network. The use of artificial intelligence and big data and other new technologies to achieve active, rapid, and accurate detection is a new trend in the development of O&M.

As the autoencoder is trained as a neural network with unsupervised learning, it can reconstruct the original data by hidden features, and the reconstructed data are made closer to the original data by iterative training. If input data are anomaly, the distribution of reconstruction error is different. Based on this, we propose an autoencoder-based anomaly detection model (ConvAE-Latency) for anomaly detection of latency-sensitive scenarios in private 5G networks. The wireless data of base station are used for training, while probed latency is used for labeling to distinguish normal and abnormal samples. Significantly, only normal samples are used to train, and the results show the ConvAE-Latency model can reconstruct normal data well, while the abnormal samples are detected.

The network architecture of the ConvAE-Latency model is shown in Figure 2. It consists of three parts: an encoder, decoder and latency classification network. For the encoder, two convolutional layers are used for dimensionality reduction because of the large dimensionality of the wireless parameters on the base station side, and the training algorithm for the ConvAE-Latency model is shown in Algorithm 1. The encoder maps an input vector

x

to a hidden representation

z

by a an affine mapping following a nonlinearity, as shown in Equation (1). Subsequently, the decoding process shown in Equation (2) also maps the potential spatial representation of the hidden layer back to the original input space as the reconstructed output through the nonlinear transformation. The reconstruction goal is given by Equation (3), the difference between the original input vector

x

and the reconstruction output vector

\hat{x}

. Notably, the reconstruction–error distribution of abnormal samples should be as far away from the normal samples as possible:

\begin{matrix} z & = k_{θ}^{E} (W_{E} x + b_{E}), \end{matrix}

(1)

\begin{matrix} \hat{x} & = k_{σ}^{D} (W_{D} z + b_{D}), \end{matrix}

(2)

\begin{matrix} E_{x} & = \sum_{i = 1}^{n} {| | x_{i} - {\hat{x}}_{i} | |}^{2}, \end{matrix}

(3)

where

x = [x_{1}, x_{2}, \dots, x_{n}]

,

x \in R^{m}

, m represents the wireless parameter dimension, and n represents the number of samples.

W

and

b

are the weight matrix and bias vector of the neural network, respectively, and k denotes the nonlinear activation function.

Considering latency as target indicators in this scenario, the correlation with the hidden features z needs to be enhanced. In addition to reconstructing input, dense networks are used to construct the latency classification network, which ensures that z is strongly correlated with the latency, using the binary cross-entropy loss function with the function set, as shown in Equation (4):

L = \frac{1}{N} \sum_{i} L_{i} = \frac{1}{N} \sum_{i} - [y_{i} log p_{i} + (1 - y_{i}) log (1 - p_{i})],

(4)

where

y_{i}

denotes the label of sample

x_{i}

, that is, the positive class is 1, the negative class is 0, and 30 ms is used as the classification threshold.

p_{i}

denotes the probability that sample i is judged to positive class.

Algorithm 1 The ConvAE-Latency model training algorithm

Input: Dataset

x

, label y

Output: Encoder

f_{θ}

, decoder

g_{σ}

, classification

u_{ϕ}

1:: $θ, σ, ϕ \leftarrow$ Initialization parameters;
2:: repeat
3:: $z = f_{θ} (x)$ ;
4:: $\hat{x} = g_{σ} (z)$ ;
5:: $E_{x} = \sum_{i = 1}^{n} {| | x_{i} - {\hat{x}}_{i} | |}^{2}$ ;
6:: $L = \frac{1}{N} \sum_{i} L_{i} = \frac{1}{N} \sum_{i} - [y_{i} log p_{i} + (1 - y_{i}) log (1 - p_{i})]$ ;
7:: $θ, σ, ϕ \leftarrow$ Update parameters according to combination of $E_{x}$ and L;
8:: until Convergence of parameters $θ, σ, ϕ$

As shown in Algorithm 2 and Figure 3, the ConvAE-Latency model is implemented as follows:

Algorithm 2 The ConvAE-Latency model based anomaly detection algorithm

Input: Dataset

x

, threshold k

Output: The results of anomaly detection

1:: Pre-training ConvAE-Latency model obtained from Algorithm 1;
2:: $\hat{x} = g_{σ} (f_{θ} (x))$ Calculate reconstruct result;
3:: $E_{x} = \sum_{i = 1}^{n} {| | x_{i} - {\hat{x}}_{i} | |}^{2}$ Calculate reconstruction error;
4:: if $E_{x}$ > k then
5:: $x$ is an anomaly;
6:: else
7:: $x$ is not an anomaly;
8:: end if

3.3. Abnormal Early Warning

3.3.1. Network Architecture

In the previous section, we proposed the ConvAE-Latency model, which can effectively shorten the O&M troubleshooting time, compared with manual experience-based O&M methods. However, it only detects the current network status. To further enhance the intelligence of O&M in private 5G networks, abnormal early warning is expected. The possible abnormality is predicted before the network faults occur, so staff can prepare preventive measures as well as maintenance in advance to avoid or reduce the loss caused by the network faults.

LSTM has a memory function in the time dimension, which is often used in prediction schemes based on historical messages. Therefore, based on the ConvAE-Latency model, this study proposes a transfer-learning-based abnormal early warning model (LstmAE-TL), as shown in Figure 4. The model is divided into two parts: the LSTM-based prediction network and ConvAE-Latency model. The specific structure is described as follows.

The first part of the proposed model consists of a three-layer LSTM network and dense networks; input data are the same as that for the ConvAE-Latency model. It is noteworthy that the input data need to be obtained by a sliding window algorithm. The output of the first part is the prediction of the wireless data

\hat{w}

for the next period. The reconstructed function is given by Equation (5). Subsequently,

\hat{w}

is fed into the previously trained ConvAE-Latency model to predict whether there is high latency, which is the final output. However, the ConvAE-Latency model is frozen and not trained here in order to fit the real situation better because inaccurate prediction will affect its judgment. It is worth noting that the latency fitting module of ConvAE-Latency is not used. The training algorithm for the LstmAE-TL is shown in Algorithm 3:

E_{w} = \sum_{i = 1}^{n} {| | w_{i} - {\hat{w}}_{i} | |}^{2},

(5)

where

w = (w_{t_{1}}, w_{t_{2}}, \dots, w_{t_{5}}), w \in R^{m \times n}

, m represents the wireless parameter dimension, and n represents the number of samples.

Algorithm 3 The LstmAE-TL model training algorithm

Input: Normal dataset

w_{normal}

, anomaly dataset

w_{anomaly}

Output: Prediction network

h_{ξ}

1:: Pre-training ConvAE-Latency model obtained from Algorithm 1;
2:: $ξ \leftarrow$ Initialization parameters;
3:: repeat
4:: $E_{w} = \sum_{i = 1}^{n} {| | w_{i} - h_{ξ} (w_{i}) | |}^{2}$ Calculate predict error by only $w_{normal}$ ;
5:: $E_{x} = \sum_{i = 1}^{n} {| | h_{ξ} (w_{i}) - g_{σ} (f_{θ} (h_{ξ} (w_{i}))) | |}^{2}$ Calculate reconstruction error of the predict result of $w_{normal}$ by ConvAE-Latency model;
6:: $ξ \leftarrow$ Update parameters according to combination of $E_{w}$ and $E_{x}$ ;
7:: until Convergence of parameters $ξ$
8:: Freeze part of parameters in $ξ$ ;
9:: repeat
10:: $E_{w} = \sum_{i = 1}^{n} {| | w_{i} - h_{ξ} (w_{i}) | |}^{2}$ Calculate predict error by mix $w_{anomaly}$ and part of $w_{normal}$ ;
11:: $ξ \leftarrow$ Update parameters according to combination of $E_{w}$ ;
12:: until Convergence of parameters $ξ$

As shown in Algorithm 4 and Figure 5, the implementation process of the LstmAE-TL model is as follows:

Algorithm 4 The LstmAE-TL model based abnormal early warning algorithm

Input: Dataset

w

, threshold k

Output: The results of abnormal early warning

1:: Pre-training ConvAE-Latency model obtained from Algorithm 1;
2:: Prediction network $h_{ξ}$ ;
3:: $E_{x} = \sum_{i = 1}^{n} {| | h_{ξ} (w_{i}) - g_{σ} (f_{θ} (h_{ξ} (w_{i}))) | |}^{2}$ Calculate reconstruction error;
4:: if $E_{x}$ > k then
5:: $x$ is an anomaly;
6:: else
7:: $x$ is not an anomaly;
8:: end if

3.3.2. Transfer Learning

For any system, the abnormal state has a small probability compared with normal functionality, and the application scenarios of private 5G networks require high-reliability services, which determines that it is a highly reliable system itself, and its abnormal state is highly improbable. Therefore, the number of anomalous samples is too small in proportion to the samples of normal cases, leading to difficult training of the network and difficulty in fitting the anomalous part of the samples, which is usually referred to as few-shot learning.

For the few-shot learning of anomalous samples, this study used transfer learning to solve this problem. The idea of transfer learning comes from human analogy learning, which aims to apply features learned from a high-quality dataset to other datasets under similar domains. In the environment of this study, although there is a gap between abnormal sample features and normal sample features, both occur in the same network environment and their feature extraction processes are similar; therefore, transfer learning can be used to apply the feature extraction framework learned from normal samples to abnormal samples to improve the accuracy of abnormal sample prediction.

In the previous section, a three-layer dense network was proposed as the temporal feature extraction of wireless base station data and is the shared network part of normal and anomalous samples, which was trained using transfer learning which is not consistent with the training method of anomaly detection. First, the proposed model is trained with normal samples to obtain a prediction network with good performance. Then, the LSTM of LstmAE-TL is frozen and trained again with a small number of normal samples mixed with anomalous samples to achieve accurate prediction of the anomalous part.

3.3.3. Data Sensitivity Analysis

In essence, the proposed abnormal early warning model can be considered a joint composition of two independent networks, and both are trained using the same dataset. Due to the poor quality of the few data present in the dataset, both networks produce large errors, and the errors of both will further superimpose on each other, leading to serious degradation of the model performance.

However, because the two networks are trained separately, the information between them is not interoperable, which leads to the fact that, although the final result error of both networks is small, the locations where the errors are generated are different, which significantly magnifies the small errors in the process of cross-network transfer, thus affecting the performance.

To solve the above two problems, in this study, the loss function of the autoencoder with fixed weights is added as the judgment basis during the training of the LSTM network, that is, it is not only required that the prediction results obtained by the LSTM conform to the true distribution but also that the prediction results can be correctly coded and decoded by the autoencoder.

4. Analysis of Results

4.1. Data Set Description

The dataset was selected from the downlink network status data of a single DU under the exclusive base station of private 5G networks, totaling 16 days, one data point every 15 min, and a total of 1536 data points. The dataset includes PRB, CQI, MCS, etc., the entries of which are listed in Table 1. According to the different pre-processing methods, all the data can be divided into two categories: normalized data, including the total number of PRBs and the total number of CQI reports, which are normalized to the 0–1 range using the maximum and minimum normalization methods, and percentage data, including PRB utilization and retransmission rate. The percentage data were not normalized. The final dataset contained 179 dimensions.

It is worth noting that, to save resources, the base station sleeps for a period of time every day at regular intervals, and the data collected during the sleep period are blank.

4.2. Anomaly Detection Results

For the input of the autoencoder-based anomaly detection model, all the data were randomly divided into training and test sets in a ratio of 2:1, and zeros were added after the existing 179-dimensional KPIs, thus forming a 12 × 15 matrix, which was fed into the convolutional network and decoded again to 179 dimensions after being compressed and coded to 32 dimensions. Owing to the high dimensionality of the data, MSE was used for the loss function to avoid large local errors. The learning rate was reduced by half every 50 epochs during the training process.

Figure 6 and Figure 7 show the distribution curves and the cumulative distribution function (CDF) of reconstruction error after the autoencoder for abnormal and normal samples, respectively, where red and blue lines represent the abnormal and normal samples, respectively.

Figure 8 and Figure 9 show the distribution curves and the CDF of reconstruction error without the latency fitting module; compared with that, the ConvAE-Latency model works better.

The results show that there is a significant difference between the error distribution of normal and abnormal samples, and the decoding error of abnormal samples is significantly higher than that of most the normal samples; however, there are also some normal samples with larger errors. This is due to the fact that, when the autoencoder is trained, all the data used in this study are expected to be normal samples, but in fact, only one criterion of latency is used as the screening basis in the data screening process and the actual screened normal samples are obtained, but there may be some abnormal samples that cannot be characterized by latency, and the performance in the training results is attributed to some “fake samples” having a large error.

4.3. Abnormal Early Warning Results

For abnormal early warning, the LstmAE-TL model is proposed, and the training process was divided into two parts: normal training and transfer learning.

4.3.1. Normal Sample Training

The purpose of normal sample training is to achieve an accurate prediction of the network for normal samples; therefore, the dataset of normal samples is selected and randomly divided into training and test sets according to the ratio of 2:1. The input data were the data of the five time periods before the period to be predicted, and the output data were the 179-dimensional network state data of the current time period. The loss function of the proposed model consists of two parts: LSTM-based abnormal early warning and autoencoder-based abnormal detection. The training first requires the prediction results to match the real situation such that the mean square error (MSE) between the prediction results and the real data are as small as possible. Second, the prediction results must be correctly reconstructed using the autoencoder. Therefore, the prediction results are fed into the abnormal detection network, and the MSE between the reconstructed data and predicted data are calculated again. The learning rate was reduced by half every 100 epochs during training.

Figure 10 shows the MSE of the prediction data based on the normal dataset, and the results indicate that the error of the network appears to increase periodically over time. It is observed that most of the error spikes occur at the beginning of the day, presumably because the base stations are beginning to work during these hours and their performance and service are unstable. Because the base stations are intermittently dormant to save energy, the dormant time data will be set to zero. As mentioned in Section 4.1, these dormant time period data are removed during data pre-processing, which results in discontinuous data timing around the dormant time period and cannot be accurately predicted.

4.3.2. Effectiveness of Transfer Learning

The purpose of migration learning is to improve the prediction accuracy of the abnormal samples. Based on the early warning network obtained above, this study freezes the three-layer LSTM network and subsequently only trains the three-layer dense network. Similarly, the data were randomly divided into a training and test set at a ratio of 2:1 for all abnormal samples. To enhance the weight of the abnormal samples, each of the samples in the training set was repeated twice and then mixed with an equal number of randomly selected normal samples to form the complete training and test sets. Because the prediction results of anomalous samples are expected to produce large errors when input into the autoencoder (while the opposite is true for normal samples), the process of transfer learning does not use reconstruction error as a loss function and only considers the accuracy of the fitted prediction. The learning rate was chosen to be one-tenth of the initial learning rate in normal learning, and the same learning rate was halved every 100 epochs.

The distribution curves and the CDF of the autoencoder reconstruction errors for abnormal and normal samples are shown in Figure 11 and Figure 12, where red and blue line represent abnormal and normal samples, respectively. Meanwhile, Figure 13 and Figure 14 show the results of abnormal early warning without transfer learning.

The results show that there are differences in the distribution of abnormal and normal samples, and the LstmAE-TL works better than training without transfer learning. However, there is an overlap in the distribution of some samples, and the error of normal samples has improved. In transfer learning, to meet the prediction performance of abnormal samples, the prediction of some normal samples is inevitably sacrificed. In addition, the sampling density of the data selected in this study was 15 min, and for some of the abnormal samples, the abnormality may occur within 15 min of the current time period, resulting in it not yet being reflected in the previous time period, which means that these abnormal points do not have temporal characteristics, and the prediction accuracy is poor.

5. Conclusions

To meet the requirements of high-quality operation and maintenance of the private 5G network, this paper proposes the ConvAE-Latency model based on the autoencoder and the LstmAE-TL model based on LSTM. This work is verified at the China Telecom private 5G networks. Transfer learning is introduced in training to solve the problem of few fault samples. The results show that the two models achieve anomaly detection and prediction, and their effects are significantly improved compared with existing research. It is planned to use the hidden features of the proposed model combined with clustering algorithms to achieve anomaly location in the future.

Author Contributions

Methodology, J.H. and J.M.; software, J.M.; validation, J.H. and J.M.; investigation, Y.X.; writing—J.H.; writing—review and editing, Y.Z.; supervision, T.L. and Y.Z.; project administration, X.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Innovation Network Research Program of China Telecom (Grant No. I202207) and the Research and Development of Atomic Capabilities for 5G Network Services and Operations (Grant No. T202210).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Rijal, L.; Colomo-Palacios, R.; Sánchez-Gordón, M. Aiops: A multivocal literature review. In Artificial Intelligence for Cloud and Edge Computing; Springer: Cham, Switzerland, 2022; pp. 31–50. [Google Scholar]
Hua, Y. A Systems Approach to Effective AIOps Implementation. Ph.D. Thesis, Massachusetts Institute of Technology, Cambridge, MA, USA, 2021. [Google Scholar]
Notaro, P.; Cardoso, J.; Gerndt, M. A Survey of AIOps Methods for Failure Management. ACM Trans. Intell. Syst. Technol. (TIST) 2021, 12, 1–45. [Google Scholar] [CrossRef]
Abusitta, A.; Silva de Carvalho, G.H.; Abdel Wahab, O.; Halabi, T.; Fung, B.; Al Mamoori, S. Deep Learning-Enabled Anomaly Detection for IoT Systems. Available online: https://ssrn.com/abstract=4258930 (accessed on 7 November 2022).
Chen, Z.; Yeo, C.K.; Lee, B.S.; Lau, C.T. Autoencoder-based network anomaly detection. In Proceedings of the 2018 Wireless Telecommunications Symposium (WTS), Phoenix, AZ, USA, 17–20 April 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 1–5. [Google Scholar]
Garg, A.; Zhang, W.; Samaran, J.; Savitha, R.; Foo, C.S. An evaluation of anomaly detection and diagnosis in multivariate time series. IEEE Trans. Neural Netw. Learn. Syst. 2021, 33, 2508–2517. [Google Scholar] [CrossRef] [PubMed]
Sun, J.; Wang, J.; Hao, Z.; Zhu, M.; Sun, H.; Wei, M.; Dong, K. AC-LSTM: Anomaly State Perception of Infrared Point Targets Based on CNN+LSTM. Remote Sens. 2022, 14, 3221. [Google Scholar] [CrossRef]
Geng, Z.; Chen, N.; Han, Y.; Ma, B. An improved intelligent early warning method based on MWSPCA and its application in complex chemical processes. Can. J. Chem. Eng. 2020, 98, 1307–1318. [Google Scholar] [CrossRef]
Qi, J.; Chu, Y.; He, L. Iterative anomaly detection algorithm based on time series analysis. In Proceedings of the 2018 IEEE 15th International Conference on Mobile Ad Hoc and Sensor Systems (MASS), Chengdu, China, 9–12 October 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 548–552. [Google Scholar]
Pena, E.H.; de Assis, M.V.; Proença, M.L. Anomaly detection using forecasting methods arima and hwds. In Proceedings of the 2013 32nd International Conference of the Chilean Computer Science Society (SCCC), Temuco, Chile, 11–15 November 2013; IEEE: Piscataway, NJ, USA, 2013; pp. 63–66. [Google Scholar]
Shyu, M.L.; Chen, S.C.; Sarinnapakorn, K.; Chang, L. A Novel Anomaly Detection Scheme Based on Principal Component Classifier; Technical Report; Department of Electrical and Computer Engineering, University of Miami: Coral Gables, FL, USA, 2003. [Google Scholar]
Liu, F.T.; Ting, K.M.; Zhou, Z.H. Isolation forest. In Proceedings of the 2008 Eighth IEEE International Conference on Data Mining, Pisa, Italy, 15–19 December 2008; IEEE: Piscataway, NJ, USA, 2008; pp. 413–422. [Google Scholar]
Zong, B.; Song, Q.; Min, M.R.; Cheng, W.; Lumezanu, C.; Cho, D.; Chen, H. Deep autoencoding gaussian mixture model for unsupervised anomaly detection. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
Zhang, C.; Song, D.; Chen, Y.; Feng, X.; Lumezanu, C.; Cheng, W.; Ni, J.; Zong, B.; Chen, H.; Chawla, N.V. A deep neural network for unsupervised anomaly detection and diagnosis in multivariate time series data. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 1409–1416. [Google Scholar]
He, Y.; Zhao, J. Temporal convolutional networks for anomaly detection in time series. J. Phys. Conf. Ser. 2019, 1213, 042050. [Google Scholar] [CrossRef] [Green Version]
Munir, M.; Siddiqui, S.A.; Dengel, A.; Ahmed, S. DeepAnT: A deep learning approach for unsupervised anomaly detection in time series. IEEE Access 2018, 7, 1991–2005. [Google Scholar] [CrossRef]
Laptev, N.; Amizadeh, S.; Flint, I. Generic and scalable framework for automated time-series anomaly detection. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Sydney, Australia, 10–13 August 2015; pp. 1939–1947. [Google Scholar]
Xu, H.; Chen, W.; Zhao, N.; Li, Z.; Bu, J.; Li, Z.; Liu, Y.; Zhao, Y.; Pei, D.; Feng, Y.; et al. Unsupervised anomaly detection via variational auto-encoder for seasonal kpis in web applications. In Proceedings of the 2018 World Wide Web Conference, Lyon, France, 23–27 April 2018; pp. 187–196. [Google Scholar]
Liu, D.; Zhao, Y.; Xu, H.; Sun, Y.; Pei, D.; Luo, J.; Jing, X.; Feng, M. Opprentice: Towards practical and automatic anomaly detection through machine learning. In Proceedings of the 2015 Internet Measurement Conference, Tokyo, Japan, 28–30 October 2015; pp. 211–224. [Google Scholar]
Ahmed, F.; Erman, J.; Ge, Z.; Liu, A.X.; Wang, J.; Yan, H. Detecting and localizing end-to-end performance degradation for cellular data services. In Proceedings of the 2015 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems, Portland, OR, USA, 15–19 June 2015; pp. 459–460. [Google Scholar]
Nguyen, B.; Ge, Z.; Van der Merwe, J.; Yan, H.; Yates, J. Absence: Usage-based failure detection in mobile networks. In Proceedings of the 21st Annual International Conference on Mobile Computing and Networking, Paris, France, 7–11 September 2015; pp. 464–476. [Google Scholar]
Sun, Y.; Zhao, Y.; Su, Y.; Liu, D.; Nie, X.; Meng, Y.; Cheng, S.; Pei, D.; Zhang, S.; Qu, X.; et al. Hotspot: Anomaly localization for additive kpis with multi-dimensional attributes. IEEE Access 2018, 6, 10909–10923. [Google Scholar] [CrossRef]
Lin, Q.; Lou, J.G.; Zhang, H.; Zhang, D. iDice: Problem identification for emerging issues. In Proceedings of the 38th International Conference on Software Engineering, Austin, TX, USA, 14–22 May 2016; pp. 214–224. [Google Scholar]
Bhagwan, R.; Kumar, R.; Ramjee, R.; Varghese, G.; Mohapatra, S.; Manoharan, H.; Shah, P. Adtributor: Revenue debugging in advertising systems. In Proceedings of the 11th USENIX Symposium on Networked Systems Design and Implementation (NSDI 14), Seattle, WA, USA, 2–4 April 2014; pp. 43–55. [Google Scholar]
Chen, M.Y.; Kiciman, E.; Fratkin, E.; Fox, A.; Brewer, E. Pinpoint: Problem determination in large, dynamic internet services. In Proceedings of the International Conference on Dependable Systems and Networks, Washington, DC, USA, 23–16 June 2002; IEEE: Piscataway, NJ, USA, 2002; pp. 595–604. [Google Scholar]
Yan, H.; Flavel, A.; Ge, Z.; Gerber, A.; Massey, D.; Papadopoulos, C.; Shah, H.; Yates, J. Argus: End-to-end service anomaly detection and localization from an isp’s point of view. In Proceedings of the 2012 Proceedings IEEE INFOCOM, Orlando, FL, USA, 25–30 March 2012; IEEE: Piscataway, NJ, USA, 2012; pp. 2756–2760. [Google Scholar]
Khoshgoftaar, T.M.; Gao, K.; Szabo, R.M. An application of zero-inflated poisson regression for software fault prediction. In Proceedings of the 12th International Symposium on Software Reliability Engineering, Hong Kong, China, 27–30 November 2001; IEEE: Piscataway, NJ, USA, 2001; pp. 66–73. [Google Scholar]
Lessmann, S.; Baesens, B.; Mues, C.; Pietsch, S. Benchmarking classification models for software defect prediction: A proposed framework and novel findings. IEEE Trans. Softw. Eng. 2008, 34, 485–496. [Google Scholar] [CrossRef] [Green Version]
Nagappan, N.; Ball, T. Static analysis tools as early indicators of pre-release defect density. In Proceedings of the 27th International Conference on Software Engineering, ICSE 2005, St. Louis, MO, USA, 15–21 May 2005; IEEE: Piscataway, NJ, USA, 2005; pp. 580–586. [Google Scholar]
Hu, L.Q.; He, C.F.; Cai, Z.Q.; Wen, L.; Ren, T. Track circuit fault prediction method based on grey theory and expert system. J. Vis. Commun. Image Represent. 2019, 58, 37–45. [Google Scholar] [CrossRef]
Wende, T.; Minggang, H.; Chuankun, L. Fault prediction based on dynamic model and grey time series model in chemical processes. Chin. J. Chem. Eng. 2014, 22, 643–650. [Google Scholar]
Malhotra, P.; Vig, L.; Shroff, G.; Agarwal, P. Long short term memory networks for anomaly detection in time series. In Proceedings of the 23rd European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, ESANN 2015, Bruges, Belgium, 22–24 April 2015; Volume 89, pp. 89–94. [Google Scholar]
Yu, A.; Yang, H.; Yao, Q.; Li, Y.; Guo, H.; Peng, T.; Li, H.; Zhang, J. Accurate fault location using deep belief network for optical fronthaul networks in 5G and beyond. IEEE Access 2019, 7, 77932–77943. [Google Scholar] [CrossRef]
Zhao, X.; Yang, H.; Guo, H.; Peng, T.; Zhang, J. Accurate fault location based on deep neural evolution network in optical networks for 5G and beyond. In Proceedings of the Optical Fiber Communication Conference, San Diego, CA, USA, 3–7 March 2019; p. M3J–5. [Google Scholar]
Mulvey, D.; Foh, C.H.; Imran, M.A.; Tafazolli, R. Cell fault management using machine learning techniques. IEEE Access 2019, 7, 124514–124539. [Google Scholar] [CrossRef]
Malhotra, R. Comparative analysis of statistical and machine learning methods for predicting faulty modules. Appl. Soft Comput. 2014, 21, 286–297. [Google Scholar] [CrossRef]

Figure 1. The framework of Intelligent O&M.

Figure 2. The network architecture of the ConvAE-Latency model.

Figure 3. Implementation process of the ConvAE-Latency model.

Figure 4. Network architecture of the LstmAE-TL model.

Figure 5. Implementation process of the LstmAE-TL model.

Figure 6. The reconstruction-error distribution for the ConvAE-Latency model.

Figure 7. The CDF of reconstruction error for the ConvAE-Latency model.

Figure 8. The reconstruction-error distribution without the latency fitting module.

Figure 9. The CDF of reconstruction error without the latency fitting module.

Figure 10. MSE of the prediction data based on the normal dataset.

Figure 11. The reconstruction-error distribution for the LstmAE-TL model.

Figure 12. The CDF of reconstruction error for the LstmAE-TL model.

Figure 13. The reconstruction–error distribution without transfer learning.

Figure 14. The CDF of reconstruction error without transfer learning.

Table 1. Sample wireless data.

Class ID	Meanings
PRB.UsedDl	Total number of downlink PRBs used
PRB.AvailDl	Total number of downlink PRBs available
MAC.TxBytesDl	Number of bits sent at the downlink MAC layer
MAC.TransIniTBDl	Initialize TB count at downlink MAC layer
HARQ.ReTransTBDl	Retransmission TB number
CQI.NumInTable	CQI Report
PRB.NumRbOfTableMCSDl	Table of the number of PRBs used in each level of MCS
PHYCell.MeanTxPower	Transmit power of physical layer
CELL.PowerSaveStateTime	Host standby sleep time

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Han, J.; Liu, T.; Ma, J.; Zhou, Y.; Zeng, X.; Xu, Y. Anomaly Detection and Early Warning Model for Latency in Private 5G Networks. Appl. Sci. 2022, 12, 12472. https://doi.org/10.3390/app122312472

AMA Style

Han J, Liu T, Ma J, Zhou Y, Zeng X, Xu Y. Anomaly Detection and Early Warning Model for Latency in Private 5G Networks. Applied Sciences. 2022; 12(23):12472. https://doi.org/10.3390/app122312472

Chicago/Turabian Style

Han, Jingyuan, Tao Liu, Jingye Ma, Yi Zhou, Xin Zeng, and Ying Xu. 2022. "Anomaly Detection and Early Warning Model for Latency in Private 5G Networks" Applied Sciences 12, no. 23: 12472. https://doi.org/10.3390/app122312472

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Anomaly Detection and Early Warning Model for Latency in Private 5G Networks

Abstract

1. Introduction

2. Related Works

2.1. Anomaly Detection

2.2. Anomaly Location

2.3. Abnormal Early Warning

3. Proposed Methods

3.1. Framework

3.2. Anomaly Detection

3.3. Abnormal Early Warning

3.3.1. Network Architecture

3.3.2. Transfer Learning

3.3.3. Data Sensitivity Analysis

4. Analysis of Results

4.1. Data Set Description

4.2. Anomaly Detection Results

4.3. Abnormal Early Warning Results

4.3.1. Normal Sample Training

4.3.2. Effectiveness of Transfer Learning

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI