1. Introduction
The appearance of fifth-generation (5G) technology, which is capable of integrated communication, presents an excellent opportunity to promote resource management that is informed, sustainable, and robust. This technology provides real-time data on the environmental impacts of different behaviors, opening new possibilities for in-depth analytics. The ability to monitor and control systems quickly has facilitated the creation of intelligent environments. In this context, the Internet of Things (IoT) is expected to emerge as a vital component of 5G applications, particularly in the sensing and monitoring domains [
1,
2]. The IoT is explicitly designed for massive machine-type communication (mMTC) applications [
3], which emphasizes its significance in the changing world of 5G technologies.
The IoT is crucial in various fields ranging from environmental monitoring [
4,
5,
6] to healthcare systems [
7,
8]. These networks offer a cost-effective means of sensing, collecting, and processing environmental information. However, the deployment of IoTs exposes them to potential data corruption resulting from factors such as device failure, signal interference, and adverse environmental conditions [
9]. Consequently, reconstructing corrupt data has become crucial for maintaining the precision and reliability of information collected by IoTs in 5G networks.
A key consideration in addressing the challenges of data reconstruction is gathering real IoT data. Typically, most recently developed WSN appliances utilize sensors with multiple sensing units to detect different variables such as temperature, pressure, O
2, and NO
2 [
10]. Therefore, sensor nodes often collect data with multiple attributes, resulting in multivariate data. In addition, the collected data exhibited spatiotemporal and multivariate correlations. The correlations among these attributes can be leveraged to enhance data reconstruction performance. Despite this promising direction, only one work has delved into multivariate data reconstruction in the IoT [
11], with limited success.
Using a tensor structure is beneficial for data reconstruction and demonstrates proficiency in harnessing spatiotemporal correlations, particularly for multivariate data. Consequently, traditional methods such as tensor completion (TC) and the tensor robust principal component analysis (TRPCA) [
11,
12] have been widely employed for both univariate and multivariate data reconstruction. However, these approaches have limitations. They often ignore differences in singular values, treating them uniformly. In the context of data reconstruction, this oversight may result in suboptimal performance.
In practice, the challenges associated with data reconstruction are often complex and involve various forms of corruption, including multiple types of noise, outliers, and missing values. However, previous studies mainly focused on individual types of corruption [
11,
13]. The corruption of IoT data can occur owing to various factors, including noise, outliers, and missing values. In this study, we consider two types of noise: Gaussian and impulsive. Gaussian noise is a result of hardware imperfections and computational limitations and follows a Gaussian distribution [
14]. By contrast, impulsive noise appears suddenly and disrupts information more intensely [
15]. It can be triggered by external factors, such as power surges, electromagnetic interference, and faults in sensor nodes, resulting in abrupt changes and extreme data values. Outliers are values that deviate significantly from the normal range. They indicate severe errors in measurements, such as sensor malfunctions, inaccurate readings, and unexpected events in the environment. Outliers are considered abnormal and can significantly affect the overall behavior of data [
16]. Missing values are a common issue in IoT data and can occur for various reasons. Sensor nodes may fail to transmit data owing to hardware failures, network problems, or limited energy resources. Environmental conditions, such as signal attenuation or obstacles, can also lead to data loss during transmission [
17]. Detecting and addressing these types of data corruption is crucial for ensuring the accuracy and reliability of information in IoTs. IoT academies must learn how to handle these different types of corruption effectively to provide high-quality data for various applications. Previous research focused only on single types of damage, such as Gaussian noise, outliers, and missing values. However, addressing all types of data corruption is essential to ensuring accurate data reconstruction.
This study proposes a new approach to addressing these challenges and makes significant contributions to the field. To the best of our knowledge, this is the first study to address complex corruption encompassing mixed noise, outliers, and missing values in multivariate sensing data within IoTs operating in 5G environments using a WTRPCA. The primary contributions of this study are as follows:
This work presents an enhanced method for multivariate data reconstruction in 5G-operating IoTs. A TRPCA is combined with TC to enhance the method’s ability to handle missing data, noise, or missing values. This unique approach leverages correlations among multiple attributes to improve reconstruction performance.
This study introduces a weighted approach to TC and the TRPCA, offering a means of handling singular values. In contrast to traditional methods, the proposed approach uses weighted tensor singular value thresholding (WTSVT) to shrink singular values based on their importance, potentially boosting reconstruction accuracy.
The proposed approach effectively tackles complex types of corrupted data, such as mixture noise, outliers, and missing values, and stands out in comparison to other models.
This paper is organized as follows:
Section 1 introduces the core concepts related to the study, while
Section 2 surveys the existing literature and related works on IoTs, TC, and the TRPCA.
Section 3 provides the necessary preliminaries, introducing tensors, TC, and the TRPCA in detail. The methodology is presented in
Section 4, in which each step of the proposed model is thoroughly explained. An experimental validation of the model, including dataset descriptions, an understanding of the low-rank structure in the dataset, and a description of the experimental setup, is presented in
Section 5.
Section 6 concludes the paper, summarizes its contributions, and discusses potential directions for future research.
4. The Proposed Model
Many studies regarding IoT data corruption have not taken into account all the different types of corruption that can occur. In this proposed model, we use a combination of the WRTPCA and WTC. The input data for this method comprise a tensor that has been damaged with missing values, mixed noise, and outliers. We apply WTSVT (Algorithm 2) in both WTC and the WRTPCA. WTC (Algorithm 3) results in a new tensor that has no missing values. This new tensor is then input into the WRTPCA (Algorithm 4). After applying this proposed model, we obtain a final tensor that has no missing values, mixture noise, or outliers.
Figure 6 provides an overview of this proposed model. Before discussing the details of this method, we present the mathematical foundations of WTC and the WRTPCA.
To recover the missing-value tensor, Equation (
17) can be solved to extract the low-rank structure from the incomplete tensor data using the alternating direction method of multipliers (ADMMs) [
24] as follows:
The WRTPCA optimization model, depicted in Equation (
13), is frequently resolved by employing the ADMM [
33]. The solution to the low-rank optimization problem can be expressed as follows:
Algorithm 2 Weighted tensor singular value thresholding method (). |
- 1:
Input: , - 2:
- 3:
for to do - 4:
- 5:
- 6:
= - 7:
= - 8:
= - 9:
= - 10:
end for - 11:
U= , D= , V= - 12:
Output: orthogonal tensor , and core tensor
|
Weighted tensor singular value thresholding (WTSVT), with a threshold based on the weighted tensor nuclear norm (WTNN), employs t-SVD [
24], which solves the drawbacks of using a fixed threshold for each singular value. Equations (
18) for TC and (
21) in the TRPCA are solved using WTSVT. This method helps recover missing values in the TC and minimizes noise and outliers in the TRPCA [
33]. To maintain essential data components, WTSVT is used to optimize the low-rank tensor
. WTSVT calculates the total weighted singular values over all the frontal slices of the tensor data, ensuring that greater singular values decrease less. In Algorithms 3 and 4, the WTSVT operator is used, and it is defined in Algorithm 2. The objective of Algorithm 2 is to break down the input tensor into three tensors, with the middle tensor being a low-rank tensor. The inputs for WTSVT include the input tensor
and the weighted vector
, and
is the penalty factor. The notation
represents the FFT used to compute the discrete Fourier transform (DFT) of a sequence. In contrast,
represents the inverse fast Fourier, which essentially reverses the FFT process. First, the input tensor is transformed using
. Next, the loop of Algorithm 2 begins, and each frontal slice of the input tensor is applied (
). The process
breaks down a matrix into three separate matrices—a left singular matrix, a diagonal matrix of singular values, and a matching singular matrix. The
pertains to the diagonal elements of the singular matrix. The operation
takes each diagonal component, converts it into a vector, and subtracts it, with each value in vector
already divided to
. Subsequently, the
converts
into a diagonal matrix with shrunken singular values. The weight shrinks the diagonal elements in the singular matrix D. After completing
for all frontal slices,
is used with the three components to obtain results.
Algorithm 3 Weighted tensor completion. |
Input: with missing values, weighted vector Initialize: , , , , , while not converged do Update = Update , Update Update Check the convergence criteria or or end while Output: low-rank tensor without missing value
|
The solution to Equation (
18) is obtained through WTSVT (Algorithm 2), and Equation (
19) represents the least-squares projection constrained by the problem [
42]. The
-indicator function maps the elements of a subset of a set to one and all other elements to zero. Following MATLAB R2021a notation, the symbols
and
signify the tensor-to-vector conversion. To recover the missing values, the ADMM is implemented, the details of which are provided in Algorithm 3. The input to Algorithm 3 includes a tensor with the missing values
and weighted vector
. First, we initialize the tensors
,
, and
as
,
, and
, respectively. Here,
is the output vector without any missing values,
is the tensor that helps the algorithms ensure the constraint, and
is the dual factor of the ADMM, which represents the augmented Lagrange penalty parameter. The step size is denoted as
. An
-tensor with all its elements set to zero is denoted as
. The other parameters
and
are the threshold stopping algorithm and the degree of increase in the step size
in each iteration, respectively. Starting with the loop in Algorithm 3,
is updated using WTSVT in Algorithm 2. Furthermore,
is updated; however, the constraint
must be ensured. The dual factor
and step size
are also updated after that. The algorithm stops updating when
or
or
.
Algorithm 4 Weighted robust tensor principal component analysis integrated with weighted tensor completion. |
Input: with mixed noise, outlier and missing value; weight vector Initialize: , , , , , , Reconstruct incompleted by Algorithm 3 while not converged do Update = Update Update Update Check the convergence criteria or or end while Output: low-rank tensor without missing values, sparse mixture noise, and outlier tensor.
|
To reduce multiple noises, outliers in Equation (
21) can be solved using WTSVT (Algorithm 2), and the solution of Equation (
22) is a proximal function that shrinks all values. The details of the process for reducing mixture noise and outliers and reconstructing missing values are given in Algorithm 4, which describes WRTPCA integrated with WTC and uses the ADMM to extract low-rank data without missing values and sparse noise tensors from the damaged data tensor. This helps reconstruct missing data and reduce multiple noises and outliers. In Algorithm 4, we initialize the components and parameters. An
-tensor with all its elements set to zero is denoted as
.
and
are a low-rank tensor present for normal data without missing values, noise, and outliers and a sparse tensor that indicates outlier and mixed noise, respectively. The ADMM updates the dual variable
, which represents the augmented Lagrange penalty parameter. The step size is denoted by
, and the maximum step size is
.
is the level of increase in step size
in each iteration.
can be set as fixed, but increasing
at each iteration helps accelerate convergence.
,
, and
are the thresholds of the convergence condition, balance parameter, and weighted vector, respectively. Subsequently, the low-rank tensor
is filled with missing values using Algorithm 3. Subsequently, the loop of Algorithm 4 starts reducing the mixed noise and outliers and starts reconstructing the missing values. In the iteration, the tensor
is updated by applying the WTSVT operator defined in Algorithm 2 to obtain the low rank of
. Next, the operator
in Algorithm 4 is applied to update
. For example, with tensor
, the proximal operator of the
, given as
, is applicable to all elements
within the tensor. The augmented Lagrange penalty parameter
uses a step size denoted as
when updating. The convergence condition of the loop required to bring the ADMM algorithm to a stop is defined as follows:
,
,
.
5. Experiments and Results
5.1. Dataset
The U.S. Climate Normals dataset (
https://www.ncdc.noaa.gov/cdo-web/datasets, accessed on 16 June 2023) comprises weather and climate data from over 1100 U.S. stations and territories. It includes hourly, daily, monthly, seasonal, and annual readings of temperature, wind statistics, mean sea level pressure, dew point, and cloud cover. These records undergo thorough quality assurance assessments at the National Centers for Environmental Information (NCEI) of the National Oceanic and Atmospheric Administration (NOAA).
The NDBC-TAO dataset
https://tao.ndbc.noaa.gov/tao/data_download (accessed on on 10 June 2023) was collected from sensors in the Tropical Pacific Ocean and sent to the National Oceanic and Atmospheric Administration (NOAA). These sensors measure different attributes including sea surface temperature, wind speed, conductivity, sea level pressure, and salinity.
Details regarding truth data tensors for both the U.S. Climate Normal dataset and the NDBC-TAO dataset are shown in
Table 3.
5.2. Low-Rank Structures and Correlation in Multi-Attribute Sensing Data
Through SVD [
13], the low-rank data and attribute interrelations of the data matrix can be confirmed. t-SVD [
22] is a potent resource for investigating the interplay among multiple attributes. The low-rank characteristics of both datasets are shown in
Figure 7 and were obtained by evaluating the singular values of the block diagonal matrix. In t-SVD, the correlation among numerous attribute surfaces is determined by assessing the core tensor’s block-diagonal matrices’ singular values. The initial singular values primarily influence the energy of the tensor data.
In
Figure 7, it can be seen that each singular value has a different importance value, and the larger singular values contain more main data information; therefore, it is necessary to shrink different singular values using different weights rather than a fixed weight.
5.3. Experiment Setup
This section outlines a series of experiments conducted to confirm the efficacy of WRTPCA_WTNN, the proposed method. Its performance was compared with that of three other methods: the TRPCA [
22], WTC [
24], and the TRPCA combined with TC using the weighted sum tensor nuclear norm (TRPCA_TNN) [
11]. This study used two real-world IoT datasets, namely the U.S. Climate Normal and NDBC-TAO datasets, to analyze their recovery under various conditions, including sparse mixed noise, outliers, and missing values. Various methods were used in these experiments to achieve optimal visual results.
All methods were carefully fine-tuned and tested repeatedly, 20 times in total, for all corrupted cases. The average values were calculated to obtain the results. Data corruption was characterized using four factors: Gaussian noise, impulsive noise, outliers, and missing values.
In extensive experiments, we assessed the performance of our model using different types of corrupted data. For Gaussian noise, we chose a zero mean and investigated ten variance levels (
) ranging from 11 to 20 in increments of one. Impulsive noise was examined at ten percentage levels (
) from 0.1 to 0.55, with a step size of 0.05. Outliers were studied at ten magnitudes (
k) ranging from 6 to 15 with a step size of 1, each being k times the standard deviation (
) of each attribute. Missing values were explored at ten ratio levels (
) from 0.1 to 0.55, with a step size of 0.05.
Table 4 presents detailed parameters of the four corruption types.
In each experimental case, we varied only the type of corruption across the levels and held the other parameters constant. The fixed values for the experiment were a Gaussian noise variance () set at 15, an impulsive noise percentage () set at 0.3, an outlier magnitude (k) set at 15, and a missing ratio () set at 0.3. For instance, in the Gaussian noise experiment, we altered the Gaussian noise variances () from 11 to 20 in steps of 1 while maintaining an impulsive noise percentage () of 0.3, an outlier magnitude (k) of 15, and a missing ratio () of 0.3.
Two algorithms, labeled Algorithms 3 and 4, were initialized with different WTSVT weight values. Specifically, was used in Algorithm 4, whereas was used in Algorithm 3. It was necessary to select appropriate values for , , , and , which were the characteristics of the sensing data. In this study, the values of , , , and were set to , , , and , respectively. The experimental results showed promising outcomes.
The experimental results were averaged across 20 replicates. All simulations used MATLAB R2021a on an Intel(R) Core(TM) i7-10700K CPU 3.80 GHz (Intel, Santa Clara, CA, USA).
5.4. Metrics
Reconstruction accuracy was evaluated using the normalized mean absolute error (NMAE). The NMAE for each of the three features considered in the dataset was determined by comparing the lateral slices of the input data with the reconstructed low-rank tensor.
The original tensor is represented by
, and the reconstructed low-rank tensor is denoted by
. A comparison between the two tensors served as the foundation for the obtained results.
5.5. Results and Analysis
Each subfigure represents the results of the experiments, each involving a specific type of data corruption for a particular attribute within the datasets. The horizontal axis shows the levels of data corruption types, and the vertical axis illustrates the NMAE values of the four models in each case.
For Gaussian and impulsive noise,
Figure 8 and
Figure 9 illustrate the results of the comparison of our models and others on the NDBC-TAO and U.S. Climate Normal datasets. Across all models, the proposed model consistently outperformed the other methods. In particular, the NMAE values of our method applied to the two datasets consistently fell below 0.1 in most noise cases.
In addition, the proposed model can handle outliers. The NMAE values of all models in outlier cases are shown in
Figure 10 for the two datasets. The results of the other models are worse than those of the proposed model, and the results of our model applied to the two datasets are below 0.1 in most outlier cases.
Finally, the missing values are also reconstructed by our model, and the results of an experiment on the two datasets in the missing-value cases of the proposed model and the three different models are depicted in
Figure 11. The experiment shows that the NMAE of the proposed model surpasses the NMAE values of the WTC, TRPCA, and TRPCA-TNN methods and maintains a value below 0.1 when applied to the two datasets in most missing-value cases.
Although each figure shows a comparison of the methods for each type of corruption, the default datasets consistently include a blend of noise, outliers, and missing values. Therefore, the reason the best outcomes were achieved by the proposed model is that it considers all types of corruption by employing WTSVT as a weighted approach for the TRPCA and TC, which shrinks singular values based on their significance and improves the capacity to reduce mixture noise and outliers and reconstruct missing values. Every singular value has distinct importance, as indicated in
Section 5.2, and the more significant the data expressed, the greater the singular value. Therefore, we must reduce the distinct singular values using varying weights rather than fixed weights, as in most studies. In contrast, the WTC method consistently underperforms in all corruption cases. The WTC method performs poorly in all cases because it specializes in reconstructing missing values and performs poorly in reducing noise and outliers. The TRPCA method consistently outperforms the WTC method. However, it still requires consideration of the weights of the singular values to enhance its ability to handle mixed noise and outliers. In some cases with high missing value ratios, such as a missing value ratios from 0.4 to 0.55 in the dew attribute of the U.S Climate Normal dataset, the TRPCA performs worse than WTC. The effectiveness of the TRPCA in recovering missing values diminishes at higher missing value ratios because it is specifically designed to address noise and outlier problems, not missing data. The TRPCA is still helpful for some small ratios of missing values because the process extracts low-rank tensors that can recover but perform poorly. In addition, the TRPCA-TNN method performed better than the WTC and TRPCA methods but worse than the proposed model. This method combines the TRPCA, which specializes in noise and outliers, and TC solved using the weighted sum tensor nuclear norm, which specializes in missing values. The TRPCA in the TRPCA-TNN method does not consider the weights of singular values; therefore, the ability to reduce noise and outliers of the TRPCA-TNN method is still inferior to that of the proposed model. This model is still superior to the TRPCA at reducing noise and outliers because, in addition to the TRPCA, it also has a TC process that recovers missing values, which reduces noise and outliers with low accuracy. In the specialized recovery of missing values inthe TRPCA-TNN method, TC using the weighted sum tensor nuclear norm performs worse than TC using the weighted tensor nuclear norm. Owing to these limitations, the TRPCA-TNN method performed worse than the proposed model.
The performance of the U.S. Climate Normal dataset was lower than that of the NDBC-TAO dataset because the attributes in this dataset might have a low correlation as there are many attributes, such as temperature, wind statistics, mean sea level pressure, dew point, and cloud cover. However, we chose only three features, a number equal to the number of features in the NDBC-TAO dataset, to create the experimental dataset. In the NDBC-TAO dataset, the results for all scenarios of all models were consistently under 1.0 NMAE. Meanwhile, in the U.S. Climate Normal dataset, the results of other models in some complicated cases reached an NMAE of 5.0.
6. Conclusions
This study proposes a new technique for recovering multivariable data from 5G-based IoTs by combining TRPCA and TC approaches. The proposed approach can recover missing values, multiple noises, and outlier issues. By using the correlation between multi-attributes and solving problems using tensor data, this method can improve system performance. This uniqueness arises from the incorporation of WTSVT into TC and the TRPCA to handle singular values, which increases reconstruction accuracy by retrieving essential components. This combination overcomes the limitations of other models, such as WTC, the TRPCA, and the TRPCA-TNN, and significantly improves the ability to reduce mixed noise and outliers and reconstruct missing values. The proposed method is highly resistant to data corruption, including mixed noise, outliers, and missing values. Its NMAE consistently outperformed other metrics across two datasets, the NDBC-TAO and U.S. Climate Normal datasets. This method can be useful in various applications, particularly those related to the 5G-based IoT in real-world scenarios. Thus, these findings bridge the gap in the literature regarding better ways to handle extensive data analysis within the IoT environment.
Our objective in the near future is to automate the process of choosing the most appropriate weight for a particular dataset. This will allow us to utilize the most efficient and precise decision-making methods. Furthermore, we will explore situations in which there is a correlation between errors or noise, such as when one type of error is dependent on another. In addition, we will consider ways to reduce the complexity of the proposed algorithm.