1. Introduction
During the last decades, the Internet of Things (IoT) has emerged as a new business model that is composed of billions of communicating devices. Hence, it has gained considerable attention in both the scientific community and industry. However, the inclusion of the IoT into the fifth generation cellular systems (5G) and their evolution still represent a formidable technical challenge due to the huge number of sensors and the generated information. Note that one of the main challenges of the 5G is the massive connectivity for Machine-Type Communications (MTC) and managing its coexistence with the high rate continuous traffic that is generated by Human-Type Communications (HTC) in an efficient and effective manner. An interesting proposal is the Compressive Sensing (CS), which reduces the number of active agents at a given time slot, while remaining able to recover the sensing data. In general, Wireless Sensor Networks (WSNs) consist of a large set of sensor nodes, which are self-organising and geographically distributed across the network. They are usually used to monitor various physical phenomena with a high resolution, such as in forests, under water, as well as in civilian and habitat application areas. Usually, these devices operate in an unattended mode and they are unable to renew their batteries. Hence, energy efficiency is the main challenge for these networks, since it directly affects their lifetimes and, thus, their sustainability. In usual data gathering techniques, each sensor node takes measures and sends data to the sink node via multi-hop transmission. If nodes face packet losses, due to collisions or buffer overflows, packets are retransmitted, which leads to a high sensing cost and a heavy traffic, especially in large-scale networks. Indeed, reducing the number of transmitting source nodes, using techniques such as CS, is not only useful in reducing the collisions, but also crucial for sensor nodes that need to sleep to prolong their lifetimes.
Recently, it has been shown that the integration of the Matrix Completion (MC) technique, viewed as an extension of CS, has enhanced wireless networks scenarios. If the received data matrix has a low-rank structure, then it can be recovered with high accuracy while using the partially received elements [
1]. Firstly, data are directly sensed in compressed form and the high energy-intensive recovery algorithm is executed at the sink node. Hence, the computation complexity is moved from sensor nodes to the sink. This meets well the resource-constrained devices and significantly reduces the energy consumption. Secondly, because MC handles the data in its matrix form, it can fully capture the signal correlation in both space and time dimensions and, hence, achieves a satisfying interpolation quality with a higher compression rate (very few transmitted readings).
In some applications, especially the densely deployed WSNs, the sensed data are, in general, highly correlated, and redundancy exists between the sensor nodes that belong to the same geographic area. These nodes can be arranged into a group or a cluster. Because they are monitoring the same targets or events, collecting raw data from all of the cluster members becomes inefficient and wasteful for the energy. Therefore, a sufficient subset of nodes can be selected from each group, according to a certain criteria, to be the representative of the whole network. These active nodes deliver their readings to the sink under a compression ratio that is guaranteed by MC theory, while the rest of nodes remain silent and do not participate in the sensing operation. Thus, as an extension to [
2], we carry on with the twofold compression technique that has been updated compared to the paper [
2]. First, we assume that part of nodes do not sense the environment at all. We can consider that these sensors are
or
for a long period or that these nodes are
. Specifically, in this paper, these notations are only related with the sensing activity, and all of the nodes are connected in order to participate in data forwarding (Here, a node is absent in the sense that its data reading is completely missing, and the sink node has to recover it correctly). The second compression level is that, at each time slot, only a subset of the active nodes, referred to as the transmitting ones, send their sensing data to the sink. Different from [
2], the nodes having the higher correlation with other nodes, i.e., best represent the network, are selected as representative sensor nodes. Indeed, in order to be chosen as active nodes, they should be able to capture enough information regarding the others and the whole network. This strategy not only minimizes the energy cost and extend the network lifetime, but it also helps to avoid other problems, such as the traffic congestion collapse [
3]. It is true that in [
4], the choice of the active nodes follows a deterministic based metric. However, unlike [
4], in this work, we explain, in detail, each building block of the introduced structured MC-based data gathering framework (the representative nodes selection process and the network clustering phase). Subsequently, we separately evaluate them in the numerical results section in order to illustrate the benefits of each building block of the proposed technique. Furthermore, in this paper, we propose a Multi-Gaussian signal model that introduces the solution of reproducing a signal retaining the behaviour of a given real world data by adjusting the correlation parameters. For that reason, this method represents an effective alternative to the real world signals.
The application of the just mentioned atypical high-loss scenario leads to a significant number of empty rows in the signal data matrix (a row (resp. column)) is called an empty row (resp. column) if and only if all of the values of the row (resp. column) are un-sampled), which completely disagrees with MC fundamentals. In fact, because MC approaches are based on the minimization of the matrix rank, they become useless when there is any empty row or empty column in the matrix. Indeed, MC techniques have been conceived to recover a matrix containing random missing elements [
5]. In the state-of-art of MC-based algorithms, to the best of our knowledge, Ref. [
6] is the only paper who dealt with the case, where there is a small number of missing rows in the received data matrix by applying a spatial pre-interpolation technique, which recovers data from neighboring sensor nodes. However, as the number of active nodes decreases, we also face absent nodes having absent neighbor sensors. Thus, this framework becomes unable to recover the data rows of these
sensor nodes. Hence, although this approach is interesting, it seems not well suited for the addressed scenario and it fails to take the existence of
sensor nodes (absent nodes having all their neighbors absent) into account. In this context, we develop our scheme, which, firstly, schedules the sampling pattern after efficiently identifying the different clusters and their representative nodes. Secondly, it treats the case of high compression ratios with a considerable number of inactive sensor nodes (empty rows) while using a sequence of three different interpolation techniques.
The proposed framework is also useful for another challenging scenario; when we have a small number of sensors that have to be deployed in a spacious area. Indeed, either the sensor nodes are costly or the environment is large enough to be content with the limited number of sensors. This may also concern the harsh environments that are difficult to access such as volcanoes and other troublesome environments, where the deployment of many sensor nodes is not practical and becomes expensive. However, in many applications, the amount of gathered data must be significant enough to be processed. The idea here is to place a relatively small number of spatially spaced sensor nodes to control the correlated field under a compression ratio. These sensor nodes represent other sensor nodes that do not really exist. Particularly, the sensory data field is, most of the time, highly correlated and redundant between nearby sensor nodes, which makes possible to estimate the readings at locations, where the signal cannot be sensed.
The main contributions of the paper are summarized, as follows:
We generate a synthetic space-time signal that is composed of different Gaussians, each of which presents a cluster of wireless nodes. Like all the WSNs signals’ profiles, the generated signals are correlated in space and time, where spatial and temporal correlation parameters and models differ from one Gaussian to another and can be separately adjusted.
For the sampling part, only a small subset of sensor nodes is selected to be active and report its readings. For each detected cluster, the active sensor nodes selection is achieved by considering the correlation criteria. Subsequently, for each time instant, we choose the transmitting sensor nodes with the same percentage from each cluster in order to ensure the diversity in the transmitted data, notably for the high compression ratios.
For the reconstruction part, we propose using three different techniques to accurately rebuild the entire data matrix. In the first step, we fill the missing readings of the active sensor nodes by applying the MC. Subsequently, we carry on with the spatial pre-interpolation to handle a part of the empty rows while adjusting the 1-hop topology matrix to the presence of the disjoint clusters in the monitored field. Finally, we recover the rows of the
sensor nodes using a minimization problem interpolation-based technique with a spatial correlation matrix. In this paper, the third stage of data recovery pattern has been re-investigated and improved to be more efficient when compared to the one used in papers [
2,
4], i.e., providing a lower data recovery error for the
nodes. In the numerical results section, we evaluate the two techniques with respect to a tuning parameter, and we show that the proposed minimization problem interpolation-based method significantly enhances the data recovery performance.
Through extensive simulations, we show that the proposed framework outperforms other existing techniques in the literature, especially when the number of inactive nodes increases.
The remainder of the paper is organized, as follows. The next section discusses the related work, and
Section 3 provides a brief overview on the MC theory and introduces the problem formulation of the paper.
Section 4 presents the signal model that we used for the evaluation of our approach. In
Section 5, we introduce the efficient clustering method that we propose. In
Section 6, we present a strategy that selects the set of the representative sensor nodes.
Section 7 is dedicated to the data reconstruction framework. Before concluding the paper in
Section 9, we carry out, in
Section 8, extensive simulations in order to evaluate the performance of the proposed approach.
2. Related Work
Environmental WSN signal profiles exhibit both spatial and temporal dependency. Such structures generate redundancy and enable a succinct representation of the data while using a number of coefficients that are much smaller than its actual dimension. One popular postulate of such low-dimensional structures is sparsity, which is, a signal can be simply represented with a few non-zero coefficients in an invertible proper sparsifying domain [
7]. CS has been introduced as a good fit for such application in both the acquisition and reconstruction of the signal [
8]. With a number of measurements proportional to the sparsity level, CS enables the reliable reconstruction of the signal. Indeed, the latter can be encoded using a much lower sampling frequency than the traditional Nyquist one [
9,
10,
11]. In order to handle the under-determined linear systems, efficient convex relaxation and greedy pursuit-based solvers have been proposed, such as NESTA [
12], L1-MAGIC [
13], and orthogonal matching pursuit (OMP) [
14]. Over the past years, plenty of papers have addressed the data gathering problems in WSNs by the integration of the CS theory, which had made appealing progress in the network energy consumption [
15,
16,
17,
18,
19,
20,
21].
Originally, CS-based schemes were designed to sample and recover sparse vectors, and they were classified either as purely spatial approaches [
18,
19,
20,
21,
22] or as purely temporal ones [
23]. Despite the incorporation of the kronecker CS framewok, the standard resolution of CS is still formulated in vector form [
15,
16,
24,
25,
26]. Moreover, tools from linear algebra are still needed in order to reformulate the data matrix into the vector form. Without the need of computing an adaptive sparsifying basis, MC has recently emerged using another type of structural sparsity (a low-rank matrix holds singular values composing a sparse spectrum) [
27], which is the matrix low rank property [
1]. Because it treats the data matrix as a genuine matrix, MC can take advantage of the correlation in its two dimensions and capture more information. In [
28], the authors have found that the data reconstruction performance of the MC depends on the compression ratio. In our previous work [
29], we have illustrated that a simple MC-based approach requires a smaller fraction of sensor node readings. In [
30], a state-of-the-art of MC-based algorithm for compressive data gathering has introduced the short-term stability with the low-rank feature. The considered feature was used not only to reduce the recovery error, but also to recover the likely empty columns appearing in the received data matrix. The existence of the empty columns was possible, since the readings were forwarded according to a presence probability. Differently, Zhou et al. in [
31], have taken advantage of the temporal stability feature and a MC method based on the Bayesian inference to interpolate the missing data. Furthermore, the authors, in [
32], addressed joint CS and MC. They have used the CS to compress the sensor node readings and then the MC to recover the non-sampled or lost information. However, this approach has not been compared to other state-of-the-art approaches to show its real contribution. In addition, they have not taken advantage of the space-time correlation of the signal as it should be, since they have used standard compression and sparsifying matrices for the CS. Different from [
32], Wang et al. in [
33], explored the graph based transform sparsity of the sensed data and considered it as a penalty term in the resolution of the MC problem. Similarly, Ref. [
34] combined the sparsity and the low-rank feature in the decoding part, and, as in [
33], has used the alternating direction method of multipliers to solve the constrained optimization problem. However, the authors have focused on vector-valued signals when sampling. In [
35], the authors introduced an active sparse mobile crowd sensing approach that is based on MC with the intention to reduce the data acquisition cost, while inferring the non-sampled data readings. Because adaptability and efficiency are two very important issues in WSNs data gathering, Ref. [
36] has proposed an adaptive and online data gathering scheme for weather data, purely based on MC requirements. In contrast to our proposed approach, this paper has addressed the sampling side differently. Indeed, they have focused on the sampled data locations in the received data matrix, whereas we have considered the sampled data locations in the network area.
The authors of [
6] have focused on the case of MC recovery with the existence of successive data missing or corruption, which is referred to as structure faults. Indeed, they have considered that successive data may be missing or corrupted due to channel fading or sensor node failures, which creates successive missing data on rows and/or on columns. However, treating a significant number of totally missing rows was out of the scope of their paper. In this paper, we investigate how to solve a challenging problem in the WSNs: how to omit a considerable number of sensor nodes from the monitoring field and estimate their readings from the partially reported readings of the representative sensor nodes using a MC-based approach. It is worth mentioning that efficiently identifying the clusters, their representative nodes, as well as the data transmission schedule significantly affect the recovery accuracy.
4. Signal Model
In this section, we investigate the generation of a synthetic signal that is composed of different Gaussians, each of which presents a portion of the whole monitored geographic area. Because structure and redundancy in data are often synonymous with sparsity, which is analogous to low-rank [
27], each portion of the signal is correlated in space and time, where the spatial correlation as well as the temporal correlation parameters differ from one Gaussian to another. These parameters can be separately adjusted because the corresponding functions are independent [
41].
The proposed signal model is inspired from [
41] that has introduced the solution of reproducing a signal retaining the behavior of a given real world data by adjusting the correlation parameters. In their model, all of the generated samples of the whole signal are Gaussian random variables with zero mean and unit variance. However, in this paper, we consider heterogeneous fields that are divided into a number of regions. Each one is modeled by a specific Gaussian (mean, variance) and different correlation characteristic. The number of different Gaussians as well as their distribution on the field can be fixed or defined according to the kind of signal that one wants to reproduce. This method represents an effective alternative to the real world signals.
In order to generate the signal of interest, we suppose that is the space domain, where x and y are the space coordinates. Consider that we have H different regions, where is the space domain of region , and . Likewise, we suppose that the time is slotted into equal time slots . Without a loss of generality, in Algorithm 1 we describe how to generate a correlated portion of the signal representing one region, where is the time domain and is a point in plane that corresponds to region h. The signal of the whole area is the combination of all the generated portions.
In order to obtain a spatially correlated signal, we apply to the signal, to be generated, a 2D filtering procedure using a specific correlation function
, where
. Among the numerous existing models in the literature, we generate the signal using the Gaussian filtering, as used in ([
15] Equation (
2)), which can be controlled by the parameter
(the Power Exponential model [
41], when
is equal to 2). The coloration of the signal with
has to be done in the frequency domain. Hence, before modeling the spatial correlation, a Fourier transformation is performed. Regarding the temporal correlation, the authors of [
41] have used an autoregressive filter to enforce the temporal correlation in the signal model. Because the time is slotted into equal time slots, they only consider the one-step time correlation and use a simple coefficient
.
Algorithm 1 Model for generating a portion of the signal |
Input: the generated field for : , the temporal correlation parameter , the spatial correlation parameter , the spatial correlation function computed in the frequency domain . - 1:
for to T do - 2:
if then - 3:
. - 4:
else - 5:
, where is a (0,1) i.i.d random Gaussian noise. - 6:
end if - 7:
. - 8:
. - 9:
. - 10:
. - 11:
end for
Output: the space–time correlated signal portion of zone . |
To start the signal generation process, for , we define to be an i.i.d random Gaussian. That is, for any specific position , is a Gaussian random variable with mean and unit variance. Algorithm 1 describes how to produce a portion of the whole signal , which represents the signal. By construction, is a three-dimensional (3D) matrix of size . The data matrix of interest, X, denotes the two-dimensional (2D) signal that is discretized by the N sensor nodes along the T time slots.
Figure 2 illustrates an example of an area of size
that is monitored by
sensor nodes. We can notice, through the colors, that this field is divided into three different regions (
) that are presented by three different Gaussians.
5. Clusters Detection
In this section, we investigate the partition of the deployed sensor nodes into J clusters. The main reason for partitioning the nodes is to involve all of the detected clusters in the data sensing and transmission. It is well-known, in the conventional MC, that transmitting sensors are selected in a random way during the T time slots. This kind of selection can disregard sensors that belong to the small clusters, which deteriorates the recovery process. However, if we make all of the clusters contribute in the data transmission process, then we fortify the diversity in the delivered data set. Therefore, for each t, according to a given compression ratio and using the same percentage, a set of sensor nodes is picked from each cluster to form the sampling and transmission schedule. It will be shown, in the simulation section, that taking the detected clusters during the sampling process into account significantly enhances the data recovery performance, especially for the high compression ratios. Indeed, our aim is to partition the sensor nodes into different clusters in such a way that we attempt to maximize the intra-cluster similarities and minimize the inter-cluster similarities. Such a successful grouping can be achieved while using the Normalized Spectral Clustering
Usually, sensor nodes, which are situated spatially close to each other, have similar readings. Nevertheless, there are some cases, where nearby nodes are separated by a certain barrier and they have readings relatively different from each other. Given the example of sensor nodes deployed in a city to monitor the air pollution. Suppose that we have a public garden located next to a road. Hence, the nearby nodes, which are placed on different sides of the borders, do not necessarily have similar readings. Therefore, to cluster the nodes, the sink relies on their delivered readings (at the initialization, we let all of the sensor nodes send their information during a short learning period
) and considers the set of data vectors,
, which we want to partition into
J clusters.
, viewed as a
-dimensional data points, holds the readings that are sent by the sensor node
i during the learning period. The spectral clustering technique performs data clustering and treats it as a graph partitioning problem without setting any assumption on the clusters form. It transforms the given set
into a weighted graph
while using some notion of symmetric similarity matrix
, where each vertex
represents
, and each edge between two vertices
and
represents the similarity
. It is recommended to use the Normalized Spectral Clustering, as mentioned above. Hence, we implemented the NJW algorithm [
42] (the algorithm name, NJW, is attributed according to the authors’ names, which is, Ng, Jordan, and Weiss), which is detailed in Algorithm 2.
Commonly, identifying the number of clusters
J in an optimal manner is the main concern of all clustering algorithms. Generally, with spectral clustering, we find the number
J by analyzing the Laplacian matrix eigenvalues that are computed using
A and according to the chosen clustering method. In this work, we choose to apply the eigengap heuristic [
43], which defines
J by finding a drop in the magnitude of Laplacian eigenvalues,
, sorted in increasing order. That is:
The idea here is to pick the number J in such a way that all of the Laplacian eigenvalues are very small when compared to , which marks relatively a large value.
Regarding the similarity matrix
A, we opted for the Gaussian kernel to measure the similarity between the data points
[
42], where
is a scaling parameter that controls the neighborhoods width:
According to ([
42] Theorem 2), an appropriate
can be automatically fixed after repeatedly running the algorithm while using a number of values and choosing the one that forms the least distorted partition in the spectral representation space. In order to determine the appropriate parameter
, in ([
43]
Section 8), the authors had provided several rules of thumb that are frequently used. As an example, the method that we have used states that
can be chosen to be in the order of nearly the mean distance of a point to its
nearest neighbor, where
.
Algorithm 2 The Ng, Jordan, and Weiss (NJW) Spectral Clustering algorithm |
Input: The set of data vectors , the number J of clusters to detect according to ( 5). - 1:
Calculate the similarity matrix A according to ( 6). - 2:
Calculate the degree matrix D, which is a diagonal matrix defined by : .
- 3:
Compute the normalized graph Laplacian matrix . - 4:
Proceed the eigenvalues decomposition of and find the J eigenvectors corresponding to the smallest eigenvalues, arranged in increasing order. - 5:
Form the matrix U by stacking the J eigenvectors in columns: . - 6:
Normalize the U’s rows to norm 1 in order to get the matrix , that is, .
- 7:
Treat each row of , , as a data point in , then partition them into J subgroups, , using k-means algorithm. - 8:
Attribute the original points to cluster j if and only if row i of the matrix was attributed to cluster j.
Output: Clusters with . |
Figure 3 plots the sorted eigenvalues of the normalized Laplacian matrix that is computed from the generated signal of the example of
Section 4 while using the first four steps of the aforementioned clustering algorithm. Clearly, there is a relatively large gap between the 3rd and 4th eigenvalue of this trace. According to metric (
5), the data set contains three clusters, which is well approved.
6. Sampling Pattern
In this section, we determine how the correlation criteria can be considered to select the representative sensor nodes and how we take the detected clusters in the selection process as well as in the sensing and transmission schedule into account. Unlike our previous work [
2], where the set
of representative sensor nodes is randomly chosen, in this paper
must hold enough information towards the other nodes to be chosen as representative of the network. Relying on the Enhanced Correlation Based Deterministic Node Selection (ECB-DNS) procedure, which was used in previous works [
15,
16], the active sensor nodes selection is achieved by considering the inter-spatial correlation, which is computed through the conditional variances of the sensor nodes. This technique enables selecting the sensor node
holding the maximum informative value
with respect to the set
of sensor nodes that are not selected yet. Namely:
In (
7),
represents the covariance between the reading
of sensor
i and the reading
of sensor
g, whereas
presents the variance of
. It is noteworthy that the way of exploiting this technique in our approach is different to that in [
15]. According to their scenario, all of the
N nodes contribute to the data sensing and transmission over the
T time slots, while, in this approach, only
nodes are selected to be active and represent the
J detected clusters. In order to cover all of the clusters, the set
consists of the combination of
J subsets,
, where
includes
representative nodes picked from cluster
while using the same shared percentage
. That is:
In (
8), if
is not an integer, we round
to the nearest integer greater than or equal to the value of that element. Here, the selection of the sets
of clusters’ representative nodes is independent from one cluster to another. Hence, the set
that appears in expression (
7) is replaced by the set
, which represents the sensor nodes of the cluster
that are not yet selected. Thus, we have:
The selection process is the same for the
J sets
. Thus, for each cluster
, according to (
9), at each iteration
, a sensor node
is selected and moved from set
to set
. Note that
represents the set of nodes of cluster
that are already chosen during the previous iterations. Once a sensor
is put in
, the metric
of the remaining sensors of set
should be recomputed in order to prepare the selection of the next sensor node
. Here, by removing
from
, we cancel its impact on the rest of the nodes in
. Hence, the selection of the sensor node
will be achieved as if the sensor node
did not exist in the network. The node selection process, especially the manner in how we remove the correlation effect of node
from
, follows the steps that are outlined in Algorithm 3. For the initialization, we define the data matrix sent during the learning period
that we partition into
J sub-matrices
, where
holds data sent by nodes belonging to
. Besides, we assume that the spatial correlation feature inherent in
reflects that in
X. By analogy with [
15], the computational complexity of selecting
representative nodes from
is
. However, different from [
15], where in each time slot
t, a new and a different set of active transmitting source nodes should be found using the node selection metric, in this work, the selection of the representative nodes’ set is performed only once, at the beginning of the sensing period
T.
Given the example of
Figure 1, we can note the existence of three detected clusters within the network. We suppose that
. Thus,
of nodes will be selected from each cluster to be active. That is to say that we should pick
sensors from
,
sensor from
and
sensors from
. That is, in total
representative sensors. Based on the correlation among the sensor nodes and using Algorithm 3, the obtained subsets are as follows:
,
and
.
Once the set
of representative sensor nodes is defined, the sink focuses on the sensing and transmitting schedule,
, by assigning
m transmitting nodes for each time instant
t. Obviously, these nodes are picked from the set
. As has been stated in the previous section, in order to ensure the diversity in the delivered data, the
m transmitting nodes are chosen in such a way that we randomly pick, with the same shared percentage
,
nodes from each subset
corresponding to cluster
. Likewise (
8), we have:
Let us focus again on the example of
Figure 1, we suppose that
. Thus, for each
t,
of sensors from each subset
are randomly designated to deliver their data to the sink. Because the used number
N of this example is very small, we end with
transmitting sensor from each cluster for each
t.
To conclude, rather than selecting, in a purely random way, the measurement locations, as usually used in the conventional MC method, in this section we presented how to intelligently assign transmitting sensor nodes that can well represent the network relying on their correlations.
Algorithm 3 A cluster representative sensor nodes selection process |
Input: For , , , , , a zero-vector , . - 1:
for to do - 2:
if then - 3:
Compute the covariance matrix of . - 4:
According to ( 9) and using , compute the metrics then select . - 5:
Remove the reading of node from so that it becomes and takes the values of node so that . - 6:
Following that removal, can be written as:
where is the covariance matrix of , is the covariance vector between and , and is the variance of . - 7:
else if then - 8:
Following the removal of node from , re-compute the conditional covariance matrix of knowing ; where:
- 9:
According to ( 9) and using , re-compute the metrics then select . - 10:
takes the values of . - 11:
Perform step 5 then step 6. - 12:
end if - 13:
and . - 14:
end for
Output: . |
7. Reconstruction Pattern
After revealing in detail how to select the representative sensor nodes and how to schedule their participation in the data sensing and transmission, in this section we focus on how to approximate the entire data matrix X based on the limited amount of reported readings. Isolating inactive sensor nodes from the sampling and transmission schedule entails the existence of fully empty rows in the received data matrix , which impedes the MC technique that is completely unable to estimate the original matrix. Therefore, the use of other complementary interpolation techniques becomes needed. In this context, we develop a structured MC-based recovery algorithm that is able to ensure the reconstruction of the entire data matrix X.
Stage 1: obviously, it is not feasible to directly apply the MC technique with the existence of
fully empty rows. Therefore, we have to remove these rows from
M. We denote the resultant matrix as
, containing the partially delivered readings of the representative sensor nodes. We carry on with the same removal from
to obtain
. Subsequently, making use of the solution introduced in (4) or any other method proposed for the MC resolution, we fill the missing entries of
that correspond to the non-transmitted data readings of the
sensor nodes. The threshold parameter
roughly equals 100 times the largest singular value of
, as has been introduced in [
39]. We denote
as the combination of the MC based estimation and directly observed data. Finally, we update
by adding the
empty rows and then placing them in their proper corresponding locations of
M.
Stage 2: after filling the random missing readings, leave the
completely missing rows that correspond to the inactive sensor nodes. In this phase, we carried on with the spatial pre-interpolation technique of [
6], which rebuilds the data of an empty row relying on the available data of the neighboring sensor nodes. To apply this method, they used a kind of an
binary symmetric matrix
Y that they called a 1-hop topology matrix, where both of the columns and rows denote the sensor nodes. The sink assigns 1 to
and
if it finds that sensor node
i and sensor node
j are 1-hop neighbors. However, according to the signals nature that we consider, and to avoid untrustworthy data reconstruction, we consider that, even though two sensor nodes are geographically close to each other, if they do not belong to the same cluster, then they are not considered to be neighbors.
The number
of the active sensor nodes is very small when compared the total number
N, which means that the
inactive sensor nodes constitute the preponderant portion of the network, as mentioned before. Consequently, there are several IS nodes in the network (having all of their neighbors absent). Hence, with the use of the stated topology matrix
Y, this interpolation technique can achieve the data reconstruction only for the absent sensor nodes that have neighbors that belong to
. We suppose that the network distribution contains
sensor nodes. Subsequently, the resulting data matrix
, obtained at the end of this stage, still holds
empty rows to be recovered (
all-zeros rows). For the detailed steps of the above interpolation method, the reader may refer to ([
6] Section VI) (As for the complexity of the used spatial pre-interpolation, according to [
6], it is estimated to be very low, since this technique is based on simple matrix multiplication with neighbor information).
Stage 3: since the above interpolation technique is limited to recover only a part of the total empty rows (absent nodes), we resort to a second spatial interpolation to rebuild the remaining part of the empty rows (
nodes). Benefiting once again from the spatial dependency among the sensor nodes, we fill the remaining empty rows while using the following minimization problem:
where
denotes the set of indexes of the non
nodes, i.e., the representative nodes and the absent ones.
S represents the spatial constraint matrix, whose computation steps will be detailed hereafter,
and
are two tuning parameters, and
is the final reconstructed data matrix. It is noteworthy that the above proposed minimization-based interpolation technique has been updated when compared to the one of our previous works ([
2] Equation (
8)) and ([
4] Equation (
5)), and, through simulations, we found out that the updated minimization significantly enhances the data reconstruction quality of the
nodes. Note that the resolution of this optimization problem can be easily accomplished while using the semidefinite programming (SDP). We opted for the CVX package [
44], implemented in Matlab, as an advanced convex programming solver, in order to solve (
11) and obtain
.
In this equation, the matrix relatively reflects our knowledge regarding the spatial structure inherent in the data, since it is computed based on the learning data matrix . This spatial matrix expresses the similarities between the sensor nodes’ readings. Suitably, we use the Euclidean distance as a distance function in the data domain of the sensor nodes to model the similarity between the rows of , whereby the smaller the distance between two rows, the closer they are. Below are the steps to obtain S:
1—We initiate these steps with an all-zeros matrix S.
2—The similarity between the rows in is not evident as the ordering of the sensor nodes’ indexes in is arbitrary. Thus, for each row i of , we search for the set of indexes of the K closest rows to i, which is, .
3—Assuming that the row
i can be approximated through the linear combination of the rows of set
, we perform the linear regression to compute the weight vector
through the following equation:
4—Finally, we assign 1 to and to .
As soon as these steps have been carried out for all the rows
i, we obtain the matrix
S, with which we interpolate
, as in (
11) (here, since, for each row of
, we search for the set
, while using a simple Euclidean distance, the complexity is
. Moreover, performing the linear regression in (
12) to compute the weight vector
W is basically dominated by simple multiplication and division operations of matrix, which makes the complexity low).
Now, there remains the last adjustment to realize, that is, the scaling of the two parameters,
and
of (
11). The regularization parameters
and
are introduced in order to establish a trade-off between a close fit to the matrix
and the intention of fulfilling the
remaining empty rows while using
S. Through several simulations, we found that adjusting these parameters nicely improves the reconstruction performance, and the found values of
and
are independent of the size of the matrix (
N and
T) as well as the Gaussians’ values composing the synthetic signal.
Let us focus again on the example shown in
Figure 1. The dotted lines refer to the neighborhood relation between sensors. As we can see, the sensors
are each linked at least to a representative sensor. Thus, their data readings can be easily recovered through the spatial pre-interpolation method of stage 2. Whereas, the data readings of the sensors
are recovered thanks to the minimization (
11) of stage 3.
8. Numerical Results
In this section, we first evaluate our proposed structured approach with the variation of the tuning parameter
of the minimization (
11) of stage 3, while fixing
, in order to measure the data reconstruction error ratio with respect to the different simulated values of
and choose the appropriate one that gives the lowest data recovery error. Secondly, we compare the performance of our proposed structured approach, with the fixed tuning parameter
, to that of a benchmark scheme, which was designed basically on what was proposed in [
6] and in line with our scenarios’ requirements. Indeed, at the end of their work, Xie et al. considered, in [
6], that there is a small number of empty rows in
M, which is, for
, 14 data rows were missing, namely
of
N (i.e.,
of
N of representative sensors). As we have already stated at the beginning of this paper, treating an important number of missing rows has not been the main focus of their work. Thus, their proposed approach has not taken the existence of the
nodes in the network into account. In fact, they basically focused on the existence of successive missing or corrupted entries in the received data matrix
M. However, to the best of our knowledge, this is the unique paper that has treated a similar case using MC, and with which we can compare our approach in the first part of this section. Subsequently, in the second part, we try to separately evaluate the benefits of each building block of the proposed approach, namely:
Making use of the generated signal of the example of
Section 4, we perform our structured approach over different scenarios to illustrate the impact of these aforementioned techniques on the interpolation accuracy of the data matrix. To measure the reconstruction error, we opted for the following metrics, where
X and
represent, respectively, the initial raw data matrix and the reconstructed one:
1—
: the Normalized Mean Absolute Error on all missing entries:
2—
: the Normalized Mean Absolute Error on the partially missing entries, which correspond to the non-transmitted readings of the representative nodes:
where
is the set of indexes of the partially missing entries, as found in the received data matrix
.
3—
: the Normalized Mean Absolute Error on the missing entries of the fully empty rows, which correspond to the inactive sensor nodes’ readings:
where
is the set of indexes of the
empty rows, found in the received data matrix
.
4—
: the Compression Ratio:
where
. Hence,
denotes the number of observed entries in
M.
We vary from 10 to 80, and, for each given , we vary from 10 to 80, in order to assess the proposed approach under different s. It is obvious that the range of the values of depends on the value assigned to . The larger , the higher range can be used. Note that we are mainly interested in the small values of and , since we are considering the high loss scenarios. Specifically, we consider that sensor nodes are randomly distributed in a square observation area of size , and we monitor the WSN during time slots.
To begin, we measure the data reconstruction error ratio
of our proposed structured approach with the variation of the regularization parameter
. To do so, we fix
to 1, then, we accordingly adjust
, which vary from the value 1 to the value
. Note that we have used
during all of the simulations of this paper.
Figure 4 shows the effect of
on the data recovery performance of our approach. For
, we vary
and for each case the
is calculated with respect to
. As we can note, the minimization (
11) of stage 3 typically performs better for the value
than the other values. For that reason, we retain this value and use it in all of the next experiments.
In
Section 7, the proposed minimization-based interpolation technique (
11) of stage 3 has been investigated and then updated when compared to the one of our previous works ([
2] Equation (
8)) and ([
4] Equation (
5)), as we have mentioned.
Figure 5 illustrates a performance comparison in terms of
between the two methods for different values of
and with respect to the regularization parameter
. As we can clearly notice through the simulations of
Figure 5, for different values of
, the data recovery performance is highly improved with the proposed minimization-based interpolation technique of this paper compared to the one shown in our original papers.
In the third simulation, we implement a benchmark approach that is based on what was proposed in [
6]. The sampling pattern of this approach consists in choosing the set
of representative sensor nodes in a purely random way, which is exactly the same as randomly selecting the empty rows. Likewise, for each time instant
t,
m nodes are uniformly selected from the set
to deliver their readings to the sink. Here, neither the selection of the representative sensors nor the selection of the transmitting ones takes the detected clusters into account. As for the reconstruction pattern, to obtain the final recovered data matrix
, this approach performs the MC, and then the spatial pre-interpolation. The temporal pre-interpolation was omitted, since we do not consider the existence of empty columns in the observed data matrix
M (This is not the case with our scenario, since, at every
t, we ensure the transmission of
m readings sensed in different
m locations). In
Figure 6, we have measured the
with respect to the variation of
, namely
, for different values of
. Our approach distinctly outperforms the benchmark one across the entire ranges of
, as we can note from the plots. We are able to go up to
of missing rows (
) with an interesting reconstruction performance,
of about
, while the benchmark technique yields an
of
.
Figure 7 and
Figure 8 illustrate the 3-D bar graph of, respectively, the
and the
values with the variation of
and
. For the convenience of comparison, we have implemented the
and
in order to separate the error ratios and demonstrate the recovery performance enhancement that has been achieved by our proposed approach on, respectively, the partially and fully missing readings.
Note that the considered framework extremely reduces the overall network energy consumption, since we only use a small set of representative sensors for the data transmission. Furthermore, when compared to the benchmark approach, the proposed one can further improve the sensors lifetime. In fact, for a given
target of
and
, we compute the energy consumption during the
T time instants for the both compared approaches, depending on the number
N of sensors. In this simulation, we consider that two nodes
i and
j can directly communicate with each other, without the need for relaying, only if the Euclidean distance
between them is within some transmission radius
that scales with
[
21]. To route the data towards the sink node, we perform the shortest path tree that was computed by the Dijkstra algorithm [
16]. The following model is used in order to compute the energy consumption during data transmission [
45].
where
and
represent, respectively, the amount of energy that is consumed by a specific node
i, to deliver or receive an
L-bit packet through a distance of length
. In (
17),
is the energy that is required by the transceiver circuitry at the sender or the receiver and
is the energy consumed by the transmitter’s amplifier. Regarding the parameters setting,
bits [
15],
50 nJ/bit and
100 pJ/bit/m
[
45].
Figure 9 illustrates the energy consumption for the proposed framework as well as for the benchmark one. Indeed, our approach requires far less sensor nodes’ readings, consequently, much less energy consumption, in order to achieve the same reconstruction performance.
Let us now focus on the benefits of the clusters selection. We show that taking the detected clusters during the representative nodes selection process as well as during the assignment of the sensing and transmitting schedule into account significantly ameliorates the data recovery performance. Thus, we compare our approach to another one, for which we proceed, regardless the existence of the different clusters. The set
of representative sensor nodes is selected according to (
7) instead of (
9), i.e., the spatial correlation criteria are present during the node selection process. Nevertheless, we do not have equitable representation of the different regions that compose the whole network. Withal, for each
t, the
m transmitting nodes are picked from the set
in a purely random way to sense then deliver their data readings, i.e.,
instead of (
10). To recover the received data matrix, both algorithms apply the 3-stage reconstruction pattern of
Section 7.
Figure 10 illustrates the 3-D bar graph of the
values with the variation of
and
. This simulation shows how curiously interesting the clusters consideration is. The barres depict that our approach provides a considerable improvement in terms of
when compared to the algorithm of comparison, especially in the high compression ratios, i.e., when the number of transmitting sensor nodes is very limited. Note that without enforcing the involvement of all the clusters in the data sensing and transmission process, sensor nodes that belong to the small clusters could be totally ignored, which gravely deteriorates the recovery process.
In
Figure 11 and
Figure 12, we have measured, respectively, the
and the
with respect to the variation of
, namely
, for different values of
.
Figure 11 and
Figure 12 highlight the effect of the introduced block on the recovery of, respectively, the representative nodes’ and the inactive nodes’ readings. Although both of the techniques apply the same MC resolution method, the
of our approach is much lower than that of the benchmark. The
also seems to be heavily affected, despite the fact that the clusters consideration, at the base, only targets the first stage of the reconstruction pattern, which is the MC resolution. For example, with
,
, and
, we can reach an improvement respectively of
,
, and
, when we enforce the involvement of all the clusters in the data sensing and transmission.
The next scenario aims to prove the importance of neatly selecting the
representative nodes. Making use of the spatial correlation in the selection process, as detailed in Algorithm 3, these nodes are selected under the criterion of having the best representation of the whole network. We compare our algorithm to another one that selects its representative nodes randomly in order to investigate the efficiency of the proposed selection process. However, in order to be comparable, this one takes the existing clusters when selecting its representative nodes into account. Hence, the set
of representative nodes consists of the combination of
J subsets,
, where
includes
representative nodes selected randomly from cluster
while using the same shared percentage
, where
and
. Both of algorithms design their sensing and transmitting schedules,
, based on their selected sets
of representative nodes, as described in
Section 6 and according to (
10). To recover the received data matrix, both of the performed algorithms apply the three-stage reconstruction pattern of
Section 7.
Figure 13 and
Figure 14 depict the results of this simulation.
Figure 13 illustrates the
. As we can see, when compared to the random selection process, the selection scheme of Algorithm 3 provides a considerable improvement in terms of
for the high
. The gap between the two curves decreases as we increase the number
of representative nodes, namely
, since we decrease the probability of choosing different sets
.
Let us focus on
Figure 14, which highlights the
to reveal the impact of our selection process on the reconstruction performance of the empty rows. Expectedly, we find that the
is sensitive to the used selection method, which confirms the aforementioned hypothesis. That is, in order to guarantee an accurate reconstruction for the inactive nodes missing data, great care must be taken when selecting the set
.
The last simulation highlights the benefit of the 3
rd stage of the proposed reconstruction pattern. We compare our algorithm to the one that only uses the first two stages of
Section 7 to obtain its final recovered data matrix
. Following the same logic of the previous experiences, in order to be comparable, we use the sampling pattern of
Section 6 with both of the simulated algorithms, which yields the same set
of representative nodes and, consequently, the same set of inactive nodes. Noticeably, we can detect a considerable gap in terms of
between the barres of
Figure 15. This difference for all of the
values comes from the non-reconstructed readings of the
nodes with the algorithm of comparison. Because we simulated the same network with the same sensor nodes neighboring, the set of the
nodes is the same for both of the compared algorithms.
Figure 16, which depicts the
for both approaches, illustrates that we can reduce the reconstruction error of the empty rows up to
for
,
for
,
for
and
for
, when we apply the minimization (
11). These results show that the number of
nodes is important for a small
. Hence, adding a third interpolation technique, as our proposed minimization (
11), becomes heavily needed. Otherwise, we end with a data matrix, which is almost half built, even less.