*Article* **Distributed Deep Fusion Predictor for a Multi-Sensor System Based on Causality Entropy**

**Xue-Bo Jin 1,2,\* , Xing-Hong Yu 1,2, Ting-Li Su 1,2,\*, Dan-Ni Yang <sup>3</sup> , Yu-Ting Bai 1,2 , Jian-Lei Kong 1,2 and Li Wang 1,2,\***


**Abstract:** Trend prediction based on sensor data in a multi-sensor system is an important topic. As the number of sensors increases, we can measure and store more and more data. However, the increase in data has not effectively improved prediction performance. This paper focuses on this problem and presents a distributed predictor that can overcome unrelated data and sensor noise: First, we define the causality entropy to calculate the measurement's causality. Then, the series causality coefficient (SCC) is proposed to select the high causal measurement as the input data. To overcome the traditional deep learning network's over-fitting to the sensor noise, the Bayesian method is used to obtain the weight distribution characteristics of the sub-predictor network. A multi-layer perceptron (MLP) is constructed as the fusion layer to fuse the results from different subpredictors. The experiments were implemented to verify the effectiveness of the proposed method by meteorological data from Beijing. The results show that the proposed predictor can effectively model the multi-sensor system's big measurement data to improve prediction performance.

**Keywords:** series causality analysis; Bayesian LSTM; multi-sensor system; meteorological data; big measurement data; deep fusion predictor

### **1. Introduction**

Measurements have been obtained and saved in many multi-sensor systems, such as mobile robots [1], unmanned aerial vehicles (UAVs) [2,3], smart agriculture [4,5], air quality monitoring systems [6,7], etc. It is very meaningful to analyze these data and understand and predict the information in the sensor system [8], for example the analysis and prediction of meteorological elements in precision agriculture or environmental management systems [9]. Furthermore, in terms of environmental governance, the prediction for air pollution sources such as PM2.5 has played an important role [10–13].

Recently, more measurements have been collected with the development of sensor technology. Therefore, in a multi-sensor system, big data analysis has become a new research area. These data have two characteristics: noisy and numerous [14]. For example, the collected and saved meteorological data are big data and include many variables, such as temperature, wind, rainfall, humidity, etc. Further, they are related to each other [15]. However, the correlation between each type of variable is different: some of them have a strong correlation, but some have a low correlation.

In general, more data can provide more information. For big data, deep learning can extract hidden information to make more accurate predictions [16]. The recent research has proven that the recurrent neural network (RNN) and its improved version are widely used

**Citation:** Jin, X.-B.; Yu, X.-H.; Su, T.-L.; Yang, D.-N.; Bai, Y.-T.; Kong, J.-L.; Wang, L. Distributed Deep Fusion Predictor for a Multi-Sensor System Based on Causality Entropy . *Entropy* **2021**, *23*, 219. https://doi.org/10.3390/e23020219

Academic Editor: Quan Min Zhu, Giuseppe Fusco, Jing Na, Weicun Zhang and Ahmad Taher Azar Received: 11 January 2021 Accepted: 7 February 2021 Published: 11 February 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

in regression prediction problems with better nonlinear modeling ability compared with the classical regression method.

We can find that the network has become larger and more complex due to the massive amount of data. However, because of the amount of data, the network's training time is getting longer. To make matters worse, the increase of the input data does not improve the prediction performance; on the contrary, it decreases.

This paper focuses on how to use this big noisy data in the multi-sensor system to efficiently improve prediction performance. This paper mainly aims at multi-sensor systems, proposes a causal entropy method for feature selection, and constructs a distributed forward multi-step prediction framework based on Bayesian deep learning theory. In this way, the dimensionality reduction of high-dimensional data feature selection is realized, and the problem of data noise affecting deep network training is initially overcome. The rest is organized as follows: Section 2 summarizes current prediction models and describes the main contribution of this paper. Section 3 proposes a distributed deep learning network predictor, and Section 4 describes the experiments and results to verify the performance of our predictor. We draw conclusions in Section 5.

#### **2. Related Works**

#### *2.1. The Methods for Prediction*

Prediction is to analyze historical data and obtain the trend of the future. With the development of computer storage technology and sensor technology, the prediction of measurement data in a multi-sensor system has been widely used in many fields. It has become a hot topic of research. The traditional prediction methods require prior knowledge of the data, such as exponential smoothing [17], moving average (MA) [18], auto-regression (AR) [19], auto-regressive integrated moving average (ARIMA) [20], etc. In practical systems, the traditional prediction methods cannot obtain a high accuracy prediction result due to the system's complexity.

For nonlinear input data, shallow machine learning methods obtain model parameters through training, such as support vector machines (SVMs) [21], the echo state network (ESO) [22], Boltzmann machines (BMs) [23], shallow artificial neural networks (ANNs) [24], generalized regression neural networks (GRNNs) [25], etc., which avoids the requirement of prior knowledge of the data. However, because of their simple structure, they cannot process large amounts of data.

With the development of the depth of the neural network, the hidden information in the massive data can be extracted to make more accurate predictions. The recurrent neural network (RNN) [26] and its improved versions, such as long short-term memory (LSTM) [27], etc., is widely used for regression prediction problems, demonstrating its superior nonlinear modeling capabilities. For example, a gated recurrent unit (GRU) network [28] and Bi-LSTM [29] were proposed to improve LSTM. Furthermore, researchers [30,31] have combined the one-dimensional convolutional neural network (CNN) with LSTM to predict the time series data. System identification is the theory, and the methods of establishing the mathematical models of dynamical systems [32–36] and some identification approaches can be used to establish the prediction models and soft sensor models [37–42] for various application problems.

#### *2.2. The Method to Calculate Causality and Correlation*

Undoubtedly, deep neural networks are currently the best solution to the big data prediction problem of multi-sensor systems. However, we found that the network's ability to predict does not increase as the amount of input variable increases. On the contrary, sometimes, the larger the amount of input measurement data from multi-sensor system, the worse performance the prediction obtains. This is contrary to what we have always believed: one advantage of the deep learning network is that it has comprehensive and robust learning capabilities for big data.

We believe one of the reasons is that the data contain too much low-relevance information; the increase in the amount of data leads to a decrease in the ratio of useful information. The weights of training for the neural network and the diluted information make the network's convergence more difficult, so the prediction performance cannot be developed, but is even reduced. Therefore, we think the data with a high correlation and strong causality with the target variable should be selected as the network's input data, rather than just increasing the number.

Then, we describe the correlation degree method and discuss a causal correlation method for measuring variables suitable for multi-sensor systems to measure big data. The Pearson correlation coefficient (PCC) [43] and Spearman correlation coefficient [44,45] have been used for such a problem. The former can be used to find the linear relationship between the two variables. For the data, the features are continuous and conform to a positive distribution; the linear relationship between the two variables can be mined by the PCC [46]. Jing et al. [47] selected the characteristic sub-sequence by PCC to improve the prediction accuracy when forecasting photo-voltaic power output. Lin et al. [48] built a hybrid model framework using the stacking scheme of integrated learning by PCC between different models. As for the prediction problem, PCC requires a known prediction target variable, so it cannot be applied for predicting.

Spearman's correlation coefficient is mainly used to solve problems related to sequential data. It applies to two variables with an order relationship. Another kind of correlation analysis method is called the Kendall correlation coefficient [49], which is suitable for sequenced variables or evenly spaced data that do not satisfy the assumption of a normal distribution. This method is usually calculated for a piece of sequence data, and it cannot obtain an effective correlation between the input and output for large amounts of data.

Contreras-Reyes et al. [50] used the frequency-domain Granger-causality method to test the statistical significance of causality between two time series and determine the direction of causality on the drivers of pelagic species' biological indicators. Since the Granger causality coefficient is used in two stationary time series, its application is limited. Podobnik et al. [51] proposed a detrending cross-correlation analysis method to explore the correlation between two non-stationary series. It shows that effectively measuring the correlation between two variables can help analyze the change characteristics of one of the variables.

The current methods to calculate the correlation and causality rely on the predicted result and cannot be applied to the prediction problem of multi-sensor systems.

#### *2.3. The Bayesian Deep Learning Network*

The big data measured by sensors contain noise, which is another reason for the degradation of prediction performance based on deep learning networks. Traditional neural network training obtains the fixed weights and biases, which are easily disturbed by noise [52,53]. On the one hand, the noise makes it difficult to converge the network, that is the loss of the network is larger. On the contrary, if the noise is also learned as a certain value until a small loss value is obtained, it will cause the problem of overfitting [54].

Suppose we use the data distribution to train the network and obtain weights and deviations to express the input data's distribution characteristics. In that case, the problem of overfitting will be avoided. Based on the distribution characteristics of the weights and deviations, the obtained neural network is a group. The output is also a group of prediction outputs with distribution characteristics, improving the prediction results' reliability. Based on this research question, the Bayesian deep learning network came into being [55]. Through Monte Carlo sampling, the Bayesian deep learning network trains the network several times and takes the average of all losses, then uses it for backpropagation to obtain the distribution of weights and deviations [56].

The Bayesian method has been used in many application systems, such as indoor tracking [57], robot systems [58], etc. The Bayesian deep learning network has been applied in modeling with noisy data, and some results have been obtained. For example, Li et al. [59] integrated uncertainties by defining the Bayesian deep learning framework, in which a sequential Bayesian boosting algorithm is used to improve the estimation accuracy. Another example is [60], where a Bayesian framework was proposed to model the valence predictions.

#### *2.4. Innovation*

Aiming at the problem of improving the prediction performance based on the huge amount measurement data in a multi-sensor system, this paper provides a distributed deep prediction network. To solve the contradiction between data volume and performance and the influence of data noise on prediction performance, the innovation of this paper lies in the following:

(1) A series causality entropy method is developed to select the related input data for the neural network. Compared with the PCC [48], Spearman correlation [45], and the Kendall correlation coefficient method [61], the method does not depend on the prediction results and is suitable for prediction problems based on measurement data in multi-sensor systems.

(2) A distributed prediction framework is proposed, in which Bayesian training is used to suppress the noise impact of the data, and the prediction based on the selected input data is fused by a nonlinear fusion network. Compared with the classical LSTM [27], GRU [28], CNN-LSTM [11], the conv-LSTM [30] predictor, etc., the proposed method outperforms them in its prediction performance.

#### **3. Distributed Deep Fusion Predictor**

#### *3.1. Series Causality Entropy*

In a multi-sensor system, we can set up multiple sensors to obtain a variety of measurement data. For example, in the system given in Figure 1, we use four sensors to obtain four types of measurement data, and they will be used as candidate input data for the deep network. The prediction task is for Measurement 1, and we will predict its future trend.

Firstly, we will consider the method to select the input data for the networks. Obviously, the principle of selecting data is to select those measurement data that are most causal for the future trend of Measurement 1. As for the prediction problem, it can be defined as the series causality between the historical data and the future data.

We give the following definition about the causality entropy to calculate the measurement's causality between two data named *X* and *Y*:

$$CE(X, Y) = \frac{1}{N - 1} \sum\_{i=1}^{N} \left( \frac{X(i) - \overline{X}}{\sigma\_X} \right) \times \log \left( \frac{Y(i) - \overline{Y}}{\sigma\_Y} \right) \tag{1}$$

where *X* and *Y* are the mean of *X*(*i*) and *Y*(*i*), *i* = 1, 2, . . . . . ., *N*, respectively, and *σ<sup>X</sup>* and *σ<sup>Y</sup>* are the standard deviation of *X*(*i*) and *Y*(*i*). We can see that *CE*(*X*,*Y*) can be positive or negative. When it is a positive number, it indicates that the two data are positively supporting. Otherwise, it is negatively supporting.

This calculating correlation method cannot be directly applied to obtain the correlation of prediction problems. Because the measured data and their prediction are considered in the prediction problem, therefore, we cannot calculate *CE* when the prediction is not yet available. Secondly, since the step length of the measurement data *I* is different from the predicted step length *J*, the number of data points *N* in Equation (1) cannot be used.

Therefore, we propose the following series causality coefficient (SCC) for the measurement in the multi-sensor system. Suppose the measured data are represented by *Xm*(*i*), where *m* = 1, 2, . . . . . ., *M* is the sensor number to obtain the measurement and *i* = 1, 2, . . . . . ., *I* is the step number of historical data used for prediction. The target data

to be predicted are represented by *Y*(*j*), where *j* = 1, 2, . . . . . ., *J* is the step number of prediction. We revised the method to calculate coefficient Equation (1) as the following.

$$S\_m = \frac{1}{K - 1} \sum\_{i=1, j=1}^{K} \left| \frac{X\_m(i) - \overline{X\_m}}{\sigma\_{X\_m}} \right| \times \log \left| \frac{Y(j) - \overline{Y}}{\sigma\_Y} \right| \tag{2}$$

where *m* = 1, 2, . . . . . .*M* is the sensor number to obtain the measurement, *K* = *min*(*I*, *J*), *X<sup>n</sup>* and *Y* are the mean of *Xm*(*i*) and *Y*(*i*), *i* = 1, 2, . . . . . .*K*, respectively, and *σX<sup>m</sup>* and *σ<sup>Y</sup>* are the standard deviation of *X<sup>n</sup>* and *Y*. We can find that Equation (2) still has the prediction *Y*(*i*), which is unknown data. To eliminate *Y*(*i*) in Equation (2), we modify Equation (2) by normalization. The normalized SCC of each measurement can be obtained by the following.

$$\text{SSC}\_{m}^{\*} = \frac{\text{S}\_{m}}{\text{S}\_{1} + \text{S}\_{2} + \dots + \text{S}\_{M}} = \frac{\sum\_{i=1}^{K} \left| \frac{\text{X}\_{m}(i) - \overline{\text{X}\_{m}}}{\sigma\_{\text{X}\_{m}}} \right|}{\sum\_{m=1}^{M} \sum\_{i=1}^{K} \left| \frac{\text{X}\_{m}(i) - \overline{\text{X}\_{m}}}{\sigma\_{\text{X}\_{m}}} \right|} \tag{3}$$

From Equation (3), we can conclude that the value of *SCC* is between zero and one; the larger the *SCC*, the higher the causality is. For example, when the value is zero, it means that the feature is not useful for predicting the target variable. We can see that the SCC given by Equation (3) omits the calculation process for the prediction *Y*(*i*).

**Figure 1.** Relationship between the target variables to be predicted and the input variables.

We give the following examples to illustrate the SCC obtained by Equation (3). Meteorological data are used, including temperature, wind direction, wind force, rainfall, and humidity, which are used to predict the future temperature. We have five measurements, so according to Equation (3), *M* is five. We set *K* = 24, then *SCC<sup>m</sup>* can be obtained. X is set

to five meteorological elements separately, and Y is the future temperature to be predicted. The result is shown in Table 1. To clearly illustrate the difference of *SCCm*, we visualize them as Figure 2.


**Table 1.** The order of the SCC between variables to be predicted.

It can be seen from Table 1 and Figure 2 that the causality between historical temperature data and their future prediction is the largest, which is 0.3673. Next is humidity. We get the SSC as 0.3259 between historical humidity and future temperature. Compared with them, the causal relationship between wind force and wind direction with the future temperature is smaller. The data can also reflect no causal relationship between rainfall and temperature, for which we obtain a zero SSC.

**Figure 2.** SCC between different measurements and predictions.

Further, we can get the following conclusions. If all the data are used for training, rainfall data can only cause the network to reduce the training's convergence and the temperature prediction performance. Therefore, rainfall data must be eliminated and cannot be used as the input for network training and prediction. Regarding wind force and wind direction data, because of their low causality, even as the network's input data, the performance improvement of the prediction results is limited. It will increase the training time of the network. On the contrary, the humidity data have a high causal correlation with the future temperature. Therefore, using the historical temperature data and humidity data to predict the future temperature may achieve better performance than just using temperature data. The experiments in Section 4 will verify the above points.

#### *3.2. Bayesian LSTM as the Sub-Predictor*

The LSTM cell is used in this paper, which is composed of three gating units, i.e., input gate, forget gate, and output gate. The calculation process is the following:

$$\begin{aligned} f\_t &= \sigma \left( \mathcal{W}\_{f\_{\text{fix}}} \mathbf{x}\_t + \mathcal{W}\_{fh} h\_{t-1} + b\_f \right) \\ i\_t &= \sigma (\mathcal{W}\_{\text{fix}} \mathbf{x}\_t + \mathcal{W}\_{lh} h\_{t-1} + b\_i) \\ \mathbf{c}\_t &= \tanh(\mathcal{W}\_{\text{ex}} \mathbf{x}\_t + \mathcal{W}\_{ch} h\_{t-1} + b\_c) \\ c\_t &= f\_t \cdot c\_{t-1} + i\_t \cdot \boldsymbol{\varepsilon}\_t \\ o\_t &= \sigma (\mathcal{W}\_{\text{ox}} \mathbf{x}\_t + \mathcal{W}\_{oh} h\_{t-1} + b\_o) \\ h\_t &= o\_t \cdot \tanh(c\_t) \end{aligned} \tag{4}$$

where *t* is the current moment to predict, *w* = [*Wf x*, *Wf h*, *Wix*, *Wcx*, *Wox*, *Woh*] are the weights, and *b* = [*b<sup>f</sup>* , *b<sup>i</sup>* , *bo*] are the biases. *c<sup>t</sup>* is the hidden state, and *h<sup>t</sup>* is the output of the LSTM cell. The cells can be placed as several layers with different input and output cells depending on the number of input and output steps of the prediction. The structure of the network is shown in Figure 3. The input data *x* are the given data used to predict the future trend, where *x* = [*X*(1), *X*(2), . . . . . ., *X*(*I*)] are the input data at each moment with the number of data *I*, and *x<sup>t</sup>* = [*Xt*(1), *Xt*(2), . . . . . ., *Xt*(*I*)] are the input data at the current moment *t*. The output of the last layer can be set as the output of the LSTM network, named as *y*. For the training process, we have *y* = [*Y*(1), *Y*(2), . . . . . ., *Y*(*J*)], and at the current moment *t*, we have *y<sup>t</sup>* = [*Yt*(1), *Yt*(2), . . . . . ., *Yt*(*J*)].

**Figure 3.** LSTM cell and its networks.

In the normal LSTM network, the parameters, including all the weights and biases, are constants. The Bayesian LSTM can get the weight and bias as a random distribution, not a certain value. Each parameter obtained by the Bayesian LSTM network training is the mean and variance according to the distribution of the weights and biases. The difference between the normal LSTM network and the Bayesian LSTM network is shown in Figure 4.

The LSTM neural network can be seen as a probabilistic model *P*(*y*|*x*, *θ*): a probability given an input *<sup>x</sup>* <sup>∈</sup> <sup>R</sup>*<sup>p</sup>* to each possible output *y* ∈ *Y*, using the set of parameters *θ* including weights w and biases *b*, i.e., *θ* = [*w*, *b*]. We denote the training data *x* and *y* as *D*, i.e., *D* = [*x*, *y*].

**Figure 4.** The difference between the normal LSTM network and the Bayesian LSTM network. (**a**) The parameters in the LSTM; (**b**) the example of the parameters in the normal LSTM; (**c**) the example of the parameters in the Bayesian LSTM.

Given the training data *D*, Bayesian inference can be used to calculate the posterior distribution of weights *P*(*w* | *D*) [62]. This distribution answers the predicted distribution of unknown data through the input data value: the predicted distribution of the input data *<sup>x</sup>* is given by *<sup>P</sup>*(*<sup>y</sup>* | *<sup>x</sup>*) = *<sup>E</sup>P*(*<sup>θ</sup>* <sup>|</sup> *<sup>D</sup>*) [*P*(*y* | *x*, *θ*)]. Until now, it is still difficult to find *P*(*w* | *D*). The variational approximation to the Bayesian posterior distribution on the weights is a feasible method. Variational learning finds the parameters (*µ*, *σ*) of a distribution on the weights *q*(*θ*|*µ*, *σ*) that minimizes the Kullback–Leibler (KL) divergence [63] with the true Bayesian posterior on the weights:

$$(\mu\_\prime \sigma)^\* = \arg\min\_{\mu\_\prime \sigma} \text{KL}[q(\theta|\mu\_\prime \sigma) || P(\theta|D)] \tag{5}$$

According to the Bayesian theory,

$$P(\theta|D) = \frac{P(D|\theta)P(\theta)}{P(D)}\tag{6}$$

and the definition of the Kullback–Leibler (KL) divergence, Equation (5) can be transformed to:

$$q(\mu\_\prime \sigma)^\* = \arg\min\_{\mu\_\prime \sigma} \int q(\theta | \mu\_\prime \, \sigma) \log \frac{q(\theta | \mu\_\prime \, \sigma)}{P(\theta)P(D|\theta)} d\theta \tag{7}$$

Note that we discarded *P*(*D*) because it does not affect the optimized parameter solution. Then, the cost function is set as:

$$Loss = \int q(\theta|\mu\_\prime \,\sigma) \log \frac{q(\theta|\mu\_\prime \,\sigma)}{P(\theta)P(D|\theta)} d\theta \tag{8}$$

To keep the variance non-negative, we set it as *σ* = log(1 + *exp*(*ρ*)). Set *ε* as zero mean Gaussian white noise, i.e., *ε* ∼ N (0, 1). Then, we have *θ* = *µ* + log(1 + *exp*(*ρ*)) ⊗ *ε*, where ⊗ is point-wise multiplication. Further, we can note that *q*(*θ*|*µ*, *ρ*) *dθ* = *q*(*ε*) *dε*, then the derivative of Equation (8) can be calculated as the following:

$$\frac{\partial}{\partial \mu} \text{Loss} = \frac{\partial}{\partial \mu} \int q(\theta | \mu, \rho) \log \frac{q(\theta | \mu, \rho)}{P(\theta)P(D|\theta)} d\theta \tag{9}$$

$$\frac{\partial}{\partial \rho} \text{Loss} = \frac{\partial}{\partial \rho} \int q(\theta | \mu, \rho) \log \frac{q(\theta | \mu, \rho)}{P(\theta)P(D|\theta)} d\theta \tag{10}$$

Then, as for Equation (9), we have:

$$\begin{split} \frac{\partial}{\partial\mu}\text{Loss} &= \frac{\partial}{\partial\mu}\int q(\theta|\mu,\,\rho)\log\frac{q(\theta|\mu,\,\rho)}{P(\theta)P(D|\theta)}d\theta \\ &= \frac{\partial}{\partial\mu}\int\log\frac{q(\theta|\mu,\,\rho)}{P(\theta)P(D|\theta)}\,q(\theta|\mu,\,\rho)d\theta \\ &= \frac{\partial}{\partial\mu}\int\log\frac{q(\theta|\mu,\,\rho)}{P(\theta)P(D|\theta)}\,q(\varepsilon)d\varepsilon \\ &= \frac{\partial}{\partial\mu}\log\frac{q(\theta|\mu,\,\rho)}{P(\theta)P(D|\theta)}\int\,q(\varepsilon)d\varepsilon \\ &= \frac{\partial}{\partial\mu}\log\frac{q(\theta|\mu,\,\rho)}{P(\theta)P(D|\theta)} \end{split} \tag{11}$$

Similarly, Equation (10) can be derived further as the following:

$$\frac{\partial}{\partial \rho} Loss = \frac{\partial}{\partial \rho} \log \frac{q(\theta | \mu \,\, \rho)}{P(\theta)P(D|\theta)}\tag{12}$$

Denote that:

$$Loss = \log \frac{q(\theta | \mu\_\prime \,\, \rho)}{P(\theta)P(D | \theta)} = \log q(\theta | \mu\_\prime \,\, \rho) - \log P(\theta) - \log P(D | \theta)$$

then we have:

$$\begin{split} \frac{\partial}{\partial \mu} \text{Loss} &= \frac{\partial Loss}{\partial \theta} \frac{\partial \theta}{\partial \mu} + \frac{\partial Loss}{\partial \mu} \\ &= \frac{\partial Loss}{\partial \theta} + \frac{\partial Loss}{\partial \mu} \end{split} \tag{13}$$

$$\begin{split} \frac{\partial}{\partial \rho} \text{Loss} &= \frac{\partial \text{Loss}}{\partial \theta} \frac{\partial \theta}{\partial \rho} + \frac{\partial \text{Loss}}{\partial \rho} \\ &= \frac{\partial \text{Loss}}{\partial \theta} \frac{\varepsilon}{1 + \exp(-\rho)} + \frac{\partial \text{Loss}}{\partial \rho} \end{split} \tag{14}$$

Please note that the standard deviations of the *<sup>∂</sup>Loss ∂θ* term of the mean and the gradient are shared, and it happens to be the gradient found by the backpropagation algorithm on the normal LSTM network. Therefore, to learn the mean and standard deviation, we can calculate the gradient by backpropagation and then scale and translate it. We summarize the optimization process as seven steps in Table 2.


**Table 2.** The optimization process for the Bayesian LSTM networks.

#### *3.3. Model Framework*

We propose a distributed prediction model combining SCC and a deep learning network for the prediction problem. The proposed model framework is shown in Figure 5, and the model consists of three main components: selection nodes, sub-predictors, and fusion nodes.

The selection node calculates the series causality of the data source and selects the variables related to the target data as the network input. For each selected input variable, a Bayesian LSTM sub-predictor is designed. Finally, we use the fusion node to fuse the prediction results of multiple sub-predictors. An artificial neural network MLP is used in the fusion node. MLP is a fully linked combination of artificially designed neurons, which applies a nonlinear activation function to model the relationship between the input and output.

#### **4. Experiments**

#### *4.1. Dataset*

Our experiments used the meteorological dataset in Shunyi District, Beijing, from 2017 to 2019. The data were measured hourly at meteorological station. The future temperature was chosen to be predicted to test the proposed model. The data set contained 1095 days for a total of 26,280 data samples to ensure sufficient training data. We selected the first 90% of the data for training and the remaining 10% for testing.

#### *4.2. Experimental Setup*

A PC with an Intel CORE CPU i5-4200U 1.60 GHz and 6 GB of memory was used for the experiments. In the experiments, the default parameters in Keras and Pytorch were used for deep neural network initialization. We used the ReLU as the activation function of the Bayesian LSTM layer and the linear activation function of the MLP layer.

We set up one Bayesian LSTM layer and one MLP layer, and each layer's size was set to 24. The Adam algorithm was used for the supervised training, and the model was trained by mini-batch sampling. The model hyperparameters, such as learning and batch size, were obtained from experiments and are presented in Table 3.

**Table 3.** Hyperparameters for the experiments.


**Figure 5.** Model framework.

The model's performance was evaluated by the following four factors. The root-meansquared error (RMSE):

$$RMSE = \sqrt{\frac{1}{n} \sum\_{i=1}^{n} \left( y\_i - \hat{y}\_i \right)^2} \tag{15}$$

where *<sup>y</sup>*b*<sup>i</sup>* is the prediction, *y<sup>i</sup>* is the ground truth, and *n* is the number of data.

The mean-squared error (MSE) can reflect the value of the loss function of network convergence and is defined as:

$$MSE = \frac{1}{n} \sum\_{i=1}^{n} (y\_i - \hat{y}\_i)^2 \tag{16}$$

The mean absolute error (MAE) and Pearson correlation coefficient (R) between the prediction and reference were also explored in the experiments.

$$MAE = \frac{1}{n} \sum\_{i=1}^{n} |y\_i - \hat{y}\_i| \tag{17}$$

$$R = \frac{\sum\_{i=1}^{n} (y\_i - \overline{y}\_i)(\hat{y}\_i - \overline{\hat{y}}\_i)}{\sqrt{\sum\_{i=1}^{n} (y\_i - \overline{y}\_i)^2 \sum\_{i=1}^{n} (\hat{y}\_i - \overline{\hat{y}}\_i)^2}} \tag{18}$$

#### *4.3. Case 1*

In this case, the Bayesian LSTM model's performance is verified and causality evaluated by predicting the further temperature. We used the SCC to compare the correlation between time series variables and selected the temperature and humidity as the distributed deep model's input data. We set the time step to 24 and got a total of 24 prediction steps. The blue and red lines present the ground truth of temperature and the model's predictive results, respectively. The RMSE of the prediction is 3.203.

Figure 6 shows the comparison of the measurement data (the ground truth) and the 24 step forward prediction results. There is a light red band above and below the red line, which is the variance of the Bayesian network's result. It can be seen that the predictive trend is close to the ground truth, and most of the forecast values are within the confidence interval.

**Figure 6.** The prediction results of the temperature. The above picture is the prediction for the first 200 hours, which is a part of the bottom picture, in which we draw the results for about 21 days. We can see that in the bottom picture, the sensor is out of order with two hours, in which the sensor measurement data are zero. However, the prediction result effectively overcomes the sensor's failure and gives a daily temperature trend consistent with historical data.

From the actual measurement data, the prediction model's input data caused by sensor failure give the wrong measurement value. We can see that in the bottom picture, the sensor is out of order in two hours, in which the sensor measurement data are zero. However, the prediction result effectively overcomes the sensor's failure and gives a daily temperature trend consistent with historical data. However, the prediction result still maintains the correct trend, which effectively overcomes the sensor failure.

#### *4.4. Case 2*

In this case, we calculated the causality of the four meteorological factors in the data set and selected the best data for the network model. Because the SCC is zero between temperature and rainfall, we did not consider the rainfall data in the prediction.

The data set used to predict the temperature is four meteorological elements, i.e., historical temperature, humidity, wind force, and wind direction. We first considered two variables as the input of the network. We found that the predicted performance was different in different combinations. This performance was related to the SCC parameter. In another case, we increased the input signal to three or four. The results show that as the

sensor input data increased, the prediction performance would not improve, but would decrease instead.

Table 4 and Figure 7 show the comparison results with two inputs. It can be seen from Table 4 that when historical temperature and humidity are set as the input, the best prediction performance can be obtained, in which the RMSE, MSE, and MAE are 3.203, 10.260, and 2, respectively. Compared with other combinations of input, such as historical temperature and wind force and historical temperature and wind direction, the RMSE, MSE, and MAE decreased.

The larger the SCC, the more it shows that the data have more causality with respect to the target data. As shown in Table 1, the historical temperature data and humidity have the greatest correlation with the future temperature data. Therefore, using these two types of data, compared with historical temperature data as the input, we can significantly improve the prediction performance.

**Table 4.** Prediction performance with two inputs.


**Figure 7.** Comparison of prediction performance with two inputs. The input variables are historical temperature and humidity, historical temperature and wind force, and historical temperature and wind direction, respectively. We can find that when the inputs are the historical temperature and humidity, the least RMSE, MSE, and MAE and the largest R can be obtained.

Then, we increased the input variables one-by-one, adding humidity, wind force, and wind direction, separately. The performance of different numbers of inputs are shown in Table 5 and Figure 8. We can see that when there was only historical temperature as the input data, the RMSE, MSE, and MAE were 3.508, 12.305, and 2.331, respectively. Then, when two inputs were used, that is together historical temperature with humidity, the minimum prediction RMSE was 3.203. In addition, the MAE, MSE, and R were the best also. However, when the input data increased and three input data were used, the RMSE increased to 3.235. When four input data were used, the RMSE further increased to 3.230. Therefore, we can conclude that the experiments show that more input data do not result in better prediction performance.

**Table 5.** Prediction performance with multiple inputs.


**Figure 8.** Comparison of prediction performance with multiple inputs. We can see that when two input variables are used, compared with one input variable, the RMSE, MSE, and MAE decrease and R increases, which shows that the performance is getting better. However, as the number of input variables increases, the performance becomes worse. For example, when the input variables are historical temperature, humidity, and wind force, the prediction performance worsens. Further, when we use the four input variables, the performance is the worst.

#### *4.5. Case 3*

In this case, we compared other deep network models with the methods proposed in this paper. Among them, no baseline models included a feature selection process and used all features as the network input. As shown in Table 6 and Figure 9, the RMSEs of LSTM [27], GRU [28], CNN-LSTM [11], conv-LSTM [30], and the proposed Bayesian LSTM were 3.714, 3.429, 3.630, 3.594, and 3.203 and the MSEs were 13.797, 11.759, 13.174, 12.915, and 10.260, respectively. The MAEs were 2.467, 2.137, 2.406, 2.344, and 2.000, respectively. Compared with LSTM and GRU, the RMSE of the proposed Bayesian LSTM decreased by 13.76% and 6.59%, and the MSE decreased by 25.64% and 12.75%, while the MAE decreased by 18.93% and 6.41%, respectively. Compared with other hybrid models, such as CNN-LSTM and conv-LSTM, the results show that the Bayesian LSTM was the best, obtaining the minimum RMSE of 3.203 and the least MAE of 2.000. Therefore, the Bayesian LSTM can better fit the data and had the best prediction performance.

**Table 6.** Prediction performance with different models.


**Figure 9.** Comparison of the prediction performance with different sub-predictors. We can find that the proposed model with the Bayesian LSTM is the best, obtaining the least RMSE, of 2.374, MSE, and MAE and the largest R.

#### **5. Conclusions**

This article focuses on multivariate noisy measurement data modeling and prediction and proposes a distributed deep Bayesian LSTM prediction network based on causality entropy. The performance of the model was verified on real weather data sets.

In a multi-sensor system, the actual data set is usually non-linear and noisy. Therefore, analyzing the correlation between measurement from a multi-sensor system is very important for predicting. We developed the SCC to analyze the original multidimensional variables and then selected the most causal variable for the target variable. The SCC can reduce the total amount of data entered into the network, thereby reducing the computational burden of the network. It also reduces errors caused by unnecessary input.

As we all know, neural networks have a strong ability to fit nonlinearity. However, we found that the measurement data from the multi-sensor system have complex noise. We used the Bayesian LSTM to reduce the influence of noise on the neural network. The model was modeled by weight sampling, and then, the average was taken to obtain a more stable output.

In future research, we can consider other causality analysis methods. We will also replace the MLP with other fusion methods to reduce the network model's parameters for the fusion results. The proposed approaches in the paper can combine other parameter estimation algorithms [32,64–67] to study the parameter identification problems of linear and nonlinear systems with different disturbances [68–72], and to build the soft sensor models and prediction models and can be applied to other fields [73–77] such as signal processing and process control systems.

**Author Contributions:** Conceptualization, X.-B.J.; data curation, Y.-T.B. and J.-L.K.; formal analysis, T.-L.S. and Y.-T.B.; methodology, X.-H.Y.; software, X.-H.Y.; supervision, L.W.; validation, T.-L.S.; visualization, D.-N.Y.; writing, original draft, X.-H.Y.; writing, review and editing, X.-B.J. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work was supported in part by the National Key Research and Development Program of China No. 2020YFC1606801, the National Natural Science Foundation of China Nos. 61903009 and 61903008, the Beijing Municipal Education Commission Nos. KM201910011010 and KM201810011005, the Young Teacher Research Foundation Project of BTBU No. QNJJ2020-26, the Defense Industrial Technology Development Program No. 6142006190201, and the Beijing excellent talent training support project for young top-notch team No. 2018000026833TD01.

**Data Availability Statement:** The data presented in this study are available on request from the corresponding author.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


### *Article* **A Modified FlowDroid Based on Chi-Square Test of Permissions**

**Hongzhaoning Kang, Gang Liu \* , Zhengping Wu, Yumin Tian and Lizhi Zhang**

School of Computer Science and Technology, Xidian University, Xi'an 710071, China; kanghzn@stu.xidian.edu.cn (H.K.); wuzhenping@stu.xidian.edu.cn (Z.W.); ymtian@mail.xidian.edu.cn (Y.T.); rnzhang@stu.xidian.edu.cn (L.Z.)

**\*** Correspondence: gliu@xidian.edu.cn

**Abstract:** Android devices are currently widely used in many fields, such as automatic control, embedded systems, the Internet of Things and so on. At the same time, Android applications (apps) always use multiple permissions, and permissions can be abused by malicious apps that disclose users' privacy or breach the secure storage of information. FlowDroid has been extensively studied as a novel and highly precise static taint analysis for Android applications,. Aiming at the problem of complex detection and false alarms in FlowDroid, an improved static detection method based on feature permission and risk rating is proposed. Firstly, the Chi-square test is used to extract correlated permissions related to malicious apps, and mutual information is used to cluster the permissions to generate feature permission clusters. Secondly, risk calculation method based on permissions and combinations of permissions are proposed to identify dangerous data flows. Experiments show that this method can significantly improve detection efficiency while maintaining the accuracy of dangerous data flow detection.

**Keywords:** automatic control; mutual information; static detection; Chi-square test; permission; Flow-Droid

**Citation:** Kang, H.; Liu, G.; Wu, Z.; Tian, Y.; Zhang, L. A Modified FlowDroid Based on Chi-Square Test of Permissions. *Entropy* **2021**, *23*, 174. https://doi.org/10.3390/e23020174

Academic Editor: Quan Min Zhu Received: 14 December 2020 Accepted: 27 January 2021 Published: 30 January 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

### **1. Introduction**

Google Android is a mobile operating system that is widely used in many fields [1,2]. With the development of the Internet of Things, Android quickly gained a large proportion of the market share. At the same time, the number of malicious applications (apps) has been increasing [3] and over the last few years, the amount of malware has increased significantly. According to a recent report from McAfee, over 1.6 million new examples of mobile malware were discovered in the first quarter of 2019 [4]. Therefore, the detection of Android malware with a high accuracy rate and high efficiency is an important issue.

Various approaches have been proposed in previous works with the intention of detecting Android malware. These approaches can be categorized into static analysis, dynamic analysis or hybrid analysis [5]. Dynamic analysis means that, in the process of running an application, the flow of privacy information and data is tracked and captured, and the malicious tendency of application behavior is analyzed and judged. Dynamic analysis can monitor and track the flow of private data in real time [6] and is not affected by code obfuscation, encryption and other factors. However, privacy leaks that are not triggered at runtime cannot be detected, and the low code coverage causes a high missing rate. At the same time, real-time operation results in greater resource consumption [7]. In the case of resource shortages on mobile devices, the system efficiency will be seriously affected. In contrast to dynamic analysis, static analysis is done without running an app. In static analysis, features such as permissions and API calls are extracted from the app source code by reverse engineering to analyze and infer suspicious behavior from an app and discover problems in different stages of the entire life cycle, verifying the security of app at the source code level. As a highly influential static analysis tool for Android

apps, FlowDroid [8] has the advantage of wide code coverage and it can detect many malicious behaviors that cannot be detected by dynamic analysis. However, source code level analysis will bring a large amount of irrelevant detection while leading to high false positives and low detection efficiency, decreasing the availability of these tools. In our experiments, for apps over 10 MB in size, FlowDroid reports timeouts and insufficient memory. Even for the two apps from the FlowDroid samples, the test takes nearly an hour.

Research and experiments show that app security threats are strongly correlated with some characteristic permissions [9]. When using FlowDroid, only those flow paths that actually cause privacy leakages need to be considered, which significantly reduces the scale of analysis and improves efficiency. This paper proposes a redundancy resolution method for FlowDroid, which can cluster the correlated permissions and calculate the risks of flow paths. This paper provides the following two contributions:


#### **2. Related Work**

The static analysis of Android malware relies on Java bytecode, which is extracted by disassembling an app. The manifest file is also a source of information for static analysis. Kirin [10] is the safety inspection scheme for app installation, which operates by defining security rules to identify dangerous permission combinations; the installation strategy is formulated based on the use of security rules as detection criteria. However, due to the small number of rules and the lack of representation of permission combinations, the detection efficiency and the accuracy cannot be guaranteed. TrustDroid [11] provides two alternative detection modes: real-time detection on the mobile device side and static analysis on the server side, converting the data flow to a tree structure using Jasm in middle code representation to generate a function call graph, preventing untrusted apps from leaking user privacy information. The resource consumption of TrustDroid is also extremely high. LeakMinder [12] analyzes the security of apps from the third-party market and decompiles Android application package (APK) files by reverse engineering. Based on a predefined source and sink, LeakMinder generates a call graph and data flow diagram and finds possible privacy leak paths. However, implicit data leaks cannot be detected. Besides, designed artificially sources and sinks are not particularly representative, which results in contingency and inaccuracy. Cen et al. [13] proposed the use of a probabilistic discriminative model based on regularized logistic regression for Android malware detection. The probabilistic discriminative model works well with permissions and achieves the best detection results by combining both decompiled source code and application permissions. Kang et al. [14] proposed a method that detects and classifies Android malware using static analysis with the combination of the attacker's information. The effectiveness of Android malware detection is improved by integrating the attacker's information as a feature, and the method categorizes illegitimate applications into homogeneous classes. Song et al. [15] proposed an integrated static framework using a filtering technique consisting of four layers to identify and evaluate mobile malware on Android. Sun et al. [16] presented an approach that interfaces static logic-structures and dynamic runtime information to detect Android malware. Behavior similarity is used for the classification of malware. The results showed that the approach is easy to implement and has low computational overheads. Rovelliet al. [17] presented a permission-based malware detection system that

uses machine learning classifiers on the behavioral patterns to consequently distinguish inconspicuous applications. DAPASA [18] is an approach used to detect Android piggybacked apps through sensitive subgraph analysis. DAPASA generates a sensitive subgraph (SSG) to profile the most suspicious behavior of an app. Five features are constructed from SSG to depict the invocation patterns. The five features are fed into the machine learning algorithms to detect whether the app is piggy-backed or benign. Talha et al. [19] presented a permission-based Android malware detection system consisting of three components, namely the central server, Android client and signature database, and static analysis is used to categorize the Android application as normal or harmful. Li et al. [20] raised the issue of considering interaction terms across features for the discovery of malicious behavior patterns in Android applications and proposed a classier for Android malware detection based on a factorization machine architecture.

FlowDroid [8] was proposed by Arztet al. in 2013 and has been widely studied and applied in the field of Android static analysis. FlowDroid is considered as a context, flow, field and object-sensitive and lifecycle-aware static taint analysis tool for Android apps. To increase recall, FlowDroid creates a complete model of Android's app lifecycle. However, a large number of normal paths are also detected while the entire life cycle is analyzed, causing false positives and low efficiency. The main purpose of this paper is to improve the analysis efficiency and applicability of FlowDroid.

#### **3. Preliminaries**

#### *3.1. Android Permission*

The Android system is an extension based on Linux, which provided the permission mechanism [21]. Operations that apps can perform are specified to limit the software's ability to manipulate systems or other software. The Android permission mechanism requires developers to apply for permissions they need in Android's Manifest.xml and gets user's consent during installation to access system resources and functional components by calling the related API. Android protects sensitive systems and user information by restricting apps from accessing system resources with permissions other than those declared. Android uses a coarse-grained permission management mechanism and no longer reviews the running process after granting permissions; thus, malicious apps exploit users' ignorance of permissions and the coarse-grained permission management of Android's permission mechanism to access or even leak sensitive information. Android 8.0 provides 135 permissions and corresponding APIs to access system resources. In fact, only a small portion of permission usage can lead to sensitive information leakage. If malicious application-independent permission calls are accurately excluded from detection, the data paths that need to be detected in static analysis can be significantly reduced, thus reducing the false alarm and improving detection efficiency.

#### *3.2. FlowDroid*

FlowDroid, based on Soot [22], works directly at the bytecode level and does not require access to an app's source code. It parses the APK file of an Android app, converts the Java code into Jimple middle code, simulates the life cycle of an Android app to handle callback functions and generates a call graph (CG) and inter-procedural control-flow graph (ICFG) [23,24] to trace taints (Figure 1). It uses the Interpretural Finite Distributive Subset (IFDS) to model data flow propagation and generates complete, polluted data flow paths by the Heros framework [25]. Therefore, FlowDroid has very high requirements in terms of computing and memory resources. For Enriched1.apk (a sample application of which is shown in [25]), a total of 46 nodes and 78 function call paths were generated (Figure 2 shows a partial call graph of these). The call graph composed of the data paths to be analyzed is very complex, and there is a large number of callbacks and callback relationships between functions, which leads to high time and resource costs.

*Entropy* **2021**, *23*, x FOR PEER REVIEW 4 of 14

to be analyzed is very complex, and there is a large number of callbacks and callback re-

to be analyzed is very complex, and there is a large number of callbacks and callback re-

lationships between functions, which leads to high time and resource costs.

lationships between functions, which leads to high time and resource costs.

**Figure 1.** Workflow of FlowDroid. **Figure 1.** Workflow of FlowDroid. **Figure 1.** Workflow of FlowDroid.

**Figure 2.** Partial function call graph of Enriched1.apk. **Figure 2.** Partial function call graph of Enriched1.apk. **Figure 2.** Partial function call graph of Enriched1.apk.

Our experiments show that FlowDroid usually reports timeout or out-of-memory errors for apps with a size larger than 10 MBytes. The reason for this is that an Android app often involves dozens of components at runtime, and the interaction between multiple components leads to hundreds of callback methods. Our experiments show that FlowDroid usually reports timeout or out-of-memory errors for apps with a size larger than 10 MBytes. The reason for this is that an Android app often involves dozens of components at runtime, and the interaction between multiple components leads to hundreds of callback methods. Our experiments show that FlowDroid usually reports timeout or out-of-memory errors for apps with a size larger than 10 MBytes. The reason for this is that an Android app often involves dozens of components at runtime, and the interaction between multiple components leads to hundreds of callback methods.

Although full-scale analysis can ensure high accuracy, it results in an unnecessary amount of analysis. FlowDroid should be improved in two aspects as follows: Although full-scale analysis can ensure high accuracy, it results in an unnecessary amount of analysis. FlowDroid should be improved in two aspects as follows: Although full-scale analysis can ensure high accuracy, it results in an unnecessary amount of analysis. FlowDroid should be improved in two aspects as follows:


#### *3.3. Mathematical Background 3.3. Mathematical Background 3.3. Mathematical Background*

The Chi-square test is a hypothesis testing method used to determine whether two variables are independent. For two discrete variables, it can be concluded whether there is a correlation between them by using the Chi-square test. The larger the Chi-square value, the greater the deviation between the two variables, the smaller the correlation between them and the stronger the independence. When the value of Chi-square reaches 0, The Chi-square test is a hypothesis testing method used to determine whether two variables are independent. For two discrete variables, it can be concluded whether there is a correlation between them by using the Chi-square test. The larger the Chi-square value, the greater the deviation between the two variables, the smaller the correlation between them and the stronger the independence. When the value of Chi-square reaches 0, The Chi-square test is a hypothesis testing method used to determine whether two variables are independent. For two discrete variables, it can be concluded whether there is a correlation between them by using the Chi-square test. The larger the Chi-square value, the greater the deviation between the two variables, the smaller the correlation between them and the stronger the independence. When the value of Chi-square reaches 0, that means the factors are exactly the same. The formula of the quaternary Chi-square test is as follows:

$$\chi^2 = \frac{N(AD - BC)^2}{(A+B)(\mathbb{C} + D)(A+\mathbb{C})(B+D)}\tag{1}$$

For an abstract random variable, to remove its uncertainty, a certain amount of information needs to be used, and information entropy is a mathematical measure of this. The higher the information entropy is, the larger the amount of information that needs to be introduced and the lower the information entropy is, and the less information is needed. The information entropy of *X* is defined as: mation needs to be used, and information entropy is a mathematical measure of this. The higher the information entropy is, the larger the amount of information that needs to be introduced and the lower the information entropy is, and the less information is needed. The information entropy of *X* is defined as: () = −∑() <sup>2</sup> () (2)

that means the factors are exactly the same. The formula of the quaternary Chi-square test

( − )

( + )( + )( + )( + )

For an abstract random variable, to remove its uncertainty, a certain amount of infor-

2

(1)

*Entropy* **2021**, *23*, x FOR PEER REVIEW 5 of 14

 <sup>2</sup> =

is as follows:

$$H(X) = -\sum\_{n} P(X) \log\_2 P(X) \tag{2}$$

(, )

In order to determine the influence of the information entropy between two variables, the information entropy of *X* can be obtained when *Y* appears, as shown in Equations (3) and (4): bles, the information entropy of *X* can be obtained when *Y* appears, as shown in equations (3) and (4): (|) = −∑()∑(|) <sup>2</sup> (|)

$$H(X|Y) = -\sum\_{n} P(Y) \sum\_{n} P(X|Y) \log\_2 P(X|Y) \tag{3}$$

$$H(X,Y) = H(X) + H(Y|X) = H(Y) + H(X|Y) \tag{4}$$

where *H*(*X*|*Y*) is the conditional information entropy, *P*(*X*|*Y*) is the conditional probability and *H*(*X*,*Y*) is the joint information entropy. According to the above equations, the mutual information values of *X* and *Y* can be obtained as *I*(*X*;*Y*): where (|) is the conditional information entropy, (|) is the conditional probability and (, ) is the joint information entropy. According to the above equations, the mutual information values of *X* and *Y* can be obtained as (; ):

$$H(\mathbf{X};\mathbf{Y}) = H(\mathbf{X}) - H(\mathbf{X}\,\mathbf{Y}) = -\sum P(\mathbf{X},\mathbf{Y}) \log \frac{P(\mathbf{X},\mathbf{Y})}{P(\mathbf{X})P(\mathbf{Y})} \tag{5}$$

#### **4. The Improved Detection Method 4. The Improved Detection Method**

Flowdroid analyzes all data paths, resulting in high false positives and high resource requirements. This paper presents a redundancy resolution method based on feature permissions and risk. The purpose is to exclude the large number of irrelevant paths (security paths) for static analysis. The lightweight FlowDroid, named Permission-based FlowDroid (PBFlowDroid), is proposed based on above methods. Figure 3 shows the architecture of PBFlowDroid. Flowdroid analyzes all data paths, resulting in high false positives and high resource requirements. This paper presents a redundancy resolution method based on feature permissions and risk. The purpose is to exclude the large number of irrelevant paths (security paths) for static analysis. The lightweight FlowDroid, named Permission-based FlowDroid (PBFlowDroid), is proposed based on above methods. Figure 3 shows the architecture of PBFlowDroid.

**Figure 3. Figure 3.**  Workflow of PBFlowDroid. Workflow of PBFlowDroid.

Firstly, the Chi-square test is used to extract permissions related to malicious applications, and these permissions (malicious sensitive permissions) are classified into permission clusters by a clustering algorithm based on mutual information. Thus, the large number of permissions is reduced to a small number of permission clusters which are considered in static analysis. Secondly, different permissions or combinations of permissions bring different risks to user privacy or system security. To improve the accuracy of analysis, a risk assignment and calculation algorithm for single permissions or combinations of permissions is proposed. With these methods, all paths are given a risk value. Using the risk value of each path, the security of the taint data flow propagation path generated by

control flow [26] and data flow can be determined, notifying the user whether the taint data flow is a safe path, ensuring the accuracy of static analysis and improving the analysis efficiency.

#### *4.1. Permission Cluster Extraction*

In Android, each permission has the two states of "request" and "no request", which are independent of the number of requests. This scenario is suitable for the Chi-square test. In this study, the quaternary Chi-square test is used; the Chi-square of permission *p* is as follows:

$$\chi^2(p) = \frac{N\left(A\_p D\_p - B\_p \mathbb{C}\_p\right)^2}{\left(A\_p + B\_p\right)\left(\mathbb{C}\_p + D\_p\right)\left(A\_p + \mathbb{C}\_p\right)\left(B\_p + D\_p\right)}\tag{6}$$

where *N* denotes the total number of app samples, which consists of *X* malicious apps and *Y* normal apps. For permission *p*, requests by malicious apps and normal apps are counted as *A<sup>p</sup>* and *Bp*. The numbers that do not apply for *p* by malicious apps and normal apps are counted as *C<sup>p</sup>* = *X* − *A<sup>p</sup>* and *D<sup>p</sup>* = *Y* − *B<sup>p</sup>* , as Table 1 shows.

**Table 1.** Chi-square test distribution of permission *p*.


As for *χ* 2 ,the Chi-square test provides a threshold checklist as a criterion of reliability. For each permission, the probability of it relating to a malicious application is obtained by referring to the Chi-square test threshold table [27]. The larger the probability, the more malicious applications tend to have the corresponding permission, while a Chi-square value less than 0.5 indicates that the permission has almost no correlation with malicious applications. In our experiment, the top 20 permissions with a Chi-square value greater than 0.5 are regarded as permissions with a high correlation with malicious applications, as shown in Table 2.

**Table 2.** Permissions clusters, Chi-square and risk assignment.


The Chi-square test selects the permissions to be investigated and significantly reduces the candidate paths for static analysis. Even with 20 permissions, the number of paths that can be associated with some apps is still quite large. In fact, permissions are not independent of each other. When one app applies for a certain permission, other permissions of the same type that are related to achieve a combined function are also applied, which leads to strong correlation between permissions. For example, "READ\_SMS" and "WRITE\_SMS" are often applied and used at the same time. In static analysis, if two permissions with high correlation are detected separately, multiple detection results will be generated. This may decrease the accuracy of detection. To solve this problem, we use the clustering algorithm to cluster the selected permissions so that each cluster is representative.

Permission is a discrete kind of feature information, and the similarity between permissions can be measured by mutual information based on information entropy. We use *Pm*(*X*) and *Pn*(*X*) to represent the probability that permission *X* will be maliciously applied and normally applied, respectively. Then, the information entropy *H (X)* of permission *X* is as follows:

$$H(X) = -\left(P\_{\mathfrak{m}}(X)\log\_2 P\_{\mathfrak{m}}(X) + P\_{\mathfrak{n}}(X)\log\_2 P\_{\mathfrak{n}}(X)\right) \tag{7}$$

The mutual information values of permissions *X* and *Y* are calculated by formula (5). In order to describe the similarity between permissions *X* and *Y* more intuitively, the correlation between the two permissions can be obtained by (8):

$$\text{Cor}(X, Y) = \mathbf{2} \times \left[ \frac{I(X, Y)}{H(X) + H(Y)} \right] \tag{8}$$

where the value of *Cor*(*X*,*Y*) is located between [0, 1]. A value of 0 means that permission *X* and *Y* are completely unrelated; the larger the value, the greater the correlation between them.

In this paper, a clustering method based on mutual information is proposed to cluster the selected permissions (which in our experiment, the number of permissions is 20 as shown in Table 2) to generate feature permission clusters (FPC) with low similarity between clusters and high similarity within clusters and further remove irrelevant detections in static analysis.

The steps of the clustering algorithm based on mutual information are as follows:

**Clustering Algorithm Based on Mutual Information:**

*Input: Permissions set Obtained by Chi-Square Test: S* = {*p*0, *p*<sup>1</sup> , · · · , *pn*} *Output: Cluster Sets: C* = {*c*0, *c*<sup>1</sup> , · · · , *cm*}, *m* ≤ *n*


After clustering, the permissions with high correlation with malicious apps were clustered into multiple clusters (we have seven clusters in our experiment, from *c*0~*c*6, as shown in Table 2).

FPC extraction not only eliminates irrelevant paths caused by permissions unrelated to malicious applications but also eliminates the correlation within the permission cluster, further eliminating the redundancy in static detection and improving the detection efficiency.

#### *4.2. Risk Calculation*

App security threats are strongly correlated with some characteristic permissions [9]. Operations corresponding to different permissions pose different threats to user privacy and system security [9]. For example, the operation of applying for network permission often transfers privacy settings on a device to other addresses involving the interaction between user information and the outside world, and so the threat degree is greater; the permission of applying for the device's local location only obtains the current user's status and does not interact with the outside world or affect the security of the device, so the threat degree is general.

To visualize the risk level of different permission clusters, we use the risk value to describe it. The risk value is the quantification of the risk of each permission cluster, and the risk value of each permission in the same cluster is the same. The selection of these values is not unique, and only three conditions need to be met:


Where *m* + 1 is the number of cluster sets, and *k* is used to classify clusters into high risk clusters and low risk clusters. In our experiment, the *m* is 6 and we choose 4 as the value of *k*.

These limitations are related to analytical calculation methods. When assigning risk values to each cluster, in this paper, we select a simple assignment that satisfies the above three conditions. It should be noted that the threshold value in 5.1 is related to the risk value; this is an empirical value obtained through experiments. Different risk value assignments will correspond to different threshold values. In our experiment, the risk value for each cluster is shown in Table 2.

When an app performs an operation or acquires a resource, it sometimes applies for more than one permission, which forms a combination of permissions. A combination of permissions may pose a greater threat to the system than a single permission [28]. In our experiment, we refer to [10] to calculate the risk value of an app, and the calculation rules are as follows:

*Rule 1*: For a single requested permission, the risk value *R<sup>S</sup>* is the sum of the risk value for each permission:

$$R\_S = \sum\_{i=1}^{M} R(p\_i) \tag{9}$$

where *R*(*pi*) is risk value of permission *p<sup>i</sup>* , and *p<sup>i</sup>* is the single permission requested by the app.

*Rule 2*: For any requested combination of permissions *PC<sup>j</sup>* with where permissions belong to clusters *c*<sup>0</sup> to *ck*−<sup>1</sup> , the risk value *R<sup>C</sup> PC<sup>j</sup>* is defined as:

$$\mathcal{R}\_c(\mathcal{PC}\_j) = \begin{cases} \prod \mathcal{R}(p\_i), \ p\_i \in c\_j \text{ and } j < k\\ \sum \mathcal{R}(p\_i), \ p\_i \in c\_j \text{ and } j \ge k \end{cases} \tag{10}$$

and the risk of the combined permissions is defined as the sum of all risk values of each combination of permissions; that is,

$$R\_{\mathcal{C}} = \sum\_{j=0}^{m} R\_{\mathcal{C}} \{ PC\_j \} \tag{11}$$

*Rule 3*: The risk value *R* of an app is defined as the logarithmic mean of the total risk value:

$$R = \frac{\log(R\_S + R\_\odot)}{M} \tag{12}$$

where *M* is the total number of requested single permissions and combinations of permissions.

In general, a normal app provides multiple services to satisfy users' functional needs, and several permissions are used, whereas a malicious app has simpler functions but uses permissions with a higher risk value [29]. Therefore, we use the mean risk rather than the total risk to evaluate the risk of an app. For apps whose *R* is less than a threshold, it can be preliminarily judged as a secure application. For apps with a higher *R* than the threshold, the mapping between the permission and corresponding API is constructed and added to the *Source* set, then the security decision of the taint data flow is entered.

#### *4.3. Filtration of Taint Data Flow*

The function call graph generated by FlowDroid is very complex. Frequent calls between functions cause a large number of redundant detection paths, which makes the further static analysis cost very high. In this study, data paths from native FlowDroid are further filtered based on the risk value.

On the basis of the definition of native FlowDroid, we define the following extra variables:


For the filtering method in PBFlowDroid, the criterion is whether the risk value of the path is higher than the threshold. Based on the risk calculation for the path in the call graph, the filtering method extracts the paths with a higher risk value to *LeakRoute*. In this way, we obtain all the pollution paths and identify the data paths with a high risk of privacy leakage. Because the filtering method reduces the data paths to be analyzed, not only is the analysis time shortened, but also the accuracy of pollution analysis can be improved due to the elimination of false positive paths.

From the above, PBFlowDroid introduces a high-risk pollution path detection method and reduces the scale of pollution paths to be analyzed. Furthermore, PBFlowDroid can solve the challenge of the high false-positive and false-negative rate of analysis in native FlowDroid.

**Figure 4.** Workflow of filtering method. **Figure 4.** Workflow of filtering method.

#### For the filtering method in PBFlowDroid, the criterion is whether the risk value of **5. Experiments**

the path is higher than the threshold. Based on the risk calculation for the path in the call graph,the filtering method extracts the paths with a higher risk value to *LeakRoute*. In this way, we obtain all the pollution paths and identify the data paths with a high risk of privacy leakage. Because the filtering method reduces the data paths to be analyzed, not only is the analysis time shortened, but also the accuracy of pollution analysis can be improved This section tests the accuracy and efficiency of the proposed PBFlowDroid. The computer used was a Z600 WorkStation with an Intel (R) Xeon (R) E5540 @ 2.53 GHz CPU and 4.00 GB of physical memory. All tests were run on Windows 7 with Oracle's Java Runtime version 1.8 (64 bit). Android 6.0 with Android-23 SDK was used in all experiments.

#### due to the elimination of false positive paths. From the above, PBFlowDroid introduces a high-risk pollution path detection *5.1. Accuracy Experiments*

method and reduces the scale of pollution paths to be analyzed. Furthermore, PBFlowDroid can solve the challenge of the high false-positive and false-negative rate of analysis in native FlowDroid. **5. Experiments** This section tests the accuracy and efficiency of the proposed PBFlowDroid. The computer used was a Z600 WorkStation with an Intel (R) Xeon (R) E5540 @ 2.53 GHz CPU and 4.00 GB of physical memory. All tests were run on Windows 7 with Oracle′s Java Runtime version 1.8 (64 bit). Android 6.0 with Android-23 SDK was used in all experiments. *5.1. Accuracy Experiments* In this experiment, 500 normal apps from Google Play and 50 malicious apps from GitHub's Malicious Application Sample Library [5] were used as test samples. The risk value *R* of each successfully tested app was calculated. An app with an *R* value less than the threshold was recognized as a normal application; otherwise, it was identified as a malicious application. The risk threshold was an unknown parameter at the beginning of the experiment. We were inspired by machine learning [30] and used one-tenth of the experimental data as a training set to obtain the risk threshold. The training set contained 50 randomly selected normal apps and five malicious apps. By calculating the risk value of all apps in the training set and selecting the threshold to minimize the false positive rate, we obtained the risk value threshold. Other data were used as the validation set. In our experiments, we set 0.12 as the threshold value. Table 3 shows the results.

#### In this experiment, 500 normal apps from Google Play and 50 malicious apps from **Table 3.** Result of testing.


the experiment. We were inspired by machine learning [30] and used one-tenth of the experimental data as a training set to obtain the risk threshold. The training set contained 50 randomly selected normal apps and five malicious apps. By calculating the risk value

In our experiments, 413 normal apps and 35 malicious apps were decompiled. Among the 413 normal apps, 332 were correctly identified and the other 81 were identified as malicious apps. The detection rate for the normal sample was 80.4%, and the false alarm rate was 19.6%. In total, 27 of 35 malicious apps were correctly identified. The omission ratio was 22.8% and accuracy rate was 77.2%. Compared with the test results of native FlowDroid in Droid Bench, where 30 malicious applications were detected out of 39 apps, with an accuracy of 76.9% [8], the proposed method guaranteed sufficient detection accuracy.

Table 4 gives the number of permissions in application as *M* and the taint paths and risk value as *R* for some apps. We can see that the *R* values of *FangTianxia* and *MeiPai* were 0.134 and 0.121, respectively; they were reported as malicious apps.


**Table 4.** Detection results for some apps.

#### *5.2. Efficiency Experiments*

In PBFlowDroid, only data paths with a higher risk value are analyzed, which reduces the detection complexity. In this section, a comparative test with native FlowDroid in terms of detection time and memory consumption was performed. When we reproduced FlowDroid in our experimental environment, we found that FlowDroid took more than 10<sup>4</sup> s to analyze some apps, and the magnitude of the results of the completed path analysis was generally between 10<sup>4</sup> to 10<sup>5</sup> . Because the time complexity of FlowDroid is orders of magnitude different from the method proposed in this article, we do not compare the apps directly in this article. Thus, 500 test samples were randomly selected from Google Play with sizes ranging from 50 KB to 60 MB. For apps larger than 10 MB, FlowDroid reports a timeout or out-of-memory error. Table 5 shows the time and memory consumption of PBFlowDroid for eight apps.


**Table 5.** Time and memory consumption comparison (OOM: out of memory).

With the increase of the app size, the memory consumption increases significantly for both tools. For FlowDroid, memory is quickly exhausted, making the detection fail. For PBFlowDroid, memory consumption is kept within 4 GB and all tests were successful in our experiments.

Table 6 lists the multiple classification algorithms supported by PUMA [31] and their results. However, machine learning algorithms, including the method proposed by PUMA, can only discriminate whether an app is a malicious app through the use of permissions and cannot analyze how apps abuse permissions. In contrast to machine learning algorithms, our method and FlowDroid can not only analyze whether an app is a malicious app but also analyze its usage of permissions. Moreover, the results in Table 5 show that our app analysis consumes less resources than FlowDroid.

**Table 6.** Android malware detection results for the different algorithms.


#### **6. Conclusions**

FlowDroid is a static taint analysis tool widely used for Android apps. However, the call graph generated by FlowDroid grows exponentially as the size of the app increases, which reduces its availability. Research shows that the security threat of an app mainly comes from its abuse of permissions, and not all permissions will lead to a leakage of sensitive information. This paper proposes a method to identify dangerous data paths, and secure paths are filtered in further analysis. In this way, the call graph is greatly simplified and the resource requirements in the analysis process are significantly reduced. On the other hand, we used the Chi-square test and mutual information values to extract the correlated permissions and proposed risk calculation method considering permission combinations. In this way, a more accurate risk value is taken as a criterion to reduce misjudgment. The experimental results show that our proposed method reduces the complexity of detection significantly, and the detection accuracy is guaranteed.

In our future work, the communication between processes needs to be taken into account, and the assessment of communication risk is worth exploring. This will help to deal with the collusion attack problem. Second, the risk value of the data path should be determined based on API, application components and other features [32] rather than only permissions to improve the accuracy of pollution path identification. Third, the distinction between small malware and large malware should also be considered. We will import some large malware to the test set in our next work and prove the applicability of PBFlowDroid to large malware.

**Author Contributions:** Conceptualization, G.L. and L.Z.; methodology, H.K.; software, Z.W.; validation, Z.W., L.Z. and H.K.; formal analysis, L.Z.; writing—original draft preparation, H.K.; writing review and editing, L.Z.; supervision, G.L.; project administration, Y.T. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work was supported in part by the Shaanxi Key R & D Program (Grant No.2019ZDLGY13- 01) and the Science and Technology Projects of Xi'an, China (Grant number:201809170CX11JC12).

**Conflicts of Interest:** The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

#### **References**


MDPI St. Alban-Anlage 66 4052 Basel Switzerland Tel. +41 61 683 77 34 Fax +41 61 302 89 18 www.mdpi.com

*Entropy* Editorial Office E-mail: entropy@mdpi.com www.mdpi.com/journal/entropy

Academic Open Access Publishing

www.mdpi.com ISBN 978-3-0365-7660-2