**Preface to "Sensor Signal and Information Processing II"**

Smart sensors are revolutionizing the world of system design in everything from sports cars to assembly lines. These new sensors have abilities that leave their predecessors in the dust! They not only measure parameters efficiently and precisely, but they also have the ability to enhance and interrupt those measurements, thereby transforming raw data into truly useful information. The concept of a smart sensor was first introduced by NASA in the process of developing a spaceship and formed a product in 1979. Smart sensors have the ability to automatically calibrate, compensate, and collect data. Its capability determines that smart sensors have high accuracy and resolution, high stability and reliability, and good adaptability. Compared with traditional sensors, they have a high performance price ratio. Early smart sensors are processed and converted from the output signal of the sensor to the microprocessor for operation. In the 1980s, the smart sensor mainly focused on microprocessors and integrated the sensor signal conditioning circuit, microelectronic computer memory, and interface circuit to a chip, so that the sensor has a certain AI. In the 1990s, intelligent measurement technology was further improved, so that the sensor could achieve miniaturization and have the function of self-diagnosis.

Fast forwarding to 2020, sensor signal and information processing (SSIP) (https://www.mdpi. com/journal/sensors/special issues/SSIP II) has become an overarching field of research focusing on the mathematical foundations and practical applications of signal processing algorithms that learn, reason, and act. It bridges the boundary between theory and application, developing novel theoretically inspired methodologies targeting both longstanding and emergent signal processing applications. The core of SSIP lies in its use of nonlinear and non-Gaussian signal processing methodologies combined with convex and nonconvex optimization. SSIP encompasses new theoretical frameworks for statistical signal processing (e.g., deep learning, latent component analysis, tensor factorization, Bayesian methods) coupled with information theoretical learning, and novel developments in these areas specialized in the processing of a variety of signal modalities including audio, bio-signals, multiphysics signals, images, multispectral, and video, among others. In recent years, many signal processing algorithms have incorporated some forms of computational intelligence as part of their core framework in problem solving. These algorithms have the capacity to generalize and discover knowledge for themselves and learn new information whenever unseen data are captured. The focus of the book will be on a broad range of sensors, signal, and information processing involving the introduction and development of new advanced theoretical and practical algorithms.

> **Wai Lok Woo, Bin Gao** *Special Issue Editors*

### *Article* **A Deep-Learning-Driven Light-Weight Phishing Detection Sensor**

**Bo Wei 1,\* , Rebeen Ali Hamad 1, Longzhi Yang <sup>1</sup> , Xuan He 2,3, Hao Wang 4, Bin Gao <sup>5</sup> and Wai Lok Woo <sup>1</sup>**


Received: 10 September 2019; Accepted: 28 September 2019; Published: 30 September 2019

**Abstract:** This paper designs an accurate and low-cost phishing detection sensor by exploring deep learning techniques. Phishing is a very common social engineering technique. The attackers try to deceive online users by mimicking a uniform resource locator (URL) and a webpage. Traditionally, phishing detection is largely based on manual reports from users. Machine learning techniques have recently been introduced for phishing detection. With the recent rapid development of deep learning techniques, many deep-learning-based recognition methods have also been explored to improve classification performance. This paper proposes a light-weight deep learning algorithm to detect the malicious URLs and enable a real-time and energy-saving phishing detection sensor. Experimental tests and comparisons have been conducted to verify the efficacy of the proposed method. According to the experiments, the true detection rate has been improved. This paper has also verified that the proposed method can run in an energy-saving embedded single board computer in real-time.

**Keywords:** phishing detection; cyber security; deep learning

### **1. Introduction**

A phishing website is a common social engineering method that mimics trustful uniform resource locators (URLs) and webpages to gather users' sensitive and confidential information, such as user names, passwords, credit card information, etc. Figure 1 shows one example of a phishing website imitating the popular website facebook.com. It replaces "oo" with the unnoticeable "00". The webpage looks exactly the same as the official Facebook, but the phishing one will keep the username and passwords of victims and forward them to attackers. The phishing website issue is becoming increasingly severe. According to the latest phishing activity trends report from the Anti-Phishing Working Group (APWG) [1], 138,328 phishing websites were reported in the fourth quarter of 2018. The report also indicates the increasing trend of detection difficulty because attackers are trying to use multiple redirection techniques in order to make the malicious URLs obscure. There was a \$48-million financial loss due to phishing in the US in 2018, only based on the cases reported to the Federal Bureau of Investigation (FBI) [2].


**Figure 1.** One example of a phishing website imitating the popular website facebook.com.

As shown in Figure 2, malicious URL recognition is relatively easy for cyber security experts since they have sufficient experience in the relevant areas. However, it is extremely difficult for normal users who usually do not pay much attention when accessing one URL. Therefore, the research community takes advantage of the expert knowledge of cyber security and designs machine-based automatic phishing URL detection. The most popular method to detect a phishing website is the use of a phishing URL tank. The URLs in that tank will be recognised as phishing URLs. Phishing URL tanks are maintained by antiphishing organisations to provide live antiphishing databases. There are several famous antiphishing organisations, such as phishtank [3], Joewein [4], hphosts [5], Malware Domains List [6], etc. Due to the rapidly increasing number of phishing websites, antiphishing organisations require comprehensive contributions from the whole community. To maintain up-to-date phishing URL tanks, they need users, including individuals and organisations, to report phishing websites manually. The URLs are fairly accurate because of this manual involvement, but there are still drawbacks in that the human effort introduces delay and extra maintenance labour costs. These handcraft list-based phishing website detection could effectively prevent further harm, but this may fail to promote warnings before its URL is reported by one user and placed in the phishing tank.

Conventional machine learning techniques have been introduced to the phishing website detection domain [7,8]. As shown in Figure 3, with the help of experts in cyber security, URLs and websites are first analysed to conduct feature selection from the malicious websites. Next, machine learning experts use the features, along with their labels, to construct a training set and take advantage of classical supervised machine learning algorithms to develop a phishing detection model. Many conventional methods, e.g., support vector machine (SVM), k-nearest neighbours algorithm (kNN), etc., have been explored to fully utilise these features. Deep learning is also incorporated in the phishing detection domain, motivated by its recent rapid development and many successful applications [9,10]. Different from classical machine learning methods involving an explicit handcrafted feature selection process, the machine learning experts can use the data directly without the knowledge from the cyber security experts (shown in Figure 3).

**Figure 2.** Difficulties to recognise malicious URLs for experts, normal users, and machines.

This paper designs and implements a light-weight phishing detection sensor. The paper proposes an innovative deep learning model to enable accurate and efficient phishing detection using URLs of websites. Different from previous research works, this research also investigates the feasibility of our proposed deep learning model in resource-constrained computing devices. Furthermore, this research work implements a prototype of a deep-learning-driven light-weight phishing detection sensor in one embedded single board computer, which shows the feasibility of the integration of our method into one wireless router. This work also uses a large volume of benign and malicious URLs to construct the training set and evaluate the efficacy of the proposed model.

In summary, the main contributions of this paper are:


This paper first introduces related work in Section 2. Section 3 shows the background and motivations of the proposed method. The proposed method is introduced in Section 4, and this paper evaluates the performance of the proposed model in Section 5. Section 6 shows the details of the implementation. Finally, Section 7 concludes our work.

### **2. Related Works**

The common method to detect phishing websites is the use of blacklists to include all the reported URLs of phishing websites. This method requires largely manual efforts from the whole community. As introduced in Section 1, the blacklists are mainly maintained by antiphishing organisations. Some popular antiphishing organisations are phishtank [3], Joewein [4], hphosts [5], Malware Domains List [6], etc. Whitelists can also be created to exclude the websites that users trust. Some methods are also proposed, aiming to facilitate the labelling process for list-based phishing website detection. Cao et al. proposed a method to automatically update the whitelist from the users' familiar websites [11]. Jain et al. designed a hyperlink-based phishing detection mechanism to update the whitelist [12]. Sharifi et al. took advantage of search engines to evaluate the legitimacy of websites and create an up-to-date blacklist accordingly [13].

Machine learning has been extensively used in the phishing detection domain. Phishing detection can be classified as a supervised machine learning problem. A large number of phishing websites on blacklists can be analysed and researched by the cyber security and machine learning community. Features from two main components of a website are commonly used for phishing detection. At first, the attackers usually imitate legitimate URLs to lure users into entering phishing websites, so researchers have focused on the analysis of URL for phishing detection. Additional to URLs, the documents implemented to display one website, such as HTML, CSS, and Javascript documents, are also explored for phishing detection. Amrutkar et al. use multiple features from HTML, CSS, and javascript documents from websites to detect the phishing contents [7]. That work also investigates the website features from smart phones and aims to realise real-time malicious website detection on mobile devices. Rule-based features from URLs were explored to detect phishing internet banking webpages [14]. Natural language processing techniques are also explored in [15] to determine a malicious URL, and the authors use seven traditional classifiers along with selected features from URLs to enable an antiphishing system. Zhang et al. [16] and Xiang et al. [8] proposed Cantita and its augmented version Cantita+, which also extracted features from the contents of websites and used multiple machine learning algorithms.

Recently, deep-learning-based methods have been introduced in the phishing website detection domain. Jiang et al. used convolutional neural network (CNN) techniques, a popular model in deep learning, to detect malicious URLs [17]. One deep learning model using word embedding and CNN has also been proposed to detect malicious URLs, file paths, and registry keys [9]. Le et al. proposed URLNet to use CNN for analysing both word-level features and character-level features for malicious URL detection [10]. Yang et al. applied multiple features for detecting phishing URLs [18]. The deep learning technique has also been introduced into phishing email detection [19].

Different from the previous works, this paper proposes a new deep learning model and further investigates the feasibility of enabling an energy-saving phishing website sensor with the integration of the deep learning model in a resource-constrained computing device.

### **3. Background and Motivations**

List-based phishing website detection is the most common method currently. The lists created by this method can offer labelled training sets, which is an essential prerequisite for the future use of machine-learning-based detection methods. Two URL lists are normally produced by list-based phishing website detection methods, i.e., a blacklist and a whitelist. Figure 4 shows the general mechanism of list-based phishing URL detection methods. The antiphishing companies use the reports from the community to create one blacklist and one whitelist. The computing devices use these two lists to detect malicious websites. The whitelist contains the user-trusted URLs. In contrast, when one URL is on the blacklist, it is recognised as a malicious URL. However, with the list-based method there remains an ongoing challenge of the detection of unknown URLs. It is difficult to classify an unknown URL that is not on any list. The common policy is to recognise that as a benign URL. If a new malicious website uses this unknown URL, the false negative could potentially harm users. Attackers take advantage of this loophole and keep changing URLs for their phishing websites to ensure the new URLs are not on the blacklist.

**Figure 4.** The mechanism of list-based phishing URL detection methods.

Phishing website detection is modelled as a supervised machine learning problem. Components from websites, such as URL, HTML, etc., are used as the training data for building a model to conduct malicious website detection. Classifiers play a vital role in supervised machine learning methods. There are several classical and popular supervised machine learning algorithms, such as kNN, SVM, etc. that have already been used for malicious website detection applications. Figure 5 shows the general process of a classical supervised learning-based malicious website detection method. The feature

selection process is an initial and essential step for these classical classifiers. Informative features can help improve the detection performance, but excellent feature selection needs the expert knowledge from a cyber security perspective. Furthermore, it is always difficult to decide the best features for a particular application. Feature selection may cause a drastic loss of valuable information.

**Figure 5.** The general process of a classical supervised learning-based malicious website detection method.

Recently, the use of deep learning has improved the performance of many applications in image processing [20], computer vision [21], acoustic classification [22], natural language processing [23], etc. Many research works also utilise deep learning techniques in malicious website detection. Additional to the significant performance improvement, deep learning has the advantage of being featureless. As shown in Figure 6, the deep-learning-based methods do not require feature selection. The unprocessed data could be used to train a model without any extra effort, and deep learning algorithms will help select the best patterns for the final decision. Motivated by these facts, this paper also designs a deep learning model and uses unprocessed URLs to derive a deep-learning-based light-weight phishing detection sensor for inference.

**Figure 6.** Deep-learning-based malicious website detection.

To enable the light-weight phishing detection sensor, another question this paper would like to address in this paper is "Can the proposed deep learning model be integrated into a resource-constrained computing device?" The paper aims to design a phishing detection sensor to achieve accurate and efficient phishing detection. By applying the designed system, it is not necessary to install antiphishing software on every single computing device and Internet of Things (IoT) device. Only the designed sensor is required for one household or office between the devices and the router. The proposed model can also be implemented into the router directly due to its computational efficiency. To summarise, this paper implements a phishing detection prototype sensor with the integration of the proposed deep learning methods.

This section will give the details of the proposed deep-learning-based phishing URL detection method. Figure 7 shows an overview of the proposed method.

**Figure 7.** System structure.

### **4. Method**

The first step of the proposed method is data sanitisation. In this step, the common URL prefixes, such as http://, https:// and www, are deleted to prevent the impact of URL presentations of the different datasets on phishing URL recognition performance. Without pruning prefixes, the inconsistency of URL formats can easily affect the quality of the model. For example, all the URLs in some phishing URL datasets contain the http prefix, which means that the trained model will falsely classify all of the URLs with the http prefix as phishing. The shorter representation will also accelerate the inference, which is also a main considerable factor for resource-constrained devices.

The tokeniser is used to vectorise each character in URLs. Character-level tokenisation is used instead of word-level analysis because URLs usually use words without any meaning. More information is contained at the character level. The attackers also mimic the URLs of authentic websites by changing several characters that are not noticeable. For example, they may change facebook.com to faceb00k.com, replacing "oo" with "00". The character-level tokenisation helps find this mimic information, improving the performance of malicious URL detection.

This paper proposes an innovative deep neural network for malicious URL detection. As shown in Figure 8, the proposed Deep Neural Network (DNN) model has the following layers: (1) embedding layers; (2) convolutional layers; (3) concatenation layer (4) dropout layers; (5) dense layers; (6) sigmoid layers. Table 1 shows the configuration of the layers of the proposed deep-layered model. The output dimension of the word embedding layer, the number of filters, and the kernel size of the convolutional layers, the rate of the dropout layer and the number of units of the dense layers are shown. Here are the details for each type of layer in the configuration.

**Figure 8.** Structure of the proposed DNN model.


**Table 1.** Architecture configuration of the proposed DNN model.

**Embedding layer:** The embedding layer is usually used in the first layer of the DNN structure for a Natural Language Processing(NLP) problem. Additional to the tokenisation, the embedding layer will return a vector. Figures 9 and 10 show examples of the simple one hot encoding and the used word embedding. Different from one hot word using binary representations for each word, the coefficients in the vector returned from the embedding layer are able to indicate the relations among characters, which can help improve the performance of NLP-related research questions. The proposed network uses embedding word configuration.

**Convolutional layers:** Following the embedding layer, five convolutional layers are used. For each convolutional layer, the kernel, a.k.a. a convolutional filter, is applied to extract the most useful features and remove unnecessary information. The element-wise multiplication and the summary operations occur between the filter and the relevant part of data, and the filter slides through the data to generate the features. Instead of a common sequential structure of convolutional neural networks, parallel convolutional layers are used. Each layer considers one window size of consecutive characters and extracts features from them. The rectified linear unit (ReLU) activation function is also used following each convolutional layer. The output from each convolutional layer is then flattened and subsequently concatenated.

**Concatenation layer:** This layer is used to concatenate the features from previous layers for further processing. Different from simply concatenating the outputs from convolutional layers, the output from the embedding layers are also combined. In addition, the output from the embedding

layer (without the convolutional filtering) preserves the original information of content that can be used to detect malicious URLs as well.

**Dropout layer:** Dropout layer is a regularisation technique that is used to prevent overfitting during the training phase [24]. Neurons are randomly selected and ignored by the dropout layer during the training phase. Those ignored neurons are temporally removed on the forward pass, and their weights are not updated on the backward pass.

**Dense layers:** A dense layer is a fully connected feedback layer that equips the proposed model with the more capabilities for extracting the informative features. Following the dropout layer, three dense layers are used to analyse the patterns from the concatenation layer. One ReLU activation function also follows each dense layer.

**Sigmoid layer:** The sigmoid function is used in this layer to determine the malicious URLs. The range of the output from a sigmoid function is between 0 and 1, which is used in the final layer of the proposed model to show the prediction probability.

**Figure 9.** Example of one hot encoding.


**Figure 10.** Example of word embedding.

### **5. Evaluation**

This section discusses the performance of the proposed model. As discussed, the configuration of the proposed model is shown in Table 1. A PC with a Graphics Processing Unit (GPU) is used to train and evaluate the model. The computer used has an Intel Core i7 8 core CPU 3.60 GHz processor, 16 GB memory, and Nvidia GeForce GTX 1060 6 GB GPU. A total of 1,523,966 URLs were used, where 999,996 were legitimate URLs and 523,970 were phishing URLs. The legitimate URLs are from the list of Alexa top 1 million sites [25], hphosts [5], Joewein [4], malwaredomains [26], and phishtank [3]. Before using them, repeated URLs were removed to construct a dataset. The dataset was randomly split into a training set and a test set. The percentage of testing instances was 10%. The true detection

rate was used as the accuracy metric, i.e., the ratio between the number of correct detected instances and the total number of instances.

Using the proposed model can achieve an 86.630% true detection rate. Many similar deep-learning-based URL detection models use similar structures but configurations with different numbers of dense layers and convolutional layers. Therefore, in the following, the effect of the dense layers, convolutional layers, and concatenation of the output from the embedding layer will be discussed.

First, the effect of the dense layers is shown in Table 2. It is expected that increasing the number of dense layers can improve performance, so we first investigate the effect of the number of dense layers. The proposed model has 4 dense layers (3 dense layers plus the sigmoid layer). Table 2 shows that the true detection rate gradually increases with the increasing number of dense layers. The proposed method can achieve an 86.630% true detection rate. With 1, 2, and 3 dense layers, the true detection rates are 86.537%, 86.538%, and 86.542%, respectively.



Table 3 shows the effect of convolutional layers. A similar observation was also found here. Th increasing number of convolutional layers can help improve the performance the URL-based phishing detection. When using 1, 2, 3, and 4 convolutional layers, the deep learning model can achieve true detection rates of 85.401%, 85.832%, 86.169%, and 86.439%, respectively. The true detection rate of the proposed method is highest at 86.630%.

**Table 3.** The effect of the convolutional layers.


In this paper, we also propose to concatenate the output from the word embedding layer to enable the dense layers to have the unprocessed information as well. Performance improvement can also be found using this strategy, as shown in Table 4. Without the concatenation of the output from the embedding layer, the true detection rate drops from 86.630% to 83.472%.

**Table 4.** The effect of the concatenation of the output from the embedding layer.


### **6. Prototype Implementation**

The proposed method is implemented by integrating the proposed deep-learning-based method into resource-constrained devices. In this work, Raspberry Pi 3 B+ was chosen to implement our prototype. Raspberry Pi 3 B+ has a Quad core 1.4 GHz 64 bit CPU with 1 GB RAM. It is powered by 5 V power input or battery and has various Input/Output (IO) ports, such as 4 USB 2.0 ports, 40-pin general-purpose input/output (GPIO) header, and Camera Serial Interface (CSI) port. Raspberry Pi 3 B+ also supports the common network ports, such as 2.4 GHz and 5 GHz IEEE 802.11 wireless cards and the Ethernet. The abundance of network cards makes Raspberry Pi 3 B+ a good candidate for simulating a router.

Figure 11 shows the implementation process for the whole system. We first use the labelled URLs to train the DNN model by using powerful computing devices, such as GPU servers, rack servers, or cloud servers. The trained DNN model is then transferred to the intelligent WiFi router that acts as the phishing malicious URL sensor. When the intelligent WiFi router receives URL requests, it conducts phishing detection by using the integrated DNN model before requiring a domain name system (DNS) server. When the URL is recognised as malicious, the smart WiFi router will raise an alarm to the user and block the user's access to that URL.

**Figure 11.** The implementation process.

To evaluate the efficiency of the proposed model, we measure the computational time of each step of the proposed method using Raspberry Pi. Table 5 demonstrates the execution time of data sanitisation, tokenisation, and inference using DNN. A total of 10 trials were executed, and we calculated the mean execution time for each step. The inference costs 105 ms for each URL request, which occupies most of the running time. In the meantime, the data sanitation and tokenisation take no more than 1 ms. Totally, the proposed phishing detection method uses approximately 110 ms to evaluate each URL request, which can enable real-time malicious detection.

**Table 5.** Execution time in the prototype.


The word-level word embedding method along with character-level word embedding is used in [10]. To compare that work and show the efficiency of the proposed model, a deep learning model is implemented using both word-level and character-level word embedding methods. The same convolutional layers (as shown in Figure 8) are applied on the outputs of those two word embedding layers. Two outputs from convolutional layers are concatenated for further processing by dropout, dense, and sigmoid layers to make a decision. This deep learning model was evaluated in Raspberry Pi, but an out-of-memory (OOM) error occurred when running it. In other words, there was not sufficient memory in the resource-constrained Raspberry Pi to execute the implemented deep learning model with both word-level and character-level word embedding methods. This further confirms the efficiency of the proposed light-weight model. To compare the performances, the proposed model and the model with both word-level and character-level word embedding methods are executed in the PC. The execution time of DNN inference with the proposed model is 67 ms, while the model with both word-level and character-level word embedding methods needs 96 ms. The execution time significantly reduces by 30% using the proposed model.

### **7. Conclusions**

This paper proposed a multispatial convolutional neural network to enable an accurate and efficient phishing detection sensor. Extensive evaluations were conducted to show the performance of the proposed method. The true detection rate of the proposed method can achieve 86.63%. A prototype by using Raspberry Pi was also implemented to enable real-time phishing URL detection. With the proposed method, the execution time reduces by 30%, and real-time detection is realised in a resource-constrained device.

In the future, a webpage-content-based phishing detection model using deep learning can be proposed and implemented in a resource-constrained sensor as well.

**Author Contributions:** Conceptualization, B.W. and W.L.W.; methodology, B.W., H.W., and R.A.H.; validation, B.W., X.H. and B.G.; writing–original draft preparation, B.W. and R.A.H.; writing–review, X.H., L.Y. and W.L.W.; editing, W.L.W., B.G., L.Y. and H.W.; visualisation, B.W., X.H., L.Y. and H.W.

**Funding:** This research was funded by the National Natural Science Foundation of China (NSFC) (Grant No. 61971093 and No. 61771121), the Open Program of Neusoft Research of Intelligent Healthcare Technology, Co. Ltd. (Grant No. NRIHTOP1802), the Research and Application for Key Technologies of IoT Oriented to Smart Cities (cstc2018jszx-cyztzx0081), and the Innovation and Application for Smart Test of Supply and Demand Integration (cstc2018jszx-cyzd0404).

**Conflicts of Interest:** The authors declare no conflict of interest.

### **Abbreviations**

The following abbreviations are used in this manuscript:


### **References**


c 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

### *Article* **A Non-Linear Filtering Algorithm Based on Alpha-Divergence Minimization**

### **Yarong Luo, Chi Guo \*, Jiansheng Zheng and Shengyong You**

Global Navigation Satellite System Research Center, Wuhan University, Wuhan 430079, China; yarongluo@whu.edu.cn (Y.L.); zjs@whu.edu.cn (J.Z.); shengyongyou@whu.edu.cn (S.Y.)

**\*** Correspondence: guochi@whu.edu.cn

Received: 25 August 2018; Accepted: 16 September 2018; Published: 24 September 2018

**Abstract:** A non-linear filtering algorithm based on the alpha-divergence is proposed, which uses the exponential family distribution to approximate the actual state distribution and the alpha-divergence to measure the approximation degree between the two distributions; thus, it provides more choices for similarity measurement by adjusting the value of *α* during the updating process of the equation of state and the measurement equation in the non-linear dynamic systems. Firstly, an *α*-mixed probability density function that satisfies the normalization condition is defined, and the properties of the mean and variance are analyzed when the probability density functions *p*(*x*) and *q*(*x*) are one-dimensional normal distributions. Secondly, the sufficient condition of the alpha-divergence taking the minimum value is proven, that is when *α* ≥ 1, the natural statistical vector's expectations of the exponential family distribution are equal to the natural statistical vector's expectations of the *α*-mixed probability state density function. Finally, the conclusion is applied to non-linear filtering, and the non-linear filtering algorithm based on alpha-divergence minimization is proposed, providing more non-linear processing strategies for non-linear filtering. Furthermore, the algorithm's validity is verified by the experimental results, and a better filtering effect is achieved for non-linear filtering by adjusting the value of *α*.

**Keywords:** alpha-divergence; Kullback–Leibler divergence; non-linear filtering; exponential family distribution

### **1. Introduction**

The analysis and design of non-linear filtering algorithms are of enormous significance because non-linear dynamic stochastic systems have been widely used in practical systems, such as navigation system [1], simultaneous localization and mapping [2], and so on. Because the state model and the measurement model are non-linear and the state variables and the observation variables of the systems no longer satisfy the Gaussian distribution, the representation of the probability density distribution of the non-linear function will become difficult. In order to solve this problem, deterministic sampling (such as the unscented Kalman filter and cubature Kalman filter) and random sampling (such as the particle filter) are adopted to approximate the probability density distribution of the non-linear function, that is to say, to replace the actual state distribution density function by a hypothetical one [3].

In order to measure the similarity between the hypothetical state distribution density function and the actual one, we need to select a measurement method to ensure the effectiveness of the above methods. The alpha-divergence, proposed by S.Amari, is used to measure the deviation between data distributions *p*(*x*) and *q*(*x*) [4]. It can be used to measure the similarity between the hypothetical state distribution density function and the actual one for the non-linear filtering. Compared with the Kullback–Leibler divergence (the KL divergence), the alpha-divergence provides more choices for measuring the similarity between the hypothetical state distribution density function and the actual one. Therefore, we use alpha-divergence as a measurement criterion to measure the similarity between the two distribution functions. Indeed, adjusting the value of parameter *α* in the function can ensure the interesting properties of similarity measurement. Another choice of *α* characterizes different learning principles, in the sense that the model distribution is more inclusive (*α* → ∞) or more exclusive (*α* → −∞) [5]. Such flexibility enables *α*-based methods to outperform KL-based methods with the value of *α* being properly selected. The higher the similarity of the two probability distributions *p*(*x*) and *q*(*x*), the smaller the value of alpha-divergence will be. Then, it can be proven that in a specific range of value, *q*(*x*) can fully represent the properties of *p*(*x*) when the value of alpha-divergence is minimum.

Because the posterior distribution of non-linear filtering is difficult to solve, given that the posterior probability distribution is *p*(*x*), we can use the probability distribution *q*(*x*) to approximate the posterior probability distribution *p*(*x*) of non-linear filtering. The approximate distribution *q*(*x*) is expected to be a distribution with a finite moment vector. This in turn means that a good choice for the approximate distribution is from the exponential family distribution, which is a practically convenient and widely-used unified family of distributions on finite dimensional Euclidean spaces.

The main contributions of this article include:


### **2. Related Work**

It has become a common method to apply various measurement methods of divergence to optimization and filtering, among which the KL divergence, as the only invariant flat divergence, has been most commonly studied [6]. The KL divergence is used to measure the error in the Gaussian approximation process, and it is applied in the process of distributing updated Kalman filtering [7]. The proposal distribution of the particle filter algorithm is regenerated using the KL divergence after containing the latest measurement values, so the new proposal distribution approaches the actual posterior distribution [8]. Martin et al. proposed the Kullback–Leibler divergence-based differential evolution Markov chain filter for global localization for mobile robots in a challenging environment [9], where the KL-divergence is the basis of the cost function for minimization. The work in [3] provides a better measurement method for estimating the posterior distribution to apply KL minimization to the prediction and updating of the filtering algorithm, but it only provides the proof of the KL divergence minimization. The similarity of the posterior probability distribution between adjacent sensors in the distributed cubature Kalman filter is measured by minimizing the KL divergence, and great simulation results are achieved in the collaborative space target tracking task [10].

As a special situation of alpha-divergence, the KL divergence is easy to calculate, but it provides only one measurement method. Therefore, the studies on the theory and related applications of the KL divergence are taken seriously. A discrete probability distribution of minimum Chi-square divergence is established [11]. Chi-square divergence is taken as a new criterion for image thresholding segmentation, obtaining better image segmentation results than that from the KL divergence [12,13]. It has been proven that the alpha-divergence minimization is equivalent to the *α*-integration of stochastic models, and it is applied to the multiple-expert decision-making system [6]. Amari et al. [14] also proved that the alpha-divergence is the only divergence category, which belongs to both f-divergence and Bregman divergence, so it has information monotonicity, a geometric structure with Fisher's measurement and a dual flat geometric structure. Gultekin et al. [15] proposed to use Monte Carlo integration to optimize the minimization equation of alpha-divergence, but this does not prove the alpha-divergence minimization. In [16], the application of the alpha-divergence minimization in approximate reasoning has been systematically analyzed, and different values of *α* can change the algorithm between the variational Bayesian algorithm and expectation propagation algorithm. As a special situation of the alpha-divergence (*α* = 2*q* − 1), *q*-entropy [17,18] has been widely used in the field of physics. Li et al. [19] proposed a new class of variational inference methods using a variant of the alpha-divergence, which is called Rényi divergence, and applied it to the variational auto-encoders and Bayesian neural networks. There are more introductions about theories and applications of the alpha-divergence in [20,21]. Although the theories and applications of alpha-divergence have been very popular, we focus on providing a theory to perfect the alpha-divergence minimization and apply it to non-linear filtering.

### **3. Background Work**

In Section 3.1, we provide the framework of the non-linear filtering. Then, we introduce the alpha-divergence in Section 3.2, which contains many types of divergence as special cases.

### *3.1. Non-Linear Filtering*

The actual system studied in the filtering is usually non-linear and non-Gaussian. Non-linear filtering refers to a filtering that can estimate the optimal estimation problem of the state variables in the dynamic system online and in real time from the system observations.

The state space model of non-linear systems with additive Gaussian white noise is:

$$\mathbf{x}\_{k} = f(\mathbf{x}\_{k-1}) + w\_{k-1} \tag{1}$$

where *xk* ∈ *<sup>R</sup><sup>n</sup>* is the system state vector that needs to be estimated; *wk* is the zero mean value Gaussian white noise, and its variance is *E*[*wkw<sup>T</sup> <sup>k</sup>* ] = *Qk*. Equation (1) describes the state transition *<sup>p</sup>*(*xk*|*xk*−1) of the system.

The random observation model of the state vector is:

$$z\_k = h(x\_k) + v\_k \tag{2}$$

where *zk* ∈ *<sup>R</sup><sup>m</sup>* is system measurement; *vk* is the zero mean value Gaussian white noise, and its variance is *E*[*vkv<sup>T</sup> <sup>k</sup>* ] = *Rk*. Suppose *wk* and *vk* are independent of each other and the observed value *zk* is independent of the state variables *xk*.

The entire probability state space is represented by the generation model as shown in Figure 1. *xk* is the system state; *zk* is the observational variable, and the purpose is to estimate the value of state *xk*. The Bayesian filter is a general method to solve state estimation. The Bayesian filter is used to calculate the posterior distribution *p*(*xk*|*zk*), and its recursive solution consists of prediction steps and update steps.

Under the Bayesian optimal filter framework, the system state equation determines that the conditional transition probability of the current state is a Gaussian distribution:

$$p(\mathbf{x}\_k | \mathbf{x}\_{k-1}, z\_{1:k-1}) = N(\mathbf{x}\_k | f(\mathbf{x}\_{k-1}), Q\_k) \tag{3}$$

If the prediction distribution of the system can be obtained from Chapman–Kolmogorov, the prior probability is:

$$p(\mathbf{x}\_k|z\_{1:k-1}) = \int p(\mathbf{x}\_k|\mathbf{x}\_{k-1}, z\_{1:k-1}) p(\mathbf{x}\_{k-1}|z\_{1:k-1}) d\mathbf{x}\_{k-1} \tag{4}$$

When there is a measurement input, the system measurement update equation determines that the measurement likelihood transfer probability of the current state obeys a Gaussian distribution:

$$p(z\_k | \mathbf{x}\_k, z\_{1:k-1}) = N(z\_k | h(\mathbf{x}\_k), \mathbf{R}\_k) \tag{5}$$

According to the Bayesian information criterion, the posterior probability obtained is:

$$p(\mathbf{x}\_k|z\_{1:k}) = \frac{p(z\_k|\mathbf{x}\_k, z\_{1:k-1})p(\mathbf{x}\_k|z\_{1:k-1})}{p(z\_k|z\_{1:k-1})}\tag{6}$$

where *<sup>p</sup>*(*zk*|*z*1:*k*−1) is the normalized factor, and it is defined as follows:

$$p(z\_k|z\_{1:k-1}) = \int p(z\_k|\mathbf{x}\_k, z\_{1:k-1}) p(\mathbf{x}\_k|z\_{1:k-1}) d\mathbf{x}\_k \tag{7}$$

Unlike the Kalman filter framework, the Bayesian filter framework does not demand that the update structure be linear, so it can use non-linear update steps.

In the non-linear filtering problem, the posterior distribution *p*(*xk*|*z*1:*k*) often cannot be solved correctly. Our purpose is to use the distribution *q*(*x*) to approximate the posterior distribution *p*(*xk*|*z*1:*k*) without an analytical solution. Here, we use the alpha-divergence measurement to measure the similarity between the two. We propose a method that directly minimizes alpha-divergence without adding any additional approximations.

**Figure 1.** Hidden Markov Model (HMM).

### *3.2. The Alpha-Divergence*

The KL divergence is commonly used in similarity measures, but we will generalize it to the alpha-divergence. The alpha-divergence is a parametric family of divergence functions, including several well-known divergence measures as special cases, and it gives us more flexibility in approximation [20].

**Definition 1.** *Let us consider two unnormalized distributions p*(*x*) *and q*(*x*) *with respect to a random variable x. The alpha-divergence is defined by:*

$$D\_a[p||q] = \frac{1}{a(1-a)} \int a p(\mathbf{x}) + (1-a)q(\mathbf{x}) - p(\mathbf{x})^a q(\mathbf{x})^{1-a} d\mathbf{x} \tag{8}$$

*where α* ∈ *R, which means D<sup>α</sup> is continuous at zero and one.*

The alpha-divergence meets the following two properties:


Note that the term [*αp*(*x*)+(1 − *α*)*q*(*x*)]*dx* disappears when *p*(*x*) and *q*(*x*) are normalized distributions, i.e., *p*(*x*)*dx* = *p*(*x*)*dx* = 1. The alpha-divergence in (8) is expressed by:

$$D\_a[p||q] = \frac{1}{a(1-a)}(1 - \int p(\mathbf{x})^a q(\mathbf{x})^{1-a} d\mathbf{x})\tag{9}$$

In general, we can get another equivalent expression of the alpha-divergence when we set *β* = 2*α* − 1:

$$D\_{\beta}[p||q] = \frac{4}{1-\beta^{2}} \int \frac{1-\beta}{2} p(\mathbf{x}) + \frac{1+\beta}{2} q(\mathbf{x}) - p(\mathbf{x})^{\frac{1+\beta}{2}} q(\mathbf{x})^{\frac{1-\beta}{2}} d\mathbf{x} \tag{10}$$

Alpha-divergence includes several special cases such as the KL divergence, the Hellinger divergence and *χ*<sup>2</sup> divergence (Pearson's distance), which are summarized below.

• As *<sup>α</sup>* approaches one, Equation (8) is the limitation form of <sup>0</sup> <sup>0</sup> , and it specializes to the KL divergence from *q*(*x*) to *p*(*x*) as L'Hôpital's rule is used:

$$\begin{split} \lim\_{a \to 1} D\_a[p||q] &= \lim\_{a \to 1} \frac{1}{a(1-a)} \int a p(\mathbf{x}) + (1-a)q(\mathbf{x}) - p(\mathbf{x})^a q(\mathbf{x})^{1-a} d\mathbf{x} \\ &= \lim\_{a \to 1} \frac{1}{1-2a} \int p(\mathbf{x}) - q(\mathbf{x}) - p(\mathbf{x})^a \log(p(\mathbf{x})) q(\mathbf{x})^{1-a} + p(\mathbf{x})^a q(\mathbf{x})^{1-a} \log(q(\mathbf{x})) d\mathbf{x} \\ &= \int p(\mathbf{x}) \log \frac{p(\mathbf{x})}{q(\mathbf{x})} - p(\mathbf{x}) + q(\mathbf{x}) d\mathbf{x} = \mathcal{K} L[p||q] \end{split} \tag{11}$$

When *p*(*x*) and *q*(*x*) are normalized distributions, the KL divergence is expressed as:

$$KL[p||q] = \int p(\mathbf{x}) \log \frac{p(\mathbf{x})}{q(\mathbf{x})} d\mathbf{x} \tag{12}$$

• As *<sup>α</sup>* approaches zero, Equation (8) is still the limitation form of <sup>0</sup> <sup>0</sup> , and it specializes to the dual form of the KL divergence from *q*(*x*) to *p*(*x*) as L'Hôpital's rule is used:

$$\begin{split} \lim\_{a \to 0} D\_a[p||q|| &= \lim\_{a \to 0} \frac{1}{a(1-a)} \int a p(\mathbf{x}) + (1-a)q(\mathbf{x}) - p(\mathbf{x})^a q(\mathbf{x})^{1-a} d\mathbf{x} \\ &= \lim\_{a \to 0} \frac{1}{1-2a} \int p(\mathbf{x}) - q(\mathbf{x}) - p(\mathbf{x})^a \log(p(\mathbf{x})) q(\mathbf{x})^{1-a} + p(\mathbf{x})^a q(\mathbf{x})^{1-a} \log(q(\mathbf{x})) d\mathbf{x} \\ &= \int q(\mathbf{x}) \log \frac{q(\mathbf{x})}{p(\mathbf{x})} + p(\mathbf{x}) - q(\mathbf{x}) d\mathbf{x} = KL[q||p] \end{split} \tag{13}$$

When *p*(*x*) and *q*(*x*) are normalized distributions, the dual form of the KL divergence is expressed as:

$$KL[q||p] = \int q(\mathbf{x}) \log \frac{q(\mathbf{x})}{p(\mathbf{x})} d\mathbf{x} \tag{14}$$

• When *<sup>α</sup>* <sup>=</sup> <sup>1</sup> <sup>2</sup> , the alpha-divergence specializes to the Hellinger divergence, which is the only dual divergence in the alpha-divergence:

$$D\_{\frac{1}{2}}[p||q] = 2\int (p(\mathbf{x}) + q(\mathbf{x}) - 2p(\mathbf{x})^{\frac{1}{2}}q(\mathbf{x})^{\frac{1}{2}})d\mathbf{x} = 2\int (\sqrt{p(\mathbf{x})} - \sqrt{q(\mathbf{x})})^2 d\mathbf{x} = 4H\mathbf{e}^2[p||q] \tag{15}$$

where *Hel*[*p*||*q*] = <sup>1</sup> 2 ( *p*(*x*) <sup>−</sup> *q*(*x*))2*dx* is the Hellinger distance, which is the half of the Euclidean distance between two random distributions after taking the difference of the square root, and it corresponds to the fundamental property of distance measurement and is a valid distance metric.

• When *<sup>α</sup>* = 2, the alpha-divergence degrades to *<sup>χ</sup>*2-divergence:

$$\begin{split} D\_2[p||q] &= \frac{-1}{2} (\int 2p(\mathbf{x}) - q(\mathbf{x}) - \frac{p(\mathbf{x})^2}{q(\mathbf{x})} d\mathbf{x}) \\ &= \frac{1}{2} (\int \frac{p(\mathbf{x})^2 + q(\mathbf{x})^2 - 2p(\mathbf{x})q(\mathbf{x})}{q(\mathbf{x})} d\mathbf{x}) = \frac{1}{2} \int \frac{(p(\mathbf{x}) - q(\mathbf{x}))^2}{q(\mathbf{x})} d\mathbf{x} \end{split} \tag{16}$$

In the later experiment, we will adapt the value of *α* to optimize the distribution similarity measurement.

### **4. Non-Linear Filtering Based on the Alpha-Divergence**

We first define an *α*-mixed probability density function, which will be used in the non-linear filtering based on the alpha-divergence minimization. Then, we show that the sufficient condition for the alpha-divergence minimization is when *α* ≥ 1 and the expected value of the natural statistical vector of *q*(*x*) is equivalent to the expected value of the natural statistical vector of the *α*-mixed probability density function. At last, we apply the sufficient condition to the non-linear measurement update steps for solving the non-linear filtering problem.

### *4.1. The α-Mixed Probability Density Function*

We first give a definition of a normalized probability density function called the *α*-mixed probability density function, which is expressed as *pα*(*x*).

**Definition 2.** *We define an α-mixed probability density function:*

$$p\_a(\mathbf{x}) = \frac{p(\mathbf{x})^a q(\mathbf{x})^{(1-a)}}{\int p(\mathbf{x})^a q(\mathbf{x})^{(1-a)} d\mathbf{x}} \tag{17}$$

We can prove that when both *p*(*x*) and *q*(*x*) are univariate normal distributions, then *pα*(*x*) is still the Gaussian probability density function.

Suppose that *<sup>p</sup>*(*x*) ∼ *<sup>N</sup>*(*μp*, *<sup>σ</sup>*<sup>2</sup> *<sup>p</sup>*) and *<sup>q</sup>*(*x*) ∼ *<sup>N</sup>*(*μq*, *<sup>σ</sup>*<sup>2</sup> *<sup>q</sup>* ), so the probability density functions can be expressed as follows:

$$p(\mathbf{x}) = \frac{1}{\sqrt{2\pi}\sigma\_p} \exp\left\{-\frac{(\mathbf{x}-\boldsymbol{\mu}\_p)^2}{2\sigma\_p^2}\right\} \quad \text{and} \quad q(\mathbf{x}) = \frac{1}{\sqrt{2\pi}\sigma\_q} \exp\left\{-\frac{(\mathbf{x}-\boldsymbol{\mu}\_q)^2}{2\sigma\_q^2}\right\} \tag{18}$$

Then we can combine these two functions with parameter *α*:

$$\begin{split} p(\mathbf{x})^a q(\mathbf{x})^{(1-a)} &= (2\pi \sigma\_p^2)^{-\frac{a}{2}} (2\pi \sigma\_q^2)^{-\frac{1-a}{2}} \exp\left\{ -\frac{a(\mathbf{x}-\boldsymbol{\mu}\_p)^2 \sigma\_q^2 + (1-a)(\mathbf{x}-\boldsymbol{\mu}\_q)^2 \sigma\_p^2}{2\sigma\_p^2 \sigma\_q^2} \right\} \\ &= \frac{S\_a}{\sqrt{2\pi} \sigma\_a} \exp\left\{ -\frac{(\mathbf{x}-\boldsymbol{\mu}\_a)^2}{2\sigma\_a^2} \right\} \end{split} \tag{19}$$

where *μα* <sup>=</sup> *αμpσ*<sup>2</sup> *<sup>q</sup>* +(1−*α*)*μqσ*<sup>2</sup> *p ασ*<sup>2</sup> *<sup>q</sup>* +(1−*α*)*σ*<sup>2</sup> *p* is the mean of the *α*-mixed probability density function; *σ*<sup>2</sup> *<sup>α</sup>* <sup>=</sup> *<sup>σ</sup>*<sup>2</sup> *<sup>q</sup> σ*<sup>2</sup> *p ασ*<sup>2</sup> *<sup>q</sup>* +(1−*α*)*σ*<sup>2</sup> *p* (which can be reduced to <sup>1</sup> *σα* <sup>=</sup> *<sup>α</sup>* <sup>1</sup> *<sup>σ</sup><sup>p</sup>* + (<sup>1</sup> <sup>−</sup> *<sup>α</sup>*) <sup>1</sup> *σq* ) is the variance of the *α*-mixed probability density function; *Sα* is a scalar factor, and the expression is as follows:

$$\begin{split} S\_{a} &= (2\pi\sigma\_{a}^{2})^{\frac{1}{2}} (2\pi\sigma\_{p}^{2})^{-\frac{a}{2}} (2\pi\sigma\_{q}^{2})^{-\frac{1-a}{2}} \exp\left\{ -\frac{a(1-a)(\mu\_{p}-\mu\_{q})^{2}}{2[a\sigma\_{q}^{2}+(1-a)\sigma\_{p}^{2}]} \right\} \\ &= (2\pi\sigma\_{a}^{2})^{\frac{a+1-a}{2}} (2\pi\sigma\_{p}^{2})^{-\frac{a}{2}} (2\pi\sigma\_{q}^{2})^{-\frac{1-a}{2}} \exp\left\{ -\frac{a(1-a)(\mu\_{p}-\mu\_{q})^{2}}{2[a\sigma\_{q}^{2}+(1-a)\sigma\_{p}^{2}]} \right\} \\ &= (\frac{\sigma\_{q}^{2}}{a\sigma\_{q}^{2}+(1-a)\sigma\_{p}^{2}})^{\frac{a}{2}} (\frac{\sigma\_{p}^{2}}{a\sigma\_{q}^{2}+(1-a)\sigma\_{p}^{2}})^{\frac{1-a}{2}} \exp\left\{ -\frac{a(1-a)(\mu\_{p}-\mu\_{q})^{2}}{2[a\sigma\_{q}^{2}+(1-a)\sigma\_{p}^{2}]} \right\} \end{split} \tag{20}$$

Therefore, *pα*(*x*) is a normalized probability density function, satisfying the normalization conditions *pα*(*x*)*dx* = 1. It is clear that the product of two Gaussian distributions is still a Gaussian distribution, which will bring great convenience to the representation of probability distribution of the latter filtering problem.

At the same time, we can get that the variance of *pα*(*x*) is *σ*<sup>2</sup> *<sup>α</sup>*, which should satisfy the condition that its value is greater than zero. We can know by its denominator when *σ*<sup>2</sup> *<sup>q</sup>* ≥ *<sup>σ</sup>*<sup>2</sup> *<sup>p</sup>*, the value of *α* can take any value on the real number axis; when *σ*<sup>2</sup> *<sup>q</sup>* < *σ*<sup>2</sup> *<sup>p</sup>*, the scope of *<sup>α</sup>* is *<sup>α</sup>* <sup>&</sup>lt; *<sup>σ</sup>*<sup>2</sup> *p σ*2 *<sup>p</sup>*−*σ*<sup>2</sup> *q* . Then, it is easy to know that the closer *σ*<sup>2</sup> *<sup>p</sup>* is to *σ*<sup>2</sup> *<sup>q</sup>* , the greater the range of values of *α*.

In addition, the influence of the mean and the variance of the two distributions on the mean and variance of the *α*-mixed probability density function can be analyzed to facilitate the solution of the algorithm latter. As for the variance, when *σ*<sup>2</sup> *<sup>q</sup>* > *σ*<sup>2</sup> *<sup>p</sup>*, *σ*<sup>2</sup> *<sup>α</sup>* decreases with the increase of *α*; when *σ*<sup>2</sup> *<sup>q</sup>* = *σ*<sup>2</sup> *p*, it can be concluded that *σ*<sup>2</sup> *<sup>α</sup>* = *σ*<sup>2</sup> *<sup>q</sup>* = *σ*<sup>2</sup> *<sup>p</sup>*; when *σ*<sup>2</sup> *<sup>q</sup>* < *σ*<sup>2</sup> *<sup>p</sup>*, *σ*<sup>2</sup> *<sup>α</sup>* increases with the increase of *α*. As for the mean value, when *σ*<sup>2</sup> *<sup>q</sup>* = *σ*<sup>2</sup> *<sup>p</sup>*, *μα* = (*μ<sup>p</sup>* − *<sup>μ</sup>q*)*<sup>α</sup>* + *<sup>μ</sup>q*; if *<sup>σ</sup>*<sup>2</sup> *<sup>q</sup>* = *<sup>σ</sup>*<sup>2</sup> *<sup>p</sup>*, *μα* <sup>=</sup> *<sup>μ</sup>pσ*<sup>2</sup> *<sup>q</sup>* <sup>−</sup>*μqσ*<sup>2</sup> *p σ*2 *<sup>q</sup>* <sup>−</sup>*σ*<sup>2</sup> *p* <sup>+</sup> (*μq*−*μp*)*σ*<sup>2</sup> *<sup>q</sup> σ*<sup>2</sup> *p* (*σ*<sup>2</sup> *<sup>q</sup>* <sup>−</sup>*σ*<sup>2</sup> *<sup>p</sup>* )2*α*+(*σ*<sup>2</sup> *<sup>q</sup>* <sup>−</sup>*σ*<sup>2</sup> *<sup>p</sup>* )*σ*<sup>2</sup> *p* . It is clear that if *μ<sup>p</sup>* > *μq*, then *μα* increases with the increase of *α*; if *μ<sup>p</sup>* < *μq*, then *μα* decreases with the increase of *α*. The summary of the properties is shown in Table 1.

**Table 1.** The monotonicity of the mean *μα* and the variance *σ*<sup>2</sup> *<sup>α</sup>* of the *α*-mixed probability density function.


The monotonicity of the mean *μα* and the variance *σ*<sup>2</sup> *<sup>α</sup>* with respect to *α* is shown in Figure 2.

It is clear that when *μ<sup>p</sup>* < *μ<sup>q</sup>* and *σ*<sup>2</sup> *<sup>q</sup>* > *σ*<sup>2</sup> *<sup>p</sup>*, *μα* decreases with the increase of *α* and *σ*<sup>2</sup> *<sup>α</sup>* decreases with the increase of *α*; when *μ<sup>p</sup>* < *μ<sup>q</sup>* and *σ*<sup>2</sup> *<sup>q</sup>* < *σ*<sup>2</sup> *<sup>p</sup>*, *μα* decreases with the increase of *α* and *σ*<sup>2</sup> *<sup>α</sup>* increases with the increase of *α*; when *μ<sup>p</sup>* > *μ<sup>q</sup>* and *σ*<sup>2</sup> *<sup>q</sup>* > *σ*<sup>2</sup> *<sup>p</sup>*, *μα* increases with the increase of *α* and *σ*<sup>2</sup> *<sup>α</sup>* decreases with the increase of *α*; when *μ<sup>p</sup>* > *μ<sup>q</sup>* and *σ*<sup>2</sup> *<sup>q</sup>* < *σ*<sup>2</sup> *<sup>p</sup>*, *μα* increases with the increase of *α* and *σ*<sup>2</sup> *<sup>α</sup>* increases with the increase of *α*.

When *α* ∈ (0, 1), the *α*-mixed probability density function is the interpolation function of *p*(*x*) and *q*(*x*), so its mean value and the variance are all between *p*(*x*) and *q*(*x*), as shown in Figure 2, and its image curve is also between them.

The above analysis will be used in the algorithm implementation of the sufficient condition in the non-linear filtering algorithm.

**Figure 2.** The monotonicity of the mean *μα* and the variance *σ*<sup>2</sup> *<sup>α</sup>* with respect to *α*.

### *4.2. The Alpha-Divergence Minimization*

In the solving process of the alpha-divergence minimization, either the posterior distribution itself or the calculation of the maximized posterior distribution is complex, so the approximate distribution *q*(*x*) with good characterization ability is often used to approximate the true posterior distribution *p*(*x*). As a result, a higher degree achieves better approximation. Here, we restrict the approximate distribution *q*(*x*) to be an exponential family distribution; denote *pe*(*x*), with good properties, defined as follows:

$$p\_\ell(\mathbf{x}) = h(\mathbf{x}) \exp\left\{\phi^T(\theta)\mu(\mathbf{x}) + \mathcal{g}(\phi(\theta))\right\} \tag{21}$$

Here, *θ* is a parameter set of probability density function; c(x) and g(*φ*(*θ*)) are known functions; *φ*(*θ*) is a vector composed of natural parameters; *u*(*x*) is a natural statistical vector. *u*(*x*) contains enough information to express the state variable x in the exponential family distribution completely; *φ*(*θ*) is a coefficient parameter that combines *u*(*x*) based on parameter set *θ*.

In the non-linear filtering, assume the exponential family distribution is *pe*(*x*); arbitrary function is *p*(*x*), and we use *pe*(*x*) to approximate *p*(*x*), measuring the degree of approximation by the alpha-divergence. Therefore, the alpha-divergence of *p*(*x*) relative to *pe*(*x*) is obtained, defined as:

$$\begin{split} \mathbb{J} = D\_{\hbar}[p||p\_{\ell}] &= \frac{1}{a(1-a)} [1 - \int p(\mathbf{x})^{a} p\_{\ell}(\mathbf{x})^{1-a}] \\ &= \frac{1}{a(1-a)} \left\{ 1 - \int p(\mathbf{x})^{a} [h(\mathbf{x}) \exp(\boldsymbol{\phi}^{T}(\boldsymbol{\theta}) \boldsymbol{u}(\mathbf{x}) + \boldsymbol{g}(\boldsymbol{\phi}(\boldsymbol{\theta})))]^{1-a} \right\} \end{split} \tag{22}$$

We state and prove in Theorem 1 that the alpha-divergence between the exponential family distribution and the probability density function of arbitrary state variable is minimum, if and only if the expected value of the natural statistical vector in the exponential family distribution is equal to the expected value of the natural statistical vector in the *α*-mixed probability state density function. In Corollary 1, given *α* = 1, the equivalence condition can be obtained in the case of *KL*[*p*||*q*]. In Corollary 2, we conclude that the specialization of the exponential family distribution is obtained after being processed by the Gaussian probability density function.

**Theorem 1.** *The alpha-divergence between the exponential family distribution and the known state probability density function takes the minimum value; if and only if α* ≥ 1*, the expected value of the natural statistical vector in the exponential family distribution is equal to the expected value of the natural statistical vector in the α-mixed probability state density function, that is:*

$$E\_{\mathbb{P}^\mu} \{ u(\mathbf{x}) \} = E\_{\mathbb{P}^\mu} \{ u(\mathbf{x}) \} \tag{23}$$

**Proof of Theorem 1.** Sufficient conditions for J minimization are that the first derivative and the second derivative satisfy the following conditions:

$$\frac{\partial f}{\partial \phi(\theta)} = 0 \quad \text{and} \quad \frac{\partial^2 f}{\partial \phi(\theta)^2} > 0 \tag{24}$$

First, we derive Equation (22) with respect to *φ*(*θ*), and according to the conditions in the first derivative, the outcome is:

$$\begin{split} \frac{\partial l}{\partial \phi(\theta)} &= \frac{-1}{a(1-a)} \int p(\mathbf{x})^{a} (1-a) p\_{\epsilon}(\mathbf{x})^{-a} p\_{\epsilon}(\mathbf{x}) \left\{ u(\mathbf{x}) + \left( \frac{\partial g(\phi(\theta))}{\partial \phi(\theta)} \right) \right\} d\mathbf{x} \\ &= \frac{-1}{a} \int p(\mathbf{x})^{a} p\_{\epsilon}(\mathbf{x})^{1-a} \left\{ u(\mathbf{x}) + \left( \frac{\partial g(\phi(\theta))}{\partial \phi(\theta)} \right) \right\} d\mathbf{x} \\ &= -\frac{1}{a} \int p(\mathbf{x})^{a} p\_{\epsilon}(\mathbf{x})^{1-a} u(\mathbf{x}) d\mathbf{x} - \frac{1}{a} \int p(\mathbf{x})^{a} p\_{\epsilon}(\mathbf{x})^{1-a} (\frac{\partial g(\phi(\theta))}{\partial \phi(\theta)}) d\mathbf{x} \end{split} \tag{25}$$

Let the above equation be equal to zero, then:

$$\frac{\partial g(\phi(\theta))}{\partial \phi(\theta)} = -\int \frac{p(\mathbf{x})^a p\_\epsilon(\mathbf{x})^{1-a}}{\int p(\mathbf{x})^a p\_\epsilon(\mathbf{x})^{1-a} d\mathbf{x}} u(\mathbf{x}) d\mathbf{x} = -\int p\_a(\mathbf{x}) u(\mathbf{x}) d\mathbf{x} \tag{26}$$

In addition, since *pe*(*x*) is a probability density function, it satisfies the normalization condition:

$$\int p\_{\epsilon}(\mathbf{x})d\mathbf{x} = \int h(\mathbf{x})\exp\left\{\phi^T(\theta)u(\mathbf{x}) + \mathcal{g}(\phi(\theta))\right\}d\mathbf{x} = 1\tag{27}$$

Derive *φ*(*θ*) in the above equation, and the outcome is:

$$\frac{\partial}{\partial \phi(\theta)} p\_{\varepsilon}(\mathbf{x}) = \int p\_{\varepsilon}(\mathbf{x}) \mu(\mathbf{x}) d\mathbf{x} + \frac{\partial g(\phi(\theta))}{\partial \phi(\theta)} = 0 \tag{28}$$

The first item of Equation (23) can be obtained from Equations (26) and (28), which is the existence conditions of the stationary point for J.

To ensure that Equation (24) can minimize Equation (22), which means the stationary point is also its minimum point, we also need to prove that the second derivative satisfies the condition. Derive *φ*(*θ*) in Equation (25); the outcome is:

*∂*<sup>2</sup> *J ∂φ*(*θ*) <sup>2</sup> <sup>=</sup> *<sup>∂</sup> ∂φ*(*θ*) −1 *α* - *<sup>p</sup>*(*x*)*<sup>α</sup> pe*(*x*)1−*αu*(*x*)*dx* <sup>−</sup> <sup>1</sup> *α* - *p*(*x*)*<sup>α</sup> pe*(*x*)1−*α*( *∂g*(*φ*(*θ*)) *∂φ*(*θ*) )*dx* <sup>=</sup> <sup>−</sup><sup>1</sup> <sup>−</sup> *<sup>α</sup> α* - *p*(*x*)*<sup>α</sup> pe*(*x*)1−*<sup>α</sup> <sup>u</sup>*(*x*)+( *<sup>∂</sup>g*(*φ*(*θ*)) *∂φ*(*θ*) ) *u*(*x*)*dx* − 1 *α* - *<sup>p</sup>*(*x*)*<sup>α</sup> pe*(*x*)1−*<sup>α</sup> <sup>∂</sup>*2*g*(*φ*(*θ*)) *∂φ*(*θ*) <sup>2</sup> *dx* <sup>−</sup> <sup>1</sup> <sup>−</sup> *<sup>α</sup> α* - *p*(*x*)*<sup>α</sup> pe*(*x*)1−*<sup>α</sup> <sup>u</sup>*(*x*)+( *<sup>∂</sup>g*(*φ*(*θ*)) *∂φ*(*θ*) ) *∂g*(*φ*(*θ*)) *∂φ*(*θ*) *dx* <sup>=</sup> <sup>−</sup><sup>1</sup> *α* - *<sup>p</sup>*(*x*)*<sup>α</sup> pe*(*x*)1−*<sup>α</sup> <sup>∂</sup>*2*g*(*φ*(*θ*)) *∂φ*(*θ*) <sup>2</sup> *dx* <sup>−</sup> <sup>1</sup> <sup>−</sup> *<sup>α</sup> α* - *p*(*x*)*<sup>α</sup> pe*(*x*)1−*<sup>α</sup> <sup>u</sup>*(*x*)<sup>2</sup> <sup>+</sup> <sup>2</sup>*u*(*x*)( *<sup>∂</sup>g*(*φ*(*θ*)) *∂φ*(*θ*) )+( *<sup>∂</sup>g*(*φ*(*θ*)) *∂φ*(*θ*) )<sup>2</sup> *dx* <sup>=</sup> <sup>−</sup><sup>1</sup> *α ∂*2*g*(*φ*(*θ*)) *∂φ*(*θ*) 2 - *<sup>p</sup>*(*x*)*<sup>α</sup> pe*(*x*)1−*αdx* <sup>+</sup> *<sup>α</sup>* <sup>−</sup> <sup>1</sup> *α* - *p*(*x*)*<sup>α</sup> pe*(*x*)1−*<sup>α</sup> <sup>u</sup>*(*x*) + *<sup>∂</sup>g*(*φ*(*θ*)) *∂φ*(*θ*) <sup>2</sup> *dx* <sup>=</sup> <sup>−</sup><sup>1</sup> *α ∂*2*g*(*φ*(*θ*)) *∂φ*(*θ*) 2 - *<sup>p</sup>α*(*x*)*dx* <sup>+</sup> *<sup>α</sup>* <sup>−</sup> <sup>1</sup> *α* - *p*(*x*)*<sup>α</sup> pe*(*x*)1−*<sup>α</sup> <sup>u</sup>*(*x*) + *<sup>∂</sup>g*(*φ*(*θ*)) *∂φ*(*θ*) <sup>2</sup> *dx* (29)

For the first item, it is easy to prove *<sup>∂</sup>*<sup>2</sup> *<sup>g</sup>*(*φ*(*θ*)) *∂φ*(*θ*) <sup>2</sup> < 0, and the proof is as follows. It can be known from Equation (21):

$$\log(\phi(\theta)) = -\log \int h(\mathbf{x}) \exp\left\{\phi^T(\theta)u(\mathbf{x})\right\} d\mathbf{x} \tag{30}$$

The gradient of Equation (30) with respect to the natural parameter vector is as follows:

$$\begin{split} \frac{\partial \boldsymbol{g}(\boldsymbol{\Phi}(\boldsymbol{\theta}))}{\partial \boldsymbol{\Phi}(\boldsymbol{\theta})} &= -\int \frac{\boldsymbol{h}(\mathbf{x}) \exp\left\{\boldsymbol{\Phi}^{\mathrm{T}}(\boldsymbol{\theta})\boldsymbol{u}(\mathbf{x})\right\}}{\int \boldsymbol{h}(\mathbf{x}) \exp\left\{\boldsymbol{\Phi}^{\mathrm{T}}(\boldsymbol{\theta})\boldsymbol{u}(\mathbf{x})\right\} d\mathbf{x}} \boldsymbol{u}(\mathbf{x}) d\mathbf{x} = \\ &= -\int \frac{\boldsymbol{h}(\mathbf{x}) \exp\left\{\boldsymbol{\Phi}^{\mathrm{T}}(\boldsymbol{\theta})\boldsymbol{u}(\mathbf{x})\right\}}{\exp\left\{-\boldsymbol{g}(\boldsymbol{\Phi}(\boldsymbol{\theta}))\right\} d\mathbf{x}} \boldsymbol{u}(\mathbf{x}) d\mathbf{x} = -\int p\_{\boldsymbol{e}}(\mathbf{x}) \boldsymbol{u}(\mathbf{x}) d\mathbf{x} \end{split} \tag{31}$$

Then, consider the matrix formed by its second derivative with respect to the natural parameter vector:

*∂*2*g*(*φ*(*θ*)) *∂φ<sup>i</sup>* (*θ*)*∂φ<sup>j</sup>* (*θ*) <sup>=</sup> <sup>−</sup> *<sup>∂</sup> ∂φ<sup>j</sup>* (*θ*) *h*(*x*)*exp φT*(*θ*)*u*(*x*) *<sup>h</sup>*(*x*)*exp* {*φT*(*θ*)*u*(*x*)} *dx <sup>u</sup><sup>i</sup>* (*x*)*dx* <sup>=</sup> <sup>−</sup> *<sup>∂</sup> ∂φ<sup>j</sup>* (*θ*) *h*(*x*)*exp φT*(*θ*)*u*(*x*) *ui* (*x*)*dx <sup>h</sup>*(*x*)*exp* {*φT*(*θ*)*u*(*x*)} *dx* = − *h*(*x*)*exp φT*(*θ*)*u*(*x*) *ui* (*x*)*dx h*(*x*)*exp φT*(*θ*) *u*(*x*)*dx* ( *<sup>h</sup>*(*x*)*exp* {*φT*(*θ*)*u*(*x*)} *dx*)<sup>2</sup> + *h*(*x*)*exp φT*(*θ*)*u*(*x*) *ui* (*x*)*dx h*(*x*)*exp φT*(*θ*)*u*(*x*) *uj* (*x*)*dx* ( *<sup>h</sup>*(*x*)*exp* {*φT*(*θ*)*u*(*x*)} *dx*)<sup>2</sup> = − - *pe*(*x*)*u<sup>i</sup>* (*x*)*u<sup>j</sup>* (*x*)*dx* − - *pe*(*x*)*u<sup>i</sup>* (*x*)*dx* - *pe*(*x*)*u<sup>j</sup>* (*x*)*dx* (32)

According to the definition of the covariance matrix, the content in the bracket is the covariance matrix of the natural parameter vector with respect to the exponential family probability density function *pe*(*x*), and for arbitrary probability density distribution *pe*(*x*), the variance matrix is a positive definite matrix, so *<sup>∂</sup>*<sup>2</sup> *<sup>g</sup>*(*φ*(*θ*)) *∂φ*(*θ*) <sup>2</sup> < 0; and when *α* > 0, the first item is greater than zero.

The integral of the second item is the secondary moment, so *α* ≥ 1 or *α* < 0, and the second item is greater than zero.

To sum up, when *<sup>α</sup>* <sup>≥</sup> 1, *<sup>∂</sup>*<sup>2</sup> *<sup>J</sup> ∂φ*(*θ*) <sup>2</sup> > 0.

**Corollary 1.** *(See Theorem 1 of [3] for more details) When α* = 1*, pα*(*x*) = *p*(*x*)*, Dα*[*p*||*q*] *turns into KL*[*p*||*q*]*. We can obtain the above theorem under the condition of KL*[*p*||*q*] *and obtain the approximate distribution by minimizing the KL divergence, which also proves that the stationary point obtained when the first derivative of its KL divergence is equal to zero also satisfies the condition that its second derivative is greater than zero. The corresponding expectation propagation algorithm is shown as follows:*

$$E\_{q(\mathbf{x})}\left\{\boldsymbol{u}(\mathbf{x})\right\} = E\_{p(\mathbf{x})}\left\{\boldsymbol{u}(\mathbf{x})\right\} \tag{33}$$

**Corollary 2.** *(See Corollary 1.1 of [3] for more details) When the exponential family distribution is simplified as the Gaussian probability density function, its sufficient statistic for u*(*x*)=(*x*, *x*2)*, we use the mean and variance of Gaussian probability density function, and the expectation of the corresponding propagation algorithm can use the moment matching method to calculate, so the first moment and the second moment are defined as follows:*

$$\mathbf{x} = E\_{p(\mathbf{x})} \begin{Bmatrix} \mathbf{x} \end{Bmatrix} \quad \text{and} \quad \mathbf{M} = E\_{p(\mathbf{x})} \begin{Bmatrix} \mathbf{x} \mathbf{x}^T \end{Bmatrix} \tag{34}$$

*The corresponding second central moment is defined as follows:*

$$P = M - mm^T = E\_{p(\mathbf{x})} \left\{ (\mathbf{x} - \mathbf{m})(\mathbf{x} - \mathbf{m})^T \right\} \tag{35}$$

The complexity of Theorem 1 lies in that both sides of Equation (23) depend on the probability distribution of *q*(*x*) at the same time. The *q*(*x*) that satisfies the condition can be obtained by repeated iterative update on *q*(*x*). The specific process is shown in Algorithm 1:

### **Algorithm 1** Approximation of the true probability distribution *p*(*x*).

**Input:** Target distribution parameter of *p*(*x*); damping factor ∈ (0, 1); divergence parameter *α* ∈

[1, +∞); initialization value of *q*(*x*)

**Output:** The exponential family probability function *q*(*x*)


$$q(\mathbf{x}) = \frac{q(\mathbf{x})^{\varepsilon} q(\mathbf{x})^{1-\varepsilon}}{\int q(\mathbf{x})^{\varepsilon} q(\mathbf{x})^{1-\varepsilon} d\mathbf{x}} \tag{36}$$


6: **end while**

In the above algorithms, we need to pay attention to the following two problems: giving an initial value of *q*(*x*) and selecting damping factors. As for the first problem, we can know that when *σ*<sup>2</sup> *<sup>q</sup>* < *σ*<sup>2</sup> *p*, the value range of *<sup>α</sup>* is *<sup>α</sup>* <sup>&</sup>lt; *<sup>σ</sup>*<sup>2</sup> *p σ*2 *<sup>p</sup>*−*σ*<sup>2</sup> *q* , according to the analysis of the *α*-mixed probability density function in Section 4.1. Although the value of *α* is greater than one, the value range of *α* is limited under the condition that *σ*<sup>2</sup> *<sup>p</sup>* is unknown in the initial state; when *σ*<sup>2</sup> *<sup>q</sup>* ≥ *<sup>σ</sup>*<sup>2</sup> *<sup>p</sup>*, the value of *α* can take any value on the whole real number axis, so the initial value we can choose is relatively larger, making *σ*<sup>2</sup> *<sup>q</sup>* ≥ *<sup>σ</sup>*<sup>2</sup> *<sup>p</sup>* and *μ<sup>q</sup>* > *μp*. When the value of *α* is greater than one, the mean value of the *α*-mixed probability density function will decrease, and the variance will also decrease, as shown in the upper left of Figure 2.

As for the second question, when *α* ∈ (0, 1), the *α*-mixed probability density function is the interpolation function of *p*(*x*) and *q*(*x*) according to the analysis in Section 4.1. The value range in (0, 1) of damping factor  is quite reasonable because the two probability density functions are interpolated when the value range of  is in (0, 1), and the new probability density function is between the two. According to Equation (36), the smaller of , the closer the new *q*(*x*) to the old *q*(*x*); the larger of , the closer the new *q*(*x*) to *q*, (*x*). The mean value and the variance of *q*, (*x*) is smaller than the real *p*(*x*) according to the analysis of the first question. Then, we will continue to combine new *q*(*x*) with *p*(*x*) to form a *α*-mixed probability density function. Similarly, we clarify that the mean value and the variance of the new *q*(*x*) are larger than *p*(*x*), so the value of  we choose should be as close as possible to one.

The convergence of the algorithm can be guaranteed after considering the above two problems, and we can get *q*(*x*) that meets the conditions. It can be known from Theorem 1 that the approximation *q*(*x*) of *p*(*x*) can be obtained to ensure it converges on this minimum point after repeated iterative updates.

### *4.3. Non-Linear Filtering Algorithm Based on the Alpha-Divergence*

In the process of non-linear filtering, assuming that a priori and a posteriori probability density functions satisfy the Assumed Density Filter (ADF), then define the prior parameter as *θ*− *<sup>k</sup>* <sup>=</sup> *m*− *<sup>k</sup>* , *P*<sup>−</sup> *k* ; the corresponding distribution is prior distribution *q*(*xk*; *θ*<sup>−</sup> *<sup>k</sup>* ); define the posterior parameter as *θ*+ *<sup>k</sup>* <sup>=</sup> *m*<sup>+</sup> *<sup>k</sup>* , *<sup>P</sup>*<sup>+</sup> *k* , then the corresponding distribution is posterior distribution *<sup>q</sup>*(*xk*; *<sup>θ</sup>*<sup>+</sup> *<sup>k</sup>* ).

The prediction of the state variance can be expressed as follows:

$$p(\mathbf{x}\_k | z\_{1:k-1}) = \int p(\mathbf{x}\_k | \mathbf{x}\_{k-1}, z\_{1:k-1}) d\mathbf{x}\_{k-1} \tag{37a}$$

$$\boldsymbol{\theta}\_{k}^{-} = \arg\min\_{\theta} D\_{\boldsymbol{\theta}} \left[ p(\mathbf{x}\_{k} | \boldsymbol{z}\_{1:k-1}) \, \big|\, q(\mathbf{x}\_{k}; \boldsymbol{\theta}) \right] \tag{37b}$$

The corresponding first moment about the origin *<sup>f</sup>*(*xk*−1) = *xk <sup>p</sup>*(*xk*|*z*1:*k*−1)*dxk* of *<sup>p</sup>*(*xk*|*z*1:*k*−1) can be obtained from Equation (37a).

By Corollary 2, when the alpha-divergence is simplified to the KL divergence, the corresponding mean value and variance are:

$$m\_k^- = \int f(\mathbf{x}\_{k-1})\boldsymbol{q}(\mathbf{x}\_{k-1}; \boldsymbol{\theta}\_{k-1}^+)d\mathbf{x} \tag{38a}$$

$$P\_k^- = \int f(\mathbf{x}\_{k-1}) f(\mathbf{x}\_{k-1})^T N(\mathbf{x}\_{k-1}|\boldsymbol{m}\_{k-1}^+, \boldsymbol{P}\_{k-1}^+) d\mathbf{x} - \boldsymbol{m}\_k^- \boldsymbol{m}\_k^{-T} + Q\_k \tag{38b}$$

Here, the prior distribution *q*(*xk*; *θ*<sup>−</sup> *<sup>k</sup>* ) can be obtained.

Similarly, the update steps of the filter can be expressed as follows:

$$p(\mathbf{x}\_k|\mathbf{z}\_{1:k}) = \frac{p(z\_k|\mathbf{x}\_{k\prime}z\_{1:k-1})q(\mathbf{x}\_k; \theta\_k^{-})}{\int p(\mathbf{x}\_k|\mathbf{x}\_{k\prime}z\_{1:k-1})q(\mathbf{x}\_k; \theta\_k^{-})d\mathbf{x}\_k} \tag{39a}$$

$$\theta\_k^+ = \arg\min\_{\theta} D\_a[p(\mathbf{x}\_k|\mathbf{z}\_{1:k})||q(\mathbf{x}\_k;\theta)] \tag{39b}$$

It is clear according to Theorem 1:

$$\begin{split} E\_{q(\mathbf{x}\_{k}\boldsymbol{\theta}\_{k}^{+})}\{u(\mathbf{x})\} &= E\_{p\_{\mathbf{a}}(\mathbf{x})}\left\{u(\mathbf{x})\right\} = \int p\_{\mathbf{a}}(\mathbf{x})u(\mathbf{x})d\mathbf{x} = \int u(\mathbf{x}) \frac{p(\mathbf{x}\_{k}|\mathbf{z}\_{1:k})^{a} q(\mathbf{x}\_{k};\boldsymbol{\theta}\_{k}^{+})^{1-a}}{\pi(\mathbf{x})} \pi(\mathbf{x}) d\mathbf{x} \\ &\approx \sum\_{i=1}^{N} u(\mathbf{x}^{i}) \frac{[p(\mathbf{z}\_{k}|\mathbf{x}\_{k}^{i},\mathbf{z}\_{1:k-1}) q(\mathbf{x}\_{k}^{i};\boldsymbol{\theta}\_{k}^{-})]^{a} q(\mathbf{x}\_{k}^{i};\boldsymbol{\theta}\_{k}^{+})^{(1-a)} / \pi(\mathbf{x}^{i})}{\sum\_{j} [p(\mathbf{z}\_{k}|\mathbf{x}\_{k}^{j},\mathbf{z}\_{1:k-1}) q(\mathbf{x}\_{k}^{j};\boldsymbol{\theta}\_{k}^{-})]^{a} q(\mathbf{x}\_{k}^{i};\boldsymbol{\theta}\_{k}^{+})^{(1-a)} / \pi(\mathbf{x}^{j})} \end{split} \tag{40}$$

Here, *<sup>x</sup><sup>i</sup>* ∼ *iidπt*(*xt*),*<sup>i</sup>* = 1, ··· , *<sup>N</sup>*, *<sup>π</sup><sup>t</sup>* is the proposal distribution. We choose the proposal distribution as a priori distribution *q*(*xk*; *θ*<sup>−</sup> *<sup>k</sup>* ). We define *<sup>w</sup><sup>i</sup>* <sup>=</sup> [*p*(*zk*|*x<sup>i</sup> <sup>k</sup>*, *<sup>z</sup>*1:*k*−1)*q*(*x<sup>i</sup> <sup>k</sup>*; *θ*<sup>−</sup> *<sup>k</sup>* )]*αq*(*x<sup>i</sup> <sup>k</sup>*; *<sup>θ</sup>*<sup>+</sup> *<sup>k</sup>* )1−*α*/*π*(*x<sup>i</sup>* ), *W* = ∑*<sup>j</sup> w<sup>j</sup>* , so:

$$E\_{q(\mathbf{x}, \mathcal{Y}\_k^+)}\left\{\boldsymbol{u}(\mathbf{x})\right\} \approx \frac{1}{W} \sum\_{i=1}^N \boldsymbol{w}^i \boldsymbol{u}(\mathbf{x}^i) \tag{41}$$

An approximate calculation of the mean value and the variance for *<sup>q</sup>*(*xk*; *<sup>θ</sup>*<sup>+</sup> *<sup>k</sup>* ) is conducted:

$$m\_k^+ = \frac{1}{W} \sum\_{i=1}^N w^i x^i \tag{42a}$$

$$P\_k^+ = \frac{1}{W} \sum\_{i=1}^N w^i (\mathbf{x}^i - m\_k^i)(\mathbf{x}^i - m\_k^i)^T \tag{42b}$$

Since Equation (40) contains *<sup>q</sup>*(*xk*; *<sup>θ</sup>*<sup>+</sup> *<sup>k</sup>* ) on both sides of the equation, we must use Algorithm 1 to conduct the iterative calculation to get the satisfied posterior distribution *<sup>q</sup>*(*xk*; *<sup>θ</sup>*<sup>+</sup> *<sup>k</sup>* ).

If *α* = 1, the above steps can be reduced to a simpler filtering algorithm, as shown in [3].

In this process, we do not use the integral operation of the denominator in Equation (39a), but use the Monte Carlo integral strategy proposed in [15], as shown in Equation (40). We cannot conduct resampling, which greatly reduces the calculation.

### **5. Simulations and Analysis**

According to Theorem 1, when *α* ≥ 1, the non-linear filtering method we proposed is feasible theoretically. In the simulation experiment, the algorithm is validated by taking different values when *α* ≥ 1. We name our proposed method as AKF and compare it with the traditional non-linear filtering methods such as EKF and UKF.

We choose the Univariate Nonstationary Growth Model (UNGM) [22] to analyze the performance of the proposed method. The system state equation is:

$$\mathbf{x}(k) = 0.5\mathbf{x}(k-1) + \frac{2.5\mathbf{x}(k-1)}{1 + \mathbf{x}^2(k-1)} + 8\cos(1.2(k-1)) + w(k) \tag{43}$$

The observation equation is:

$$y(k) = \frac{x^2(k)}{20} + v(k)\tag{44}$$

The equation of state is a non-linear equation including the fractional relation, square relation and trigonometric function relation. *w*(*k*) is the process noise with the mean value of zero and the variance of Q. The relationship between the observed signal *y*(*k*) and state *v*(*k*) in the measurement equation is also non-linear. *v*(*k*) is the observation noise with the mean value of zero and the variance of R. Therefore, this system is a typical system with non-linear states and observations, and this model has become the basic model for verifying the non-linear filtering algorithm [22,23].

In the experiment, we set Q = 10, R = 1 and set the initial state as *p*(*x*(1)) = *N*(*x*(1); 0, 1).

First, we simulate the system. When *α* ≥ 1, the values of *α* are right for the experiments; here, the value of *α* is two, and the entire experimental simulation time is T = 50. The result of the state estimation is shown in Figure 3, and it can be seen that the non-linear filtering method we proposed is feasible; the state value can be estimated well during the whole process, and its performance is superior to EKF and UKF in some cases.

**Figure 3.** State estimation comparison of different non-linear filtering methods.

Second, in order to measure the accuracy of state estimation, the difference between the real state value at each moment and the estimated state value can be calculated to obtain the absolute value; thus, the absolute deviation of the state estimation at each moment is obtained, namely:

$$RMS(k) = |x\_{real}(k) - x\_{estimated}(k)|\tag{45}$$

As shown in Figure 4, we can see that the algorithm error we proposed is always relatively small where the absolute value deviation is relatively large. It can be seen that our proposed method performs better than other non-linear methods.

**Figure 4.** RMS comparison at different times.

In order to measure the overall level of error, we have done many simulation experiments. The average error of each experiment is defined as:

$$RMSE(k) = \frac{1}{T} \sum\_{k=1}^{T} RMS(k) \tag{46}$$

The experimental results are shown in Table 2. We can see that when the estimation of T time series is averaged, the error mean of each AKF is minimum, which indicates the effectiveness of the algorithm, and the filtering accuracy of the algorithm is better than the other two methods under the same conditions. Because the UNGM has strong nonlinearity and we set the variance to the state noise as 10, which is quite large, so the performance differences between EKF, UKF and AKF are rather small.


**Table 2.** Average errors of experiments.

Then, we analyze the influence of the initial value on the filtering results by modifying the value of process noise. As can be seen from Table 3, AKF's performance becomes more and more similar to EKF/UKF as the Q becomes smaller.


**Table 3.** Influence of the variance Q of state equation noise on experimental error.

In the end, we analyze the performance of the whole non-linear filtering algorithm by adjusting the value of *α* through 20 experiments. In order to reduce the influence of the initial value on the experimental results, we take Q = 0.1 and then average the 20 experimental errors. The result is shown in Figure 5. We can see that the error grows as *α* grows in this example, as the noise is relatively small.

**Figure 5.** The error changes as *α* changes.

### **6. Conclusions**

We have first defined the *α*-mixed probability density function and analyzed the monotonicity of the mean and the variance under different *α* values. Secondly, the sufficient conditions for *α* to find the minimum value have been proven, which provides more methods for measuring the distribution similarity of non-linear filtering. Finally, a non-linear filtering algorithm based on the alpha-divergence minimization has been proposed by applying the above two points to the non-linear filtering. Moreover, we have verified that the validity of the algorithm in one-dimensional UNGM.

Although the filtering algorithm is effective, the alpha-divergence is a direct extension of the KL divergence. We can try to verify that the minimum physical meaning of the alpha divergence is equivalent to the minimum physical meaning of the KL divergence in a further study. The algorithm should be applied to more practical applications to prove its effectiveness. Meanwhile, we can use more sophisticated particle filtering techniques, such as [24,25], to make the algorithm more efficient. Furthermore, the alpha-divergence method described above is applied to uni-modal approximations, but more attention should be paid to multi-modal distributions, which are more difficult and common in practical systems. Furthermore, it is worth designing a strategy to automatically learn the appropriate *α* values.

**Author Contributions:** Y.L. and C.G. conceived of and designed the method and performed the experiment. Y.L. and S.Y. participated in the experiment and analyzed the data. Y.L. and C.G. wrote the paper. S.Y. revised the paper. J.Z. guided and supervised the overall process.

**Funding:** This research was supported by a grant from the National Key Research and Development Program of China (2016YFB0501801).

**Acknowledgments:** The authors thank Dingyou Ma of Tsinghua University for his help.

**Conflicts of Interest:** The authors declare no conflict of interest.

### **References**


c 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

### *Article* **A Weak Selection Stochastic Gradient Matching Pursuit Algorithm**

### **Liquan Zhao 1,\* , Yunfeng Hu <sup>1</sup> and Yanfei Jia <sup>2</sup>**


Received: 20 April 2019; Accepted: 15 May 2019; Published: 21 May 2019

**Abstract:** In the existing stochastic gradient matching pursuit algorithm, the preliminary atomic set includes atoms that do not fully match the original signal. This weakens the reconstruction capability and increases the computational complexity. To solve these two problems, a new method is proposed. Firstly, a weak selection threshold method is proposed to select the atoms that best match the original signal. If the absolute gradient coefficients were greater than the product of the maximum absolute gradient coefficient and the threshold that was set according to the experiments, then we selected the atoms that corresponded to the absolute gradient coefficients as the preliminary atoms. Secondly, if the scale of the current candidate atomic set was equal to the previous support atomic set, then the loop was exited; otherwise, the loop was continued. Finally, before the transition estimation of the original signal was calculated, we determined whether the number of columns of the candidate atomic set was smaller than the number of rows of the measurement matrix. If this condition was satisfied, then the current candidate atomic set could be regarded as the support atomic set and the loop was continued; otherwise, the loop was exited. The simulation results showed that the proposed method has better reconstruction performance than the stochastic gradient algorithms when the original signals were a one-dimensional sparse signal, a two-dimensional image signal, and a low-rank matrix signal.

**Keywords:** compressed sensing; low rank matrix; stochastic gradient; weak selection method; reliability verification strategy; reconstruction performance

### **1. Introduction**

Compressed sensing (CS) [1–4] has been receiving considerable attention. The main premise of CS theory is that the reconstruction of a high-dimensional sparse (or compressive) original signal from a low-dimensional linear measurement vector under the measurement matrix should satisfy the restricted isometry property (RIP) [5]. At present, CS is divided into the following three core aspects: Sparse representation of the signal, nonrelated linear measurements, and signal reconstruction. The sparse representation of the signal is used as the design basis for the over-complete dictionary [6,7] with the capability of sparse representation, such as discrete cosine transform (DCT), wavelet transform (WT), and Fourier transform (FT). These functions are used as the sparse representation of the signal, where they obtain a fine effect. Unrelated linear measurement is used to design the measurement matrix [8] that satisfies the RIP condition. The commonly used measurement matrices include the Gaussian random matrix, the Bernoulli random matrix, and the partial Hadamard matrix. In this study, we focused mainly on the signal reconstruction.

Signal reconstruction methods can be divided into two categories: Those based on the minimized *l*1-norm problem, and the greedy pursuit algorithm based on the minimized *l*0-norm problem. Those in the first category include methods such as the basis pursuit (BP) [9] algorithm and its optimization algorithm, the gradient projection for sparse reconstruction algorithm (GPSR) [10], the iterative threshold (IT) [11], the interior point method [12], and the Bergman iteration (BT) [13] method. These algorithms are generally used to solve the convex optimization problems. The convex optimization algorithms have a better reconstruction performance and theoretical performance guarantees; however, they are sensitive to noise and usually suffer from heavy computational complexity when processing large signal reconstruction problems. The second category includes methods such as the matching pursuit (MP) [14], orthogonal matching pursuit (OMP) [15], regularized OMP (ROMP) [16], and stage-wise OMP (StOMP) [17]. These algorithms offer much faster running times than the convex optimization methods, but they lack comparable strong reconstruction guarantees. Greedy pursuit algorithms, such as subspace pursuit (SP) [18], compressive sampling matching pursuit (CoSaMP) [19,20], and iterative hard threshold (IHT) [21] algorithms, have faster running times and essentially the same reconstruction guarantees, but these algorithms are only suitable for one-dimensional (1D) signals in compressed sensing.

Several algorithms that are deemed suitable for a 1D signal and multidimensionality signals have been proposed. Ding et al. [22] and Rantzer et al. [23] proposed the forward selection method for sparse signal and low rank matrix reconstruction problems. The algorithm iteratively selects each nonzero element or each rank-one matrix. Wassell et al. [24] proposed a more general sparse basis based on previous studies [22,23]. Liu et al. [25] proposed the forward-backward method, where the atoms can be completely added or removed from the set. Bresler et al. [26] extended this algorithm beyond the quadratic loss studied in Liu et al. [25]. Soltani et al. [27] proposed an improved CoSaMP algorithm for a more general form objective function. Bahmani et al. [28] used the gradient matching pursuit (GradMP) algorithm to solve the reconstruction problem of large-scale class signals with sparsity constraints based on the CoSaMP algorithm. However, for large-scale class signal reconstruction problems, the GradMP algorithm needs to compute the full gradient of the objective function, which greatly increases the computation cost of the algorithm. Therefore, Needell et al. [29] proposed a stochastic version of the GradMP algorithm that was called the StoGradMP algorithm. Compared with the GradMP algorithm, the StoGradMP algorithm randomly selects an index and computes its associated gradient at each iteration. This operation is extremely effective for large-scale signal recovery problem.

Although the StoGradMP algorithm effectively reduces the computational cost of the algorithm, its reconstruction capability still needs improvement. In the StoGradMP algorithm, the atomic selection method of the fixed number (namely, selecting 2*K* atoms to complete the expansion of the preliminary atomic set at each round of iterations) leads to a preliminary atomic set of the existing atoms that cannot be fully matched with the original signal. When these atoms are added to the candidate atomic set, the accuracy of the least square solution and the inaccuracy of the support atomic set are affected, which then weakens the reconstruction capability of the signal and increases the computational complexity of the StoGradMP algorithm. Therefore, in this study, we created a weak selection threshold method to select the atoms that best match the original signal, thereby completing the expansion of the preliminary atomic set with a more flexible atom selection. This method improves the reconstruction performance of the algorithm. The combination of the two reliability guarantee methods ensures the correctness and effectiveness of the proposed algorithm, identifies the support atomic set, and calculates the transition estimation of the original signal. Finally, we established different original signal environments to verify the reconstruction performance of the proposed method.

The layout of this paper is as follows. Section 2 introduces the CS theory for signal reconstruction and low-rank matrix reconstruction. The StoGradMP algorithm is described in Section 3. The proposed method, with the weak selection threshold method and the reliability verification strategy of the stochastic gradient algorithm, are outlined in Section 4. The simulation results and the discussion are provided in Section 5, and the conclusion is drawn in Section 6.

### **2. Compressed Sensing Theory**

CS theory supposes that signal *x* is an *n*-length signal. It is said to be a *K*-sparse signal (or compressive) if *x* can be well approximated using *K* coefficients under some nonrelated linear measurements. According to the CS theory, such a signal can be acquired by the following linear random projection:

$$
\mu = \Phi \mathbf{x} + \varepsilon \tag{1}
$$

where <sup>Φ</sup> <sup>∈</sup> *<sup>R</sup>m*×*n*, *<sup>u</sup>* <sup>∈</sup> *Rm*×1(*<sup>m</sup> <sup>n</sup>*), and <sup>ε</sup> <sup>∈</sup> *<sup>R</sup>m*×<sup>1</sup> are the measurement matrix [30], the observation vector, and the noise signal, respectively; *u* contains nearly all the information of the sparse signal *x*. According to Equation (1), the dimensionality of *u* is much lower than the dimensionality of *x*. This problem is an underdetermined problem, which shows that Equation (1) has an infinite number of solutions. It is difficult to reconstruct the sparse signal vector *x* from *u*. However, according to the literature [5,31], a sufficient condition for exact signal reconstruction is that the sensing matrix Φ should satisfy the RIP condition. The RIP condition is described in Definition 1.

**Definition 1.** *For each integer K* = 1, 2, ... , *define the restricted isometry constant* δ*<sup>K</sup> of the sensing matrix* Φ *as the smallest number, such that holds for all K -sparse signal vectors x* <sup>∈</sup> *<sup>R</sup>n*×<sup>1</sup> *with x*<sup>0</sup> = *K.*

$$\|\left(1-\delta\chi\right)\|\mathbf{x}\|\_{2}^{2}\leq\|\Phi\mathbf{x}\|\_{2}^{2}\leq\left(1+\delta\chi\right)\|\mathbf{x}\|\_{2}^{2}\tag{2}$$

Assuming that the original signal *x* is sparse in compressed sensing, then *x* can be reconstructed by solving the following optimization problem:

$$\min\_{\mathbf{x}\in\mathbb{R}^{n\times 1}} \frac{1}{2m} \|\boldsymbol{u} - \Phi\boldsymbol{x}\|\_{2}^{2} \text{ subject to } \|\boldsymbol{x}\|\_{0} \le K \tag{3}$$

where *m* is the number of the measurement value and . 2 <sup>2</sup> denotes the square of the 2-norm of the noise signal estimate vector. Here, *K* controls the sparsity of the solutions to Equation (3).

The low-rank matrix reconstruction problem can be similarly formulated. We obtain the observation vector *ui*, which can be described as

$$
u\_i = \Phi\_i \mathcal{X} + \varepsilon\_i \tag{4}$$

where *i* = 1, 2, ... , *m*, the size of measurement matrix Φ*<sup>i</sup>* is an *m* × *n*1; the unknown signal matrix *X* ∈ *Rn*1×*n*<sup>2</sup> is assumed to be a low rank matrix; and ε*<sup>i</sup>* is the measurement noise. According to Equation (4), the matrix *X* can be reconstructed by the solving the following optimization:

$$\min\_{X \in \mathbb{R}^{n\_1 \times n\_2}} \frac{1}{2m} \|\mu - \Phi X\|\_2^2 \text{ subject to } \text{rank}(X) \le R \tag{5}$$

where *m* is the number of the measurements; *u*, Φ, and *X* are the observation signal, measurement matrix, and low-rank matrix signal, respectively; and *R* controls the rank level of the solution to Equation (5).

To analyze Equations (3) and (5), we first define a more general notion of sparsity. Given the sparse basis Ψ = - ψ1,ψ2, ... ,ψ*<sup>n</sup>* , which consists of the vectors ψ*<sup>i</sup>*

$$\alpha = \sum\_{i=1}^{n} \alpha\_i \psi\_i = \Psi \alpha \tag{6}$$

where α*<sup>i</sup>* = *x*,ψ*<sup>i</sup>* = ψ*<sup>T</sup> <sup>i</sup> x* is the projection coefficient of the original sparse signal *x* and *K n*. *x* is sparse with respect to the sparse basis Ψ if the number of nonzero entries are much lower than the length of signal *x*; that is, *K n*. The sparse basis Ψ can be explained respectively:


This notion is sufficiently general to address several important sparse models, such as the group sparsity and low ranks [28,32]. Therefore, we can describe Equations (3) and (5) using Equations (7) and (8), respectively:

$$\underbrace{\min\limits\_{\mathbf{x}}\frac{1}{M}\sum\_{i=1}^{M}f\_{i}(\mathbf{x})\text{ subject to }\|\mathbf{x}\|\_{0,\mathbf{Y}}\leq K}}\_{\text{-}\text{-}\epsilon} \qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad(7)$$

$$\mathop{F(x)}\_{X}$$

$$\underbrace{\min\_{X} \frac{1}{M} \sum\_{i=1}^{M} f\_{i}(X)}\_{F(X)} \text{Subject to } \text{rank}(X)\_{\mathbb{F}} \le \mathbb{R} \tag{8}$$

where *fi*(*x*) is the smooth function, which can be a non-convex function; *x*0,<sup>Ψ</sup> controls the sparsity level of signal; *fi*(*X*) is also the smooth function with respect to the low rank matrix, which is the non-convex function; and rank(*X*)<sup>Ψ</sup> determines the rank level of the low rank matrix *X*. In particular, *x*0,<sup>Ψ</sup> is the smallest number of atoms in Ψ, such that the original signal *x* can be described by

$$\|\|\mathbf{x}\|\|\_{0,\mathbf{Y}} = \min\_{\mathbf{x}} \left\{ K : \mathbf{x} = \sum\_{i \in |E|} \alpha\_i \psi\_{i\nu} |E| = K \right\} \tag{9}$$

where |*E*| denotes the number of nonzero entries in the original signal *x*.

According to Equations (7) and (8), the reconstruction problem of the sparse signal and the low rank matrix need to be separately explained.

For the sparse signal reconstruction, the sparse basis Ψ consists of *n* basic vectors, each of size *n* in Euclidean space. This problem can be regarded as a special case of Equation (7), where *fi*(*x*) = (*ui* <sup>−</sup> ϕ*i*, *x* ) <sup>2</sup> and *M* = *m*. In this case, we need to decompose the observation signal *u* into a non-overlapping vector *ubi* of size *b*. The matrix Φ*bi* is the sub-matrix of size *bi* × *n*, which consists of partial row vectors in the measurement matrix Φ.

According to Equations (3) and (7), the smooth function is *F*(*x*) = <sup>1</sup> <sup>2</sup>*<sup>m</sup> u* − Φ*x* 2 <sup>2</sup>. Therefore, the smooth function *F*(*x*) can be written as

$$F(\mathbf{x}) = \frac{1}{M} \sum\_{i=1}^{M} \frac{1}{2b} \|\boldsymbol{u}\_{b\_i} - \boldsymbol{\Phi}\_{b\_i}\mathbf{x}\|\_2^2 = \frac{1}{M} \sum\_{i=1}^{M} f\_i(\mathbf{x}) \tag{10}$$

where *M* = *m*/*b*, representing the number of the sub-matrix *M*, is an integer. Consequently, each sub-function *fi*(*x*) can be treated as *fi*(*x*) = <sup>1</sup> <sup>2</sup>*<sup>b</sup> ubi* − Φ*bi* 2 <sup>2</sup>. In this case, each sub-function *fi*(*x*) accounts for a collection of observations of size *b*, rather than only one observation. Thus, when we randomly spilt the smooth function *F*(*x*) into multiple sub-functions *fi*(*x*) and block the measurement matrix Φ into multiple sub matrices Φ*bi* , the computation of the stochastic gradient in the stochastic gradient methods is benefitted.

For the low-rank matrix reconstruction problem, according to the explanation provided in (2) of this section, we know that the sparse basis Ψ consists of infinitely several unit-norm rank-one matrices. According to Equations (5) and (8), the smooth function can be represented as *fi*(*X*) = (*ui* − Θ*i*, *X*) 2 . Therefore, the smooth function *F*(*X*) can be written as

$$\begin{array}{ll} F(X) &= \frac{1}{M} \sum\_{i=1}^{M} f\_i(X) = \frac{1}{M} \sum\_{i=1}^{M} \left( \frac{1}{2b} \sum\_{\substack{j=(i-1) \times b + 1}}^{\text{i}b} \left( u\_j - \left\langle \Phi\_{j,\*} X \right\rangle \right)^2 \right) \\ &\stackrel{\Delta}{=} \frac{1}{M} \sum\_{i=1}^{M} \frac{1}{2b} \|\mu\_{b\_i} - \Phi\_{b\_i} \* X\| \end{array} \tag{11}$$

where *M* = *m*/*b* is the number of block matrix in the sensing matrix Φ and *M* is an integer. Similarly, each function *fi*(*X*) accounts for a collection of observations *ubi* of size *b*, rather than only one observation.

### **3. StoGradMP Algorithm**

The CoSaMP [19] algorithm has become popular for reconstructing sparse or compressive signals from their linear non-adaptive measurement. According to the relevant literature, we know that the CoSaMP algorithm is fast for small-scale signals with low dimensionality, but for a large-scale signal with high dimensionality, the reconstruction accuracy and the robustness of the algorithm are considered poor and not ideal. Regarding the shortcomings of the CoSaMP algorithm, Bahmani et al. [28] summarized the idea of the CoSaMP algorithm and proposed a gradient matching pursuit (GradMP) algorithm to solve the reconstruction problem of large-scale class signals with sparsity constraints. However, for large-scale class signals, the GradMP algorithm needs to compute the full gradient of the objective function *F*(*x*), which greatly increases the computational cost of the algorithm. Therefore, after the GradMP algorithm, Needell et al. proposed a stochastic version of the GradMP algorithm called StoGradMP [29], which does not need to compute the full gradient of *F*(*x*). Instead, at each round of iterations, an index *i* ∈ [*M*] is randomly selected and its associated gradient *fi*(*x*) is computed. This operation is effective for handling the large-scale signal recovery problem, as gradient computation is often prohibitively expensive. To better analyze the StoGradMP algorithm, its block diagram is shown in Figure 1.

**Figure 1.** Block diagram of the StoGradMP algorithm.

The StoGradMP algorithm is described in Algorithm 1, where the steps at each iteration are shown below. Because the reconstruction process of the sparse original signal and the low-rank matrix are almost identical, for the sake of simplicity, we express the above two original signals using *w* in the subsequent explanations.

**Randomize process**: Randomly determine an index *ik* with probability *p*(*ik*), where *k* is the loop index and *ik* ∈ *M*. Then, compute its associated block matrix Φ*bi* and smooth functions.

**Signal proxy**: Compute the gradient *gk* of the smooth function, where *gk* is an *n* × 1 vector. For the low-rank matrix, *gk* is an *n* × *n* matrix.

**Identify**: In compressed sensing, when sorting the absolute values of the gradient in descending order, the first 2*K*-largest absolute values of the gradient vector are selected. Then, search the atomic index of the block sensing matrix corresponding to these coefficients. Thereafter, a preliminary atomic set *Tk* is formed at the *k*-th iteration. In the low rank matrix reconstruction, the best rank 2*R* approximation to *gk* is obtained by keeping the top 2*R* singular values in the singular value decomposition (SVD).

**Merge:** Establish the candidate atomic set Γ*<sup>k</sup>* at the *k*-th iteration, which consists of the preliminary atomic set *Tk* at the current iteration and the support atomic set Λ*<sup>k</sup>* at the previous iteration.

**Estimate:** Calculate the transition signal *ak* at the current iteration, which is obtained using a sub-optimization method. This is a least squares problem for both the compressed sensing and low-rank matrix reconstruction problems. In compressed sensing, *ak* is an *n* × 1 vector, whereas in matrix recovery, *ak* is an *n* × *n* matrix.

**Prune:** Sorting the absolute values of the transition signal vector *ak* in descending order, the first *K* largest components are selected in vector *ak*, and the atomic index of the candidate atomic set corresponding to these components is then obtained. The support atomic set Λ*<sup>k</sup>* is constructed at the current iteration. The support atomic set belongs to the candidate atomic set, Λ*<sup>k</sup>* ∈ Γ*k*. Similarly, in the matrix reconstruction, the best rank *R* approximation to *ak* is obtained by retaining the top *R* singular values in the SVD.

**Update:** Update the current approximate estimation of the original signal, *wk* = *ak*Λ. Here, Λ = Λ*k*. The position of the nonzero entries of the final estimation signal *wk* is determined by the index of the support atomic set. *wk* is the final estimation signal at the *k*th iteration and *w* represents the original signal, which includes the sparse signal and the low rank matrix signal.

**Check:** When the *l*2-norm of the current residual of the estimation signal *wk* is smaller than the tolerance error *tol*\_*a*lg or the loop index *k* is greater than the maximum number of iterations (maxIter), then the reconstruction algorithm halts the iterations and the final approximation estimation of signal *w*ˆ is output such that *w*ˆ = *wk*. If the halt condition is not satisfied, then the algorithm continues to execute the iterations until the halt condition is met.

The entire procedure is as shown in Algorithm 1.

#### **Algorithm 1.** StoGradMP Algorithm

**Input:** *K*, *u*, Φ, *p*(*i*)**,** *b*, *tol*\_*a*lg, maxIter **Output:** an approximation estimation signal *w*ˆ = *wk* **Initialize:** *w*ˆ = 0, *k* = 0, Λ*<sup>k</sup>* = 0, T*<sup>k</sup>* = 0, Γ*<sup>k</sup>* = 0, *M* **repeat** *k* = *k* + 1 loop index select the *ik* with probability *p*(*ik*) randomize *gk* = ∇*fik* (*wk*) form signal proxy *Tk*= supp2*<sup>K</sup> gk* identify 2*K* components Γ*<sup>k</sup>* = *Tk* ∪ Λ*k*−<sup>1</sup> merge to form candidate set *ak* = <sup>Φ</sup><sup>+</sup> Γ*k u* transition estimation using least squares method Λ = supp*K*(|*ak*|) prune to obtain the support atomic set *wk* = *ak*<sup>Λ</sup> final signal estimation *r* = *u* − Φ*wk* update the current residual **Until** halting iteration condition is true, exit loop

### **4. Proposed Algorithm**

The StoGradMP algorithm takes the sparsity of the original signal as the known information and uses it to complete the expansion of the preliminary atomic set in the preliminary stage of the algorithm. The StoGradMP algorithm determines the 2*K*-most relevant atoms in the preliminary stage of each round of iterations, and these atoms form a preliminary atomic set. Here, *K* represents the numerical value of the sparsity level and rank level, which is a fixed number greater than zero. This atomic selection results in the addition of smaller relevant atoms and incorrect atoms to the preliminary atomic set, which reduces the accuracy and speed of the reconstruction algorithm, thereby affecting the reconstruction performance of the algorithm. To solve this problem, we used the weak selection threshold strategy to achieve the expansion of the preliminary atomic set at the preliminary stage of the algorithm.

The entire process explanation of the proposed algorithm is described here. First, according to Equations (10) and (11) in Section 2, we selected the index *ik* with probability *p*(*ik*). This step is mainly used to randomize the measurement matrix Φ to obtain a stochastic block matrix Φ*bi k* , which is expressed by

$$I = \text{celil}(rand \times nb) \tag{12}$$

$$b\_{i\_k} = b \times (I - 1) + 1 : b \times I \tag{13}$$

where *nb* is the number of block matrices according to Equation (10), which is equal to *nb* = *floor*(*m*/*b*). Here, *M* = *nb*. *b* is the number of rows of the block matrix, which is equal to *b* = min(*m*,*K*). When the original signal is a sparse signal, *K* represents the numerical value of the sparsity level. When the original signal is a low-rank matrix signal, *K* is the numerical value of the rank level. *bik* represents the index of rows of the measurement matrix, which is randomly determined. The block matrix Φ*bi k* is also randomly selected. Then, the stochastic gradient function *fi*(*w*) is computed. Here, *w* consists of the symbols used in Section 3, which represents the sparse original signal and the low rank matrix. According to Equations (10) and (11), the sub-function *fi*(*w*) is expressed as

$$\left\| f\_{\boldsymbol{k}} \left( \boldsymbol{w}\_{\boldsymbol{k}} \right) \right\| = \frac{1}{2b} \left\| \boldsymbol{u}\_{\boldsymbol{b}\_{\boldsymbol{i}\_{k}}} - \Phi\_{\boldsymbol{b}\_{\boldsymbol{i}\_{k}}} \boldsymbol{w}\_{\boldsymbol{k}-1} \right\|\_{2}^{2} \tag{14}$$

where *k* is the loop index, and *ubi <sup>k</sup>* and <sup>Φ</sup>*bi <sup>k</sup>* are the *<sup>i</sup>*-th block observation signal and the *<sup>i</sup>*-th block matrix at *k* iteration, respectively. From Equations (12)–(14), we know that the sub-function *fi*(*w*) is also stochastically determined, and that *fi*(*w*) belongs to *F*(*w*).

When the block matrix Φ*bi <sup>k</sup>* and the stochastic gradient function *fi*(*w*) are obtained, the gradient of sub-function *fi*(*w*) is calculated, which is expressed as

$$\mathcal{g}\_k = \nabla f\_i(w\_k) \tag{15}$$

where *gk* is the gradient of the sub-function *fi*(*w*) at the *k*-th iteration, *wk* is the final estimation of the original signal at the *k*-th iteration, and ∇(.) denotes the derivative of the sub-function *fi*(*w*). Combining Equation (13) with Equation (14), the gradient *gk* can be expressed as

$$\mathcal{g}\_k = -2 \times \Phi\_{b\_{\hat{k}}}^T \left( \mu\_{b\_{\hat{k}}} - \Phi\_{b\_{\hat{k}\_k}} w\_{k-1} \right) \tag{16}$$

where (.) *<sup>T</sup>* represents the transpose operation of the matrix.

According to Equation (16), the smaller the absolute value of the gradient, the worse the match between the selected atoms and the original signal. In the StoGradMP algorithm, 2*K* is fixed and selected as the largest gradient coefficient from the gradient vector *gk* to determine the atomic index of the block matrix and form the preliminary atomic set. The selected gradient coefficients may contain some smaller gradient coefficients in the StoGradMP algorithm during some iterations. This reduces the reconstruction performance and increases the computational complexity. Therefore, to improve the reconstruction performance of the StoGradMP algorithm, we used the weak selection threshold method to complete the expansion of the preliminary atomic set *Tk*. This process can be described as

$$\gamma = \max\left( |\varrho\_k| \right) \tag{17}$$

$$T\_k = \text{supp}\_{k \times \gamma} \left( \left| \mathcal{g}\_k \right| \right) \tag{18}$$

where <sup>γ</sup> is the maximum value of the absolute gradient vector *gk* at the *<sup>k</sup>* <sup>−</sup> *th* iteration, <sup>κ</sup> <sup>∈</sup> [0.1 1.0] is the threshold, and suppκ×γ(.) represents the preliminary atomic set that satisfies the weak selection threshold condition. The gradients corresponding to the preliminary atoms satisfy the condition that their absolute values are greater than κ × γ. If the threshold is greater than 1, then κ × γ is greater than all gradients of the absolute value, and the atomic set is null. This causes the weak selection threshold method to fail. A too-small threshold of κ increases the number of error atoms in the preliminary atomic set. In the low rank matrix reconstruction, the best rank approximation to *gk* is obtained by maintaining the singular value at a level greater than the weak selection threshold in the SVD, where γ is the maximum singular value. The preliminary atomic set consists of the singular vectors that satisfy the weak selection threshold method.

After selecting the preliminary atomic set, we used it and the previous support atomic set Λ*k*−<sup>1</sup> to form the current candidate atomic set, which can be expressed as:

$$
\Gamma\_k = T\_k \cup \Lambda\_{k-1} \tag{19}
$$

where Γ*k*, *Tk*, and Λ*k*−<sup>1</sup> denote the candidate atomic index set, the preliminary atomic index set, and the support atomic index set, respectively.

After the current candidate atomic set was constructed, to ensure the correctness and effectiveness of the proposed method, we added the reliability guarantee method 1 to the proposed algorithm, that is, if

$$\text{supp}|\Lambda\_{k-1}| = = \text{supp}|\Gamma\_k|\tag{20}$$

is true, where supp|Γ*k*| and supp|Λ*k*−1| are represents the size (or scale) of the current candidate atomic index set Γ*<sup>k</sup>* and the previous support atomic index set Λ*k*−1, respectively. The method is unable to select the new atoms from the block matrix to add to the candidate atomic set. At this time, the loop is exited and the estimated value of the original signal is the output. We added the sub-condition judgment in the above judgment condition to prevent the proposed method from exiting the loop in the first round of iterations. Since both the candidate atomic set and the support atomic set are empty sets in the first round of iterations, this sub-condition judgment can be expressed as follows: If *k* == 1, then the estimated signal *w*ˆ is equal to 0.

Although the weak selection threshold method improves the correlation of the preliminary atomic set and increases the flexibility of atom selection, it is possible that when the threshold is too small, the number of columns of the candidate atomic set is greater than the number of the rows of the candidate atomic set. This leads to an inability to obtain the transition estimation of the original signal using the least squares method because the premise of the least squares method is that the number of rows of the atomic set is greater than the number of columns of the atomic set. Therefore, before solving the least squares method, we must ensure that this condition exists. Therefore, we developed the reliability guarantee method 2, that is, if

$$\text{supp}\|\Gamma\_k\|\le m\tag{21}$$

then,

$$
\Lambda = \Gamma\_k \tag{22}
$$

$$
\Phi\_{\Lambda} = \Phi(\text{:}, \Lambda) \tag{23}
$$

where supp|Γ*k*| represents the number of columns of the candidate atomic matrix Γ*<sup>k</sup>* at the *k*-th iteration. If this condition is satisfied, then we regard the candidate atomic index set Γ*<sup>k</sup>* as the current support atomic index set Λ, and the atoms corresponding to the current support atomic index set Λ are used to construct the current support atomic set ΦΛ. Conversely, if the condition is not satisfied (the number of rows is smaller than the number of columns), then the matrix Φ*<sup>T</sup>* <sup>Λ</sup> × ΦΛ −<sup>1</sup> is not inverse. If this occurs, we exit the loop and let *w*ˆ = 0.

Next, we used the least squares method to solve the sub-optimization problem, which can be described as:

$$a\_k = \Phi\_\Lambda^+ u \tag{24}$$

where *ak* is the transition estimation signal of the original signal, *u* is the observation signal, and (ΦΛ) + represents the pseudo inverse of the support atomic set ΦΛ. To better analyze the role of the reliability guarantee method 2, Equation (24) can be written as:

$$a\_k = \left(\Phi\_\Lambda^T \times \Phi\_\Lambda\right)^{-1} \times \Phi\_\Lambda^T \times u \tag{25}$$

where (ΦΛ) *<sup>T</sup>* and Φ*<sup>T</sup>* <sup>Λ</sup> × ΦΛ −<sup>1</sup> represent the transpose operation and inverse operation of the matrix ΦΛ and the matrix Φ*<sup>T</sup>* <sup>Λ</sup> × ΦΛ, respectively. In combination with Equations (21) and (24), we can ensure that the operation Φ*<sup>T</sup>* <sup>Λ</sup> × ΦΛ is invertible.

Based on Equations (22)–(24), we observed that the support atomic set is obtained using reliability guarantee method 2 and the candidate atomic set. If the reliability guarantee method 2 is true, then the current candidate atomic set can be regarded as the support atomic set. This operation is used to obtain the final support for the signal estimation. Next, we updated the current residual and final estimation of the original signal, which is expressed as

$$r\_{\mathcal{L}} = u - \Phi\_{\Lambda} q\_k \tag{26}$$

$$w\_k = a\_{k\Lambda} \tag{27}$$

where *wk* is the final estimation of the original signal at the *k*-th iteration, *ak*<sup>Λ</sup> is the reconstruction signal corresponding to the support atomic index set Λ, and *rc* is the current residual.

Finally, for the different original signals, we created different stop iteration conditions if

$$\|\|r\_{\mathfrak{c}}\|\|\_{\mathbb{L}} \le \text{tol\\_alg} \text{ or } k \ge \text{maxIter} \tag{28}$$

is true, where *tol*\_*a*lg is the tolerance error of the algorithm iteration, and maxIter is the maximum number iterations of the algorithm. Specifically, if the original signal is a sparse signal, *l* = 2, that is, the *<sup>l</sup>*2-norm of the residual estimation vector, then we set *tol*\_*a*lg and maxIter to 1 <sup>×</sup> <sup>10</sup>−<sup>7</sup> and 500 <sup>×</sup> *<sup>M</sup>*, respectively. When the original signal is a low-rank matrix, then the current residual estimation is a matrix, which is obtained by conducting a Frobenius norm operation on the error matrix. Here, *<sup>l</sup>* = *<sup>F</sup>*. We set the *tol*\_*a*lg and maxIter to 1 <sup>×</sup> 10−<sup>7</sup> and 300 <sup>×</sup> *<sup>M</sup>*, respectively. According to Equation (28), when the stop iteration condition is satisfied, the algorithm stops the iterations and the output is the final estimation of the original signal *w*ˆ = *wk*. If the halt iteration condition is not satisfied, the iteration is continued, and it updates the current final estimation for the gradient computation of the next iteration, *wk*<sup>+</sup><sup>1</sup> = *wk*. It continues until the stop iteration condition is true. To better analyze the proposed algorithm, its block diagram is shown in Figure 2.

**Figure 2.** Block diagram of the proposed method.

The entire procedure is shown in Algorithm 2.

**Algorithm 2:** Proposed algorithm.

Input: Φ, *u*, *p*(*i*), *b*, *tol*\_*a*lg, maxIter, κ Output: an approximation estimation signal *w*ˆ = *wk* Initialize: *w*ˆ = 0, *k* = 0, Λ*<sup>k</sup>* = 0, *Tk* = 0, Γ*<sup>k</sup>* = 0, *M* repeat *k* = *k* + 1 loop index Select *ik* from [*M*] with probability *p*(*ik*) randomize process *gk* = ∇*fik* (*wk*−1) form signal proxy <sup>γ</sup> = max *gk* determine the max gradient value *Tk* <sup>=</sup> suppκ×<sup>γ</sup> *gk* weak selection method to identify the preliminary atomic set Γ*<sup>k</sup>* = *Tk* ∪ Λ*k*−<sup>1</sup> merge to form candidate set **Reliability guarantee method 1** If supp|Λ*k*−1|== supp|Γ*k*| If *k* == 1 *w*ˆ = 0; end break; end **Reliability guarantee method 2** If supp|Γ*k*| ≤ *m* Λ = Γ*<sup>k</sup>* identify the support atomic set ΦΛ = Φ(:, Λ) else If *k* == 1 *w*ˆ = 0; end break; end *ak* = <sup>Φ</sup><sup>+</sup> <sup>Λ</sup>*u* transition estimation by least squares method *r* = *u* − ΦΛ*ak* update the current residual *wk* = *ak*<sup>Λ</sup> final signal estimation Until halting iteration condition is true, exit loop

### **5. Discussion**

We analyzed the simulation for the following experiments: 1D sparse signal reconstruction, low rank matrix reconstruction, and 2D image signal reconstruction. The reconstruction performance is an average after running the simulation 200 times using a computer with a quad-core, 64-bit processor, and 4G memory.

### *5.1. 1D Sparse Signal Reconstruction Experiment*

In this experiment, we used a random signal with *K*-sparse as the original signal. The measurement matrix was randomly generated with a Gaussian distribution. We set the range of the weak selection thresholds κ to [0.2, 0.4, 0.6, 0.8]. The recovery error and iteration stop error of all the algorithms were set to 1 <sup>×</sup> 10−<sup>6</sup> and 1 <sup>×</sup> 10<sup>−</sup>7, respectively. These errors were obtained by conducting an *<sup>l</sup>*2-norm operation on the error vector. The maximum number of iterations maxIter was set to 500 × *M*.

Figure 3 compares the reconstruction percentage of the proposed algorithm to the different thresholds. Figure 3 shows that when the threshold was 0.6, the reconstruction percentage of the proposed algorithm was the highest compared to the other thresholds under the same measurements and sparsity levels.

Figure 3 shows that the reconstruction percentage of the proposed algorithm was 100% for all of the sparsity and threshold levels when the number of measurements was greater than 160. Therefore, we set the range of measurements to [160 – 250] to compare the average running time of the proposed algorithm at different weak selection thresholds, as shown in Figure 4. Figure 4 shows that the average running time of the proposed algorithm was the shortest for different sparse levels when the threshold was 0.8, followed by 0.6, with very small differences between the two. Based on the analysis of Figures 3 and 4, we set the default weak selection threshold to 0.6.

**Figure 3.** Reconstruction percentage of the proposed method at different weak selection thresholds (*n* = 256, *K* = 12, 16, 20, 24, κ = 0.2, 0.4, 0.6, 0.8, *m* = 20 : 5 : 160, Gaussian signal).

**Figure 4.** Average running time of the proposed method with different weak selection thresholds (*n* = 256, *K* = 12, 16, 20, 24, κ = 0.2, 0.4, 0.6, 0.8, *m* = 160 : 5 : 250, Gaussian signal).

Figure 5 compares the reconstruction percentage of the proposed algorithm using the StoGradMP algorithm. We set the sparse level to *K* ∈ [12, 16, 20, 24], and the weak selection threshold to 0.6. Figure 5 shows that when the sparse level was 12, the reconstruction percentages of the proposed algorithm and the StoGradMP algorithm were nearly identical for all the measurements. When *K* = 16, 20, or 24, the reconstruction percentage of the proposed algorithm was higher than that of the StoGradMP algorithm. When the sparse level was 24, the difference in the reconstruction percentages between the two algorithms was the largest. Therefore, we concluded that when the sparse level increases, the difference between the reconstruction percentages increases further. This means that in sparse signal reconstruction, the proposed method is more suitable for reconstruction in a larger sparsity environment compared to the StoGradMP algorithm.

**Figure 5.** Reconstruction percentages of the StoGradMP and proposed algorithms (*n* = 256, κ = 0.6, *K* = 12, 16, 20, 24, *m* = 20 : 5 : 90, Gaussian signal).

Figure 6 compares the reconstruction percentage of the proposed algorithm with the StoGradMP and StoIHT algorithms. In Figure 5, the interval of measurement was set to five, and to reflect the details, we set the interval of measurement to two in Figure 4. Figure 6 shows that when 46 ≤ *m* < 50, the reconstruction percentages of the methods were 0%. This means that they could not complete the reconstruction. When 50 ≤ *m* ≤ 76, the reconstruction percentage of the proposed method ranged from 0.2% to 97.6%; however, the reconstruction percentages of the StoGradMP and StoIHT algorithms were still 0%. When 72 ≤ *m* ≤ 78, the reconstruction percentage of the StoGradMP algorithm began to increase from 0.6% to 95%. When 78 ≤ *m* ≤ 120, the reconstruction percentages of the proposed and StoGradMP algorithms were nearly 100%. However, the StoIHT algorithm was still unable to complete reconstruction. When 120 ≤ *m* ≤ 142, the reconstruction percentage of the StoIHT algorithm increased from 0% to 100%. When 142 ≤ *m*, then all the reconstruction algorithms could achieve full reconstruction. This demonstrates that the proposed method provides better reconstruction performance than the others.

Figure 7 compares the average running time. Figure 6 shows that the reconstruction percentage was 100% for all the reconstruction algorithms when the number of measurements was greater than 150. Therefore, in this simulation, we set the range of measurement to [150– 250]. Figure 7 shows that the proposed method has a shorter running time than the StoGradMP algorithm. Although the StoIHT algorithm had a shorter running time than the other algorithms, it required more measurements to achieve the same reconstruction percentage as the other algorithms.

Figure 8 compares the reconstruction percentages of the proposed algorithm and the prior improved algorithm (IStoGradMP) [33]. Both of these algorithms reconstructed the signal in an unknown sparsity environment. The main differences between the proposed algorithm and the IStoGradMP algorithm are: (1) In the preliminary atomic stage, the proposed algorithm uses the atomic matching strategy to obtain the preliminary atomic set, whereas the IStoGradMP algorithm evaluates and adjusts the estimated sparsity of the original signal to obtain the preliminary atomic set; (2) the scales of the preliminary atomic set and the support atomic set are unfixed at each iteration in the proposed algorithm, whereas the scale of the preliminary atomic set and the support atomic set are fixed at each iteration of the IStoGradMP algorithm; and (3) the support atomic set is determined by the candidate atomic set and the reliability guarantee method for the proposed method, whereas the support atomic set is determined by pruning the candidate atomic set in the IStoGradMP algorithm. We also proposed an improved StoGradMP algorithm based on the soft-threshold method [34]. This algorithm [34] requires that sparsity information of the original signal to be known, whereas the proposed method and the IStoGradMP algorithm [33] can reconstruct the signal without knowing the sparsity information. Based on the above comparative analysis, in this section, we only compared the experimental simulations in an unknown sparsity environment. We only compared the IStoGradMP and the proposed methods.

**Figure 6.** Reconstruction percentages of the different algorithms (*n* = 256, *K* = 24, κ = 0.6, *m* = 46 : 2 : 150, Gaussian signal).

**Figure 7.** Average running times of the different algorithms (*n* = 256, *K* = 24, κ = 0.6, *m* = 150 : 5 : 250, Gaussian signal).

Figure 8 shows that for arbitrary measurement values, the reconstruction percentage of the proposed algorithm was higher than that of the IStoGradMP algorithm. However, we discovered that as the sparsity level increased, the gap in the reconstruction percentage between the proposed algorithm and the IStoGradMP algorithm gradually reduced. This means that the proposed method is more suitable than the IStoGradMP algorithm under smaller sparsity environments.

**Figure 8.** Reconstruction percentages of the proposed algorithm and IStoGradMP algorithm (*n* = 256, κ = 0.6, *K* = 12, 16, 20, 24, *s* = 1, 5, 10, 15, *m* = 30 : 5 : 150, Gaussian signal).

Figure 9 compares the average running times of the proposed method and the IStoGradMP method under different sparsity level conditions. Figure 8 shows that when the number of the measurements was greater than 150, the reconstruction percentage of the proposed method and the IStoGradMP was 100% for all the sparsity levels. Therefore, we set the range of the number of the measurement to [150 – 250]. Figure 9 shows that when the threshold of the proposed method was set to 0.6, the average running time of the proposed method was less than that for the IStoGradMP algorithm. This means that the computational complexity of the proposed algorithm was lower than that for the IStoGradMP algorithm. That is, the proposed method was faster than the IStoGradMP algorithm under the full reconstruction conditions.

**Figure 9.** Average running times of the proposed algorithm and IStoGradMP algorithm (*n* = 256, κ = 0.6, *K* = 12, 16, 20, 24, *s* = 1, 5, 10, 15, *m* = 150 : 5 : 250, δ*<sup>K</sup>* = 0.1, Gaussian signal).

Based on the analysis of Figures 8 and 9, we conclude that for a smaller sparsity level environment, the proposed algorithm with a proper weak selection threshold has better reconstruction performance as well as a lower computational complexity than the IStoGradMP algorithm.

### *5.2. Low-Rank Matrix Reconstruction Experiment*

In this experiment, we used the random matrix with a low-rank property as the original signal. We set the rank level of the matrix to *R* = 1, 3. The size of the low rank matrix was 10 × 10. The measurement matrix was randomly generated with a Gaussian distribution. The recovery error and iteration halt error of all the algorithms were set to 1 <sup>×</sup> 10−<sup>6</sup> and 1 <sup>×</sup> 10−7, respectively. These errors were obtained using a Frobenius norm operation on the respective error matrix. The maximum number of iterations maxIter was set to 300 × *M*.

Figure 10 compares the reconstruction percentage of the proposed method at different weak selection thresholds. Figure 10 shows that the reconstruction percentage was higher than the other algorithms when the threshold was 0.2 for the different rank levels of the matrix.

**Figure 10.** Reconstruction percentage of the proposed method with different weak selection thresholds (*d* = 100, *R* = 1, 3, κ = 0.2, 0.4, 0.6, 0.8, *m* = 15 : 5 : 140, random low rank matrix).

Figure 11 compares the average running times of the proposed method at different thresholds. Figure 11 shows that the smaller the weak selection threshold, the higher the reconstruction percentage, which means that a larger threshold increases the computational complexity of the proposed method. Thus, in the subsequent simulation without special instructions, the default weak selection threshold was set to 0.2.

Figure 12 compares the reconstruction percentages of the different measurements of the proposed and StoGradMP algorithms at different rank levels. We observed that the proposed method had a better reconstruction percentage than the StoGradMP algorithm at the different rank levels.

Figure 13 compares the reconstruction percentage of the proposed algorithm to the StoGradMP and StoIHT algorithms for different measurements. In Figure 12, we set the measurement interval to five, and to show details, we set the interval of measurement to two in Figure 13. Figure 13 shows that when 20 ≤ *m* < 22, the reconstruction percentage of all the algorithms was 0%. When 22 ≤ *m* ≤ 34, the reconstruction percentage of the StoIHT and proposed algorithms began to increase from 0% to 88.2% and 0.6% to 87%, respectively. However, the StoGradMP algorithm struggled to complete the signal reconstruction. When 32 ≤ *m* ≤ 58, the reconstruction percentage of the proposed algorithm ranged approximately from 87% to 100%. The reconstruction percentage of the StoIHT algorithm increased from 88.2% to 99.8%. The reconstruction percentage of the StoGradMP algorithm increased from 0.2% to 95.4%. In this measurement range, the reconstruction percentage of the proposed algorithm was higher than those of the other algorithms. When 58 ≤ *m*, almost all the reconstruction algorithms achieved a high probability reconstruction. Therefore, we conclude that the reconstruction percentages

of the proposed method and the StoIHT method were almost the same, and higher than the StoGradMP algorithm, which existed at a lower rank level of the matrix.

**Figure 11.** Average running times of the proposed method at different weak selection thresholds (*d* = 100, *R* = 1, 3, κ = 0.2, 0.4, 0.6, 0.8, *m* = 200 : 5 : 300, random low-rank matrix).

**Figure 12.** Reconstruction percentages of the StoGradMP and proposed algorithms (*d* = 100, *R* = 1, 3, κ = 0.2, *m* = 15 : 5 : 120, random low-rank matrix).

**Figure 13.** Reconstruction percentages of the different methods (*d* = 100, *R* = 1, κ = 0.2, *m* = 20 : 2 : 70, randomly low-rank matrix).

Figure 14 compares the average running times of the different algorithms. Figure 13 shows that when the number of measurements was more than 80, the reconstruction percentage of all the algorithms was 100%. Therefore, to better analyze the computational complexity of the different algorithms, we set the range of measurements to [80 – 200] in the simulation. Figure 14 shows that the proposed algorithm had the shortest running time, followed by the StoIHT and StoGradMP algorithms. We observed that when the number of measurements increased, the running time of the StoIHT algorithm also increased, whereas the average running times of the proposed and StoGradMP algorithms tended to decrease and remain stable.

**Figure 14.** Average running times of the different algorithms (*d* = 100, *R* = 1, κ = 0.2, *m* = 80 : 5 : 200, randomly low-rank matrix).

Based on the above analysis, we conclude that the proposed algorithm with a weak selection threshold produces better reconstruction performance than the StoGradMP algorithm, as well as a lower computational complexity than the other algorithms.

### *5.3. 2D Image Signal Reconstruction Experiment*

In this subsection, we used six 256 × 256 test images as the original images. We considered the image signal as a 2D signal. The test images included the following: Baboon, Boat, Cameraman, Fruits, Lena (human portrait), and Peppers. The sparse basis was a wavelet basis with sparse representation capability, and the size was 256 × 256. The measurement matrix was randomly generated with a Gaussian distribution, and the size was 153 × 256. We assumed that the sparsity was 51. The iteration halt error of the algorithm was set to 1 <sup>×</sup> 10<sup>−</sup>7, the maximum number of iterations maxIter was set to 30, and the weak selection threshold was set to κ = 0.6.

We used the peak signal to noise ratio (PSNR) as an indicator to evaluate the reconstruction quality, which could be expressed as:

$$MSE = \frac{1}{M \times N} \sum\_{i=0}^{M-1} \sum\_{j=0}^{N-1} \left| \pounds(i,j) - \pounds(i,j) \right|^2 \tag{29}$$

$$PSNR = 10 \times \log\_{10} \left( \frac{MAX\_{\frac{\alpha}{\ell}}^2}{MSE} \right) = 20 \times \log\_{10} \left( \frac{MAX\_{\frac{\alpha}{\ell}}}{\sqrt{MSE}} \right) \tag{30}$$

where *M* = *N* = 256; *x*ˆ(*i*, *j*) and *x*(*i*, *j*) represent the reconstruction value and the original value of the correspondence position, respectively; *MSE* is the mean square error; and *MAXx*<sup>ˆ</sup> represents the maximum value of the color of the image point. In this paper, each sample point is represented by eight bits, *MAXx*<sup>ˆ</sup> = 255. The larger the PSNR, the higher the reconstructed image quality.

Figure 15 shows the original images. Figures 16 and 17 shows the reconstructed images using the StoGradMP and proposed algorithms, respectively. Comparing the reconstructed images to the original images, we observed that the two methods successfully reconstructed the original images.

**Figure 15.** Original images (**a**) Baboon image, (**b**) Boat image, (**c**) Cameraman image, (**d**) Fruits image, (**e**) Lena image, (**f**) Peppers image.

**Figure 16.** Reconstructed images using the StoGradMP algorithm (**a**) Reconstructed Baboon image with PSNR = 16.5754 dB; (**b**) Reconstructed Boat image with PSNR = 20.9987 dB; (**c**) Reconstructed Cameraman image with PSNR = 21.8698 dB; (**d**) Reconstructed Fruits image with PSNR = 23.5299 dB; (**e**) Reconstructed Lena image with PSNR = 25.5532 dB; (**f**) Reconstructed Peppers image with PSNR = 23.7168 dB.

**Figure 17.** Reconstructed images using the proposed algorithm (**a**) Reconstructed Baboon image with PSNR = 19.4619 dB; (**b**) Reconstructed Boat image with PSNR = 24.4034 dB; (**c**) Reconstructed Cameraman image with PSNR = 28.8105 dB; (**d**) Reconstructed Fruits image with PSNR = 27.0217 dB; (**e**) Reconstructed Lena image with PSNR = 28.8224 dB; (**f**) Reconstructed Peppers image with PSNR = 27.2669 dB.

Table 1 compares the average PSNR of the StoGradMP and proposed algorithms under different test image conditions. Table 1 shows that the average PSNR of the proposed algorithm was higher than the StoGradMP algorithm for the different test images, and the average PSNR of the proposed method was higher than 3–4 dB. This shows that the reconstructed image quality of the proposed algorithm was better than the StoGradMP.


**Table 1.** Comparison of the average peak signal to noise ratios (PSNR) of the StoGradMP and proposed algorithms for the different test images.

Table 2 compares the average running times of the StoGradMP and proposed algorithms for the different test images. From Table 2, the average running times of the StoGradMP algorithm was longer than the proposed method for the different test images, and the average running time of the StoGradMP algorithm was more than twice that of the proposed algorithm. This means that the proposed method had lower computational complexity than the StoGradMP algorithm when images were reconstructed.


**Table 2.** Comparison of the average runtimes of the StoGradMP and proposed algorithms for the different test images.

Based on the above analysis, the proposed method has a better reconstruction performance for different test images compared to the StoGradMP algorithm, as well as a lower computational complexity than the StoGradMP algorithm.

### **6. Conclusions**

In this paper, a novel stochastic gradient matching pursuit algorithm based on weak selection thresholds was proposed. This algorithm uses the weak selection threshold method to select the atoms that best match the original signal from the block sensing matrix and completes the expansion of the preliminary atomic set. The proposed algorithm adopts two reliability guarantee methods to identify the support atomic set and calculate the transition estimation of the original signal to ensure the correctness and effectiveness of the proposed algorithm. The proposed algorithm not only eliminates dependency on prior sparsity information of the original signal, but also increases the flexibility of the atomic selection process while improving atomic reliability. Therefore, it enhances the reconstruction accuracy and reconstruction efficiency of the proposed algorithm.

Our series of simulation results showed that the proposed method has better reconstruction performance and less computational complexity compared to the other algorithms. Future research should consider using the proposed method to process large-scale array signals, such as wireless communication signals, radar signals, and sonar signals, to enhance the useful signal, suppress noise interference, reduce the burden on sensor devices, and ensure fast real-time transmission of the array signal. The weak selection threshold was determined by setting the threshold and the maximum stochastic gradient in our proposed method. The optimal setting threshold was different for different types of signals, which affects the reconstruction performance. Our future work will consider methods to adapt the setting threshold to the signal.

**Author Contributions:** Conceptualization, formal analysis, investigation, and writing the original draft was done by L.Z. and Y.H. Experimental tests were done by Y.H. and Y.J. All authors have read and approved the final manuscript.

**Funding:** This work was supported by the National Natural Science Foundation of China (61271115) and the Foundation of Jilin Educational Committee (JJKH20180336KJ).

**Conflicts of Interest:** The authors declare no conflict of interest.

### **References**


© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
