**Data Science Measuring Uncertainties**

Printed Edition of the Special Issue Published in *Entropy* Carlos Alberto De Bragança Pereira, Adriano Polpo and Agatha Rodrigues Edited by

www.mdpi.com/journal/entropy

## **Data Science: Measuring Uncertainties**

## **Data Science: Measuring Uncertainties**

Editors

**Carlos Alberto De Bragan¸ca Pereira Adriano Polpo Agatha Rodrigues**

MDPI • Basel • Beijing • Wuhan • Barcelona • Belgrade • Manchester • Tokyo • Cluj • Tianjin

*Editors* Carlos Alberto De Braganc¸a Pereira Brazil and University of Sao Paulo Brazil

Adriano Polpo University of Western Australia Australia

Agatha Rodrigues Federal University of Espirito Santo Brazil

*Editorial Office* MDPI St. Alban-Anlage 66 4052 Basel, Switzerland

This is a reprint of articles from the Special Issue published online in the open access journal *Entropy* (ISSN 1099-4300) (available at: https://www.mdpi.com/journal/entropy/special issues/ data science uncertainties).

For citation purposes, cite each article independently as indicated on the article page online and as indicated below:

LastName, A.A.; LastName, B.B.; LastName, C.C. Article Title. *Journal Name* **Year**, *Volume Number*, Page Range.

**ISBN 978-3-0365-0792-7 (Hbk) ISBN 978-3-0365-0793-4 (PDF)**

© 2021 by the authors. Articles in this book are Open Access and distributed under the Creative Commons Attribution (CC BY) license, which allows users to download, copy and build upon published articles, as long as the author and publisher are properly credited, which ensures maximum dissemination and a wider impact of our publications.

The book as a whole is distributed by MDPI under the terms and conditions of the Creative Commons license CC BY-NC-ND.

## **Contents**




## **About the Editors**

**Carlos Alberto De Braganca¸ Pereira** received his B.Sc. degree in statistics from the National School of Statistical Science, Brazil, in 1968, M.Sc. degree in statistics from the University of Sao Paulo (USP), Brazil, in 1971, and Ph.D. degree in statistics from Florida State University, USA, in 1980. He was Director of the Institute of Mathematics and Statistics, USP, from 1994 to 1998, Head of the Statistical Department in three separate appointments, and Director of the Bioinformatics Scientific Center, USP, from 2006 to 2009. He is currently Senior Professor with the Department of Statistics, USP, and has served as Visiting Professor with the Federal University of Mato Grosso do Sul from 2018 to 2020. He has authored or coauthored more than 200 papers and seven books. He was President of the Brazilian Statistical Society from 1998 to 1990 and an elected member of the International Statistical Institute. He has edited three Special Issues of *Entropy* in 2018–2020.

**Adriano Polpo** is a statistician who likes to develop new methods and to study the statistical theory underlying practical approaches, and to do so properly, he considers that it is necessary to work with real challenges. In the search of such challenges, he was introduced to many different problems in statistics and collaborated with many researchers in other fields. Adriano is a graduate of the State University of Campinas, Brazil, in 2001, obtained his Ph.D. from the University of Sao Paulo, Brazil, in 2005, and has taken a postdoctoral sabbatical at Florida State University, USA, from 2011 to 2012. He was Associate Professor at Federal University of Sao Carlos, from 2006 to 2018, and Head of the Department of Statistics (from 2013 to 2015). He was Elected Secretary (2011–2012), Elected President (2013–2014), and Member of Director Board (2015–2016) of ISBrA (Brazilian Chapter of ISBA, the International Society for Bayesian Analysis). Since 2019, Adriano has been Associate Professor at the University of Western Australia and Chief Investigator of the Australian Research Council, Training Centre for Transforming Maintenance through Data Science. He is an investigator of the Brazilian Obstetric Observatory project since 2021, funded by the Bill & Melinda Gates Foundation. He has mainly been working with reliability/survival analysis, regression models for counting data, and Bayesian nonparametric methods. He has been also working with statistical hypothesis testing. His research agenda aims at providing solutions to many different problems, from functional analysis of human gait, studies of the impact of copper nanoparticles in algae (carbon farming), and latency to treatment in OCD patients to methods for the proper use of statistical hypothesis tests. He has served as chair of several scientific events and authored or coauthored more than 40 papers and three books. He has edited four Special Issues of *Entropy* in 2018–2020.

**Agatha Rodrigues** received her B.Sc. degree in statistics from the Federal University of Sao Carlos, Brazil, in 2010, and M.Sc. and Ph.D. degrees in statistics from the University of Sao Paulo, Brazil, in 2013 and 2018, respectively. Currently, she works as Professor in the Statistics Department at the Federal University of Esp´ırito Santo, Brazil. She has authored or coauthored more than 20 papers. Her current research interests include biostatistics, reliability, and survival analysis. She has edited one Special Issue of Entropy (2020). Dr. Rodrigues is also the principal investigator of the Brazilian Obstetric Observatory project, funded by the Bill & Melinda Gates Foundation.

### *Editorial* **Data Science: Measuring Uncertainties**

#### **Carlos Alberto de Braganca Pereira 1,\*,†, Adriano Polpo 2,† and Agatha Sacramento Rodrigues 3,†**


Received: 26 November 2020; Accepted: 15 December 2020; Published: 20 December 2020 -

With the increase in data processing and storage capacity, a large amount of data is available. Data without analysis does not have much value. Thus, the demand for data analysis is increasing daily, and the consequence is the appearance of a large number of jobs and published articles in this area.

Data science has emerged as a multidisciplinary field to support data-driven activities, integrating and developing ideas, methods and processes to extract information from data. There are methods built from different areas of knowledge: Statistics, Computer Science, Mathematics, Physics, Information Science and Engineering, among others. This mixture of areas gave rise to what we call Data Science.

New solutions to new problems have been proposed rapidly for large volumes of data. Current and future challenges require greater care in creating new solutions that satisfy the rationality of each type of problem. Labels such as Big Data, Data Science, Machine Learning, Statistical Learning and Artificial Intelligence are demanding more sophistication in their foundations and in the way they are being applied. This point highlights the importance of building the foundations of Data Science.

This Special Issue is dedicated to solutions and discussions of measuring uncertainties in data analysis problems. The twelve articles in this edition discuss data science problems. The articles consider the reasoning behind their proposed solutions and illustrate how to apply them either in a real dataset or simulated dataset.

As stated earlier, multidisciplinarity is an important feature of data science, and this is clearly presented in this Special Issue. Ref. [1] proposes a new method for modelling problems and a data-clustering framework, and ref. [2] considers the estimation of the probability density function. In terms of the stochastic process, ref. [3] considers the fundamental properties of Tensor Markov Fields. Under a Bayesian paradigm of Statistical Inference, ref. [4] proposes a solution to classification problems.

Time series is one of the most prominent areas in data science, and some of the articles published here propose solutions with practical motivations in this area [5–8]. As mentioned before, this Special Issue encouraged articles on the foundations of measuring uncertainty [9–12].

The first article of this Special Issue was published on 30 October 2019, and the last on 26 October 2020. The articles are briefly discussed below, in order of the date of submission.

Due to its flexibility for treating heterogeneous populations, mixture models have been increasingly considered in modelling problems, and it provides a better cluster interpretation under a data-clustering framework [13].

In the traditional literature solutions, the results of the mixture model fit are highly dependent on the choice of the number of components fixed a priori. Thus, selecting an incorrect number of mixture components may cause the non-convergence of the algorithm and/or a short exploration of the clusterings [1].

Ref. [1] is the first published article in this issue. The authors propose an integrated approach that jointly selects the number of clusters and estimates the parameters of interest, without needing to specify (fix) the number of components. The authors developed the ISEM (integrated stochastic expectation maximisation) algorithm where the allocation probabilities depend on the number of clusters, and they are independent of the number of components of the mixture model.

In addition to theoretical development and evaluation of the proposed algorithm through simulation studies, the authors analyse two datasets. The first one refers to velocity in km/s of 82 galaxies from 6 well-separated conic sections of an unfilled survey of the Corona Borealis region; this is well-known Galaxy data in the literature. The second dataset refers to an acidity index measured in a sample of 155 lakes in central-north Wisconsin.

By considering the estimation of the probability density function (pdf), ref. [2] presented a wide range of applications for pdf estimation are provided, exemplifying its ubiquitous importance in data analysis. They discuss the need for developing universal measures to quantify error and uncertainties to enable comparisons across distribution classes, by establishing a robust distribution-free method to make estimates rapidly while quantifying the error of an estimate.

The authors consider a high-throughput, non-parametric maximum entropy method that employs a log-likelihood scoring function to characterise uncertainty in trial probability density estimates through a scaled quantile residual (SQR). This work is based on [14]. The SQR for the true probability density has universal sample size invariant properties equivalent to the sampled uniform random data (SURD).

Several alternative scoring functions that use SQR were considered, and they compared the sensitivity in quantifying the quality of a pdf estimate. The scoring function must exhibit distribution-free and sample size invariant properties so that it can be applied to any random sample of a continuous random variable. It is worth noting that all the scoring functions presented in the article exhibit desirable properties with similar or greater efficacy than the Anderson Darling scoring function and all are useful for assessing the quality of density estimates.

They present a numerical study to explore different types of measures for SQR quality. The initial emphasis was on constructing sensitive quality measures that are universal and sample size invariant. These scoring functions based on SQR properties can be applied to quantifying the "goodness of fit" of a pdf estimate created by any methodology, without knowledge of the true pdf.

The scoring function effectiveness is evaluated using receiver operator characteristics (ROC) to identify the most discriminating scoring function, by comparing overall performance characteristics during density estimation across a diverse test set of known probability distributions.

Integer-valued time series are relevant to many fields of knowledge, and an extensive number of models has been proposed, such as the first-order integer-valued autoregressive (INAR(1)) model. Ref. [5] considered a hierarchical Bayesian version of the INAR(p) model with variable innovation rates clustered according to a Pitman–Yor process placed at the top of the model hierarchy.

Using the full conditional distributions of the innovation rates, they inspected the behaviour of the model as concentrating or spreading the mass of the Pitman–Yor base measure. Then, they presented a graphical criterion that identified an elbow in the posterior expectation of the number of clusters as varying the hyperparameters of the base measure. The authors investigated the prior sensitivity and found ways to control the hyperparameters in order to achieve robust results. A significant contribution is a graphical criterion, which guides the specification of the hyperparameters of the Pitman–Yor process base measure.

Besides the theoretical development, the proposed graphical criterion was evaluated in simulated data. Considering a time series of yearly worldwide earthquakes events of substantial magnitude (equal or greater than 7 points on the Richter scale) from 1900 to 2018, they compared the forecasting performance of their model against the original INAR(p) model. Ref. [6] considered the problem of model fit and model forecasting in time series. For that, the authors considered the singular spectrum analysis (SSA), that is a powerful non-parametric technique to decompose the original time series into a set of components that can be interpreted, such as trend components, seasonal components, and noise components. They proposed a robust SSA algorithm by replacing the standard least-squares singular value decomposition (SVD) by a robust SVD algorithm based on the L1 norm and a robust SSA algorithm. The robust SVD was based on the Huber function. Then, a forecasting strategy was presented for the robust SSA algorithms, based on the linear recurrent SSA forecasting algorithm.

Considering a simulation example and time-series data from investment funds, the algorithms were compared to other versions of the SSA algorithm and classical ARIMA. The comparisons considered the computational time and the accuracy for model fit and model forecast. Ref. [9] presented a discussion about hypothetical judgment and measures to evaluate that, and exemplified it using a diagnostic of the infection of the Coronavirus Disease (COVID-19). Their purposes are (1) to distinguish channel confirmation measures that are compatible with the likelihood ratio and prediction confirmation measures that can be used to assess probability predictions, and (2) to use a prediction confirmation measure to eliminate the Raven Paradox and to explain that confirmation and falsification may be compatible.

They consider the measure *F*, that is one of few confirmation measures which possess the desirable properties as identified by many authors: symmetries and asymmetries, normalisation, and monotonicity. Also, the measure *b*∗, the degree of belief, was considered and optimised with a sampling distribution seen as a confirmation measure, which is similar to the measure *F* and also possesses the above-mentioned desirable properties.

From the diagnosis of the infection of the COVID-19, they show that only measures that are functions of the likelihood ratio, such as *F* and *b*∗, can help to diagnose the infection or choose a better result that can be accepted by the medical society. However, measures *F* and *b*∗ do not reflect the probability of the infection. Furthermore, using *F* or *b*∗ is still difficult to eliminate the Raven Paradox.

The measures *F* and *b*∗ indicate how good a hypothesis test of means is compared to the probability predictions. Hence, the authors proposed a measure *c*∗ that can indicate how good a probability prediction is. *c*∗ is called the prediction confirmation measure and *b*∗ is the channel confirmation measure. The measure *c*∗ accords to the Nicod criterion and undermines the Equivalence Condition, and hence can be used to eliminate the Raven Paradox. Ref. [3] presented the definitions and properties of Tensor Markov Fields (random Markov fields over tensor spaces). The author shows that tensor Markov fields are indeed Gibbs fields whenever strictly positive probability measures are considered. It is also proved how this class of Markov fields can be built based on statistical dependency structures inferred on information-theoretical grounds over empirical data. The author discusses how the Tensor Markov Fields described in the article can be useful for mathematical modelling and data analysis due to their intrinsic simplicity and generality. Ref. [4] proposed a variational approximation on probit regression models with intrinsic priors to deal with a classification problem. Some of the authors' motivations to combine intrinsic prior methodology and variational inference are to automatically generate a family of non-informative priors; to apply intrinsic priors on inference problems; intrinsic priors have flat tails that prevent finite sample inconsistency; for inference problems with a large dataset, variational approximation methods are much faster than MCMC-based methods.

The proposed method is applied to the LendingClub dataset (https://www.lendingclub.com). The LendingClub is a peer-to-peer lending platform that enables borrowers to create unsecured personal loans between \$1000 and \$40,000. Investors can search and browse the loan listings on the LendingClub website and select loans that they want to invest in. In addition, the information about the borrower, amount of loan, loan grade, and loan purpose was provided to them. The variable loan status (paid-off or charged-off) is the target variable, and [4] considers a set of predictive covariates, as loan term in months, employment length in years, annual income, among others. [10] constructed a decision-making model based on intuitionistic fuzzy cross-entropy and a comprehensive grey correlation analysis algorithm. Their motivation is the fact that despite the fact that intuitionistic fuzzy

distance measurement is an effective method to study multi-attribute emergency decision-making (MAEDM) problems, the traditional intuitionistic fuzzy distance measurement method cannot accurately reflect the difference between membership and non-membership data, where it is easy to cause information confusion.

The intuitionistic fuzzy cross-entropy distance measurement method was introduced, which can not only retain the integrity of decision information but also directly reflect the differences between intuitionistic fuzzy data. Focusing on the weight problem in MAEDM, the authors analysed and compared the known and unknown attribute weights, which significantly improved the reliability and stability of decision-making results. The intuitionistic fuzzy cross-entropy and grey correlation analysis algorithm were introduced into the emergency decision-making problems such as the location ranking of shelters in earthquake disaster areas, which significantly reduced the risk of decision-making. The validity of the proposed method was verified by comparing the traditional intuitionistic fuzzy distance to the intuitionistic fuzzy cross-entropy.

The authors highlight that the proposed method applies to emergency decision-making problems with certain subjective preference. In addition to the theoretical approach and highlighting the importance to deal with disasters motivations, the authors took the Wenchuan Earthquake on May 12th 2008 as a case of study, constructing and solving the ranking problem of shelters.

Motivated by time series problems, ref. [7] reviewed the shortcomings of unit root and cointegration tests. They proposed a Bayesian approach based on the Full Bayesian Significance Test (FBST), a procedure designed to test a sharp or precise hypothesis.

The importance of studying this is justified by the fact that one should be able to assess if a time series present deterministic or stochastic trends to perform statistical inference. For univariate analysis, one way to detect stochastic trends is to test if the series has unit-roots (unit root tests). For multivariate studies, determining stationary linear relationships between the series, or if they cointegrate (cointegration tests) are important.

The Augmented Dickey–Fuller test is one of the most popular tests used to assess if a time series has a stochastic trend or if they have a unit root for series described by auto-regressive models. When one is searching for long-term relationships between multiple series, it is crucial to know if there are stationary linear combinations of these series, i.e., if the series are cointegrated. One of the most used tests is the maximum eigenvalue test.

Besides proposing the method considering FBST, the authors also compared its performance with the most used frequentist alternatives. They have shown that the FBST works considerably well even when one uses improper priors, a choice that may preclude the derivation of Bayes Factors, a standard Bayesian procedure in hypotheses testing. Ref. [11] considered a Kalman filter and a Rényi entropy. The Rényi entropy was employed to measure the uncertainty of the multivariate Gaussian probability density function. The authors proposed calculation of the temporal derivative of the Rényi entropy of the Kalman filter's mean square error matrix, which provided the optimal recursive solution mathematically and was minimised to obtain the Kalman filter gain.

One of the findings of this manuscript was that, from the physical point of view, the continuous Kalman filter approached a steady state when the temporal derivative of the Rényi entropy was equal to zero, which means that the Rényi entropy remained stable.

A numerical experiment of falling body tracking in noisy conditions with radar using the unscented Kalman filter, and a practical experiment of loosely-coupled integration, are provided to demonstrate the effectiveness of the above statements and to show the Rényi entropy truly stays stable when the system becomes steady.

The knowledge about future values and the stock market trend has attracted the attention of researchers, investors, financial experts, and brokers. Ref. [8] proposed a stock trend prediction model by utilising a combination of the cloud model, Heikin–Ashi candlesticks, and fuzzy time series in a unified model.

By incorporating probability and fuzzy set theories, the cloud model can aid the required transformation between the qualitative concepts and quantitative data. The degree of certainty associated with candlestick patterns can be calculated through repeated assessments by employing the normal cloud model. The hybrid weighting method comprising the fuzzy time series, and Heikin–Ashi candlestick was employed for determining the weights of the indicators in the multi-criteria decision-making process. The cloud model constructs fuzzy membership functions to deal effectively with uncertainty and vagueness of the historical stock data to predict the next open, high, low, and close prices for the stock.

The objective of the proposed model is to handle qualitative forecasting and not quantitative only. The experimental results prove the feasibility and high forecasting accuracy of the constructed model. Ref. [12] uses the maximum entropy principle to provide an equation to calculate the Lagrange multipliers. Accordingly, an equation was developed to predict the bank profile shape of threshold channels.

The relation between ratio with the entropy parameter and the hydraulic and geometric characteristics of channels was evaluated. The Entropy-based Design Model of Threshold Channels (EDMTC) for estimating the shape of banks profiles and the channel dimensions was designed based on the maximum entropy principle in combination with the Gene Expression Programming regression model.

The results indicate that the entropy model is capable of predicting the bank profile shape trend with acceptable error. The proposed EDMTC can be used in threshold channel design and for cases when the channel characteristics are unknown.

It is our understanding that this Special Issue contributes to increasing knowledge in the data science field, by fostering discussions of measuring uncertainties in data analysis problems. The discussion of foundations/theoretical aspects of the methods is essential to avoid the use of black-box procedures, as well as the presentation of the methods in real problem data. Theory and application are both important to the development of data science.

**Funding:** This research received no external funding.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **An Integrated Approach for Making Inference on the Number of Clusters in a Mixture Model**

**Erlandson Ferreira Saraiva 1,\*, Adriano Kamimura Suzuki 2, Luis Aparecido Milan <sup>3</sup> and Carlos Alberto de Bragança Pereira 1,4**


Received: 23 September 2019; Accepted: 26 October 2019; Published: 30 October 2019

**Abstract:** This paper presents an integrated approach for the estimation of the parameters of a mixture model in the context of data clustering. The method is designed to estimate the unknown number of clusters from observed data. For this, we marginalize out the weights for getting allocation probabilities that depend on the number of clusters but not on the number of components of the mixture model. As an alternative to the stochastic expectation maximization (SEM) algorithm, we propose the integrated stochastic expectation maximization (ISEM) algorithm, which in contrast to SEM, does not need the specification, a priori, of the number of components of the mixture. Using this algorithm, one estimates the parameters associated with the clusters, with at least two observations, via local maximization of the likelihood function. In addition, at each iteration of the algorithm, there exists a positive probability of a new cluster being created by a single observation. Using simulated datasets, we compare the performance of the ISEM algorithm against both SEM and reversible jump (RJ) algorithms. The obtained results show that ISEM outperforms SEM and RJ algorithms. We also provide the performance of the three algorithms in two real datasets.

**Keywords:** model-based clustering; mixture model; EM algorithm; integrated approach

#### **1. Introduction**

Recently, there has been increasing interest in modeling using mixture models. This is mainly due to the flexibility for treating heterogeneous populations. Under a data-clustering framework, this model has the advantage of being probabilistic, and then the obtained clusters can have a better interpretation from a statistical point of view [1]. This contrasts with usual methods, such as k-means or hierarchical clustering, in which clusters are not statistically based, as discussed by [2].

From a frequentist viewpoint, the standard method to get the maximum likelihood estimates for the parameters of a mixture model is based on the use of the Expectation Maximization (EM) algorithm [3]. However, for the use of this algorithm, the number of components *k* of the mixture model needs to be known a priori. As the resulting model is highly dependent on the choice of this value, the main question is how to set the *k* value. Selecting an erroneous *k* value may cause the non-convergence of the algorithm and/or a low exploration of the clusterings. In addition, depending on the *k* value chosen we may have empty components, and therefore, there are no maximum likelihood estimates for these components.

An approach frequently used to determine the best *k* value among a fixed set of values is the use of the stochastic version of the EM algorithm (SEM) with some model selection criterion, such as the Akaike information criterion (AIC) [4,5] or the BIC [6]. In this approach, models are fitted for a set of predefined *k* values, and the best model is the one that has the smallest AIC or BIC value.

However, as discussed by [7], to adjust several models for a predefined set of values for the number of the cluster and compare them using some model selection criterion is not a practical and efficient procedure. Therefore, it is desirable to have an efficient algorithm to calculate the optimal number of clusters together with the estimation of the parameters of each mixture component. In this scenario, the Bayesian approach was successfully performed considering the Markov chain Monte Carlo (MCMC) algorithm with reversible jumps, described by [8] in the context of univariate normal mixture models. On the other hand, a difficulty often encountered in implementing a reversible jump algorithm (RJ) is the construction of efficient transition proposals that lead to a reasonable acceptance rate.

Following in the line of MCMC algorithms, [9] proposes a split–merge MCMC procedure for the conjugated Dirichlet process mixture model using a restricted Gibbs sampling scan to determine a split proposal, where the number of scans (tuning parameter) must be previously fixed by the user, and [10] extend their method to a nonconjugated Dirichlet process mixture model. [11] proposes a data-driven split-and-merge approach. In this proposal, the number of clusters is updated according to the creation of a new component based on a single observation and using a split–merge strategy, developed based on the use of the Kullback–Leibler divergence. A difficulty encountered for implementing this algorithm is the obtaining of the mathematical expression for the Kullback–Leibler divergence, which does not always have known analytical expression. In addition, the sequential allocation used in the split–merge strategy of these three works may make the algorithm slow when the sample size is great, and the computation implementation of these methods is not so simple.

The present work proposes an integrated approach that, in a joint way, selects the number of clusters and estimates the parameters of interest. With this approach, the mixture weights are integrated out to obtain allocation probabilities that depend on the number of clusters (nonempty components) but do not depend on the number of components *k*. In addition, considering *k* tending to infinity, this procedure introduces a positive probability of a new cluster being created by a single observation. When a new cluster is created, the parameters associated with it are generated from its posterior distribution. We then developed the ISEM (integrated stochastic expectation maximization) algorithm to estimate the parameters of interest. This algorithm configures a setting for latent allocation variables **c** according to allocation probabilities, and then the cluster parameters are updated conditionally on **c** as follows: for clusters with at least two observations, the parameter values are the maximum likelihood estimates; for the clusters with only one observation, the parameter values are generated from their posterior distribution.

In order to illustrate the computation implementation of the method and verify its performance, we have considered a specific model in which data are generated from mixtures of univariate normal distributions. This model allows us to avoid the label switching problem by considering the labeling of the components according to the increasing order of the component averages, as done by [8,11–13], among others. But we emphasize that our algorithm is not restricted to this particular model. For instance, for the multivariate case, we may consider the labeling of the components according to the eigenvalues of the current covariance matrix, as done by [14]. However, a detailed discussion of the multivariate case will be done in a future paper.

We also compare the performance of the ISEM with both SEM and RJ algorithms. The criteria used to compare the methods are the estimated probability of the number of clusters, convergence of the sampled values, mixing, autocorrelation, and computation time. We also applied the three algorithms to two real datasets. The first is the well-known Galaxy data, and the second is a dataset on Acidity.

The remainder of the paper is as follows. Section 2 describes the mixture model and the estimation process based on the SEM algorithm. Section 3 develops the integrated approach and describes the ISEM algorithm. Section 4 shows how we applied the algorithm to simulated datasets in order to assess its performance. Section 5 describes the application of the three algorithms to two real datasets. Section 6 is about our final remarks. Additional details are in the Supplementary Material, which is referred to as "SM" in this paper. Table 1 presents the main notations used throughout the article.

**Table 1.** Main mathematical notation used throughout the paper.


#### **2. Mixture Model and SEM Algorithm**

Let **y** = (*y*1, ... , *yn*) be a vector of independent observations from a mixture model with *k* components, i.e.,

$$f(y\_i|\mathbf{w}, \theta\_k, k) = \sum\_{j=1}^{k} w\_j f(y\_i|\theta\_j), \tag{1}$$

where *f*(*yi*|*θj*) is the density of a family of parametric distributions with parameters *θ<sup>j</sup>* (scalar or vector), *θ<sup>k</sup>* = (*θ*1, ... , *θk*) are the parameters of the components, and **w** = (*w*1, ... , *wk*), *wj* > 0 and ∑*k <sup>j</sup>*=<sup>1</sup> *wj* = 1 are component weights.

The log-likelihood function for (*θk*, **w**) is given by

$$l(\boldsymbol{\theta}\_k, \mathbf{w} | \mathbf{y}, k) = \log \left\{ \prod\_{i=1}^n \left[ \sum\_{j=1}^k w\_j f(y\_i | \boldsymbol{\theta}\_j) \right] \right\} = \sum\_{i=1}^n \log \left\{ \left[ \sum\_{j=1}^k w\_j f(y\_i | \boldsymbol{\theta}\_j) \right] \right\}.$$

The mathematical notation *l*(*θk*, **w**|**y**, *k*) is given as in the book of Casella and Berger (2002).

The usual procedure to obtain the maximum likelihood estimators consists of getting partial derivatives of *l*(*θk*, **w**|**y**) in relation to *θ<sup>j</sup>* and then equalizing the result to zero, i.e.,

$$\frac{d l(\boldsymbol{\theta}\_{k}, \mathbf{w} | \mathbf{y})}{d \theta\_{j}} = \sum\_{i=1}^{n} \frac{w\_{j} f(y\_{i} | \theta\_{j})}{\sum\_{j=1}^{k} w\_{j} f(y\_{i} | \theta\_{j})} \frac{d \log \left[ f(y\_{i} | \theta\_{j}) \right]}{d \theta\_{j}} = 0,\tag{2}$$

for *j* = 1, . . . , *k*.

But, note that in (2), the maximization procedure consists of a weighted maximization process of the log-likelihood function with each observation *yi* having a weight associated to component *j* given by

$$w\_{ij}^\* = \frac{w\_j f(y\_i|\theta\_j)}{\sum\_{j=1}^k w\_j f(y\_i|\theta\_j)},\tag{3}$$

for *i* = 1, ... , *n* and *j* = 1, ... , *k*. However, these weights depend on the parameters that we are trying to estimate. In this way, we cannot obtain a "closed" mathematical expression that allows the direct maximization of the log-likelihood function. Due to this, the mixture problem is reformulated as a complete-data problem [12,15].

#### *Complete-Data Formulation*

Consider associated to each observation *yi* a latent indicator variable *ci* not known, so that if *ci* = *j*, then *yi* is from component *j*, for *i* = 1, ... , *n* and *j* = 1, ... , *k*. The probability of *ci* = *j* is *wj*, *P*(*ci* = *j*|**w**, *k*) = *wj*, for *i* = 1, ... , *n* and *j* = 1, ... , *k*. Letting *nj* be the number of observations from component *j* (i.e., the number of *ci*s equals to *j*), the joint probability for **c** = (*c*1, ... , *cn*) given **w** and *k* is

$$
\pi(\mathbf{c}|\mathbf{w},k) = \prod\_{j=1}^{k} w\_j^{n\_j}.\tag{4}
$$

The distribution of the number of observations assigned to each component, *n*1, ... , *nk*, called the occupation number, is multinomial, (*n*1,..., *nk*|*n*, **w**) ∼ *Multinomial*(*n*, **w**), where *n* = *n*<sup>1</sup> + ... + *nk*.

Thus, under this augmented framework, we have that


$$l(\boldsymbol{\theta}\_{k'}\mathbf{w}|\mathbf{y},\mathbf{c}) = \log \left\{ \prod\_{j=1}^{k} w\_j^{n\_j} L(\boldsymbol{\theta}\_j|\mathbf{y}) \right\} = \sum\_{j=1}^{k} \left[ n\_j \log(w\_j) + l(\boldsymbol{\theta}\_j|\mathbf{y}) \right],$$

where *<sup>l</sup>*(*θj*|**y**) = log *L*(*θj*|**y**) is the log-likelihood function for component *j*, for *j* = 1, ... , *k*. Thus, the estimation procedure of the *k* component parameters reduce to *k* independent problems of estimation. For example, for a normal mixture model, the maximum likelihood estimates for component parameters *θ<sup>j</sup>* = (*μj*, *σ*<sup>2</sup> *<sup>j</sup>* ) is <sup>ˆ</sup> *θ<sup>j</sup>* = (*μ*ˆ*j*, *σ*ˆ <sup>2</sup> *<sup>j</sup>* )=(*yj* ,*s*<sup>2</sup> *<sup>j</sup>*), where *yj* and *<sup>s</sup>*<sup>2</sup> *<sup>j</sup>* are, respectively, the average and variance of the observations allocated to component *j*, for *j* = 1, . . . , *k*.

From this complete-data formulation, the estimation procedure is given by an iterative process with two steps. In the first one, the allocation indicator variables are updated conditional on component parameters, and in the subsequent step, the component parameters are updated conditional on configuration of the allocation indicator variables.

The usual algorithm used to implement these two steps is the EM algorithm [3]. The stochastic version of the EM algorithm (SEM) can be implemented according to Algorithm 1.

#### **Algorithm 1** SEM Algorithm

1: Initialize the algorithm with a configuration **c**(0) = *c* (0) <sup>1</sup> ,..., *c* (0) *n* for allocation indicator variables.


8: Do *s* = *s* + 1 and return to step (3).

Although it is simple to implement computationally, the SEM algorithm may present some practical problems. As discussed by [16], the algorithm may present a slow convergence. Due to this, some authors, such as [17,18], discuss how to set up the start values in order to increase the convergence. In addition, [15] discusses the non-existence of the global maximum estimator.

Moreover, in this algorithm, the *k* value must be known previously. For the cases in which *k* is an unknown quantity, the best *k* value is chosen by fitting a set of models associated with a set of predefined *k* values and comparing them according to AIC [4,5] or BIC [6] criteria. Furthermore, given a sample of size *<sup>n</sup>* and fixed a *<sup>k</sup>* value, there exists a positive probability, given by (<sup>1</sup> − *wj*)*<sup>n</sup>* = 0, of the *j*-th component not having observations allocated in an iteration of the algorithm. In this case, we have an empty component, and the maximum likelihood estimates cannot be calculated for these components. Thus, in order to avoid the practical problems presented by the EM algorithm, we propose an integrated approach.

#### **3. Integrated Approach**

We start our integrated approach linking data clustering to a mixture model. For this, consider a sampling process from a heterogeneous population that is subdivided into *k* sub-populations. Thus, it is natural to assume that the sampling process consists of the realization of the following steps:


for *i* = 1, . . . , *n* and *j* = 1, . . . , *k*, where *n* is the sample size.

Let (*Yi*, *ci*) be a sample unit, where *ci* is an indicator allocation variable that assumes a value of the set {1, ... , *k*} with probabilities {*w*1, ... , *wk*}, respectively. Thus, assuming that subpopulation *j* is modeled by a probability distribution *F*(*θj*) indexed by parameter *θ<sup>j</sup>* (scalar or vector), we have that

$$P(Y\_i|c\_i = j, \theta\_j) \sim F(\theta\_j) \quad \text{and} \quad P(c\_i = j|\mathbf{w}) = w\_{j'}$$

for *i* = 1, . . . , *n* and *j* = 1, . . . , *k*.

However, in clustering problems, the *ci*'s values are non-observable. Thus, the probability of *ci* = *j* is *wj*, and the marginal probability density function for *Yi* = *yi* is given by Equation (1).

In addition, as the model in Equation (1) is a population model; so there exists a non-null probability (<sup>1</sup> − *wj*)*<sup>n</sup>* that the *<sup>j</sup>*-th component is an empty component. Thus, the number of clusters (i.e., non-empty components) is smaller than the number of components *k*. As viewed in the description of the EM algorithm, the number of clusters is defined by the configuration of the latent allocation variables **c**; thus hereafter, we will denote the number of clusters by *k***c**, for *k***<sup>c</sup>** ≤ *k*.

Since the interest lies in the configuration of **c**, let us marginalize out the weights of the mixture model. Thus, integrating density (4) with respect to the prior *Dirichlet <sup>γ</sup> <sup>k</sup>* ,..., *<sup>γ</sup> k* distribution of the weights, denoted by (*w*1, ... , *wk*)|*k*, *<sup>γ</sup>* <sup>∼</sup> *Dirichlet <sup>γ</sup> <sup>k</sup>* ,..., *<sup>γ</sup> k* , the joint probability for **c** is given by (see Appendix 3 of the SM)

$$\pi(\mathfrak{e}|\gamma,k) = \frac{\Gamma(\gamma)}{\Gamma(n+\gamma)} \prod\_{j=1}^{k} \frac{\Gamma(n\_j + \frac{\gamma}{k})}{\Gamma\left(\frac{\gamma}{k}\right)}.\tag{5}$$

Similarly, the conditional probability for *ci* = *j* given **c**−*<sup>i</sup>* = (*c*1,..., *ci*−1, *ci*+1,..., *cn*), is given by

$$\pi(\mathbf{c}\_i = j | \mathbf{c}\_{-i\prime}, \gamma\_\prime k) = \frac{n\_{j,-i} + \frac{\gamma}{k}}{n + \gamma - 1},\tag{6}$$

where *nj*,−*<sup>i</sup>* is the number of observations allocated to the *j*-th component, excluding the *i*-th observation, for *i* = 1, . . . , *n* and *j* = 1, . . . , *k*.

As the main interest is in *k***<sup>c</sup>** and not *k*, we remove *k* from Equation (6) by letting *k* tend to infinity. Under this assumption, the probability reaches the following limit:

$$
\pi(\mathbf{c}\_i = j | \mathbf{c}\_{-i\prime} \gamma) = \frac{n\_{j\prime} - i}{n + \gamma - 1} \,\prime \tag{7}
$$

when *nj*,−*<sup>i</sup>* > 0, for *i* = 1, ... , *n* and *j* = 1, ... , *k***c**, where *k***<sup>c</sup>** is the number of clusters defined by configuration **c**. In addition, we now have a probability of the *i*-th observation being allocated to one of the other infinite components, which is given by

$$
\pi(\mathbf{c}\_i = j^\* | \mathbf{c}\_{-i}, \gamma) = \frac{\gamma}{n + \gamma - 1},
\tag{8}
$$

for *j* <sup>∗</sup> ∈ { / 1, ... , *k***c**}. This is the probability of the observation *yi* creating a new cluster, for *i* = 1, ... , *n*. The probabilities in (7) and (8) are equivalent to the update probabilities of a Dirichlet process mixture model. See, for example, [19–21].

Given *yi*, the conditional probability for *ci* = *j*, such that *nj*,−*<sup>i</sup>* > 0, is

$$
\pi\_{i\dot{j}} = \pi(\mathbf{c}\_i = j | y\_{i\nu} \theta\_{\dot{j}\nu} \mathbf{c}\_{-i\nu} \gamma) = \frac{n\_{\dot{j},-i}}{n + \gamma - 1} f(y\_i | \theta\_{\dot{j}}),\tag{9}
$$

for *<sup>i</sup>* = 1, ... , *<sup>n</sup>* and *<sup>j</sup>* = 1, ... , *<sup>k</sup>***c**−**<sup>i</sup>** , where *k***c**−*<sup>i</sup>* is the number of clusters excluding the *i*-th observation. At this point, it is important to note that if an observation *yi* is allocated to a component *j*, *ci* = *j*, and *nj* > 1, then *nj*,−*<sup>i</sup>* ≥ 1 and *<sup>k</sup>***c**−*<sup>i</sup>* = *<sup>k</sup>***c**. But if *ci* = *<sup>j</sup>* and *nj* = 1, then *nj*,−*<sup>i</sup>* = 0 and *<sup>k</sup>***c**−*<sup>i</sup>* = *<sup>k</sup>***<sup>c</sup>** − 1.

In order to define the conditional probability of the *i*-th observation creating a new cluster *j* ∗, we integrate parameters out for this case, for *j* <sup>∗</sup> = *<sup>k</sup>***c**−*<sup>i</sup>* + 1. This was done because that probability does not depend on the parameter value *θj*<sup>∗</sup> . Thus, the conditional posterior probability for *Ci* = *j* ∗ is

$$
\pi\_{i\bar{j}^\*} = \pi(c\_i = j^\* | y\_i, \mathbf{c}\_{-i\nu} \gamma) = \frac{\gamma}{n + \gamma - 1} \mathbf{I}(y\_i), \tag{10}
$$

where **<sup>I</sup>**(*yi*) = *<sup>f</sup>*(*yi*|*θj*<sup>∗</sup> )*π*(*θj*<sup>∗</sup> )*dθj*<sup>∗</sup> and *<sup>π</sup>*(*θj*<sup>∗</sup> ) is the density of the prior distribution for *<sup>θ</sup>j*<sup>∗</sup> , for *<sup>i</sup>* <sup>=</sup> 1, . . . , *n*.

As is known from the literature, the likelihood function for a mixture model is non-identifiable, i.e., any permutation of the components' labels lead to the same likelihood function (see, for example, [8,11,22–24]). Thus, in order to get identifiability, we assume that *μ*1, ... , *μk***<sup>c</sup>** are the component means for clusters and that *μ*<sup>1</sup> < ... < *μk***<sup>c</sup>** . However, it does not prevent the algorithm

described in the next Section from being applicable to another labeling criterion. Additional discussion about label switching can be found in [22,23].

#### *3.1. Integrated SEM Algorithm*

Using probabilities given in Equations (9) and (10), we update the allocation indicator variables according to Algorithm 2.

Conditional on a configuration **c**, we have *k***<sup>c</sup>** clusters. So, we update parameters of interest according to Algorithm 3. We then join Algorithms 2 and 3 to get the Algorithm 4.

After the *S* iterations, we discard the first *B* iterations as a burn-in. In the following, we also consider "jumps" of size *h*, i.e., only one draw every *h* is extracted from the original sequence in order to obtain a sub-sequence of size *H* = (*S* − *B*)/*h* to make inferences. Denote this sub-sequence by S(*H*).

Consider <sup>N</sup>*k***<sup>c</sup>** (*j*) to be the number of times that *<sup>k</sup>***<sup>c</sup>** <sup>=</sup> *<sup>j</sup>* in <sup>S</sup>(*H*), for *<sup>j</sup>* ∈ {1, ... , *Kmax*}, where *Kmax* is the maximum *<sup>k</sup>***<sup>c</sup>** value sampled in the course of iterations. Thus, *<sup>P</sup>*˜(*k***<sup>c</sup>** <sup>=</sup> *<sup>j</sup>*) = <sup>N</sup>*k***<sup>c</sup>** (*j*) *<sup>H</sup>* is the estimated probability for *k***<sup>c</sup>** = *j*. We then consider

$$\vec{k}\_{\mathbf{c}} = \operatorname\*{arg\max}\_{1 \le j \le K\_{\max}} \left( \mathcal{P}(k\_{\mathbf{c}} = j) \right)$$

as being the estimates for the number of components *k***c**.

Appendix 1 of the SM presents the mathematical expression used to determine a configuration for **c** and estimates for the parameters of the clusters, conditional on the estimate ˜ *k***c**.

#### **Algorithm 2** Updating **c**


$$\text{5:} \qquad \text{If } Z\_{i\underline{j}} = 1 \text{, for } j \in \{1, \dots, k\_{\mathfrak{C}\_{-i}}\}, \text{ set up } c\_i = j \text{ and do } n\_{\underline{j}} = n\_{\underline{j}, -i} + 1;$$


$$\text{18: } \qquad \mu\_{j^\*} = \max\_{1 \le j \le k\_{\mathbf{c}}} \mu\_{j^\*} \text{ then do } j^\* = k\_{\mathbf{c}} \text{ and keep all other clusters labels;} \newline \qquad \text{20: } \mu\_{j^\*} = \max\_{1 \le j \le k\_{\mathbf{c}}} \mu\_{j^\*} \text{ then } \mu\_{j^\*} = \mu\_{j^\*} \text{ and keep all other clusters labels;} \newline \qquad \text{30: } \mu\_{j^\*} = \mu\_{j^\*} \text{ then } \mu\_{j^\*} = \mu\_{j^\*} \text{ and keep all other clusters labels.}$$

9: *μ<sup>j</sup>* < *μj*<sup>∗</sup> < *μj*+1, for *j* = {1, *k***c**}, then do *j* ∗ = *j* + 1 and relabel all other clusters *j* ≥ *j* + 1 doing *j* = *j* + 1.

#### **Algorithm 3** Updating cluster parameters


#### **Algorithm 4** ISEM Algorithm


#### **4. Simulation Study**

In this section, we describe the results from a simulation study carried out to verify the performance of the proposed algorithm. To generate the artificial datasets, we considered univariate normal mixture models. We set up the number of clusters and parameter values according to the specified values in Table 2. We also fixed the sample size equal to *n* = 200.


**Table 2.** Number of clusters and parameter values used for simulating the datasets.

The procedure for generating the datasets is given by the following steps:

(i) For *i* = 1, ... , *n*, generate *Ui* ∼ U(0, 1); if *j*−1 ∑ *j*=1 *wj* < *ui* ≤ *j* ∑ *j*=1 *wj*, generate *Yi* ∼ N *μj*, *σ*<sup>2</sup> *j* , with fixed parameter values according to Table 2, for *w*<sup>0</sup> = 0 and *j* = 1, . . . , *k***c**.

(ii) In order to record from which component each observation is generated, we define *G* = (*G*1,..., *Gn*) such that *Gi* <sup>=</sup> *<sup>j</sup>* if *Yi* ∼ N *μj*, *σ*<sup>2</sup> *j* , for *i* = 1, . . . , *n* and *j* = 1, . . . , *k***c**.

Having generated the datasets, we need to define the the probability of creating a new cluster and the posterior distribution for *θj*<sup>∗</sup> = *μj*<sup>∗</sup> , *σ*<sup>2</sup> *j*∗ given *yi*, for *i* = 1, ... , *n*. For this, consider the following conjugated prior distributions for component parameters *θ<sup>j</sup>* = *μj*, *σ*<sup>2</sup> *j* :

$$
\mu\_{\dot{j}}|\sigma\_{\dot{j}}^2, \mu\_{0\prime}\lambda \sim \mathcal{N}\left(\mu\_{0\prime}\frac{\sigma\_{\dot{j}}^2}{\lambda}\right) \quad \text{and} \quad \sigma\_{\dot{j}}^{-2}|\alpha, \beta \sim \Gamma(\alpha, \beta),
$$

where *μ*0, *λ*, *α*, and *β* are hyperparameters. The parametrization of the gamma distribution is such that the mean is *α*/*β* and the variance is *α*/*β*2.

Following [11,24], we consider the following procedure to define the values for the hyperparameters. Let *R* be the observed variation interval of the data and *ε* its midpoint. Then, we set up *μ*<sup>0</sup> = *ε* and *E*(*σ*−<sup>2</sup> *<sup>j</sup>* ) = *<sup>R</sup>*<sup>−</sup>2. Thus, we obtain *<sup>β</sup>* = *<sup>α</sup>R*2, and we fix *<sup>α</sup>* = 1. In addition, to obtain a prior distribution with a large variance, we fixed *λ* = 10<sup>−</sup>2, and for the hyperparameter *γ*, we consider the value 0.1, *γ* = 0.1.

Thus, the probability of creating a new cluster is given by Equation (10), in which

$$\Pi(y\_i) = \left[\frac{\lambda}{2\beta\pi(1+\lambda)}\right]^{\frac{1}{2}} \frac{\Gamma(a+1)}{\Gamma(a)} \left[1 + \frac{y\_i^2 + \lambda\mu\_0^2}{2\beta} - \frac{(y\_i + \lambda\mu\_0)^2}{2\beta(1+\lambda)}\right]^{-(a+\frac{1}{2})},\tag{11}$$

and *j* <sup>∗</sup> = *k***<sup>c</sup>** + 1, for *i* = 1, . . . , *n*.

When a new cluster is created, the new parameter values *θj*<sup>∗</sup> = (*μj*<sup>∗</sup> , *σ*<sup>2</sup> *<sup>j</sup>*<sup>∗</sup> ) are generated from the following conditional posterior distributions,

$$
\mu\_{j^\*} | \sigma\_{j^\*}^2, y\_j, \mathbf{c}, \mu\_0, \lambda \sim \mathcal{N}\left(\frac{y\_i + \lambda \mu\_0}{1 + \lambda}, \frac{\sigma\_j^2}{1 + \lambda}\right) \tag{12}
$$

and

$$\sigma\_{j^\*}^{-2} | y\_i, \mathbf{c}, \tau, \beta \sim \Gamma \left( a + 1, \beta + \frac{y\_i^2 + \lambda \mu\_0^2}{2} - \frac{(y\_i + \lambda \mu\_0)^2}{2(1 + \lambda)} \right), \tag{13}$$

for *j* <sup>∗</sup> = *<sup>k</sup>***c**−*<sup>i</sup>* + 1.

We run the ISEM algorithm for *S* = 55,000, *B* = 5000, and *h* = 10. From these values, we got a sub-sequence S(*H*) of size 5000 to make inferences. The algorithm was initialized with *k***<sup>c</sup>** = 1 and parameter values *μ*<sup>1</sup> = *y* and *σ*<sup>2</sup> <sup>1</sup> = *<sup>s</sup>*2, the sample mean and variance of the generated dataset, respectively.

We also apply to the generated datasets the SEM algorithm, as describe in Section 2, and the RJ algorithm as proposed by [8]. In order to choose the number of clusters using the SEM algorithm, we consider the AIC and BIC model selection criteria. In addition, the algorithm was initialized using a configuration **c**(0) obtained via the *k*-means algorithm [25]. As stop criterion, we set up the threshold *ε* = 0.001. For the RJ algorithm, we consider the same number of iterations, burn-in, and thin value used in the ISEM algorithm.

In order to compare the three algorithms in terms of the estimation of the number of clusters, we consider *M* = 500 simulated datasets. Table 3 shows the proportion of times that the ISEM and RJ algorithms put the highest estimated probability on the *k***<sup>c</sup>** values presented. This table also show the proportion of times that the AIC and BIC indicated the *k***<sup>c</sup>** value as the best among the tested values. The values highlighted in bold are the proportions on the *k***<sup>c</sup>** true value. As one can note, the ISEM

shows a better performance, i.e., higher proportion of the *k***<sup>c</sup>** true value than the other two algorithms, especially in relation to the SEM algorithm with the selection of *k***<sup>c</sup>** via the AIC and BIC. The results also show that the AIC and BIC model selection criteria have a low success ratio, with a proportion of the *k***<sup>c</sup>** true value smaller than 0.50.


**Table 3.** Proportion of times the algorithms chose the *k***c** values as the number of clusters.

#### *4.1. Results from a Single Simulated Data Set*

We also analyze the results from a single dataset selected at random from the *M* = 500 generated datasets in each situation *A*<sup>1</sup> to *A*4. Then, we discuss the convergence of the ISEM and RJ algorithms based on the sample generated across iterations, using graphical tools. In general, the graphical tools show whether the simulated chain stabilizes in some sense and provide useful feedback about the convergence [26].

Table 4 shows the estimated probabilities of *k***<sup>c</sup>** obtained with ISEM and RJ and the AIC and BIC values from the SEM algorithm for the selected dataset. In this table, the values highlighted in bold are the highest estimated probabilities and the smallest AIC and BIC values. As we can note, the ISEM algorithm set up a maximum probability for the *k***<sup>c</sup>** true value for the four simulated cases.

The RJ algorithm puts a higher probability on the *k***<sup>c</sup>** true value for datasets *A*<sup>1</sup> and *A*2. However, the probability on the *k***<sup>c</sup>** true value is smaller than that estimated by ISEM. This indicates a higher precision for the ISEM algorithm. For datasets *A*<sup>3</sup> and *A*4, the RJ attributes maximum probability to the wrong values, *k***<sup>c</sup>** = 5 and *k***<sup>c</sup>** = 6, respectively. Moreover, the probabilities estimated by RJ do not evidence a single value for *k***<sup>c</sup>** as being the best value since there are different values for *k***<sup>c</sup>** with similar probabilities. For example, for dataset *A*2, the maximum is at *k***<sup>c</sup>** = 3 with *P*(*k***<sup>c</sup>** = 3|·) = 0.3836, but one can argue that the estimated probabilities favor *k***<sup>c</sup>** = 3 or *k***<sup>c</sup>** = 4. For dataset *A*3, there is similar support for *k***<sup>c</sup>** between 4 and 7, and for *A*<sup>4</sup> between 5 and 7.

Analogously to ISEM and RJ, the AIC and BIC model selection criteria indicate the *k***<sup>c</sup>** true value as the best value for datasets *A*<sup>1</sup> and *A*2. For dataset *A*3, similar to the RJ, the AIC indicates the wrong value *k***<sup>c</sup>** = 5 as the best value, while the BIC indicates the *k***<sup>c</sup>** true value as the best value. For dataset *A*4, the AIC and BIC indicate the wrong value *k***<sup>c</sup>** = 6 as the best model.

#### *4.2. An Empirical Check of the Convergence*

We now empirically check the convergence of the sequence of the probability for *k***<sup>c</sup>** across iterations, the capacity to move for different values of *k***<sup>c</sup>** in the course of the iterations, and the estimated autocorrelation function the (acf) for the ISEM and RJ algorithms.


**Table 4.** Estimated probability for *k***c**.

Figure 1a,d,g,j presents the graphics of the probability for *k***<sup>c</sup>** in the course of the iterations, for the four simulated datasets. To maintain a better visualization, we plot in these graphics only the three higher *P*(*k***c**|·) estimates. Observing at these figures, it can be seen that the L iterations and the burn-in value *B* used were adequate to achieve stability for *P*(*k***c**|·). In addition, Figure 1b,e,h,k shows that the ISEM algorithm mixes well over *k***c**, i.e., "visits" mixture models with different values of *k***<sup>c</sup>** across iterations. As shown by Figure 1c,f,i,l, the sampled *k***<sup>c</sup>** values also do not have significant autocorrelation function (ACF). Thus, based on these graphical tools, there is no evidence against the convergence of the generated values by the ISEM algorithm.

Figure 2 shows the performance of the RJ algorithm. The probabilities of *k***<sup>c</sup>** present a satisfactory stability. The sampled *k***<sup>c</sup>** values have a satisfactory mix, and the estimated autocorrelation is non-significant. In addition, as can be noted in Figure 2, probabilities for the number of clusters do not differentiate a value of *k***<sup>c</sup>** in order to be chosen as the better value, as done by ISEM. This may happen due the fact that the performance of the RJ depends on the choice of the transition functions to do "good" jumping, meaning that a transition function that is adequate for one dataset may be not for another one. As the ISEM algorithm does not need the specification of transition functions to propose a change of the *k***<sup>c</sup>** value, these results shows us that ISEM may be an effective alternative in relation to RJ and SEM algorithms for the joint estimation of *k***<sup>c</sup>** and the cluster parameters of a mixture model.

Figure 1 in Appendix 2 of the SM shows the generated values for datasets *A*<sup>1</sup> to *A*4. This Figure also shows the clusters identified by the ISEM algorithm. As can be seen, clusters are satisfactorily identified by the proposed algorithm.

We also compare ISEM and RJ algorithms in terms of CPU computation time. The simulations were realized on a MacBook Pro, 2.5 GHz Intel Core i5 dual core, 4 Gb MHz DDR3. Table 5 shows a summary of the times of iterations for the ISEM and RJ algorithms. The column denoted by s.d. presents the standard deviation values. For dataset *A*1, the average time that RJ takes to run one iteration is 1.8491 times greater than the average time that ISEM takes to run an iteration. For datasets *A*2, *A*3, and *A*4, the average time that RJ needs to run one iteration is 1.8175, 2.3239, and 1.8932 times greater than the average time that ISEM takes to run an iteration, respectively. These results show a better performance of the ISEM algorithm. The higher iteration times of the RJ algorithm are mainly due to the split–merge step used to increase the mixing of the Markov chain in relation to the number of clusters.

**Figure 1.** Performance of the ISEM algorithm across iterations.

**Figure 2.** Performance of the RJ algorithm across iterations.

The results from these simulated datasets show that the ISEM algorithm may be an effective alternative to the RJ and SEM algorithms for data clustering in situations where the number of clusters is a unknown quantity.


**Table 5.** Times of the iterations, in seconds.

#### **5. Application**

The three algorithms are now applied to two real datasets. The first real dataset refers to velocity in km/s of *n* = 82 galaxies from 6 well-separated conic sections of an unfilled survey of the Corona Borealis region. This dataset is known in the literature as the Galaxy data and has already been analyzed by [8,13,22,27], among others. This dataset is available in the R software. The second real dataset refers to an acidity index measured in a sample of *n* = 155 lakes in central-north Wisconsin. This dataset was downloaded from the website https://people.maths.bris.ac.uk/\$\sim\$mapjg/mixdata.

For application of ISEM and RJ algorithms, we consider the same number *L* = 5500, *B* = 5000, and *h* = 10. Table 6 shows the estimated probabilities for *k***<sup>c</sup>** obtained with ISEM and RJ and the AIC and BIC values from EM algorithm for each dataset. The maximum probability from ISEM and RJ and the minimum AIC and BIC values are highlighted in bold.


**Table 6.** Estimated probabilities for *k***c**, real datasets.

For the Galaxy dataset, the ISEM and RJ algorithms put highest probability on *k***<sup>c</sup>** = 3 and *k***<sup>c</sup>** = 5, respectively. However, analogously to the simulation study, the probabilities estimated by RJ do not evidence a single value for *k***<sup>c</sup>** as being the best value. For this dataset, the estimated probabilities indicate a *k***<sup>c</sup>** value between 3 and 7. The AIC and BIC also indicate *k***<sup>c</sup>** = 5 as the best value. For the Acidity dataset, ISEM, AIC, and BIC indicate *k***<sup>c</sup>** = 2 as the best value. The probabilities estimated by RJ attribute similar values for *k***<sup>c</sup>** = 3 and *k***<sup>c</sup>** = 4.

Figures 3 and 4 show the performance of the ISEM and RJ algorithms across iterations for the Galaxy and Acidity datasets. The values sampled by the ISEM algorithm present satisfactory stability for estimated probability across iterations, mix well among different *k***<sup>c</sup>** values, and present no significant autocorrelation. That is, we do not have evidence against the convergence of the generated chain by the ISEM algorithm. In relation to the RJ, the sampled values mix well and do not present significant autocorrelation. However, although the values sampled by RJ present stability for *P*(*k***c**), the estimated probabilities do not differentiate a value of *k***<sup>c</sup>** in order to be chosen as the better value, as done by ISEM. This result shows the need to run RJ for a greater number of iterations. With this, we have that for both real datasets, ISEM presents faster convergence than the RJ algorithm.

Table 7 shows a summary of the iteration times for the ISEM and RJ algorithms. For the Galaxy data, the average time that ISEM takes to run an iteration is 0.0053 s; while the average time for RJ is 0.0098 s. That is, the average time that RJ takes to run one iteration is 1.8491 times greater than the average time that ISEM takes to run an iteration. For the Acidity data, the average times that the ISEM and RJ algorithms take to run an iteration are 0.0085 and 0.0180 s, respectively. For this dataset, the average time that RJ needs to run an iteration is 2.2118 times greater than the average time that ISEM runs. Similarly to results from the simulation study, ISEM presents better results, i.e., a shorter time to run the iterations.

**Figure 3.** Performance of the ISEM and RJ algorithms for the Galaxy data.

**Figure 4.** Performance of the RJ algorithm across iterations for the Acidity data.

**Table 7.** Iteration times in seconds.


#### **6. Final Remarks**

This article presents a discussion of how to estimate the parameters of a mixture model in the context of data clustering. We propose an alternative algorithm to the EM algorithm called ISEM. This algorithm was developed through an integrated approach in order to allow *k***<sup>c</sup>** to be estimated jointly with the other parameters of interest. In the ISEM algorithm, the allocation probabilities depend on the number of clusters *k***<sup>c</sup>** and are independent of the number of components *k* of the mixture model.

In addition, there exists a positive probability of a new cluster being created by a single observation. This is an advantage of the algorithm because it creates a new cluster without the need to specify transition functions. In addition, the cluster parameters are updated according to the number of allocated observations. For the clusters with at least two of these observations, the values of the parameters are taken by the maximum likelihood estimates. For a cluster with just one observation, the parameter values are generated from the posterior distribution.

In order to illustrate the performance of the ISEM algorithm, we developed a simulation study. In this simulation study, we considered four scenarios with artificial data generated from Gaussian mixture models. In addition, each one of the four scenarios was replicated *M* = 500 times, and the proportion of times that ISEM put a higher probability on the *k***<sup>c</sup>** true value was recorded. We applied

this same procedure to the EM algorithm, choosing the number of clusters *k***<sup>c</sup>** via the AIC and BIC, and to the RJ algorithm. Then, the three algorithms were compared in terms of proportion of times that the *k***<sup>c</sup>** true value was selected as the best value. The results obtained show that the ISEM algorithm outperforms the RJ and SEM algorithms. Moreover, the results also show that the AIC and BIC model selection criteria should not be used to determine the number of clusters in a mixture model due to a low success rate.

We also compared the performance of ISEM and RJ in terms of empirical convergence of the sequence of values generated using graphical tools. For this, we selected at random an artificial dataset from each scenery, and then we plotted the probability estimates for *k***<sup>c</sup>** across iterations, the generated *k***<sup>c</sup>** values, and the estimated autocorrelation of the sampled values (see Figures 1 and 2). Again, the results show a better performance for the ISEM algorithm. While ISEM presents satisfactory stability for the probability of *k***<sup>c</sup>** and differentiates the true *k***<sup>c</sup>** as the best value, the probabilities estimated by RJ do not differentiate a value of *k***<sup>c</sup>** in order to be chosen as the better value.

In order to illustrate the practical use of the proposed algorithm and compare its performance with the SEM and RJ algorithms, we applied the three algorithms to two real datasets: the Galaxy and Acidity datasets. For the Galaxy dataset, ISEM indicates *k***<sup>c</sup>** = 3 with probability *P*(*k***<sup>c</sup>** = 3|·) = 0.7024, while the RJ algorithm, the AIC, and the BIC indicate *k***<sup>c</sup>** = 5. However, as shown in Figure 3d, the RJ algorithm again does not differentiate a value of *k***c**, while ISEM differentiates the *k***<sup>c</sup>** = 3 value, and the generated values across iterations present satisfactory stability. For the Acidity dataset, the ISEM, AIC, and BIC indicate *k***<sup>c</sup>** = 2 as the best value, while RJ attributes similar probabilities for *k***<sup>c</sup>** = 3 and *k***<sup>c</sup>** = 4.

As mentioned in the Introduction, the generalization of the proposed algorithm for the multivariate case is the next step of our research. The simulation study and the application were done in R software, and the computational codes can be obtained by emailing the authors.

**Supplementary Materials:** The following are available online at http://www.mdpi.com/1099-4300/21/11/1063/ s1.

**Author Contributions:** E.F.S. and C.A.d.B.P. developed the whole theoretical part of the research. E.F.S., A.K.S. and L.A.M. developed the simulation studies and real data application.

**Funding:** This research was funded by Conselho Nacional de Desenvolvimento Científico e Tecnológico, CNPq, grant number 308776/2014-3.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **Universal Sample Size Invariant Measures for Uncertainty Quantification in Density Estimation**

#### **Jenny Farmer 1, Zach Merino 1, Alexander Gray <sup>1</sup> and Donald Jacobs 1,2,\***


Received: 7 October 2019; Accepted: 8 November 2019; Published: 15 November 2019

**Abstract:** Previously, we developed a high throughput non-parametric maximum entropy method (PLOS ONE, 13(5): e0196937, 2018) that employs a log-likelihood scoring function to characterize uncertainty in trial probability density estimates through a scaled quantile residual (SQR). The SQR for the true probability density has universal sample size invariant properties equivalent to sampled uniform random data (SURD). Alternative scoring functions are considered that include the Anderson-Darling test. Scoring function effectiveness is evaluated using receiver operator characteristics to quantify efficacy in discriminating SURD from decoy-SURD, and by comparing overall performance characteristics during density estimation across a diverse test set of known probability distributions.

**Keywords:** density estimation; distribution free; non-parametric statistical test; decoy distributions; size invariance; scaled quantile residual; maximum entropy method; scoring function; outlier detection; overfitting detection

#### **1. Introduction**

The rapid and accurate estimate of the probability density function (pdf) for a random variable is important in many different fields and areas of research [1–6]. For example, accurate high throughput pdf estimation is sought in bioinformatics screening applications and in high frequency trading to evaluate profit/loss risks. In the era of big data, data analytics and machine learning, it has never been more important to strive for automated high-quality pdf estimation. Of course, there are numerous other traditional areas of low throughput applications where pdf estimation is also of great importance, such as damage detection in engineering [7], isotope analysis in archaeology [8], econometric data analysis in economics [9], and particle discrimination in high energy physics [10]. The wide range of applications for pdf estimation exemplifies its ubiquitous importance in data analysis. However, a continuing objective regarding pdf estimation is to establish a robust distribution free method to make estimates rapidly while quantifying error in an estimate. To this end, it is necessary to develop universal measures to quantify error and uncertainties to enable comparisons across distribution classes. To illustrate the need for universality, the pdf and cumulative distribution function (cdf) for four distinctly different distributions are shown in Figure 1a,b. Comparing the four cases of pdf and cdf over the same sample range, it is apparent that the data are distributed very differently.

**Figure 1.** Examples of four distribution types in the form of (**a**) pdf and corresponding (**b**) cdf.

The process of estimating the pdf for a given sample of data is an inverse problem. Due to fluctuations in a sample of random data, many pdf estimates will be able to model the data sample well. If additional smoothness criteria are imposed, many proposed pdf estimates can be filtered out. Nevertheless, a pdf estimate will carry intrinsic uncertainty along with it . The development of a scoring function to measure uncertainty in a pdf estimate without knowing the form of the true pdf is indispensable in high throughput applications where human domain expertise cannot be applied to inspect every proposed solution for validity. Moreover, it is desirable to remove subjective bias from human (or artificial intelligence) intervention. Automation can be achieved by employing a scoring function that measures over-fitting and under-fitting quantitatively based solely on mathematical properties. The ultimate limit is set by statistical resolution, which depends on sample size.

Solving the inverse problem becomes a matter of optimizing a scoring function, which breaks down into two parts—first, developing a suitable measure that resists under- and over-fitting to the sampled data, which is the focus of this paper. Second, developing an efficient algorithm to optimize the score while adaptively constructing a non-parametric pdf. The second part will be accomplished by an algorithm involving a non-parametric maximum entropy method (NMEM) that was recently developed by JF and DJ [11] and implemented as the "PDFestimator." Similar to a traditional parametric maximum entropy method (MEM), NMEM employs Lagrange multipliers as coefficients to orthogonal functions within a generalized Fourier series. The non-parametric aspect of the process derives from employing a data driven scoring function to select an appropriate number of orthogonal functions, as their Lagrange multipliers are optimized to accurately represent the complexity of the data sample that ultimately determines the features of the pdf. The resolution of features that can be uncovered without over-fitting naturally depends on the sample size.

Some important results in statistics [12] that are critical to obtain universality in a scoring function are summarized here. For a univariate continuous random variable, *X*, the cdf is given by *FX*(*x*), which is a monotonically increasing function of *x* and, irrespective of the domain, the range of *FX*(*x*) is on the interval (0, 1). A new random variable, *R*, that spans the interval (0, 1) is obtained through the mapping *r* = *FX*(*x*). The cdf for the random variable *R* can be determined as follows,

$$F(r) \ = P(R \le r) \ = P(F\_X(x) \le r) \ = P(X \le F\_X^{-1}(r)) \ = F\_X(F\_X^{-1}(r)) \ = r \tag{1}$$

Since the pdf for the random variable *R* is given as *f*(*r*) = *dF*(*r*) *dr* = 1 it follows that *R* has a uniform pdf on the interval (0, 1). Furthermore, due to the monotonically increasing property of *FX*(*x*) it follows that a sort ordered set of *N* random numbers {*xk*}*<sup>N</sup>* maps to the transformed set of random numbers {*rk*}*<sup>N</sup>* in a 1 to 1 fashion, where *k* is a labeling index that runs from 1 to *N*. In particular, for an index *k* > *k*, it is the case that *rk* ≥ *rk*. The 1 to 1 mapping that takes *X* → *R* has important implications for assessing the quality of a pdf estimate. The universal nature of this approach is that, for a given sample of random data and no a priori knowledge of the underlying functional form of the true pdf, an evaluation can be made of the transformed data.

Given a high-quality pdf estimate from an estimation method, ˆ *fX*(*x*), the corresponding estimated cdf, *F*ˆ *<sup>X</sup>*(*x*), will exhibit sampled uniform random data (SURD). Conversely, for a given sample from the true pdf, a poor trial estimate, ˆ *fX*(*x*) , will yield transformed random variables that deviate from SURD. The objective of this work is to consider a variety of measures that can be used as a scoring function to quantify the uncertainty in how close the estimate ˆ *fX*(*x*) is to the true pdf based on how closely the sort order statistics of *F*ˆ *<sup>X</sup>*({*xk*}) matches with the sort order statistics of SURD. The powerful concept of using sort order statistics to quantify the quality of density estimates [13] will be leveraged to construct universal scoring functions that are sample size invariant.

The strategy employed in the NMEM is to iteratively perturb a trial cdf and evaluate it with a scoring function. By means of a random search using adaptive perturbations, the trial cdf with the best score is tracked until the score reaches a threshold where optimization terminates. At this point, the trial cdf is within an acceptable tolerance to the true cdf and constitutes the pdf estimate. Different outcomes are possible since the method is based on a random fitness-selection process to solve an inverse problem. The role of the scoring function in the NMEM includes defining the objective target for optimizing the Lagrange multipliers, providing stopping criteria for adding orthogonal functions in the generalized Fourier series expansion and marking a point of diminishing returns where further optimizing the Lagrange multipliers results in over-fitting to the data. Simply put, the scoring function provides a means to quantify the quality of the NMEM density estimate. Optimizing the scoring function in NMEM differs from traditional MEM approaches that minimize error in estimates based on moments of the sampled data. Note that the universality of the scoring function eliminates problems with heavy tailed distributions that have divergent moments. Nevertheless, Lagrange multipliers are determined based on solving a well defined extremum problem in both cases.

Before tackling how to evaluate the efficacy of scoring functions, a brief description is given here on how the quality of a pdf estimate can be assessed without knowing the true pdf. Visualizing a quantile-quantile plot (QQ-plot) is a common approach in determining if two random samples come from the same pdf. Given a set of *N* sort ordered random variables {*xk*}*<sup>N</sup>* that are monotonically increasing, along with a cdf estimate, the corresponding empirical quantiles are determined by the mapping {*rk*}*<sup>N</sup>* <sup>=</sup> *<sup>F</sup>*<sup>ˆ</sup> *<sup>X</sup>*({*xk*}*N*) as described above. It is not necessary to have a second data set to compare. As described previously [11], the empirical quantile can be plotted on the y-axis versus the theoretical average quantile for the true pdf plotted on the x-axis. From single order statistics (SOS) the expectation value of *rk* is given by *μ<sup>k</sup>* = *k*/(*N* + 1) for *k* = 1, 2, ...*N*, which gives the mean quantile. Figure 2a illustrates the QQ plot for the distributions shown in Figure 1. The benefit of the QQ plot is that it is a universal measure. Unfortunately, for large sample sizes, the plot is no longer informative because all curves approach a perfect straight line as random fluctuations decrease with increasing sample size. A quantile residual (QR) allows deviations from the mean quantile to be readily visualized when one sample size is considered. However, as illustrated in Figure 2b, the residuals in a QR-plot decrease as sample size increases. Hence, the quantile residual is not sample size invariant.

The QR-plot is scaled [11] in such a way as to make the scaled quantile residual (SQR) sample size invariant. From SOS, the standard deviation for the empirical quantile to deviate from the mean quantile is well-known to be *<sup>σ</sup><sup>k</sup>* <sup>=</sup> *μk*(<sup>1</sup> <sup>−</sup> *<sup>μ</sup>k*)/ <sup>√</sup>*<sup>N</sup>* <sup>+</sup> <sup>2</sup> where *<sup>k</sup>* is the sort order index. Interestingly, all fluctuations regardless of the value for the mean quantile scale with sample size as 1/ <sup>√</sup>*<sup>N</sup>* <sup>+</sup> 2. Sample size invariance is achieved by defining SQR as <sup>√</sup>*<sup>N</sup>* <sup>+</sup> <sup>2</sup>(*rk* <sup>−</sup> *<sup>μ</sup>k*) and, when plotted against *<sup>μ</sup>k*, one obtains a SQR-plot. Figure 2c shows an SQR-plot for three different sample sizes for each of the four distributions considered in Figure 1. It is convenient to define contour lines using the formula *s fμ*(<sup>1</sup> <sup>−</sup> *<sup>μ</sup>*), where the scale factor, *s f* , can be adjusted to control how frequently points on the SQR plot will fall within a given contour. In particular, 99% of the time the SQR points will fall within the boundaries of the oval when bounded by <sup>±</sup>2.58*μ*(<sup>1</sup> <sup>−</sup> *<sup>μ</sup>*). Scale factors of 1.65, 1.96, 2.58 and 3.40 lead to 90%, 95%, 99% and 99.9% of SQR points falling within the oval based on numerical simulation. Interestingly, the scale factors of 1.65, 1.96, 2.58 and 3.40 respectively correspond to the *z*-values of a Gaussian distribution at the 90%, 95%, 99% and 99.9% confidence levels.

**Figure 2.** For each of the four distributions shown in Figure 1 and for sample sizes *N* = 50, 500, 5000 shown in all panels with same distinct colors, an empirical quantity is plotted as a function of the theoretical average quantile. The panels show (**a**) QQ-plot, (**b**) QR-plot and (**c**) SQR-plot. Only the SQR-plot is sample size invariant. As an illustration of universality in all panels, any of the colored lines could represent any one of the four distributions.

The SQR-plot provides a distribution free visualization tool to assess the quality of a cdf estimate in three ways. First, when the SQR falls appreciably within the oval that encloses 99% of the residual, it is not possible to reject the null hypothesis. Second, when the SQR exhibits non-random patterns, this is an indication of systematic error introduced by the estimator method. Finally, when the SQR has suppressed random fluctuations such that it is close to 0 for an extended interval, this indicates that the pdf estimate is over-fitting to the sample data. In general, over-fitting is hard to quantify [14]. As the graphical abstract shows, it is possible to plot the SQR against the original random variable *x* instead of the mean quantile. Doing this deforms the oval or "lemon drop" shape of the SQR-plot but it directly shows where problems in the estimate are locally occurring in relation to the pdf estimate. The aim of this paper is to quantify these salient features of an SQR-plot using a scoring function.

This work was motivated by the concern that different scoring functions will likely perform differently in terms of speed and accuracy in NMEM. The scoring function that was initially considered was constructed from the natural logarithm of the product of probabilities for each transformed random variable, given by *F*ˆ *<sup>X</sup>*({*xk*}). This log-likelihood scoring function provides one way to measure the quality of a proposed cdf. Interestingly, the log-likelihood scoring function has a mathematical structure similar to the commonly employed Anderson-Darling (AD) test [15,16]. As such, the current study considers several alternative scoring functions that use SQR and compares how sensitive they are in quantifying the quality of a pdf estimate. Other types of information measures that use cumulative relative entropy [17] or residual cumulative Kullback–Leibler information [18,19] are possible. However, these alternatives are outside the scope of this study, which focuses on leveraging SQR properties. The scoring function must exhibit distribution free and sample size invariant properties so that it can be applied to any sample of random data of a continuous variable and also to sub-partitions of the data when employed in the PDFestimator. It is worth noting that all the scoring functions presented in this paper exhibit desirable properties with similar or greater efficacy than the AD scoring function and all are useful for assessing the quality of density estimates.

In the remainder of this paper, a numerical study is presented to explore different types of measures for SQR quality. The initial emphasis is on constructing sensitive quality measures that are universal and sample size invariant. These scoring functions based on SQR properties can be applied to quantifying the accuracy (or ''goodness of fit") of a pdf estimate created by any methodology, without knowledge of the true pdf. The SQR is readily calculated from the cdf which is obtained by integrating the pdf. To determine which scoring function best distinguishes between good and poor cdf estimates, the concept of decoy SURD is introduced. Once decoys are generated, Receiver

Operator Characteristics (ROC) are employed to identify the most discriminating scoring function [17]. In addition to ROC evaluation, performance of the PDFestimator for different plugged in scoring functions is evaluated. This benchmark is important because the scoring function is expected to affect the rate of convergence toward a satisfactory pdf estimate using the NMEM approach. After discussing the significance of the results, several conclusions are drawn from an extensive body of experiments.

#### **2. Results**

#### *2.1. Sample Size Invariant Scoring Functions*

Seven scoring functions are defined in Table 1. At the moment, the input to these scoring functions is SURD of sample size *N*. Specifically, *N* random numbers are independently and identically drawn uniformly on the interval (0,1) and then sort ordered to give SOS represented by the set {*rk*}*<sup>N</sup>* where 0 < *rk* ≤ *rk*<sup>+</sup><sup>1</sup> < 1 ∀ *k* = 1, 2, ...*N*. For sample size, *N*, a scoring function of type *t* is evaluated as *St*({*rk*}*N*), which defines a new random variable that is simply denoted as *St*(*N*). A scoring function is scale invariant if the probability density for *St*(*N*) is independent of sample size, which typically holds only for large *N*. However, finite size corrections are made for each scoring function and are listed in Table 1. In all cases, the finite size corrections are empirically determined based on numerical simulation to achieve approximate scale invariance for *N* ≥ 9. In all coefficients reported, there is a (3) error in the last significant figure, such as 0.406(3) or 11.32(3).


**Table 1.** Scoring function definitions and finite size corrections.

As defined in Table 1, the proposed scoring functions include the relevant part of the Anderson-Darling (AD) measure [15], denoted as *SAD*, and the quasi log-likelihood formula [11], denoted as *SLL*. Note that *SLL* = *log* [∏*<sup>k</sup> pk*(*rk*)] where *pk*(*rk*) is the exact pdf corresponding to a beta distribution that describes the random variable *rk* as derived from SOS [13]. The quasi log-likelihood is not an exact log-likelihood. Rather, *SLL* corresponds to a mean field approximation where correlations between the random variables, {*rk*}*N*, are neglected. Another scoring function is defined as *SVAR* = *<sup>z</sup>*<sup>2</sup> *<sup>k</sup>* , where *zk* = (*rk* − *μk*)/*σk*. As mentioned in the Introduction, *μ<sup>k</sup>* = *rk* = *k*/(*N* + 1) is the mean quantile of the k-th random variable and *σ<sup>k</sup>* = *μ<sup>k</sup>* (*μ<sup>k</sup>* − 1) / <sup>√</sup>*<sup>N</sup>* <sup>+</sup> <sup>2</sup> is the standard deviation

of the k-th random variable about its mean. Essentially *SVAR* is the mean variance of a "z-value" for SOS.

Despite sharing a similar mathematical form, the *SAD* and *SLL* scoring functions are not the same, even in the limit *N* → ∞. At face value, these functions look very different. However, after shifting the origin of these functions to their natural reference points and scaling *SLL* by a factor of −2, which was empirically determined to obtain data collapse, these two measures were remarkably similar. To demonstrate this, let *S AD* ≡ *SAD* ({*rk*}) − *SAD* ({*μk*}) and *S LL* ≡ −2 [*SLL* ({*rk*}) − *SLL* ({*μk*})]. The natural reference points *S<sup>o</sup> AD* and *<sup>S</sup><sup>o</sup> LL* are respectively defined as *SAD* and *SLL*, evaluated at the mean quantiles. Figure 3a,d show the pdf for *S AD* and the pdf for *S LL* are approximately sample size invariant and markedly similar. Interestingly, *S AD* has superior sample size invariance because it reaches its asymptotic limit extremely fast, as reported almost 60 years ago [20].

To improve or create a scale invariant scoring function, finite size corrections are incorporated by transforming *St*(*N*) to a z-value. For all score types, *Zt* = (*St* − *μt*)/*σ<sup>t</sup>* where *μ<sup>t</sup>* is the average of *St* and *σ<sup>t</sup>* is the standard deviation of *St* about its mean. All shifts and scale factors used to transform *St*(*N*) → *Zt*(*N*) are given in Table 1. Figure 3a,d,g show that, after finite size corrections, the pdf for the three scoring functions *ZAD* , *ZLL* and *ZVAR* exhibit excellent scale invariance. Furthermore, the pdf for these scoring functions fall on top of one another in a massive data collapse (data not shown) indicating they share the same pdf for all practical purposes. It is worth noting that because this is a numerical study, there is uncertainty in the formulas that define the corrections to finite sample size. As can be clearly seen in Figure 3a, the AD measures before finite size corrections are applied display the most impressive data collapse. Indeed, the observed data collapse from numerical simulation are tighter than the intrinsic uncertainties in the correction to finite size samples. In contrast, the log likelihood measure has the most dispersion in its data collapse before finite size corrections are applied. In this case, the finite sample size corrections greatly improved the data collapse.

The most surprising result is that this numerical study demonstrates that *ZVAR* shares the same pdf as *ZAD*. This result is surprising because both *ZAD* and *ZLL* involve linear combinations of logarithms, while *ZVAR* has no logarithms. However, it is not surprising that *ZVAR* has good scaling properties because the function is defined in terms of the scaled variable, otherwise called the z-value. The transformation to the z-value naively sets the mean to the origin and normalizes the variance. As such, it would be somewhat surprising if *ZVAR* did not exhibit data collapse as a function of the z-value. Given that *ZVAR* scales, it is expected that generalized moments of the z-value variable will exhibit data collapse and also exhibit sample size invariance.

From a practical standpoint, it is computationally faster to work with *ZVAR*. Therefore, additional scoring functions defined as *Sp* = |*zk*| *p* 1/*<sup>p</sup>* for *p* = <sup>1</sup> <sup>2</sup> , 1, 2, 3, 4 were considered. Note that *S*<sup>2</sup> is the standard deviation of *zk* and, after finite size corrections are applied, *Sp* <sup>→</sup> *Zp*. The cases *<sup>p</sup>* <sup>=</sup> <sup>1</sup> 2 and *p* = 4 are listed in Table 1 and exhibit scale invariance as shown in Figure 3b,e respectively. The *p* = 1, 2, 3 cases (data not shown) are similar and straddle the limiting cases smoothly. It is worth mentioning that the natural reference at the mean quantile is zero for *SVAR* and *Sp*.

By exploring SURD for additional patterns, it was observed that two disjoint blocks of the same size can be compared using double order statistics (DOS). Among all random variables, {*rk*}*N*, the indices that span from *k*<sup>1</sup> *<sup>o</sup>* to *k*<sup>1</sup> *<sup>f</sup>* define block 1 and the indices that span from *<sup>k</sup>*<sup>2</sup> *<sup>o</sup>* to *k*<sup>2</sup> *<sup>f</sup>* define block 2. Without loss of generality, block 2 is taken to be to the right of block 1, such that *k*<sup>1</sup> *<sup>o</sup>* < *k*<sup>1</sup> *<sup>f</sup>* < *<sup>k</sup>*<sup>2</sup> *<sup>o</sup>* < *k*<sup>2</sup> *f* . With *m* random variables in both blocks, *m* − 1 differences given by *δ<sup>k</sup>* = *rk*<sup>+</sup><sup>1</sup> − *rk* are used in the scoring function *S*2,1 *LR* = *log*(*δ*<sup>2</sup> *k*/*δ*<sup>1</sup> *<sup>k</sup>* ) , which simplifies to *<sup>S</sup>*2,1 *LR* = *log*(*δ*<sup>2</sup> *<sup>k</sup>* )−*log*(*δ*<sup>1</sup> *<sup>k</sup>* ) . Importantly, *log*(*δ j <sup>k</sup>*) is calculated for all disjoint blocks at once. By partitioning all random variables into equal blocks of indices, the mean log-ratio is calculated rapidly for all pairs of blocks. For any size block and for any pair of blocks, *<sup>S</sup>*(*i*,*j*) *LR* exhibits strong scale invariance as shown in Figure 3h. Over a hundred diverse cases are shown as gray lines. Interestingly, the pdf for *<sup>S</sup>*(*i*,*j*) *LR* is essentially a normal distribution shown as a red line.

**Figure 3.** Illustration of sample size invariance in the probability density function for various scoring functions. The sample sizes selected in panels (**a**,**b**,**d**,**e**,**g**,**h**) to show data collapse include *N* = 9, 11, 12, 14, 17, 20, 24, 33, 49, 95, 110, 124, 142, 166, 199, 249, 332, 497, 990, 1500, 2015, 3298, 5505, 8838, 14,467, 23,684, 38,771, 63,471, 103,905, 272,389, 750,000, 1,000,000, 2,000,000. The sample sizes selected in panels (**c**,**f**,**i**) include *N* = 10, 50, 200, 1000, 5000, 20,000, 100,000. In panel **h**, the results from each system size along with all partitions made within each system size (a total of 111 cases) is plotted as light gray lines. The red line shows the result of a normal distribution, indicating that the scaling is well described by a normal distribution. All other details are described in the text.

Because *<sup>S</sup>*(*i*,*j*) *LR* is localized to a pair of blocks, to cover the entire SQR-plot a new scoring function is constructed by taking the root mean square of all distinct pairs of *<sup>S</sup>*(*i*,*j*) *LR* . For a calculation time proportional to sample size, the size of a block is set proportional to <sup>√</sup>*N*, which necessarily makes the number of blocks, *Nb*, proportional to <sup>√</sup>*N*. The pdf for *RMSLR* is nearly sample size invariant as shown in Figure 3i. From Table 1, it appears the finite size corrections for *RMSLR* are complicated. However, as will be discussed below, scale invariance should be preserved for sub-samples of the data, called partitions. It turns out that only *RMSLR* requires special attention to make partitions scale, where *Np* is the number of data points being sub-sampled. Finally, the absolute value of the measures *ZVAR* and *Z*<sup>4</sup> are respectively shown in Figure 3c,f. Note that taking an absolute value of a measure that is scale invariant will remain scale invariant.

#### *2.2. Redundant and Complimentary Information*

Since the pdf of different scoring functions may be similar or the same, the next question addressed is how do different measures compare when applied to the same SURD? For sample size *N*, SURD is generated using numerical simulation and each measure is evaluated per realization of {*rk*}*N*. For 100,000 random trials per *N*, a 1 to 1 comparison is made between *Za*(*N*) versus *Zb*(*N*) with *a* = *b*. Note that by definition, *Zt*(*N*) has a mean of zero and a standard deviation of 1. For reasons that will become clear below, absolute values are taken on the scoring functions. Despite the pdf for |*ZVAR*|, |*ZLL*| and |*ZAD*| being practically identical for all sample sizes, scatter plots indicate that the scores are not identical on a 1 to 1 basis. Figure 4a,b plot |*ZVAR*| and |*ZLL*| against |*ZAD*|, respectively. Although there is always a tight linear correlation, there is more scatter in the comparison at smaller sample sizes. As *N* → ∞ the different scores converge to the same value, although the approach to the asymptotic limit for each measure differs. These differences have important implications for application to density estimation as discussed below.

**Figure 4.** Examples of pairwise comparisons of different measures through scatter plots. (**a**,**b**) show that the |*ZAD*| measure is statistically the same as the |*ZLL*| and |*ZVAR*| measures. (**c**) Shows mild differences between |*Z*4| and |*ZVAR*|. (**d**) Shows that the information content between *RMSLR* and |*ZVAR*| is very different.

The scatter plot of |*Z*4| versus |*ZVAR*| in Figure 4c shows that these two measures characterize SURD in a fundamentally different way due to the strong deviation of |*Z*4| relative to |*ZVAR*| with modest statistical scatter. The greatest non-linear deviation between the two scores occurs at large values of |*ZVAR*|, corresponding to outliers in SURD. The scatter plot of *RMSLR* versus |*ZVAR*| in Figure 4d shows strong random scatter with no discernible deterministic dependence. Hence,

*RMSLR* and |*ZVAR*| measure different SURD characteristics. Yet, despite their conspicuous differences, the pdf for |*Z*4| and *RMSLR* are qualitatively similar as shown in Figure 3f,i, respectively.

As demonstrated by scatter plots, various scoring functions characterize SURD in different or similar ways relative to one another. Note that combining measures with complimentary properties can potentially lead to a more sensitive measure. Through reductive analysis, a composite score (CS) is proposed as:

$$\text{CS} = \left| Z\_{VAR} + 0.666 \right| + \left[ \max\left( 2.5, \left| Z\_4 \right|, \text{RMSLR} \right) - 2.5 \right] \tag{2}$$

In constructing *CS*, the most probable score for *ZVAR*, near 0.666, is used as a baseline. Then contributions are added from outliers from either |*Z*4| or *RMSLR*, whichever is larger. The last term does not modify the score when no outlier is detected, otherwise the contribution to CS continuously increases starting at zero at just above the threshold for outlier detection.

#### *2.3. Partition Size Invariance*

A critical part of the algorithm in the PDFestimator [11] is that the input data sample is partitioned into hierarchical sub-samples by powers of 2 when *N* > 1025. Consequently, the employed scoring function should be sample size invariant for all partitions. Invariance of partition size, *Np*, is satisfied by all scoring functions described in this work, as exemplified in Figure 5 for three of the most distinct measures. Furthermore, for any realization of SURD of size *N*, all partitions within have essentially the same score independent of the type of scoring function.

**Figure 5.** *ZVAR*, |*Z*4| and CS illustrate the three most distinct measures considered. Data collapse based on the probability density for different measures is demonstrated for *N* = 10, 50, 200, 1000, 5000, 20,000, 100,000 in addition to *Np* = 1025, 2049, 4097, 8193, 16,385, 32,769, 65,537. A different color is used for each sample size.

A necessary requirement for all the scoring functions is that sub-sampling must be uniformly distributed over the data. It is worth noting that *SAD* (and its corresponding *ZAD*) is particularly sensitive to the way the uniform sub-sampling is performed within a partition. Due to the form of the *SAD* equation, it is critical that the selected points are symmetric about the center index in the sort ordering. The number of samples used in a partition is always odd of the form *Np* = 1 + 2*n*. Thus, the median point is included and for each index selected to be in the sub-sample below the median, a corresponding mirror image index above the median is selected. For example, if there are 17 indices in the full sample, indices 1, 4, 9, 14, 17 has the required mirror symmetry. All other scoring functions are not sensitive to breaking mirror symmetry.

#### *2.4. Decoy SURD*

For the purpose of quantifying how well a scoring function discriminates between true SURD and random data that is not SURD, a controlled decoy-SURD (dSURD) is generated. Let {*r<sup>o</sup> <sup>k</sup>*} define SURD and let {*r<sup>d</sup> <sup>k</sup>*} define dSURD. As described in detail in Section 4.4, a decoy cdf, *Fd*(*r*), is constructed to facilitate the 1 to 1 mapping given by {*r<sup>d</sup> <sup>k</sup>*} = *Fd*({*r<sup>o</sup> <sup>k</sup>*}). If *Fd*(*r*) = *r*, then the output is identical to the input. A decoy-SURD is controlled by adding a perturbation of the form *Fd*(*r*) = *r* + Δ(*r*). By choosing various functional forms for the perturbation and, by controlling the amplitude of the perturbation, it is a simple matter to make a broad spectrum of decoys that range from impossible to markedly obvious to detect at any specified sample size.

In Figure 6, the middle row shows the decoy cdf resulting from the perturbations shown along the top row. This is an example of a moderately hard dSURD because by eye the decoy cdf looks close to a perfect straight line. To make it clear that dSURD is indeed different from SURD, the pdf for each case is shown along the bottom row. For a sufficiently large sample size, statistical resolution will be good enough to resolve these small perturbations, but for smaller sample sizes the perturbation will not be detectable. To demonstrate how statistical resolution increases with larger sample sizes, Figure 7 shows SQR-plots for SURD and its corresponding dSURD for samples sizes of 1000, 5000, 20,000 and 100,000. These three cases are examples of localized perturbations.

**Figure 6.** Top row shows three examples of localized perturbations for moderately difficult decoys. Center row shows the corresponding cdf. Bottom row shows the pdf, where the cyan horizontal highlights the probability density function (pdf) for sampled uniform random data (SURD).

0 05 1 **Figure 7.** Progression of scaled quantile residual (SQR)-plots for moderately difficult localized decoys as sample size increases.

0 05 1

Three additional perturbations of an extended type are shown in Figure 8 using the same layout. The last column plots the perturbation, cdf and pdf as dashed red lines because "reduced fluctuation" is a special type of perturbation that is also explained in Section 4.4. As the name implies, fluctuations are suppressed, representing a scenario where a pdf estimate over-fits the data. Figure 9 shows the SQR-plots for SURD and its corresponding dSURD of the extended type for samples sizes of 1000, 5000, 20,000 and 100,000. Note that the reduced fluctuation perturbation is equally detectable at any size sample because fluctuations are suppressed by a fixed proportion in relation to true SURD.

By comparing measures applied to dSURD and SURD, it can be expected that the more sensitive scoring function is one that detects a given perturbation at smaller sample sizes compared to other scoring functions. It is also expected that a certain scoring function will be able to detect certain types of perturbations more readily than other types of perturbations. As such, it is likely impossible to find a perfect scoring function that performs best on all decoy types all the time. Nevertheless, for a given diverse set of dSURD examples, the best overall performing scoring functions with the greatest sensitivity or selectivity can be deduced using receiver operator characteristics.

#### *2.5. Receiver Operator Characteristics*

0 05 1

Receiver operator characteristics (ROC) are calculated based on simulation data involving 10,000 trials of SURD over a broad range of *N* samples, and for each SURD, many dSURD mappings are generated for each of the six decoy types shown above. Results are exemplified in Figure 10, showing ROC curves for three different sample sizes and six different decoy types. ROC curves quantify the efficacy of a scoring function in discriminating SURD from dSURD. Figure 10 shows representative results for moderately difficult decoys. As a point of reference, easy, moderate and hard decoys are aimed at requiring about 1000, 10,000 and 100,000 samples to have sufficient statistical resolution to notice dSURD just barely by eye (e.g., see Figures 7 and 9). Only the decoy that reduces fluctuations using a fixed scale factor has the same difficulty for detection independent of sample size.

**Figure 8.** Extended perturbations for moderately difficult decoys. The cyan horizontal line shown on the bottom panels defines the pdf for SURD. The red dashed lines represent suppression of fluctuations.

0 05 1 **Figure 9.** Progression of SQR-plots for moderately difficult extended decoys as sample size increases.

0 05 1

0 05 1

**Figure 10.** The qualitative features of receiver operator characteristic (ROC) curves are shown for sample sizes of 5000, 20,000 and 100,000 along the top, middle and bottom rows. The left, middle and right columns correspond to the |*ZVAR*|, |*Z*4| and *CS* scoring functions. The (y-axis, x-axis) corresponds to the fraction of true (positives, negatives) having a range from 0 to 1. Each ROC curve compares 6 different decoy types.

It is common practice to quantify ROC curves by their area under the curve (AUC). Table 2 gives all AUC values for all the cases shown in Figure 10. The ROC curves and the results listed in Table 2 clearly show that *CS* detects decoys better than the other measures. Of course, informed by reductive analysis, this result was purposely intended during the construction of *CS* given in Equation (2). In summary, it is generally found that *ZAD*, *ZLL*, *ZVAR*, |*ZAD*|, |*ZLL*|, |*ZVAR*|, *Z*4, |*Z*4| and *CS* scoring functions are all good measures to distinguish SURD from easy to detect dSURD. However, it is always possible to create decoy SURD that will go undetected by any measure (e.g., Figure 10).


**Table 2.** Area under the ROC curves shown in Figure 10.

In general, *ZAD*, *ZLL* and *ZVAR* share similar ROC curves and |*ZVAR*| and |*Z*4| have similar ROC curves. The most sensitive scoring function is *CS*. The reason *Zt* and |*Zt*| are considered as two separate cases is now easily explained. First note that *Zt* has a mean of zero and a standard deviation of 1. For a decoy type of "reduced fluctuations" that mimics an over-fitting scenario, the ROC curve becomes inverted for any type of measure, *Zt*. However, the inversion problem is eliminated when considering |*Zt*| because both over-fitting and under-fitting is detected when |*Zt*| is large. Finally, only the combined score, *CS*, readily detects very localized perturbations due to its *RMSLR* component.

#### *2.6. PDF Estimation Performance*

Figure 11 summarizes the comparative statistics for failure rates. The bar plots in Figure 11a report averages across distributions and random samples, for cumulative ranges of sample sizes. As expected, the failure rate increases with sample size. For all scoring methods, average failure rates are typically on the order of 10% for sample sizes less than one million. Failure rate averages are the least for |*Z*4| and |*ZLL*|, a trend that holds across sample size. The associated box plots in Figure 11b more clearly demonstrate the computational advantage of |*Z*4| and |*ZLL*| over the other scoring methods. All scoring methods have between 50 and 60 outliers, but |*Z*4| and |*ZLL*| have virtually no failures outside of these extreme values.

**Figure 11.** Figure (**a**) cumulative averages of failure rates across four ranges Figure (**b**) distribution of failure rates for each scoring method. Box plots show inner-quartiles and whiskers represent range of data excluding outliers, which are shown as red crosses.

For computational time and Kullback-Leibler (KL) divergence [21], or simply KL, care must be taken to ensure a fair comparison, accounting for failure rates. Thus, a subset of the data is considered for these measurements. Of the 275 test sets (25 distributions at 11 sample sizes), 230 of these contain at least 10 successes out of the 100 trials, across all five scoring methods. The remaining 45 tests contain failure rates greater than 90% for at least one scoring method and are eliminated from further comparison, ensuring an equitable comparison across successful distributions and sample sizes. The results are shown in Figure 12.

Computational time comparisons prove to be the most challenging to pin down, due to wide variations between distributions, sample sizes and random trials. However, Figure 12a demonstrates a clear advantage in the average computational time for |*Z*4|, across all sample sizes. Once again, the number of outliers, which are compressed for clarity in Figure 12b, is roughly the same across the five scoring methods. However, |*ZAD*| has a higher range of typical runtimes, as well as higher averages in the smallest sample sizes. The KL-divergence comparisons shown in Figure 12c,d are less variable between scoring methods. A lower divergence between the estimate and the known reference distribution suggests a better estimate is being made. Figure 12c shows a decreasing KL-divergence with increasing sample size for all scoring methods, which demonstrates expected convergence, albeit with diminishing returns for larger sample sizes. Notably, |*ZAD*| produces slightly lower KL-divergence on average, compared to the other methods.

**Figure 12.** Comparative statistics between five scoring methods averaged over successful solutions. Cumulative averages for (**a**) performance time across four sample size ranges and (**c**) Kullback-Leibler divergence [21]. Panels (**b**,**d**) show box plots for the respective data shown in panels (**a**,**c**). Box plots show inner-quartiles and whiskers represent range of data excluding outliers, which are shown as red crosses.

#### **3. Discussion**

Each of the five scoring methods have been evaluated when utilized within the PDFestimator and applied to the same distribution test set in terms of scalability, sensitivity, failure rate and KL-divergence. Each of the proposed measures have strengths and weaknesses in different areas. The |*ZAD*| measure produces the most accurate scaling and the lowest KL-divergence. The *CS* measure shows the greatest sensitivity for detecting small deviations from SURD. The |*ZLL*| method, although not a clear winner in any particular area, is notably well-performing in all tests. These results suggest a possible trade-off between a lower KL-divergence versus longer computational time with the |*ZAD*| scoring method. However, the slight benefit of a lower KL-divergence is arguably not worth the computational cost, particularly when also considering the higher failure rate. In contrast, the significantly low failure rate and fast performance times are strong arguments in favor of |*Z*4| as the preferred scoring method. However, this result is only true when the score of a sensitive measure is minimized, while the threshold to terminate is based on a less sensitive measure (see Section 4.7 in methods for details).

Qualitative analysis is used to elucidate why |*Z*4| minimization is the best overall performer. The pdf and SQR for hundreds of different estimates were compared visually and robust trends were observed between the |*ZVAR*| and |*Z*4| methods. Figure 13a is a representative example, showing the density estimates for the Burr distribution at 100,000 samples. Although both estimates were terminated at the same quality level, the smooth curve found for |*Z*4| would be subjectively judged superior. However, there is nothing inherently or measurably incorrect about the small wiggles in the |*ZVAR*| estimate. Note that no smoothness conditions are enforced in the PDFestimator.

The SQR-plot, shown in Figure 13b, is especially insightful in evaluating the differences in this example. The Burr distribution is deceptively difficult to estimate accurately due to a heavy tail on the right. Both |*ZVAR*| and |*Z*4| fall mostly within the expected range, except for the sharp peak to the right corresponding to the long tail. Although the peak is more pronounced for |*ZVAR*|, the more relevant point in this example is the shape of the entire SQR-plot. SQR for |*ZVAR*| contains scaled residuals close to zero, behavior virtually never observed in true SURD. Hence, this corresponds to over-fitting. This contrast in the SQR-plot between |*ZVAR*| and |*Z*4| is generally true with the following explanation.

The |*Z*4| scoring method uses the same threshold scoring as |*ZVAR*|, but simultaneously seeks to minimize the variance from average, thus highly penalizing outliers to the expected z-score. The |*ZVAR*| method, by contrast, tends to over-fit some areas of the distribution of high density, attempting to compensate for areas of relatively low density where it deviates significantly. This often results in longer run times, many unnecessary Lagrange multipliers, less smooth estimates and unrealistic SQR-plots, as the NMEM algorithm attempts to improve inappropriately. For example, in the test shown in Figure 13, the number of Lagrange multipliers required for the |*ZVAR*| estimate was 141, whereas |*Z*4| required only 19. Therefore, it is easy to see why |*ZVAR*| took much longer to complete. This phenomenon is a general trend but it is exacerbated in cases where there are large sample sizes on distributions that have a combination of sharp peaks and heavy tails.

**Figure 13.** (**a**) Two density estimates are compared based on two different scoring functions. (**b**) The corresponding SQR-plots for each density estimate are shown. By eye, both density estimates look exceptionally good, but the SQR-plot has a strong peak representing error in the extreme tail of the distribution. The degree of error depends on the scoring function, but both scoring functions give qualitatively the same results.

A surprising null result of this work is that the *CS* measure, custom designed to have the greatest overall sensitivity and selectivity, failed to be the best overall performer in practice when invoked in the PDFestimator. Although more investigation is required, all comparative results taken together suggest that the *CS* scoring function is the most sensitive but is over-designed for the capability of the random search optimization method currently employed in the PDFestimator. In the progression of improvements on pdf estimation, the results from the initial PDFestimator suggested that a more sensitive scoring function would improve performance. With that aim, more sensitive scoring functions have been determined and performance of the PDFestimator substantially improved. However, it appears the opposite is now true, requiring a shift in attention to optimize the optimizer, with access to a battery of available scoring functions. In preparation, another work (ZM, JF, DJ) optimizes the overall scheme by dividing the data into smaller blocks, which gives much greater speed and higher accuracy, while taking advantage of parallelization.

#### **4. Methods**

MATLAB 2019a (MathWorks, Natick, MA, USA) and the density estimation program "PDFestimator" were used to generate all the data presented in this work. The PDFestimator is a C++ program that JF and DJ developed as previously reported [11], which has the original Java program in supporting material. Upgrades on the PDFestimator are continuously being made on the BioMolecular Physics Group (BMPG) GitHub website,Available online: https://github. com/BioMolecularPhysicsGroup-UNCC/PDF-Estimator, where the source code is freely available, including a MATLAB interface to the C++ program. An older C++ version is also available in R, https://cran.r-project.org/web/packages/PDFEstimator/index.html. The version on the public GitHub website is the most recent stable version that has been well tested.

#### *4.1. Generating SURD and Scoring Function Evaluation*

MATLAB was employed in numerical simulations to generate SURD. For a sample size *N*, the sort ordered sequence of numbers {*rk*}*<sup>N</sup>* was used to evaluate each scoring function being considered. The same realization of SURD was assigned multiple scores to facilitate subsequent cross correlations.

#### *4.2. Method for Partitioning Data*

As previously explained in detail [11], sample sizes of N > 1025 were partitioned in the PDFestimator to achieve rapid calculations. The lowest and highest random number in the set {*rk*}*<sup>N</sup>* define the boundaries of each partition. The random number closest to the median was also included. Partitions have an odd number of random numbers due to the recursive process of adding one additional random number between the previously selected random numbers in the current partition. Partition sizes follow the pattern of 3, 5, 9, 17, 33, ....1 + 2*n*. A desired property of scoring functions is that they should maintain size invariance for all partitions. Scores for each measure were tracked for all partitions of size 1026 and greater, including the full data set, which is the last partition. For example, with N=100,000 the scores for partitions of size Np = 1025, 2049, 4097, 8193, 16,385, 32,769, 65,537, 100,000 were calculated. Scores from different partitions were cross correlated in scatter plots.

#### *4.3. Finite Size Corrections*

For each partition of size *Np*, including the last partition of size *N*, the scores were transformed to obtain data collapse. For all practical purposes finite size corrections were successfully achieved by shifting the average of a score to zero and normalizing the data by the standard deviation of the raw score. That is to say, the score, *St*(*Np*) for *Np* samples in the p-th partition, was a random variable. This score was transformed to a Z-value through the procedure *Zt*(*Np*) = *St*(*Np*) − *μt*(*Np*, *N*) /*σt*(*Np*, *N*). Operationally, tens of thousands of random sequences of SURD were generated for each scoring function type to empirically estimate *μt*(*Np*, *N*) and *σt*(*Np*, *N*). Note that *μt*(*Np*, *N*) and *σt*(*Np*, *N*) were obtained using basic fitting tools in the MATLAB graphics interface, and these are reported in Table 1.

#### *4.4. Decoy Generation*

For each decoy the sort ordered sequence of numbers *ro k <sup>N</sup>* defining SURD was transformed into decoy-SURD, denoted as dSURD. This was accomplished by creating a model decoy cdf, *Fd*(*r*). A new set of sort ordered random numbers was created by the 1 to 1 mapping {*r<sup>d</sup> <sup>k</sup>*}*<sup>N</sup>* = *Fd*({*r<sup>o</sup> <sup>k</sup>*}*N*), yielding a dSURD realization per SURD realization. Different decoys were generated based on different types of perturbations, which must meet certain criteria. Let Δ(*r*) represent a perturbation to SURD, such that

$$F(r) = r + \Delta(r) \tag{3}$$

For the perturbation to be valid, the pdf given by *fd*(*r*) = *dFd*(*r*) *dr* must satisfy *fd*(*r*) ≥ 0, which implies 1 + Δ (*r*) ≥ 0. The boundary conditions Δ(0) = Δ(1) = 0 must also be imposed. With these conditions satisfied, decoys of a wide variety could be generated. Four types of decoys were created using this approach, listed in the first 4 rows of Table 3. In this approach, the amplitude of the perturbation is a parameter. A decoy that is marginally difficult to detect at sample size of *Nd* has max(|Δ|) = 1/ <sup>√</sup>*Nd*. It will be challenging to discriminate between SURD and dSURD for *<sup>N</sup>* <sup>&</sup>lt; *Nd*, and markedly distinguishable when *N*/*Nd* 1.

Two additional types of decoys were also generated. First, *Fd*(*r*) is set to a beta distribution cdf, denoted as *Fβ*(*r*|*α*, *β*). Therefore, the perturbation is given as Δ(*r*) = *Fβ*(*r*|*α*, *β*) − *r*. The *α* and *β* parameters were adjusted to tune detection difficulty, by systematically searching for pairs of *α* and *β* on a high resolution square grid to find when max(Δ) was at a level that was consistent with the targeted sample size, *Nd*. Second, a decoy can be defined by uniformly reducing fluctuations according to *r<sup>d</sup> <sup>k</sup>* = *<sup>r</sup><sup>o</sup> <sup>k</sup>* + *<sup>p</sup>*(*r<sup>o</sup> <sup>k</sup>* − *μk*) where *μ<sup>k</sup>* = *k*/(*N* + 1). When *p* = 0 the decoy was the same as SURD, but as *p* → 1 the decoy retained no fluctuations. In this sense, this decoy type mimics extreme over-fitting, where *p* controls how much of the fluctuations are reduced.


**Table 3.** Decoy type summary.

#### *4.5. ROC Curves*

All ROC curves were generated according to the definition that the fraction of true positives (FTP) were plotted on the y-axis versus the fraction of false positives (FFP) plotted on the x-axis [22]. Note that alternative definitions for ROC are possible. To calculate FTP and FFP, a threshold score must be specified. If a score is below this threshold, the sort ordered sequence of numbers is predicted to be SURD. Conversely, if a score exceeds the threshold, the prediction is not SURD. As such, there are four possible outcomes. First, true SURD can be predicted as SURD or not, respectively, corresponding to a true positive (TP) or a false negative (FN). Second, dSURD can be predicted as SURD or not, respectively corresponding to a false positive (FP) or true negative (TN). All possible outcomes are tallied, such that FTP = TP/(TP + FN) and FFP = FP/(FP + TN). For a given threshold value, this calculation determines one point on the ROC curve. By considering a continuous range of possible thresholds, the entire ROC curve is constructed.

Procedurally, the data used to calculate the fractions of true and false positives that come from numerical simulations in MATLAB comprised 10,000 random SURD and dSURD pairs for sample sizes, *N* = 10, 50, 200, 1000, 5000, 20,000 and 100,000. About 60 different types of decoys were considered with diverse sets of parameters.

#### *4.6. Distribution Test Set*

To benchmark the effect of a scoring function on the performance of the PDFestimator, a diverse collection of distributions was selected and these are listed in Table 4. A MATLAB script was created to utilize built in functions dealing with statistical distributions to generate random samples of specified size. The random samples were subsequently processed by the PDFestimator to estimate the pdf, but for which the exact pdf is known. The set of possible distributions available for analysis cover a range of monomodal distributions that represent many types of features that include sharp peaks, heavy tails and multiple resolution scales. Some mixture models were also included that combine difficult distributions to create a greater challenge.

#### *4.7. PDF Estimation Method*

Each alternative scoring function, {|*ZAD*|, |*ZLL*|, |*ZVAR*|, |*Z*4|, *CS*} was implemented in the PDFestimator and were evaluated separately. Factors confounding comparisons in performance include sample size, distribution type, selection of key factors to evaluate and consistency across multiple trials. To provide a quantitative synopsis of the strengths and weaknesses of the proposed scoring methods, large numbers of trials were conducted on the distribution test set listed in Table 4. The distribution test set increases atypical failures amongst the estimates because it is necessary to consider extreme scenarios to identify breaking points in each of the scoring methods. Nevertheless, easier distributions, such as Gaussian, uniform and exponential, were included. To wit, good performance of an estimator when applied to challenging cases should not suffer when applied to easier distributions.

As an inverse problem, density estimation applied to multiple random samples of the same size for any given distribution will generally produce variation amongst the estimates. For small samples, the pdf estimate must resist over-fitting, whereas large sample sizes create computational challenges that must trade between speed and accuracy. To monitor these issues, a large range of sample sizes were tested, each with 100 trials of an independently generated input sample data set. Specifically, 100 random samples were generated for each of the 25 distributions, for each of the following 11 sample sizes with *N* = 10, 50, 100, 500, 1000, 5000, 10,000, 50,000, 100,000, 500,000, 1,000,000. This produced a total of 27,500 test cases, each of which were estimated using five scoring methods. Statistics were collected and averaged over each of the 100 random sample sets.

Three key quantities were calculated for a quantitative comparison of the scoring methods—failure rate, computational time and Kullback-Leibler (KL) divergence [21]. It was found that the KL-divergence distance was not sensitive to the different scoring functions. Alternative information measures [23,24] could be considered in future work. Failure rate is expressed as a fraction of failures out of 100 random samples. The KL-divergence measures the difference between the estimate against the known reference distribution. Computational times and KL-divergences were averaged only for successful solutions and thus were not impacted by failures. A failure is automatically determined by the PDFestimator when a score does not reach a minimum threshold.

During an initial testing phase, it was found that the measures *Zt* and |*Zt*| for *t* = AD, LL, and VAR all worked successfully, which is not surprising considering the original measure, *ZLL*, works markedly well. However, for the more sensitive measures, *Z*4, |*Z*4| and *CS*, the PDFestimator failed consistently because the score rarely reached its target threshold, at least within a reasonable time. Therefore, a hybrid method was developed that minimizes a sensitive measure as usual, but the |*ZVAR*| measure was invoked to determine when to terminate. In tests of |*Zt*| for *t* = AD, LL or VAR, these measures were optimized and were simultaneously used as a stopping condition with a threshold of 0.66 corresponding to the 40% level in the cdf, which was the same level used previously [11]. All these measures have the same pdf and cdf, and thus the same threshold value. This threshold was used for |*ZVAR*| as a stopping condition when different scoring functions are minimized.


**Table 4.** List of distribution types and corresponding parameters used to generate random data samples. Parameter and variable names correspond to the labeling scheme of MATLAB. For mixture distributions, subscripts indicate the distribution used to create the mixture with ordinal numbering,

#### **5. Conclusions**

Several conclusions can be drawn from the large body of results presented. (1) The scaled quantile residual (SQR) is instrumental in assessing the quality of a pdf by means of visual inspection. The advantage of an SQR-plot over a traditional QQ-plot is that the displayed information is not only universal (distribution free), but importantly, sample size invariant; (2) It is possible to construct myriad scoring functions that are universal and sample size invariant based on quantitatively characterizing SQR. In particular, various measures can be developed based on mathematical properties of single order statistics (SOS) and/or double order statistics (DOS); (3) Finite size corrections can generally be applied to scoring functions so that their asymptotic properties can be utilized for finite size samples, as low as *N* = 9; (4) Surprisingly, the scoring functions based on the Anderson-Daring test, quasi log-likelihood of SOS and the variance of SOS z-score —when applied to sampled uniform random data (SURD) share identical pdf for their scores for all practical purposes. Moreover, the scores are invariant across sample size and for different size partitions that sub-sample the input data; (5) The concept of decoy-SURD is introduced and a few methods are given for creating decoy-SURD (dSURD). The purpose of dSURD is to quantify the sensitivity and selectivity of a proposed scoring function using Receiver Operator Characteristics (ROC) or other means, such as machine learning. The usefulness of dSURD to quantify uncertainty in density estimation parallels the use of decoys in the field of protein structure prediction. That is, better scoring functions can be developed by focusing on how they discriminate between true SURD and dSURD; (6) Implementing a more sensitive scoring function in a method that estimates a pdf from random sampled data does not necessarily imply the process of estimation will be improved. There are many confounding factors that determine the ultimate performance characteristics of an algorithm for density estimation, since speed and accuracy need to be balanced for a practical software tool; (7) Minimizing either the *Z*<sup>4</sup> or |*Z*4| scores greatly improved the performance of the PDFestimator, a C++ program for univariate density estimation, compared to the initially used scoring function *ZLL*.

In closing, a few research directions that can stem from this work are highlighted. Interestingly, the mean log ratio of nearest neighbor differences in sort ordered SURD, when taken from two disjoint subsets, is normally distributed (at least to a very good approximation). Unaware of an existing proof of this result, the empirical result suggests that a proof should be sought given that the literature contains many works that derive the pdf for ratios of random numbers that are distributed in a specific way. The results presented here can be applied to the problem of constructing a more sensitive distribution free "test for goodness of fit." Essentially, this was the main objective that was addressed but here the emphasis was on how to better quantify uncertainty for the process of estimating a pdf for a random sample of data. Going forward, the universal sample size invariant measures developed here can be employed to test the similarity of two random samples of data.

**Author Contributions:** D.J. formalized the project objectives, proposed most measures and all decoy types. D.J. wrote the MATLAB code to scale all measures for SURD, generate decoy-SURD and discriminate SURD from decoy-SURD using ROC curves. Z.M. selected the distribution test set, wrote the MATLAB code to generate the random samples, and performed preliminary tests on the PDFestimator. A.G. evaluated ROC curves and scatter plots for the proposed scoring functions applied to SURD and decoy-SURD across sample sizes. J.F. motivated the initial work by reductive data analysis using methods of data collapse and scaling. J.F. modified the PDFestimator as needed, performed all simulations involving the PDFestimator, and performed all data analysis regarding comparative density estimation performance. D.J., J.F. and Z.M. wrote the paper.

**Funding:** This research received no external funding.

**Acknowledgments:** We thank Michael Grabchak for several discussions that helped direct this work.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **Prior Sensitivity Analysis in a Semi-Parametric Integer-Valued Time Series Model**

**Helton Graziadei 1,\*, Antonio Lijoi 2, Hedibert F. Lopes <sup>3</sup> and Paulo C. Marques F. <sup>3</sup> and Igor Prünster <sup>2</sup>**


Received: 26 November 2019; Accepted: 3 January 2020; Published: 6 January 2020

**Abstract:** We examine issues of prior sensitivity in a semi-parametric hierarchical extension of the INAR(*p*) model with innovation rates clustered according to a Pitman–Yor process placed at the top of the model hierarchy. Our main finding is a graphical criterion that guides the specification of the hyperparameters of the Pitman–Yor process base measure. We show how the discount and concentration parameters interact with the chosen base measure to yield a gain in terms of the robustness of the inferential results. The forecasting performance of the model is exemplified in the analysis of a time series of worldwide earthquake events, for which the new model outperforms the original INAR(*p*) model.

**Keywords:** time series of counts; Bayesian hierarchical modeling; Bayesian nonparametrics; Pitman–Yor process; prior sensitivity; clustering; Bayesian forecasting

#### **1. Introduction**

Integer-valued time series are relevant to many fields of knowledge, ranging from finance and econometrics to ecology and meteorology. An extensive number of models for this kind of data has been proposed since the introduction of the INAR(1) model in the pioneering works of McKenzie [1] and Al-Osh and Alzaid [2] (see also the book by Weiss [3]). A higher-order INAR(*p*) model was considered in the work of Du and Li [4].

In this paper, we generalize the Bayesian version of the INAR(*p*) model studied by Neal and Kypraios [5]. In our model, the innovation rates are allowed to vary through time, with the distribution of the innovation rates being modeled hierarchically by means of a Pitman–Yor process [6]. In this way, we account for potential heterogeneity in the innovation rates as the process evolves through time, and this feature is automatically incorporated in the Bayesian forecasting capabilities of the model.

The semi-parametric form of the model demands a robustness analysis of our inferential conclusions as we vary the hyperparameters of the Pitman–Yor process. We investigate this prior sensitivity issue carefully and find ways to control the hyperparameters in order to achieve robust results.

This paper is organized as follows. In Section 2, we construct a generalized INAR(*p*) model with variable innovation rates. The likelihood function of the generalized model is derived and a data augmentation scheme is developed, which gives a specification of the model in terms of conditional distributions. This data augmented representation of the model enables the derivation in Section 4 of full conditional distributions in simple analytical form, which are essential for the stochastic simulations in Section 5. Section 3 recollects the main properties of the Pitman–Yor process which are used to define the PY-INAR(*p*) model in Section 4, including its clustering properties. In building the PY-INAR(*p*), we propose a form for the prior distribution of the thinning parameters vector which improves on the choice made for the Bayesian INAR(*p*) model studied in [5]. In Section 5, we investigate the robustness of the inference with respect to changes in the Pitman–Yor process hyperparameters. Using the full conditional distributions of the innovation rates derived in Section 4, we inspect the behavior of the model as we concentrate or spread the mass of the Pitman–Yor base measure. This leads us to a graphical criterion that identifies an elbow in the posterior expectation of the number of clusters as we vary the hyperparameters of the base measure. Once we have control over the base measure, we study its interaction with the concentration and discount hyperparameters, showing how to make choices that yield robust results. In the course of this development, we use geometrical tools to inspect the clustering of the innovation rates produced by the model. Section 6 puts the graphical criterion to work for simulated data. In Section 7, using a time series of worldwide earthquake events, we finish the paper comparing the forecasting performance of the PY-INAR(*p*) model against the original INAR(*p*) model, with favorable results.

#### **2. A Generalization of the INAR(***p***) Model**

We begin by generalizing the original INAR(*p*) model of Du and Li [4] as follows.

Let {*Yt*}*t*≥<sup>1</sup> be an integer-valued time series, and, for some integer *p* ≥ 1, let the *innovations* {*Zt*}*t*≥*p*<sup>+</sup>1, given positive parameters {*λt*}*t*≥*p*<sup>+</sup>1, be a sequence of conditionally independent Poisson(*λt*) random variables. For a given vector of parameters *α* = (*α*1, ... , *αp*) ∈ [0, 1] *<sup>p</sup>*, let F*<sup>i</sup>* = {*Bij*(*t*) : *j* ≥ 0, *t* ≥ 2} be a family of conditionally independent and identically distributed Bernoulli(*αi*) random variables. For *i* = *k*, suppose that F*<sup>i</sup>* and F*<sup>k</sup>* are conditionally independent, given *α*. Furthermore, assume that the innovations {*Zt*}*t*≥*p*+<sup>1</sup> and the families F1, ... , F*<sup>p</sup>* are conditionally independent, given *α* and *λ*. The generalized INAR(*p*) model is defined by the functional relation

$$\mathcal{Y}\_t = \mathfrak{a}\_1 \circ \mathcal{Y}\_{t-1} + \dots + \mathfrak{a}\_p \circ \mathcal{Y}\_{t-p} + Z\_{t\nu}$$

for *<sup>t</sup>* <sup>≥</sup> *<sup>p</sup>* <sup>+</sup> 1, in which ◦ denotes the binomial thinning operator, defined by *<sup>α</sup><sup>i</sup>* ◦ *Yt*−*<sup>i</sup>* <sup>=</sup> <sup>∑</sup>*Yt*−*<sup>i</sup> <sup>j</sup>*=<sup>1</sup> *Bij*(*t*), if *Yt*−*<sup>i</sup>* > 0, and *α<sup>i</sup>* ◦ *Yt*−*<sup>i</sup>* = 0, if *Yt*−*<sup>i</sup>* = 0. In the homogeneous case, when all the *λt*'s are assumed to be equal, we recover the original INAR(*p*) model.

When *p* = 1, this model can be interpreted as specifying a birth-and-death process, in which, at epoch *t*, the number of cases *Yt* is equal to the new cases *Zt* plus the cases that survived from the previous epoch; the role of the binomial thinning operator being to remove a random number of the *Yt*−<sup>1</sup> cases present at the previous epoch *t* − 1 (see [7] for an interpretation of the order *p* case as a birth-and-death process with immigration).

Let *y* = (*y*1, ... , *yT*) denote the values of an observed time series. For simplicity, we assume that *Y*<sup>1</sup> = *y*1, ... ,*Yp* = *yp* with probability one. The joint distribution of *Y*1, ... ,*YT*, given parameters *α* and *λ* = (*λp*+1,..., *λT*), can be factored as

$$\Pr\{Y\_1 = y\_1, \dots, Y\_T = y\_T \mid a, \lambda\} = \prod\_{t=p+1}^T \Pr\{Y\_t = y\_t \mid Y\_{t-1} = y\_{t-1}, \dots, Y\_{t-p} = y\_{t-p}, a, \lambda\_t\}.$$

Since, with probability one, *α<sup>i</sup>* ◦ *Yt*−*<sup>i</sup>* ≤ *Yt*−*<sup>i</sup>* and *Zt* ≥ 0, the likelihood function of the generalized INAR(*p*) model is given by

$$L\_{\mathcal{Y}}(\boldsymbol{a},\boldsymbol{\lambda}) = \prod\_{t=p+1}^{T} \sum\_{m\_{1,t}=0}^{\min\{y\_t, y\_{t-1}\}} \cdots \sum\_{m\_{p,t}=0}^{\min\{y\_t - \sum\_{j=1}^{p-1} m\_{j,t}, y\_{t-p}\}} \left(\prod\_{i=1}^{p} \binom{y\_{t-i}}{m\_{i,t}} \boldsymbol{a}\_i^{m\_{i,t}} (1 - \boldsymbol{a}\_i)^{y\_{t-i} - m\_{i,t}}\right) \times \boldsymbol{b}$$
 
$$\left(\frac{e^{-\lambda\_t} \boldsymbol{\lambda}\_t^{y\_t - \sum\_{j=1}^p m\_{j,t}}}{(y\_t - \sum\_{j=1}^p m\_{j,t})!}\right).$$

For some epoch *t* and *i* = 1, ... , *p*, suppose that we could observe the values of the latent *maturations Mi*,*t*. Postulate that *Mi*,*<sup>t</sup>* | *Yt*−*<sup>i</sup>* = *yt*−*i*, *α<sup>i</sup>* ∼ Binomial(*yt*−*i*, *αi*), so that the conditional probability function of *Mi*,*<sup>t</sup>* is given by

$$\begin{aligned} p(m\_{i,t} \mid y\_{t-i}, a\_i) &= \Pr\{M\_{i,t} = m\_{i,t} \mid \mathcal{Y}\_{t-i} = y\_{t-i}, a\_i\} \\ &= \binom{y\_{t-i}}{m\_{i,t}} a\_i^{m\_{i,t}} (1 - \alpha\_i)^{y\_{t-i} - m\_{i,t}} \mathbb{I}\_{\{0, \dots, y\_{t-i}\}}(m\_{i,t}) .\end{aligned}$$

Furthermore, suppose that

$$\begin{split} p(y\_t \mid m\_{1,t}, \dots, m\_{p,t}, \lambda\_t) &= \Pr\{Y\_t = y\_t \mid M\_{1,t} = m\_{1,t}, \dots, M\_{p,t} = m\_{p,t}, \lambda\_t\} \\ &= \frac{e^{-\lambda\_t} \lambda\_t^{y\_t - \sum\_{j=1}^p m\_{j,t}}}{(y\_t - \sum\_{j=1}^p m\_{j,t})!} \mathbb{I}\_{\{\sum\_{j=1}^p m\_{j,t}, \sum\_{j=1}^p m\_{j,t} + 1, \dots\}}(y\_t). \end{split}$$

Using the law of total probability and the product rule, we have that

$$p(y\_t \mid y\_{t-1}, \dots, y\_{t-p}, a\_t \lambda\_t) = \sum\_{m\_{1,t}=0}^{y\_{t-1}} \dots \sum\_{m\_{p,t}=0}^{y\_{t-p}} p(y\_t, m\_{1,t}, \dots, m\_{p,t} \mid y\_{t-1}, \dots, y\_{t-p}, a\_t \lambda\_t)$$

$$= \sum\_{m\_{1,t}=0}^{y\_{t-1}} \dots \sum\_{m\_{p,t}=0}^{y\_{t-p}} p(y\_t \mid m\_{1,t}, \dots, m\_{p,t}, \lambda\_t) \times \prod\_{i=1}^p p(m\_{i,t} \mid y\_{t-i}, a\_i).$$

Since

$$\begin{aligned} \mathbb{I}\_{\{\sum\_{j=1}^p m\_{j,t}, \sum\_{j=1}^p m\_{j,t} + 1, \dots \}}(y\_t) &= \mathbb{I}\_{\{0, \dots, y\_t\}} \left(\sum\_{j=1}^p m\_{j,t}\right) \\ &= \mathbb{I}\_{\{0, \dots, y\_t\}}(m\_{1,t}) \times \dots \times \mathbb{I}\_{\{0, \dots, y\_t - \sum\_{j=1}^p m\_{j,t}\}}(m\_{p,t}) \end{aligned}$$

and

$$\mathbb{I}\_{\{\sum\_{j=1}^{p} m\_{j,t}, \sum\_{j=1}^{p} m\_{j,t} + 1, \dots \}}(y\_t) \times \mathbb{I}\_{\{0, \dots, y\_{t-i}\}}(m\_{i,t}) = \mathbb{I}\_{\{0, 1, \dots, \min\{y\_t - \sum\_{j \neq i} m\_{j,t}, y\_{t-i}\}\}}(m\_{i,t}),$$

we recover the original likelihood of the generalized INAR(*p*), showing that the introduction of the latent maturations *Mi*,*<sup>t</sup>* with the specified distributions is a valid data augmentation scheme (see [8,9] for a general discussion of data augmentation techniques).

In the next section, we review the needed definitions and properties of the Pitman–Yor process.

#### **3. Pitman–Yor Process**

Let the random probability measure <sup>G</sup> <sup>∼</sup> DP(*τ*, *<sup>G</sup>*0) be a Dirichlet process [10–12] with concentration parameter *τ* and base measure *G*0. If the random variables *X*1, ... , *Xn*, given G = *G*, are conditionally independent and identically distributed as *G*, then it follows that

$$\Pr\{X\_{n+1}\in B \mid X\_1 = x\_1, \dots, X\_{\text{ll}} = x\_{\text{ll}}\} = \frac{\tau}{\tau + n} G\_0(B) + \frac{1}{\tau + n} \sum\_{i=1}^n I\_B(x\_i),$$

for every Borel set *B*. If we imagine the sequential generation of the *Xi*'s, for *i* = 1, ... , *n*, the former expression shows that a value is generated anew from *G*<sup>0</sup> with probability proportional to *τ*, or we repeat one the previously generated values with probability proportional to its multiplicity. Therefore, almost surely, realizations of a Dirichlet process are discrete probability measures, perhaps with denumerable infinite support, depending on the nature of *G*0. Also, this data-generating process, known as the Pólya–Blackwell–MacQueen urn, implies that the *Xi*'s are "softly clustered", in the sense that in one realization of the process the elements of a subset of the *Xi*'s may have exactly the same value.

The Pitman–Yor process [6] is a generalization of the Dirichlet process which results in a model with added flexibility. Essentially, the Pitman–Yor process modifies the expression of the probability associated with the Pólya-Blackwell-MacQueen urn introducing a new parameter so that the posterior predictive probability becomes

$$\Pr\{X\_{n+1}\in B \mid X\_1 = x\_1, \dots, X\_n = x\_n\} = \frac{\tau + k\sigma}{\tau + n} G\_0(B) + \frac{1}{\tau + n} \sum\_{i=1}^n \left(1 - \frac{\sigma}{n\_i}\right) I\_B(x\_i),$$

in which 0 ≤ *σ* < 1 is the discount parameter, *τ* > −*σ*, *k* is the number of distinct elements in {*X*1, ... , *Xn*}, and *ni* is the number of elements in {*X*1, ... , *Xn*} which are equal to *Xi*, for *i* = 1, ... , *n*. It is well known that E[G(*B*)] = *G*0(*B*) and

$$\text{Var}[\mathbb{G}(B)] = \left(\frac{1-\sigma}{\tau+1}\right) G\_0(B) \left(1 - G\_0(B)\right),$$

for every Borel set *B*. Hence, G is centered on the base probability measure *G*0, while *τ* and *σ* control the concentration of <sup>G</sup> around *<sup>G</sup>*0. We use the notation <sup>G</sup> <sup>∼</sup> PY(*τ*, *<sup>σ</sup>*, *<sup>G</sup>*0). When *<sup>σ</sup>* <sup>=</sup> 0, we recover the Dirichlet process as a special case. The PY process is also defined for *σ* < 0 and *τ* = |*σ*|*m*, for some positive integer *m*. For our purposes, it is enough to consider the case of non-negative *σ*.

Pitman [6] derived the distribution of the number of clusters *K* (the number of distinct *Xi*'s), conditionally on both the concentration parameter *τ* and the discount parameter *σ*, as

$$\Pr\{K=k \mid \tau, \sigma\} = \frac{\prod\_{i=1}^{k-1} (\tau + i\sigma)}{\sigma^k \times (\tau + 1)\_{n-1}} \times \mathcal{O}(n, k; \sigma)\_{\tau}$$

in which (*x*)*<sup>n</sup>* = Γ(*x* + *n*)/Γ(*x*) is the rising factorial and C (*n*, *k*; *σ*) is the generalized factorial coefficient [13].

In the next section, we use a Pitman–Yor process to model the distribution of the innovation rates in the generalized INAR(*p*) model.

#### **4. PY-INAR(***p***) Model**

The PY-INAR(*p*) model is as a hierarchical extension of the generalized INAR(*p*) model defined in Section 2. Given a random measure <sup>G</sup> <sup>∼</sup> PY(*τ*, *<sup>σ</sup>*, *<sup>G</sup>*0), in which *<sup>G</sup>*<sup>0</sup> is a Gamma(*a*0, *<sup>b</sup>*0) distribution, let the innovation rates *λp*<sup>+</sup>1, ... , *λ<sup>T</sup>* be conditionally independent and identically distributed with distribution Pr{*λ<sup>t</sup>* <sup>∈</sup> *<sup>B</sup>* <sup>|</sup> <sup>G</sup> <sup>=</sup> *<sup>G</sup>*} <sup>=</sup> *<sup>G</sup>*(*B*).

To complete the PY-INAR(*p*) model, we need to specify the form of the prior distribution for the vector of thinning parameters *α* = (*α*1, ... , *αp*). By comparison with standard results from the theory of the AR(*p*) model [14], Du and Li [4] found that in the INAR(*p*) model the constraint ∑*<sup>p</sup> <sup>i</sup>*=<sup>1</sup> *α<sup>i</sup>* < 1 must be fulfilled to guarantee the non-explosiveness of the process. In their Bayesian analysis of the INAR(*p*) model, Neal and Kypraios [5] considered independent beta distributions for the *αi*'s. Unfortunately, this choice is problematic. For example, in the particular case when the *αi*'s have independent uniform distributions, it is possible to show that Pr{∑*<sup>p</sup> <sup>i</sup>*=<sup>1</sup> *α<sup>i</sup>* < 1} = 1/*p*!, implying that we would be concentrating most of the prior mass on the explosive region even for moderate values of the model order *p*. We circumvent this problem using a prior distribution for *α* that places all of its

mass on the nonexplosive region and still allows us to derive the full conditional distributions of the *αi*'s in simple closed form. Specifically, we take the prior distribution of *α* to be a Dirichlet distribution with hyperparameters (*a*1,..., *ap*; *ap*+1), and corresponding density

$$\pi(\boldsymbol{a}) = \frac{\Gamma\left(\sum\_{i=1}^{p+1} a\_i\right)}{\prod\_{i=1}^{p+1} \Gamma(a\_i)} \prod\_{i=1}^{p+1} a\_i^{a\_i - 1} \mu$$

in which *ai* <sup>&</sup>gt; 0, for *<sup>i</sup>* <sup>=</sup> 1, . . . , *<sup>p</sup>* <sup>+</sup> 1, and *<sup>α</sup>p*+<sup>1</sup> <sup>=</sup> <sup>1</sup> <sup>−</sup> <sup>∑</sup>*<sup>p</sup> <sup>i</sup>*=<sup>1</sup> *αi*.

Let *m* = {*mi*,*t*: *i* = 1, ... , *p*, *t* = *p* + 1, ... , *T*} denote the set of all maturations, and let *μ*<sup>G</sup> be the distribution of G. Our strategy to derive the full conditionals distributions of the model parameters and latent variables is to consider the marginal distribution

$$p(y, m, a\_r \lambda) = \int p(y\_\prime m, a\_r \lambda \mid G) \, d\mu\_\mathbb{G}(G)$$

$$= \left\{ \prod\_{t=p+1}^T p(y\_t \mid m\_{1,t}, \dots, m\_{p,t}, \lambda\_t) \prod\_{i=1}^p p(m\_{i,t} \mid y\_{t-i}, a\_i) \right\}$$

$$\times \pi(a) \times \int \prod\_{t=p+1}^T p(\lambda\_t \mid G) \, d\mu\_\mathbb{G}(G).$$

From this expression, using the results in Section 3, the derivation of the full conditional distributions is straightforward. In the following expressions, the symbol **∝** denotes proportionality up to a suitable normalization factor, and the label "all others" designate the observed counts *y* and all the other latent variables and model parameters, with the exception of the one under consideration.

Let *<sup>λ</sup>*\*<sup>t</sup>* denote the set {*λp*<sup>+</sup>1, ... , *<sup>λ</sup>T*} with the element *<sup>λ</sup><sup>t</sup>* removed. Then, for *<sup>t</sup>* = *<sup>p</sup>* + 1, ... , *<sup>T</sup>*, we have

$$\lambda\_t \mid \text{all others} \sim w\_t \times \text{Gamma}(y\_t - m\_t + a\_0, b\_0 + 1) + \sum\_{r \neq t} \left(1 - \frac{\sigma}{n\_r}\right) \lambda\_r^{y\_t - m\_t} e^{-\lambda\_r} \delta\_{\{\lambda\_r\}\_r}$$

in which the weight

$$w\_t = \frac{(\pi + k\_{\backslash\_t} \sigma) \times b\_0^{a\_0} \times \Gamma(y\_t - m\_t + a\_0)}{\Gamma(a\_0) \times (b\_0 + 1)^{y\_t - m\_t + a\_0}}.$$

*nr* is the number of elements in *λ*\*<sup>t</sup>* which are equal to *λr*, and *k*\*<sup>t</sup>* is the number of distinct elements in *λ*\*t*. In this mixture, we suppressed the normalization constant that makes all weights add up to one.

Making the choice *ap*+<sup>1</sup> = 1, we have

$$\alpha\_i \mid \text{all others} \sim \text{TBeta} \left( a\_i + \sum\_{t=p+1}^T m\_{i,t}, 1 + \sum\_{t=p+1}^T (y\_{t-i} - m\_{i,t}), 1 - \sum\_{j \neq i} a\_j \right), \quad i = 1, \dots, T$$

for *i* = 1, ... , *p*, in which TBeta denotes the right truncated Beta distribution with support (0, 1 <sup>−</sup> <sup>∑</sup>*<sup>p</sup> <sup>j</sup>*=*<sup>i</sup> <sup>α</sup>j*).

For the latent maturations, we find

$$p(m\_{i,t} \mid \text{all others}) \propto \frac{1}{(m\_{i,t})!(y\_t - \sum\_{j=1}^p m\_{j,t})!(y\_{t-i} - m\_{i,t})!} \left(\frac{a\_i}{\lambda\_t (1 - a\_i)}\right)^{m\_{i,t}}$$

$$\times \mathbb{I}\_{\{0, 1, \dots, \min\{y\_t - \sum\_{j \neq i} m\_{j,t}, y\_{t-i}\}\}}(m\_{i,t}).$$

To explore the posterior distribution of the model, we build a Gibbs sampler [15] using these full conditional distributions. Escobar and West [16] showed, in a similar context, that we can improve mixing by resampling simultaneously the values of all *λt*'s inside the same cluster at the end of each iteration of the Gibbs sampler. Letting (*λ*∗ <sup>1</sup>, ... , *λ*<sup>∗</sup> *<sup>k</sup>* ) be the *k* unique values among (*λp*<sup>+</sup>1, ... , *λT*), define the number of occupants of cluster *j* by *ν<sup>j</sup>* = ∑*<sup>T</sup> <sup>t</sup>*=*p*+<sup>1</sup> <sup>I</sup>{*λ*<sup>∗</sup> *<sup>j</sup>* }(*λt*), for *<sup>j</sup>* <sup>=</sup> 1, . . . , *<sup>k</sup>*. It follows that

$$\lambda\_j^\* \mid \text{all others} \sim \text{Gamma}\left(a\_0 + \sum\_{t=p+1}^T \left(y\_t - \sum\_{i=1}^p m\_{i,t}\right) \cdot \mathbb{I}\_{\{\lambda\_j^\*\}}(\lambda\_t), b\_0 + \nu\_j\right).$$

for *j* = 1, ... , *k*. At the end of each iteration of the Gibbs sampler, we update the values of all *λt*'s inside each cluster by the corresponding *λ*∗ *<sup>j</sup>* using this distribution.

#### **5. Prior Sensitivity**

As it is often the case for Bayesian models with nonparametric components, a choice of the prior parameters for the PY-INAR(*p*) model which yields robustness of the posterior distribution is nontrivial [17].

The first aspect to be considered is the fact that the base measure *G*<sup>0</sup> plays a crucial role in the determination of the posterior distribution of the number of clusters *K*. This can be seen directly by inspecting the form of the full conditional distributions derived in Section 4. Recalling that *G*<sup>0</sup> is a gamma distribution with mean *a*0/*b*<sup>0</sup> and variance *a*0/*b*<sup>2</sup> <sup>0</sup>, from the full conditional distribution of *λ<sup>t</sup>* one may note that the probability of generating, on each iteration of the Gibbs sampler, a value for *λ<sup>t</sup>* anew from *G*<sup>0</sup> is proportional to

$$\frac{(\pi + k\_{\backslash\_t} \sigma) \times b\_0^{a\_0} \times \Gamma(y\_t - m\_t + a\_0)}{\Gamma(a\_0)(b\_0 + 1)^{y\_t - m\_t + a\_0}} \dots$$

Therefore, supposing that all the other terms are fixed, if we concentrate the mass of *G*<sup>0</sup> around zero by making *b*<sup>0</sup> → ∞, this probability decreases to zero. This is not problematic, because it is hardly the case that we want to make such a drastic choice for *G*0. The behavior in the other direction is more revealing, since taking *b*<sup>0</sup> ↓ 0, in order to spread the mass of *G*0, also makes the limit of this probability to be zero. Due to this behavior, we need to establish a criterion to choose the hyperparameters of the base measure which avoids these extreme cases.

In our analysis, it is convenient to have a single hyperparameter regulating how the mass of *G*<sup>0</sup> is spread over its support. For a given *λ*max > 0, we find numerically the values of *a*<sup>0</sup> and *b*<sup>0</sup> which minimize the Kullback-Leibler divergence between *G*<sup>0</sup> and a uniform distribution on the interval [0, *λ*max]. This Kullback-Leibler divergence can be computed explicitly as

$$-\log \lambda\_{\text{max}} - a\_0 \log b\_0 + \log \Gamma(a\_0) - (a\_0 - 1)(\log \lambda\_{\text{max}} - 1) + \frac{b\_0 \lambda\_{\text{max}}}{2}.$$

In this new parameterization, our goal is to make a sensible choice for *λ*max. It is worth emphasizing that by this procedure we are not truncating the support of *G*0, but only using the uniform distribution on the interval [0, *λ*max] as a reference for our choice of the base measure hyperparameters *a*<sup>0</sup> and *b*0.

Our proposal to choose *λ*max goes as follows. We fix some value 0 ≤ *σ* < 1 for the discount parameter and choose an integer *k*<sup>0</sup> as the prior expectation of the number of clusters *K*, which, using the results at the end of Section 3, can be computed explicitly as

$$\mathrm{E}[K] = \begin{cases} \tau \times \left( \psi(\tau + T - p) - \psi(\tau) \right) & \text{if } \quad \sigma = 0; \\ \left( (\tau + \sigma)\_{T-p} / (\sigma \times (\tau + 1)\_{T-p-1}) \right) - \tau / \sigma & \text{if } \quad \sigma > 0. \end{cases}$$

in which *ψ*(*x*) is the digamma function (see [6] for a derivation of this result). Next, we find the value of the concentration parameter *τ* by solving E[*K*] = *k*<sup>0</sup> numerically. After this, for each *λ*max in a grid of values, we run the Gibbs sampler and compute the posterior expectation of the number of clusters E[*K* | *y*]. Finally, in the corresponding graph, we look for the value of *λ*max located at the "elbow" of the curve, that is, the value of *λ*max at which the values of E[*K* | *y*] level off.

#### **6. Simulated Data**

As an explicit example of the graphical criterion in action, we used the functional form of a first-order model with thinning parameter *α* = 0.15 to simulate a time series of length *T* = 1000, for which the distribution of the innovations is a symmetric mixture of three Poisson distributions with parameters 1, 8, and 15. Figure 1 shows the formations of the elbows for two values of the discount parameter: *σ* = 0.5 and *σ* = 0.75.

**Figure 1.** Formation of the elbows for *σ* = 0.5 (left) and *σ* = 0.75 (right). The red dotted lines indicate the chosen values of *λ*max.

For the simulated time series, Figures 2–5 display the behavior of the posterior distributions obtained using the elbow method for (*k*0, *σ*) ∈ {4, 10, 16, 30}×{0, 0.25, 0.5, 0.75}. These figures make the relation between the choice of the value of the discount parameter *σ* and the achieved robustness of the posterior distribution quite explicit: as we increase the value of the discount parameter *σ*, the posterior becomes insensitive to the choice of *k*0. In particular, for *σ* = 0.75, the posterior mode is always near 3, which is the number of components used in the distribution of the innovations of the simulated time series.

**Figure 2.** Posterior distributions of the number of clusters *K* for the simulated time series with *σ* = 0 and *k*<sup>0</sup> = 4, 10, 16, 30. The red dotted lines indicate the value of *k*0.

Once we understand the influence of the prior parameters on the robustness of the posterior distribution, an interesting question is how to get a point estimate for the distribution of clusters, in the sense that each *λt*, for *t* = *p* + 1, . . . , *T*, would be assigned to one of the available clusters.

From the Gibbs sampler, we can easily get a Monte Carlo approximation for the probabilities *drt* = Pr{*λ<sup>r</sup>* = *λ<sup>t</sup>* | *y*}, for *r*, *t* = *p* + 1, ... , *T*. These probabilities define a dissimilarity matrix *D* = (*drt*) among the innovation rates. Although *D* is not a distance matrix, we can use it as a starting point to represent the innovation rates in a two-dimensional Euclidean space using the technique of metric multidimensional scaling (see [18] for a general discussion). From this two-dimensional representation, we use hierarchical clustering techniques to build a dendrogram, which is appropriately cut in order to define three clusters, allowing us to assign a single cluster label to each innovation rate.

**Figure 3.** Posterior distributions of the number of clusters *K* for the simulated time series with *σ* = 0.25 and *k*<sup>0</sup> = 4, 10, 16, 30. The red dotted lines indicate the value of *k*0.

Table 1 displays the confusion matrix of this assignment, showing that 83% of the innovations were grouped correctly in the clusters which correspond to the mixture components used to simulate the time series.


**Table 1.** Confusion matrix for the cluster assignments.

**Figure 4.** Posterior distributions of the number of clusters *K* for the simulated time series with *σ* = 0.5 and *k*<sup>0</sup> = 4, 10, 16, 30. The red dotted lines indicate the value of *k*0.

**Figure 5.** Posterior distributions of the number of clusters *K* for the simulated time series with *σ* = 0.75 and *k*<sup>0</sup> = 4, 10, 16, 30. The red dotted lines indicate the value of *k*0.

#### **7. Earthquake Data**

In this section, we analyze a time series of yearly worldwide earthquakes events of substantial magnitude (equal or greater than 7 points on the Richter scale) from 1900 to 2018 (http://www.usgs. gov/natural-hazards/earthquake-hazards/earthquakes).

The forecasting performances of the INAR(*p*) and the PY-INAR(*p*) models are compared using a cross-validation procedure in which the models are trained with data ranging from the beginning of the time series up to a certain time, and predictions are made for epochs outside this training range.

Using this cross-validation procedure, we trained the INAR(*p*) and the PY-INAR(*p*) models with orders *p* = 1, 2, and 3, and made one-step-ahead predictions. Table 2 shows the out-of-sample mean absolute errors (MAE) for the INAR(*p*) and the PY-INAR(*p*) models. In this table, the MAE's are computed predicting the counts for the last 36 months. For the three model orders, the PY-INAR(*p*) model yields a smaller MAE than the original INAR(*p*) model.

**Table 2.** Out-of-sample MAE's for the INAR(*p*) and the PY-INAR(*p*) models, with orders *p* = 1, 2, and 3. The last column shows the relative variations of the MAE's for the PY-INAR(*p*) models with respect to the corresponding MAE's for the INAR(*p*) models.


**Author Contributions:** Theoretical development: H.G., A.L., H.F.L., P.C.M.F, I.P. Software development: H.G., P.C.M.F. All authors have read and agreed to the published version of the manuscript.

**Funding:** Helton Graziadei and Hedibert F. Lopes thank FAPESP (Fundação de Amparo à Pesquisa do Estado de São Paulo) for financial support through grants numbers 2017/10096-6 and 2017/22914-5. Antonio Lijoi and Igor Prünster are partially supported by MIUR, PRIN Project 2015SNS29B.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **The Decomposition and Forecasting of Mutual Investment Funds Using Singular Spectrum Analysis**

#### **Paulo Canas Rodrigues 1,2,\*, Jonatha Pimentel <sup>1</sup> and Patrick Messala <sup>1</sup> and Mohammad Kazemi <sup>3</sup>**


Received: 7 November 2019; Accepted: 7 January 2020; Published: 9 January 2020

**Abstract:** Singular spectrum analysis (SSA) is a non-parametric method that breaks down a time series into a set of components that can be interpreted and grouped as trend, periodicity, and noise, emphasizing the separability of the underlying components and separate periodicities that occur at different time scales. The original time series can be recovered by summing all components. However, only the components associated to the signal should be considered for the reconstruction of the noise-free time series and to conduct forecasts. When the time series data has the presence of outliers, SSA and other classic parametric and non-parametric methods might result in misleading conclusions and robust methodologies should be used. In this paper we consider the use of two robust SSA algorithms for model fit and one for model forecasting. The classic SSA model, the robust SSA alternatives, and the autoregressive integrated moving average (ARIMA) model are compared in terms of computational time and accuracy for model fit and model forecast, using a simulation example and time series data from the quotas and returns of six mutual investment funds. When outliers are present in the data, the simulation study shows that the robust SSA algorithms outperform the classical ARIMA and SSA models.

**Keywords:** singular spectrum analysis; robust singular spectrum analysis; time series forecasting; mutual investment funds

#### **1. Introduction**

Mutual investment funds provide management services to institutional and individual investors, besides great liquidity for financial investments made in them and low transactional costs [1,2]. These funds can be of fixed or variable income and allow to diversify the assets while reducing unsystematic risk. Fixed income mutual investment funds are of low risk, whereas variable-income mutual investment funds vary in terms of risk but also in terms of returns. In this study, we were interested in analyzing the quotas and returns of six of the largest Brazilian based mutual investment funds—three purely based on stocks: (i) Alaska Black, (ii) APEX Long Biased, and (iii) Brasil Capital; and three balanced funds (usually combining a stock component, a bond component, and sometimes a money market component in a single portfolio): (iv) ADAM Strategy, (v) Gavea Macro, and (vi) SPX Nimitz.

A natural framework for analyzing mutual investment funds, due to its underlying structure, is a time series method.

Singular spectrum analysis (SSA) is a powerful non-parametric technique for time series analysis and forecasting, which incorporates elements of classical time series analysis, multivariate statistics, and matrix algebra. Its main aim is to decompose the original time series into a set of components that can be interpreted as trend components, seasonal components, and noise components [3–6]. SSA has proven both wide usefulness and applicability across many applications [7–17], being that its scope of application ranges from parameter estimation to time series filtering, synchronization analysis, and forecasting [18].

The SSA methodology for model fit can be summarized in four steps: (i) embedding, which maps the original univariate time series into a trajectory matrix; (ii) singular value decomposition (SVD), which helps decomposing the trajectory matrix into the sum of rank-one matrices; (iii) eigentriple grouping, which helps deciding which of the components are associated to the signal and which are associated to the noise; and (iv) diagonal averaging, which maps the rank-one matrices, associated to the signal, back to time series that can be interpreted as trend, seasonal, or other meaningful components.

SSA results and interpretation, similarly to many other classical time series methods, can be sensitive to data contamination with outliers [19,20]. In those cases, even a small percentage of outliers can make a big difference on the results for model fit and model forecast. Very few attempts have been made in order to access the effect of the presence of outliers in the data while conducting a SSA. One study [21,22] presented some preliminary results on the effect of outliers in singular spectrum analysis, and [23] made a first attempt to robustify the SSA by considering an SVD based on a robust *L*<sup>1</sup> norm [24] instead of the *L*<sup>2</sup> norm used in the classical algorithm, which they used for model fit.

In this paper we go one step further than [23] and propose a new robust algorithm for SSA that considers the SVD based on the Huber function [25]. Moreover, we propose two robust SSA forecasting algorithms, one based on the the *L*<sup>1</sup> norm and another based on the Huber function. Comparisons are made between the classical SSA algorithm, the robust SSA algorithm based on the *L*<sup>1</sup> norm (RLSSA), the robust SSA algorithm based on the Huber function (RHSSA), and the classical autoregressive integrated moving average (ARIMA) model, in terms of computational time and accuracy for model fit and model forecast. These comparisons for decomposing and forecasting time series were done by considering a simulation example and the six mutual investment funds mentioned above.

The rest of this paper is organized as follows. Section 2 provides the materials and methods containing the data description, a brief introduction to the ARIMA and SSA methodologies, and the details of the proposed robust SSA algorithm that uses the SVD based on the Huber function. Section 3 presents the results and discussion, wherein the ARIMA, SSA, and robust SSA algorithms are compared in terms of model fit and model forecast, using the six mutual investment funds and the simulation example. The paper closes in Section 4, wherein some conclusions are drawn.

#### **2. Materials and Methods**

#### *2.1. Data*

In this paper we consider a dataset that includes daily observations of six mutual investment funds, three based purely on stocks and three balanced funds:

#### *Stock funds*


#### *Balanced funds*


The datasets were collected from https://infofundos.com.br/carteira.

#### *2.2. ARIMA Model*

The autoregressive integrated moving average (ARIMA) models are among the most widely used techniques for time series analysis and forecasting. Such a model depends on three parameters: *p* is the number of lagged observations in the model, i.e., the autoregressive (AR) order; *d* is the number of times that the original observations are differenced, i.e., the integrated (I) degree; and *q* is the size of the moving average window, i.e., the order of the moving average (MA) [26]. This parametric model can then be written as *ARIMA*(*p*, *d*, *q*), with *p*, *d*, and *q* non-negative integers. Given a time series *YN* = *y*1,..., *yN*, the *ARIMA*(*p*, *d*, *q*) model can be written as:

$$(1 - \phi\_1 B^1 - \dots - \phi\_p B^p)(1 - B)^d y\_t = c + (1 + \theta\_1 B^1 + \dots + \theta\_q B^q) \varepsilon\_t,\tag{1}$$

where *φ*1, ... , *φ<sup>p</sup>* are the parameters or coefficients of the *p* autoregressive terms; *B* is the time lag operator, or backward shift, which is a linear operator denoted by *<sup>B</sup><sup>k</sup>* such that *<sup>L</sup>kyt* <sup>=</sup> *yt*−*k*, *<sup>t</sup>* <sup>∈</sup> <sup>Z</sup>; *yt* is the observation at the time point *<sup>t</sup>*; *<sup>c</sup>* = *<sup>μ</sup>*(<sup>1</sup> − *<sup>φ</sup>*<sup>1</sup> −···− *<sup>φ</sup>p*); *<sup>μ</sup>* is the mean of (<sup>1</sup> − *<sup>B</sup>*)*dyt*; *<sup>β</sup>*1, ... , *<sup>β</sup><sup>q</sup>* are the parameters or coefficients of the *q* moving average terms; and *ε<sup>t</sup>* is an error term, usually white noise with variance *σ*2.

Alternatively, the model can be written as:

$$(1 - \phi\_1 B - \cdots - \phi\_{\overline{p}} B^p)(1 - B)^d(y\_1 - \mu t^d/d!) = (1 + \theta\_1 B + \cdots + \theta\_{\overline{q}} B^q) \varepsilon\_{\overline{t}}.\tag{2}$$

which is the parametization used in the "arima" function of the software R [27].

#### *2.3. Singular Spectrum Analysis*

Singular spectrum analysis is a non-parametric technique for model fit and model forecasting that decomposes a time series into a number of components that are summed and interpreted as trend, periodicity, and noise. Similarly to many other time series techniques, SSA can be used for solving a wide range of problems, some of the most relevant being its ability to smooth the original time series, and to separate the signal (i.e., trend and oscillatory components with different amplitudes) from the noise components. Therefore, SSA can be used to analyze and reconstruct smoother noise-free time series that can then be used for model forecasting.

SSA is divided into two interconnected stages: decomposition and reconstruction of the time series. These stages are divided into two sets each, forming a total of four steps: embedding, singular value decomposition (SVD), grouping, and diagonal averaging. The complete algorithm for model fit is described in the following sub-section. Further details can be found in, e.g., [5,6,28].

#### 2.3.1. Decomposition

In the first stage, the (univariate) time series is converted into a high-dimensional matrix called a trajectory matrix, which is then decomposed into the sum of rank-one matrices based on the SVD.

#### (1) Embedding:

Consider a non-zero time series *YN* = {*y*1, ... , *yn*} with size *N* > 2. Let *L*(1 < *L* < *N*) be an integer value called window length and *K* an integer such that the trajectory matrix includes all values; i.e., *K* = *N* − *L* + 1. The embedding step is achieved by mapping the original time series into a sequence of *K* vectors with length *L*:

$$Y\_i = (y\_{i\prime} \cdots \prime, y\_{i+L-1})^T, 1 \le i \le K. \tag{3}$$

Then, the trajectory matrix **X**, that includes the vectors *Yi*, *i* = 1, ... , *K*, in its columns can be written as:

$$\mathbf{X} = [\mathbf{Y}\_1 \cdot \cdots \cdot \mathbf{Y}\_K] = (y\_i \mathbf{j})\_{i,j=1}^{LK} = \begin{bmatrix} y\_1 & y\_2 & \cdots & y\_K \\ y\_2 & y\_3 & \cdots & y\_{K+1} \\ \vdots & \vdots & \ddots & \vdots \\ y\_L & y\_{L+1} & \cdots & y\_N \end{bmatrix} . \tag{4}$$

(2) Singular value decomposition:

Let **<sup>S</sup>** <sup>=</sup> **XX***T*, **<sup>U</sup>**1, ... , **<sup>U</sup>***<sup>L</sup>* be the eigenvectors of **<sup>S</sup>**, and *<sup>λ</sup>*<sup>1</sup> ≥ ··· ≥ *<sup>λ</sup>L*, its corresponding eigenvalues. If *d* is the number of non-null eigenvalues of **S**, and considering **V***<sup>i</sup>* = **X***TUi* <sup>√</sup>*λi*, we can decompose the trajectory matrix **X** as:

$$\mathbf{X} = \sum\_{i=1}^{d} \mathbf{X}\_i = \sum\_{i=1}^{d} \sqrt{\lambda\_i} \mathbf{U}\_i \mathbf{V}\_i^T. \tag{5}$$

The decomposition stage can be accomplished either by the eigendecomposition of **X***T***X** or by the SVD of **X** (**X** = **UDV***T*, **D** = *diag*( <sup>√</sup>*λ*1, ... , <sup>√</sup>*λd*)). A comparison between both decompositions can be found in [29].

#### 2.3.2. Reconstruction

In the second stage, after a separating signal from noise components, a diagonal averaging procedure is conducted in the matrices associated to the signal resulting into the sum of time series components that can then be interpreted as trend or oscillatory components:

#### (1) Eigentriple grouping:

This step consists of identifying the first *r* eigentriples associated with the signal and discarding the *<sup>d</sup>* − *<sup>r</sup>* eigentriples associated with the noise. Formally, let *<sup>I</sup>* = 1, . . . ,*<sup>r</sup>* and *<sup>I</sup><sup>c</sup>* = *<sup>r</sup>* + 1, . . . , *<sup>d</sup>*. The goal of this step is to choose *I* such that the trajectory matrix can be written as:

$$\mathbf{X}\_{I} = \sum\_{i \in I} \sqrt{\lambda\_{i}} \mathbf{U}\_{i} \mathbf{V}\_{i}^{T} + \epsilon\_{\prime} \tag{6}$$

where is the noise term.

The number of eigentriples to conduct the reconstruction is often decided based on w-correlations. We shall say that two series *Y*(1) and *Y*(2) are approximately separable if all correlations between the rows and the columns of the corresponding trajectory matrices obtained from series *Y*(1) and *Y*(2) are close to zero. In [5] they considered other characteristics of the quality of separability; namely, the weighted correlation or **<sup>w</sup>**-correlation, which is a natural measure of deviation of two series *<sup>Y</sup>*(1) *T* and *Y*(2) *<sup>T</sup>* from **w**-orthogonality:

$$\rho\_{12}^{(w)} = \frac{\left(\boldsymbol{\chi}\_{T}^{(1)}, \boldsymbol{\chi}\_{T}^{(2)}\right)\_{w}}{||\boldsymbol{\chi}\_{T}^{(1)}||\_{w} ||\boldsymbol{\chi}\_{T}^{(1)}||\_{w}},\tag{7}$$

where *Y*(*i*) *<sup>T</sup> <sup>w</sup>* = % *Y*(*i*) *<sup>T</sup>* ,*Y*(*i*) *T w* , *i* = 1, 2, and *Y*(1) *<sup>T</sup>* ,*Y*(2) *T <sup>w</sup>* <sup>=</sup> <sup>∑</sup>*<sup>T</sup> <sup>t</sup>*=<sup>1</sup> *wty* (1) *<sup>t</sup> y* (2) *<sup>t</sup>* with *wt* = min {*t*, *L*, *T* − *t* + 1}. If the absolute value of the **w**-correlation is small, the two series are almost **w**-orthogonal. If the absolute value of the **w**-correlation is large, the series are far from being **w**-orthogonal and are, therefore, badly separable. Further explanation and intuition about this measure can be found in [5,28]. Other proposals for this choice were proposed by, e.g., [30,31].

(2) Diagonal averaging:

In this step, using anti-diagonal averaging on the matrices included in **X***I*, the noise-free time series is reconstructed. First, the approximate trajectory matrix **X***<sup>I</sup>* is transformed into a Hankel matrix. Let *As* = {(*l*, *k*) : *l* + *k* = *s*, 1 ≤ *l* ≤ *L*, 1 ≤ *k* ≤ *K*} and #(*As*) be the number of elements in *As*. The element *<sup>x</sup>*&*ij* of the new Hankel matrix **<sup>X</sup>**& is given by:

$$\check{\mathfrak{X}}\_{ij} = \sum\_{(l,k)\in A\_{\mathfrak{s}}} \frac{\mathfrak{x}\_{lk}}{\#(A\_{\mathfrak{s}})} . \tag{8}$$

Next, the Hankel matrix **X**&*<sup>I</sup>* is transformed into a new series of dimension *N*, and the original time series **Y***<sup>N</sup>* can be approximated by:

$$\widetilde{y}\_{i} = \begin{cases} \quad \widetilde{x}\_{i1} & \text{for } i = 1, \dots, L, \\ \quad \widetilde{x}\_{Lj} & \text{for } i = L+1, \dots, N\_{\prime} \end{cases} \tag{9}$$

where *j* = *i* − *L* + 1.

The reconstructed noise-fee time series can then be used for out-of-sample forecasting.

#### *2.4. Robust SSA*

Despite knowing that SSA has shown to be superior to traditional model-based methods in many applications, the singular value decomposition (second step of the SSA algorithm) is highly sensitive to data contamination with outliers. Very few studies were made in order to access effects of outliers in SSA and to generalize this methodology [21,22]. A first attempt to robustify the SSA by considering an SVD based on a robust *L*<sup>1</sup> norm [24] instead of the *L*<sup>2</sup> norm used in the classical algorithm, was proposed by [23]. That robust generalization was compared with the classical SSA algorithm for model fit by these authors. In this subsection we review that robust SSA algorithm proposed by [23] and propose a new robust algorithm for SSA that considers the SVD based on the Huber function [25] and also propose an algorithm for robust SSA model forecasting. While the robust algorithms based on the *L*<sup>1</sup> norm are very popular, they have difficulties in handling heavy tail outliers. The robust algorithms based on the Huber function combine the sum of squares loss and the least absolute deviation loss, that is, a quadratic on small errors, but grows linearly for large errors. As a result, the Huber loss function is not only more robust against outliers but also more adaptive for different types of data [32]. Further details and comparisons between the *L*<sup>1</sup> and Huber loss functions, among others, can be found in [33]. The R source code is available upon request from the first author of this paper.

#### 2.4.1. Robust SSA Based on the *L*<sup>1</sup> Norm

The robust SSA algorithm proposed by [23] replaces the classical SVD based on the least squares *L*<sup>2</sup> norm, by the robust SVD algorithm based on the *L*<sup>1</sup> norm [24]. This robust SVD is performed iteratively, starting with an initial estimate of the first left singular vector *U*<sup>1</sup> and leading to an outlier-resistant approach that also allows for missing data. The robust SVD based on the *L*<sup>1</sup> norm is implemented under the function "robustSVD()" from the R package "pcaMethods".

#### 2.4.2. Robust SSA based on the Huber Function

Here we propose a new alternative to robustify the SSA algorithm, where the least squares SVD in the step two is replaced by the robust SVD based on the Huber function [25]. The Huber loss function [34] can be defined as:

$$L\_{\delta}(a) = \begin{cases} \frac{1}{2}a^2 & \text{if } |a| \le \delta \\\ \delta \left( |a| - \frac{1}{2}\delta \right) & \text{if } |a| > \delta \end{cases} \tag{10}$$

where *δ* is a parameter that controls the robustness level, and a smaller value of *δ* usually leads to more robust estimation.

The robust SVD based on the Huber function is a special case of robust regularized SVD and can be obtained with the function "RobRSVD" of the "RobRSVD" R package, in the following way: RobRSVD (data, rough = TRUE, uspar = 0, vspar = 0). In this R implementation, the authors consider *δ* = 1.345, the value commonly used in robust regression that produces 95% efficiency for normal errors [35]. However, numerical studies suggested that the RobRSVD function is not very sensitive to the choice of *δ* [25]. More details about this robust SVD can be found in [25].

#### *2.5. Robust SSA Forecasting Algorithm*

The standard recurrent SSA forecasting algorithm assumes that a given observation can be written as a linear combination of the *L* − 1 previous observations [5,6,30]. The coefficients of those linear combinations in the classical SSA forecasting algorithm are obtained based on the left singular vectors, *U*, of the trajectory matrix **X**. This is valid for SSA because of the orthogonality of the vectors in *U* and of the full rank decomposition of **X**, which is not the case for the robust SVD algorithms because of their construction and specific properties. To overcome this limitation for the robust SSA algorithms and to be able to obtain out-of-sample forecasts using a robust SSA algorithm, a three stages approach can be conducted:


$$
\hat{a} = (\hat{a}\_{L-1}, \cdots, \hat{a}\_1)' = \frac{1}{1 - \gamma^2} \sum\_{j=1}^r \pi\_j! I\_j^{\nabla},\tag{11}
$$

where *γ*<sup>2</sup> = ∑*<sup>r</sup> <sup>j</sup>*=<sup>1</sup> *π*<sup>2</sup> *j* .

(iii) The h-steps-ahead out-of-sample recurrent robust SSA forecasts *y*'*N*+1, ... , *y*'*N*+*h*, can be obtained as

$$\hat{y}\_t = \begin{cases} \hat{y}\_{t\prime} & \text{for } t = 1, \cdots, N \\ \sum\_{j=1}^{L-1} \hat{a}\_j \hat{y}\_{t-j\prime} & \text{for } t = N+1, \cdots, N+h \end{cases} \tag{12}$$

where *y*˜1, ... , *y*˜*N*, are the fitted values for the reconstructed time series, as obtained from the robust SSA algorithm in (i).

#### *2.6. Accuracy Measures*

There are several methods and measures for assessing model accuracy based on the behavior of model errors. Here, there are two types of errors:


Typically, the root mean squared error (RMSE) is used as a criterion for accessing the precision of a model. The RMSE to investigate the quality of the model fit can be written as:

$$RMSE = \sqrt{\frac{1}{N} \sum\_{t=1}^{N} (y\_t - \hat{y}\_t)^2} \,\tag{13}$$

where *yt* are the observed values and *<sup>y</sup>*&*<sup>t</sup>* the fitted values by the considered model/algorithm (i.e., ARIMA, SSA, robust SSA).

To investigate the forecasting accuracy, let us assume that the last *g* observations are used as a reference (i.e., as test set). Let *N*<sup>0</sup> = *N* − *h* − *g*. The RMSE to investigate the quality of the forecasting model can be written as:

$$RMSE = \sqrt{\frac{1}{\mathcal{S}} \sum\_{t=N\_0+h+1}^{N} (y\_t - \tilde{y}\_t)^2},\tag{14}$$

where *yt* are the last *<sup>g</sup>* observed values and *<sup>y</sup>*&*<sup>t</sup>* the respective h-steps-ahead forecast values.

#### **3. Results and Discussion**

In this section, comparisons are made between the classical ARIMA model, the classical SSA algorithm, and the robust SSA algorithms, in terms of computational time and accuracy for model fit and model forecast. These comparisons for decomposing and forecasting time series are done by considering a simulation example and the time series of six mutual investment funds.

Table 1 shows the descriptive statistics for the six mutual investment funds, including the minimum, maximum, and mean returns, being clear that Alaska Black is the fund that shows the largest variation and with the highest mean daily return. On the other end there are Gavea Macro and SPX Nimitz, which show the smallest variations among the considered funds, and low mean returns.

In addition to the descriptive measures, Figure 1 shows the behavior of the six investment funds over time. From these plots, it is possible to observe that all funds have an overall growing tendency, with similar patterns for Gavea Macro and SPX Nimitz.

**Figure 1.** Time series for the returns of the six mutual investment funds, ADAM Strategy, Alaska Black, APEX Long Biased, Brasil Capital, Gávea Macro and SPX Nimitz, from left to right and from top to bottom. The vertical axes show the quota values; i.e., the total net assets of a fund divided by the total number of quotas existing.


**Table 1.** Descriptive measures for returns of the six mutual investment funds.

#### *3.1. Model Fit*

The models/algorithms under comparison for model fit are: (i) ARIMA, (ii) SSA, (iii) robust SSA based on the *L*<sup>1</sup> norm (RLSSA), and (iv) robust SSA based on the Huber function (RHSSA).

The parameters of the ARIMA model for each of the six mutual investment funds were estimated with the function "auto.arima" from the R package "forecast" [36].

For the SSA and robust SSA algorithms, there are two choices to be made by the researcher: (i) the window length *L*; and (ii) the number of eigentriples used for reconstruction *r*. Three values of *L* were chosen for each time series, as defined in Table 2—*L*<sup>1</sup> = *N*/20, *L*<sup>2</sup> = *N*/2, and *Lp*—being the *Lp* obtained from the periodogram, based on the largest cycle for each time series [37] (i.e., about one trimester for ADAM Strategy, one semester for Alaska Black, one year for APEX Long Biased, one quadrimeter for Brasil Capital, one quadrimeter for Gavea Macro, and one quadrimester for SPX Nimitz), and *N* being the time series length. The choice of the number of eigentriples used for reconstruction *r*, for each of the considered window lengths and each of the time series, was done by taking into consideration the the w-correlations among components [5]. Figure 2 shows the w-correlation matrices for each of the six mutual investment funds, considering an window length *L* = *N*/20, and Figure A1 of the appendix shows the w-correlation matrices for each of the six mutual investment funds, considering an window length *L* = *N*/2. The w-correlation matrices can be obtained with the function "wcor" of the R package "Rssa" [38] and the number of eigentriples *r* should be chosen in order to maximize the separability between signal and noise components; i.e., maximize the w-correlation among signal components, maximize the w-correlation among noise components, and minimize the w-correlation between signal and noise components. A summary of the number of eigentriples used for the reconstruction of each time series for each of the window length considered can be seen in Table 2.

Since one of the objectives in SSA is to decompose the original time series into interpretable components such as trend and seasonality, plus the noise component that is then discarded, Figure 3 shows the original time series for the Alaska Black mutual investment fund, its trend component (sum of individual trend components), its seasonal component (sum of individual seasonal components), and its residuals (sum of the remaining components associated to noise), considering an window length *L* = *N*/20 = 33 and *r* = 12 eigentriples for reconstruction. Similar SSA decompositions for ADAM Strategy, APEX Long Biased, Brasil Capital, ADAM Strategy, Gavea Macro, and SPX Nimitz—considering the values of window length *L*<sup>1</sup> and *r*<sup>1</sup> eigentriples used for reconstruction, as defined in Table 2—can be found in Figures A2–A6 of the appendix, respectively.

**Figure 2.** W-correlation matrices for each of the six mutual investment funds, ADAM Strategy, Alaska Black, APEX Long Biased, Brasil Capital, Gávea Macro and SPX Nimitz, from left to right and from top to bottom, considering an window length *L* = *N*/20.

**Figure 3.** Decomposition of the original time series for the Alaska Black mutual investment fund (top panel), with a trend component (sum of individual trend components, second panel), a seasonal component (sum of individual seasonal components, third panel), and a residual (sum of the remaining components associated to noise, bottom panel), considering an window length *L* = *N*/20 = 33 and *r* = 12 eigentriples for reconstruction.


**Table 2.** Window length *L*1, *L*2, and *Lp*, and number of eigentriples *r* considered for model fit and model forecast for each of the mutual investment funds.

In order to evaluate and compare the ability for model fit using the four models, ARIMA, SSA, robust SSA based on the *L*<sup>1</sup> norm (RLSSA), and robust SSA based on the Huber function (RHSSA), the root mean square error (RMSE) was calculated for each time series. Table 3 shows the RMSE for model fit by each of the four models applied to each of the six mutual investment funds, considering a window length *L* = *N*/2 (Table 2). Table 4 shows the RMSE for model fit by each of the four models applied to each of the six mutual investment funds, considering a window length *L* = *N*/20 (Table 2). Table 5 shows the RMSE for model fit by each of the four models applied to each of the six mutual investment funds, considering a window length obtained based on the largest cycle for each time series (Table 2). From the analyzes of these tables, we can conclude that the ARIMA model shows an overall better performance when the window length in the SSA related algorithms is set to be half of the time series (Table 3). However, when the window length is set to be *L*<sup>1</sup> = *N*/20 or *Lp* (i.e., equal to the length of the largest cycle), the classical SSA provides the best results, while the ARIMA model and the robust SSA algorithms alternate for the second best performances. For all choices of window length, the two robust SSA algorithms behaved similarly.

**Table 3.** Root mean square error for each of the six mutual investment funds, considering each of the four models, ARIMA, SSA, robust SSA based on the *L*<sup>1</sup> norm (RLSSA), and robust SSA based on the Huber function (RHSSA), for the window length *L*<sup>2</sup> = *N*/2 and considering *r*<sup>2</sup> engentriples for reconstruction as defined in Table 2.


**Table 4.** Root mean square error for each of the six mutual investment funds, considering each of the four models, ARIMA, SSA, robust SSA based on the *L*<sup>1</sup> norm (RLSSA), and robust SSA based on the Huber function (RHSSA), for the window length *L*<sup>1</sup> = *N*/20 and considering *r*<sup>1</sup> engentriples for reconstruction as defined in Table 2.




Tables 6–8 show the computational times for each combination of model/algorithm and mutual investment fund, as presented in Tables 3–5, respectively. From the analyzes of these tables, we can conclude that the best performance was obtained by the ARIMA and SSA algorithms. The computational time, for the classic and robust SSA algorithms, increases with the increase of the length *L*. Moreover, for larger trajectory matrices (i.e., considering *L* = *N*/2) the robust SSA algorithm based on the Huber function has a lower computational time than the robust SSA algorithm based on the *L*<sup>1</sup> norm (Table 6). However, when the trajectory matrices are more rectangular (i.e., considering *L* = *N*/20, Table 7, or *L* = *Lp*, Table 8), the robust SSA algorithm based on the *L*<sup>1</sup> norm has a much lower computational time (comparable to the ARIMA and SSA computational times) than the robust SSA algorithm based on the Huber function).

**Table 6.** Computational time, in minutes, for each of the six mutual investment funds, considering each of the four models, ARIMA, SSA, robust SSA based on the *L*<sup>1</sup> norm (RLSSA), and robust SSA based on the Huber function (RHSSA), for the window length *L*<sup>2</sup> = *N*/2 and considering *r*<sup>2</sup> engentriples for reconstruction as defined in Table 2.


**Table 7.** Computational time, in minutes, for each of the six mutual investment funds, considering each of the four models, ARIMA, SSA, robust SSA based on the *L*<sup>1</sup> norm (RLSSA), and robust SSA based on the Huber function (RHSSA), for the window length *L*<sup>1</sup> = *N*/20 and considering *r*<sup>1</sup> engentriples for reconstruction as defined in Table 2.


**Table 8.** Computational time, in minutes, for each of the six mutual investment funds, considering each of the four models, ARIMA, SSA, robust SSA based on the *L*<sup>1</sup> norm (RLSSA), and robust SSA based on the Huber function (RHSSA), for the window length *Lp* (i.e., the length of the longest cycle) and considering *rp* engentriples for reconstruction as defined in Table 2.


Figure 4 shows the original time series and the model fit by the SSA model with *L* = *N*/20 and by the ARIMA model. We can confirm that both fits are almost overlapped and very near to the original time series, which was expected from the small RMSE showed in Table 4.

**Figure 4.** Original time series (black line); smoothed time series after applying the SSA considering *L* = *N*/20, with the number of eigentriples *r* as they are defined in Table 2 (red line); and model fit by the ARIMA model (green line), for each of the six mutual investment funds, ADAM Strategy, Alaska Black, APEX Long Biased, Brasil Capital, Gávea Macro and SPX Nimitz, from left to right and from top to bottom. The vertical axes show the quota values.

#### *3.2. Model Forecasting*

In this section we compare the forecasting abilities of ARIMA, SSA with *L* = *N*/2, SSA with *L* = *N*/20, SSA with *L* = *Lp* based on the largest cycle for each time series, and robust SSA based on the *L*<sup>1</sup> norm with *L* = *N*/20 and *Lp*. The decision for not considering the robust SSA algorithm based on the Huber function was because of its similarity in terms of RMSE with the robust SSA based on the *L*<sup>1</sup> norm (Tables 3–5) and the much higher computational time (Tables 6–8). A similar argument was considered for not presenting the results for the robust SSA algorithm based on the *L*<sup>1</sup> norm with *L* = *N*/2.

Table 9 shows the RRMSE for model forecasting for each of the six mutual investment funds, considering each of the four models, ARIMA, SSA with *L* = *N*/2, SSA with *L* = *N*/20, SSA with *L* = *Lp*, and robust SSA based on the *L*<sup>1</sup> norm (RLSSA) with *L* = *N*/20 and *Lp*, considering the

window length and engentriples used for reconstruction as defined in Table 2. These values were obtained based on the forecasting of the *g* = 12 observations from each time series, obtained for one, five, and ten steps ahead out-of-sample forecast; i.e., one day ahead, one week ahead, and two weeks ahead.

The overall best performance was obtained with the classic SSA algorithm that considers a lower value for the window length, either *L* = *N*/20 or *L* = *Lp*, followed closely by ARIMA and the robust SSA algorithm based on the *L*<sup>1</sup> norm. The ARIMA model obtained the best performance in three cases for one-step-ahead forecasting, and the robust SSA algorithm based on the *L*<sup>1</sup> norm with *L* = *N*/20 yielded the best performance in a couple of time series for five-steps-ahead forecasting. As expected, the RMSE shows an overall increase when increasing the number of steps ahead to be forecast. A possible justification for the similarity between the SSA and robust SSA algorithm can be explained by the possible lack of outliers in the data. Table 10 shows the computational time for model forecasting for each of the six mutual investment funds, considering each of the five models shown in Table 9. As expected, after analyzing the computational times for model fit (Tables 6–8), the best performance in terms of computational time for model forecasting was obtained by the the ARIMA and SSA (with lower values for the window length) models and the worse by the robust SSA algorithm based on the *L*<sup>1</sup> norm.

**Table 9.** Root mean square error for model forecasting for each of the six mutual investment funds, considering the models ARIMA, SSA with *L* = *N*/2, SSA with *L* = *N*/20, SSA with *Lp*, robust SSA based on the *L*<sup>1</sup> norm (RLSSA) with *L* = *N*/20, and RSSA with *Lp*, and their respective engentriples, as defined in Table 2.


**Table 10.** Computational time, in minutes, for the model for each of the six mutual investment funds, considering the models ARIMA, SSA with *L* = *N*/2, SSA with *L* = *N*/20, SSA with *Lp*, robust SSA based on the *L*<sup>1</sup> norm (RLSSA) with *L* = *N*/20, and RSSA with *Lp*, and their respective engentriples, as defined in Table 2.


#### *3.3. Simulation Example*

To verify the hypothesis raised in the previous subsection that the similarity between the results from SSA and the robust SSA algorithm can be due to the lack of outliers in the time series, in this subsection we present a simulation example where the methods are compared while analyzing a time series contaminated with outlying observations. The synthetic data were obtained by generating random values from the following function, and then we transformed them into a time series (right-hand plot in Figure 5):

$$f(t) = \exp\{0.02t + 0.5\sin(2\pi t/5)\} + \epsilon, \quad t = 1, \dots, 100\_{\nu}$$

where is the noise generated from the *N*(0, 0.1). A total of 100 simulated time series were considered.

**Figure 5.** Synthetic data without contamination (**right**), data with 5% additive outliers (**left**), and data with 5% multiplicative outliers (**center**). The vertical axes show the simulated value and the horizontal axes show the index of the simulated observation.

The data contamination, for illustration purposes, was made by considering additive outliers and magnitude increase outliers in the following way:


Table 11 shows the mean of the root mean square errors for model fit, computed for each of the four models, ARIMA, SSA, robust SSA based on the *L*<sup>1</sup> norm, and robust SSA based on the Huber function, for the simulated data, based on 100 runs, using *L* = 24 and *r* = 5, and considering both contamination scenarios with 2, 5, and 10% outliers. As expected, when there is no data contamination, the classic SSA model is the most appropriated. For the mild contamination scenario with additive outliers, the robust SSA algorithms outperform both ARIMA and SSA models, the better performance being more evident when the percentage of the outliers increases. For the more extreme contamination scenario with multiplicative outliers, a similar patters was obtained, the RLSSA being the best robust algorithm, in this simulation example.

Appendix B includes a second simulation scenario where robust SSA algorithm based on the Huber function (RHSSA) outperforms the classic ARIMA and SSA models and the robust SSA algorithm based on the *L*<sup>1</sup> norm (RLSSA).

Table 12 shows mean of the root mean square errors for model forecasting (*M* = 1, 5, 10 stepsahead), computed for each of ARIMA, SSA, and robust SSA based on the *L*<sup>1</sup> norm, for the simulated data, based on 100 runs, using *L* = 24 and *r* = 5. The results for the robust SSA based on the Huber function were not included because of their computational cost and out-performance when compared with the robust SSA based on the *L*<sup>1</sup> norm. Again, as expected, the SSA model yielded the best performance for no data contamination. For scenarios with data contamination, the best performance was obtained by the robust SSA forecasting algorithm, with a very large decrease in RMSE in many scenarios.


**Table 11.** Mean of the root mean square errors for model fit, computed for each of the four models, ARIMA, SSA, robust SSA based on the *L*<sup>1</sup> norm, and robust SSA based on the Huber function, for the simulated data, based on 100 runs, using *L* = 24 and *r* = 5.


**Table 12.** Mean of the root mean square errors for model forecasting (*M* = 1, 5, and 10 steps-ahead), computed for each of the four models, ARIMA, SSA, robust SSA based on the *L*<sup>1</sup> norm, and robust SSA based on the Huber function, for the simulated data, based on 100 runs, using *L* = 24 and *r* = 5.

\* 10% trimed mean. The mean value is 1.566 × 106.

#### **4. Conclusions**

In this paper we considered the problem of model fit and model forecasting in time series. In particular, we analyzed six mutual investment funds. Following up on [23], who proposed a robust SSA algorithm by replacing the standard least squares SVD by a robust SVD algorithm based on the *L*<sup>1</sup> norm [24] for model fit, we proposed another robust SSA algorithm where the robust SVD based on the Huber function is considered [25]. Moreover, we propose a forecasting strategy for the robust SSA algorithms, based on the linear recurrent SSA forecasting algorithm.

Comparisons were made between the classical SSA algorithm, the robust SSA algorithms, and the classical ARIMA model, both in terms of computational time and accuracy for model fit and model forecast. Those comparisons were made by using daily observations of six mutual investment funds, and a synthetic data set where the time series were contaminated with outlying observations.

For model fit of the six mutual investment funds, the best results were obtained for the SSA model when the window length *L* was set to be equal to the length of the time series divided by 20, or when the window length is defined as the length of the largest cycle in the time series. The ARIMA model and the robust SSA algorithms alternated for the second best performance. For model forecasting of the six mutual investment funds, the best overall performance was obtained for the classic SSA model considering a lower value for the window length, *L* = *N*/20 or *Lp*, followed closely by the ARIMA model and the robust SSA algorithm based on the *L*<sup>1</sup> norm.

Based on the similarity between the results from the classic SSA model and the robust SSA algorithms, both for model fit and model forecasting, one may assume that the time series data from the six mutual investment funds had no or little data contamination. To access that hypothesis and to better illustrate the usefulness of the robust SSA algorithms, using a scenario with known and controlled outliers, a simulation study and its results were presented in this article. For both mild and and more extreme contamination scenarios, the robust SSA algorithms clearly outperformed the classical AMMI and SSA models, both for model fit and for model forecasting. Another important advantage of the robust SSA algorithms, because of their use of the robust SVD, is that they allow for missing values.

In terms of computational time, the SSA model gives the best performance, the robust algorithms being the most time consuming. A possible future development to reduce the computational time in the robust SSA algorithms is to consider a similar strategy as in [39], where a randomized SVD algorithm was used to speed up the SSA algorithm.

The usefulness of the proposed approach, regarding the forecasting case, can be assessed based on forecasting competitions (e.g., [40]) or large scale forecasting studies (see, e.g., [41]).

The methodology and results presented in this paper are of great generality and can be applied to other time series applications.

**Author Contributions:** Conceptualization, P.C.R.; Formal analysis, P.C.R., J.P. and P.M.; Methodology, P.C.R. and M.K.; Software, P.C.R., J.P., P.M. and M.K.; Supervision, P.C.R.; Visualization, J.P. and P.M.; Writing—original draft, P.C.R., J.P., P.M. and M.K.; Writing—review and editing, P.C.R., J.P. and M.K. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Acknowledgments:** The authors thank the associate editor and three anonymous reviewers for providing helpful suggestions which contributed to the improvement of the paper.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Abbreviations**

The following abbreviations are used in this manuscript:


SVD singular value decomposition

RHSSA robust SSA algorithm based on the Huber function


#### **Appendix A**

**Figure A1.** W-correlation matrices for each of the six mutual investment funds, ADAM Strategy, Alaska Black, APEX Long Biased, Brasil Capital, Gávea Macro and SPX Nimitz, from left to right and from top to bottom, considering an window length *L* = *N*/2.

**Figure A2.** Decomposition of the original time series for the ADAM Strategy mutual investment fund (top panel), with a trend component (sum of individual trend components, second panel), a seasonal component (sum of individual seasonal components, third panel), and a residual (sum of the remaining components associated to noise, bottom panel), considering an window length *L* = *N*/20 = 41 and *r* = 17 eigentriples used for reconstruction.

**Figure A3.** Decomposition of the original time series for the APEX Long Biased mutual investment fund (top panel), with a trend component (sum of individual trend components, second panel), a seasonal component (sum of individual seasonal components, third panel), and a residual (sum of the remaining components associated to noise, bottom panel), considering an window length *L* = *N*/20 = 80 and *r* = 14 eigentriples used for reconstruction.

**Figure A4.** Decomposition of the original time series for the Brasil Capital mutual investment fund (top panel), with a trend component (sum of individual trend components, second panel), a seasonal component (sum of individual seasonal components, third panel), and a residual (sum of the remaining components associated to noise, bottom panel), considering an window length *L* = *N*/20 = 88 and *r* = 12 eigentriples used for reconstruction.

**Figure A5.** Decomposition of the original time series for the Gavea Macro mutual investment fund (top panel), with a trend component (sum of individual trend components, second panel), a seasonal component (sum of individual seasonal components, third panel), and a residual (sum of the remaining components associated to noise, bottom panel), considering an window length *L* = *N*/20 = 140 and *r* = 12 eigentriples used for reconstruction.

**Figure A6.** Decomposition of the original time series for the SPX Nimitz mutual investment fund (top panel), with a trend component (sum of individual trend components, second panel), a seasonal component (sum of individual seasonal components, third panel), and a residual (sum of the remaining components associated to noise, bottom panel), considering an window length *L* = *N*/20 = 109 and *r* = 8 eigentriples used for reconstruction.

#### **Appendix B**

A second synthetic dataset was obtained by generating random values from the following function and then transforming them into a time series:

$$f(t) = \cos\left(2\pi wt + \phi\right) + \epsilon, \quad t = 1, \ldots, 100,$$

with *w* = 3/8, *φ* = *π*/8 and the noise generated from the *N*(0, 0.1) (right-hand side of Figure A7). A total of 100 simulated time series were considered.

The data contamination was done in the same manner as described before. An example of 5% additive outliers scenario can be found on the left-hand plot of Figure A7, and an example of 5% multiplicative outliers scenario can be found on the central plot of Figure A7. The results for the root mean square errors for model fit, computed for each of the four models, ARIMA, SSA, robust SSA based on the *L*<sup>1</sup> norm, and robust SSA based on the Huber function, can be found in Table A1.


**Table A1.** Mean of the root mean square errors for model fit, computed for each of the four models, ARIMA, SSA, robust SSA based on the *L*<sup>1</sup> norm, and robust SSA based on the Huber function, for the simulated data, based on 100 runs, using *L* = 24 and *r* = 2.

#### **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **Channels' Confirmation and Predictions' Confirmation: From the Medical Test to the Raven Paradox**

#### **Chenguang Lu**

Intelligence Engineering and Mathematics Institute, Liaoning Technical University, Fuxin 123000, China; survival99@gmail.com

Received: 24 January 2020; Accepted: 25 March 2020; Published: 26 March 2020

**Abstract:** After long arguments between positivism and falsificationism, the verification of universal hypotheses was replaced with the confirmation of uncertain major premises. Unfortunately, Hemple proposed the Raven Paradox. Then, Carnap used the increment of logical probability as the confirmation measure. So far, many confirmation measures have been proposed. Measure *F* proposed by Kemeny and Oppenheim among them possesses symmetries and asymmetries proposed by Elles and Fitelson, monotonicity proposed by Greco et al., and normalizing property suggested by many researchers. Based on the semantic information theory, a measure *b*\* similar to *F* is derived from the medical test. Like the likelihood ratio, measures *b*\* and *F* can only indicate the quality of channels or the testing means instead of the quality of probability predictions. Furthermore, it is still not easy to use *b*\*, *F*, or another measure to clarify the Raven Paradox. For this reason, measure *c*\* similar to the correct rate is derived. Measure *c*\* supports the Nicod Criterion and undermines the Equivalence Condition, and hence, can be used to eliminate the Raven Paradox. An example indicates that measures *F* and *b*\* are helpful for diagnosing the infection of Novel Coronavirus, whereas most popular confirmation measures are not. Another example reveals that all popular confirmation measures cannot be used to explain that a black raven can confirm "Ravens are black" more strongly than a piece of chalk. Measures *F*, *b*\*, and *c*\* indicate that the existence of fewer counterexamples is more important than more positive examples' existence, and hence, are compatible with Popper's falsification thought.

**Keywords:** relative entropy; cross-entropy; uncertain reasoning; inductive logic; confirmation measure; semantic information; medical test; raven paradox

#### **1. Introduction**

A universal judgment is equivalent to a hypothetical judgment or a rule, such as "All ravens are black" is equivalent to "For every *x*, if *x* is a raven, then *x* is black". Both can be used as a major premise for a syllogism. Deductive logic needs major premises; however, some major premises for empirical reasoning must be supported by inductive logic. Logical empiricism affirmed that a universal judgment can be verified finally by sense data. Popper said against logical empiricism that a universal judgment could only be falsified rather than be verified. However, for a universal or hypothetical judgment that is not strict, and is therefore uncertain, such as "Almost all ravens are black", "Ravens are black", or "If a man's Coronavirus test is positive, then he is very possibly infected", we cannot say that one counterexample can falsify it. After long arguments, Popper and most logical empiricists reached the identical conclusion [1,2] that we may use evidence to confirm universal judgments or major premises that are not strict or uncertain.

In 1945, Hemple [3] proposed the confirmation paradox or the Raven Paradox. According to the Equivalence Condition in the classical logic, "If *x* is a raven, then *x* is black" (Rule I) is equivalent to "If *x* is not black, then *x* is not a raven" (Rule II). A piece of white chalk supports the Rule II, and hence, also supports the Rule I. However, according to the Nicod criterion [4], a black raven supports the Rule I, a non-black raven undermines the Rule I, and a non-raven thing, such as a black cat or a piece of white chalk, is irrelevant to the Rule I. Hence, there exists a paradox between the Equivalence Condition and the Nicod criterion.

To quantize confirmation, both Carnap [1] and Popper [2] proposed their confirmation measures. However, only Carnap's confirmation measures are famous. So far, researchers have proposed many confirmation measures [1,5–13]. The induction problem seemly has become the confirmation problem. To screen reasonable confirmation measures, Elles and Fitelson [14] proposed **symmetries and asymmetries** as desirable properties; Crupi et al. [8] and Greco et al. [15] suggested **normalization** (for measures between −1 and 1) as a desirable property; Greco et al. [16] proposed **monotonicity** as a desirable property. We can find that only measures *F* (proposed by Kemeny and Oppenheim) and *Z* among popular confirmation measures possess these desirable properties. Measure *Z* was proposed by Crupi et al. [8] as the normalization of some other confirmation measures. It is also called the certainty factor proposed by Shortliffe and Buchanan [7].

When the author of this paper researched semantic information theory [17], he found that an uncertain prediction could be treated as the combination of a clear prediction and a tautology; the combining proportion of the clear prediction could be used as the degree of belief; the degree of belief optimized with a sampling distribution could be regarded as a confirmation measure. This measure is denoted by *b*\*; it is similar to measure *F* and also possesses the above-mentioned desirable properties.

Good confirmation measures should possess not only mathematically desirable properties but also practicabilities. We can use medical tests to check their practicabilities. We use the degree of belief to represent the degree to which we believe a major premise and use the degree of confirmation to denote the degree of belief that is optimized by a sample or some examples. The former is subjective, whereas the latter is objective. A medical test provides the test-positive (or the test-negative) to predict if a person or a specimen is infected (or uninfected). Both the test-positive and the test-negative have degrees of belief and degrees of confirmation. In medical practices, there exists an important issue: if two tests provide different results, which test should we believe? For example, when both Nucleic Acid Test (NAT) and CT (Computed Tomography) are used to diagnose the infection of Novel Coronavirus Disease (COVID-19), if the result of NAT is negative and the result of CT is positive, which should we believe? According to the sensitivity and the specificity [18] of a test and the prior probability of the infection, we can use any confirmation measure to calculate the degrees of confirmation of the test-positive and the test-negative. Using popular confirmation measures, can we provide reasonable degrees of confirmation to help us choose a better result from NAT-negative and CT-positive? Can these degrees of confirmation reflect the probability of the infection?

This paper will show that only measures that are the functions of the likelihood ratio, such as *F* and *b*\*, can help us to diagnose the infection or choose a better result that can be accepted by the medical society. However, measures *F* and *b*\* do not reflect the probability of the infection. Furthermore, using *F*, *b*\*, or another measure, it is still difficult to eliminate the Raven Paradox.

Recently, the author found that the problem with the Raven Paradox is different from the problem with the medical diagnosis. Measures *F* and *b*\* indicate how good the testing means are instead of how good the probability predictions are. To clarify the Raven Paradox, we need a confirmation measure that can indicate how good a probability prediction is. The confirmation measure *c*\* is hence derived. We call *c*\* a prediction confirmation measure and call *b*\* a channel confirmation measure. The distinction between Channels' confirmation and predictions' confirmation is similar to yet different from the distinction between Bayesian confirmation and Likelihoodist confirmation [19]. Measure *c*\* accords with the Nicod criterion and undermines the Equivalence Condition, and hence can be used to eliminate the Raven Paradox.

The main purposes of this paper are:


The confirmation methods in this paper are different from popular methods, since:


The main contributions of this paper are:


The rest of this paper is organized as follows. Section 2 includes background knowledge. It reviews existing confirmation measures, introduces the related semantic information method, and clarifies some questions about confirmation. Section 3 derives new confirmation measures *b*\* and *c*\* with the medical test as an example. It also provides many confirmation formulas for major premises with different antecedents and consequents. Section 4 includes results. It gives some cases to show the characteristics of new confirmation measures, to compare various confirmation measures by applying them to the diagnosis of COVID-19, and to show how an increased example affects the degrees of confirmation with different confirmation measures. Section 5 discusses why we can only eliminate the Raven Paradox by measure *c*\*. It also discusses some conceptual confusion and explains how new confirmation measures are compatible with Popper's falsification thought. Section 5 ends with conclusions.

#### **2. Background**

#### *2.1. Statistical Probability, Logical Probability, Shannon's Channel, and Semantic Channel*

First we distinguish logical probability and statistical probability. Logical probability of a hypothesis (or a label) is the probability in which the hypothesis is judged to be true, whereas its statistical probability is the probability in which the hypothesis or the label is selected.

Suppose that ten thousand people go through a door. For everyone denoted by *x*, entrance guards judge if *x* is elderly. If two thousand people are judged to be elderly, then the logical probability of the predicate "*x* is elderly" is 2000/10,000 = 0.2. If the task of entrance guards is to select a label for every person from four labels: "Child", "Youth", "Adult", and "Elderly", there may be one thousand people who are labeled "Elderly". The statistical probability of "Elderly" should be 1000/10,000 = 0.1. Why are not two thousand people are labeled "Elderly"? The reason is that some elderly people are labeled "Adult". A person may make two labels be true, such as a 65 years old person makes both "Adult" and "Elderly" be true. That is why the logical probability of a label is often greater than its statistical probability. An extreme example is that the logical probability of a tautology, such as "*x* is elderly or not elderly", is 1, whereas its statistical probability is almost 0 in general because a tautology is rarely selected. Statistical probability is normalized (the sum is 1), whereas logical probability is not normalized in general [17]. Therefore, we use two different symbols "*P*" and "*T*" to distinguish statistical probability and logical probability.

We now consider the Shannon channel [21] between human ages and labels "Child", "Adult", "Youth", "Middle age", "Elderly", and the like.

Let *X* be a random variable to denote an age and *Y* be a random variable to denote a label. *X* takes a value *x*∈{ages}; Y takes a value *y*∈{"Child", "Adult", "Youth", "Middle age", "Elderly", ... }. Shannon calls the prior probability distribution *P*(*X*) (or *P*(*x*)) the source, and calls *P*(*Y*) the destination. There is a Shannon channel *P*(*Y*|*X*) from *X* to *Y*. It is a transition probability matrix:

$$P(Y|\mathbf{X}) \Leftrightarrow \begin{bmatrix} P(y\_1|\mathbf{x}\_1) & P(y\_1|\mathbf{x}\_2) & \dots & P(y\_1|\mathbf{x}\_m) \\ P(y\_2|\mathbf{x}\_1) & P(y\_2|\mathbf{x}\_2) & \dots & P(y\_2|\mathbf{x}\_m) \\ \dots & \dots & \dots & \dots \\ P(y\_n|\mathbf{x}\_1) & P(y\_n|\mathbf{x}\_2) & \dots & P(y\_n|\mathbf{x}\_m) \end{bmatrix} \Leftrightarrow \begin{bmatrix} P(y\_j|\mathbf{x}) \\ P(y\_j|\mathbf{x}) \\ \dots \\ P(y\_n|\mathbf{x}) \end{bmatrix} \tag{1}$$

where ⇔ indicates equivalence. This matrix consists of a group of conditional probabilities *P*(*yj*|*xi*) (*j* = 0, 1, ... , *n*; *i* = 0, 1, ... , *m*) or a group of transition probability functions (so called by Shannon [21]), *P*(*yj*|*x*) (*j* = 0, 1, ... , *n*), where *yj* is a constant, and *x* is a variable.

There is also a semantic channel that consists of a group of truth functions. Let *T*(θ*j*|*x*) be the truth function of *yj*, where θ*<sup>j</sup>* is a model or a set of model parameters, by which we construct *T*(θ*j*|*x*). The θ*<sup>j</sup>* is alse explained as a fuzzy sub-set of the domain of *x* [17]. For example, *yj* = "*x* is young". Its truth function may be

$$T(\theta\_j|\mathbf{x}) = \exp[- (\mathbf{x} - \mathbf{2}0)^2 / 25]. \tag{2}$$

where 20 and 25 are model parameters. For *yk* = "*x* is elderly", its truth function may be a logistic function:

$$T(\theta\_k|\mathbf{x}) = 1/\text{[1} + \exp[-0.2(\mathbf{x} - 6\mathbf{S})] \,\text{]}\tag{3}$$

where 0.2 and 65 are model parameters. The two truth functions are shown in Figure 1.

**Figure 1.** The truth functions of two hypotheses about ages.

According to Tarski's truth theory [22] and Davidson's truth-conditional semantics [23], a truth function can represent the semantic meaning of a hypothesis. Therefore, we call the matrix, which consists of a group of truth functions, a semantic channel:

$$T(\boldsymbol{\theta}|\boldsymbol{X}) \Leftrightarrow \begin{bmatrix} T(\boldsymbol{\theta}\_{1}|\mathbf{x}\_{1}) & T(\boldsymbol{\theta}\_{1}|\mathbf{x}\_{2}) & \dots & T(\boldsymbol{\theta}\_{1}|\mathbf{x}\_{\mathfrak{m}}) \\ T(\boldsymbol{\theta}\_{2}|\mathbf{x}\_{1}) & T(\boldsymbol{\theta}\_{2}|\mathbf{x}\_{2}) & \dots & T(\boldsymbol{\theta}\_{2}|\mathbf{x}\_{\mathfrak{m}}) \\ \dots & \dots & \dots & \dots \\ T(\boldsymbol{\theta}\_{n}|\mathbf{x}\_{1}) & T(\boldsymbol{\theta}\_{n}|\mathbf{x}\_{2}) & \dots & T(\boldsymbol{\theta}\_{n}|\mathbf{x}\_{\mathfrak{m}}) \end{bmatrix} \Leftrightarrow \begin{bmatrix} T(\boldsymbol{\theta}\_{1}|\mathbf{x}) \\ T(\boldsymbol{\theta}\_{2}|\mathbf{x}) \\ \dots \\ T(\boldsymbol{\theta}\_{n}|\mathbf{x}) \end{bmatrix}. \tag{4}$$

Using a transition probability function *P*(*yj*|*x*), we can make the probability prediction *P*(*x*|*yj*) by

$$P(\mathbf{x}|y\_j) = P(\mathbf{x})P(y\_j|\mathbf{x})/P(y\_j),\tag{5}$$

which is the classical Bayes' formula. Using a truth function *T*(θ*j*|*x*), we can also make a probability prediction or produce a likelihood function by

$$P(\mathbf{x}|\boldsymbol{\theta}\_{\dot{j}}) = P(\mathbf{x})T(\boldsymbol{\theta}\_{\dot{j}}|\mathbf{x})/T(\boldsymbol{\theta}\_{\dot{j}}),\tag{6}$$

where *T*(θ*j*) is the logical probability of *yj*. There is

$$T(\boldsymbol{\theta}\_{\boldsymbol{j}}) = \sum\_{i} P(\mathbf{x}\_{i}) T(\boldsymbol{\theta}\_{\boldsymbol{j}} | \mathbf{x}\_{i}). \tag{7}$$

Equation (6) is called the semantic Bayes' formula [17]. The likelihood function is subjective; it may be regarded as the hybird of logical probability and statistical probability.

When the source *P*(*x*) is changed, the above formulas for predictions still work. It is easy to prove that *P*(*x*|θ*j*) = *P*(*x*|*yj*) as *T*(θ*j*|*x*)∝*P*(*yj*|*x*). Since the maximum of *T*(θ*j*|*x*) is 1, letting *P*(*x*|θ*j*) = *P*(*x*|*yj*), we can obtain the optimized truth function [17]:

$$T^\bullet(\theta\_j|\mathbf{x}) = [P(\mathbf{x}|y\_j)|P(\mathbf{x})] \land \max[P(\mathbf{x}|y\_j)|P(\mathbf{x})] = P(y\_j|\mathbf{x}) \land \max[P(y\_j|\mathbf{x})].\tag{8}$$

where *x* is a variable and max(.) is the maximum of the function in brackets (.).

#### *2.2. To Review Popular Confirmation Measures*

We use *h*<sup>1</sup> to denote a hypothesis, *h*<sup>0</sup> to denote its negation, and *h* to denote one of them. We use *e*<sup>1</sup> as another hypothesis as the evidence of *h*1, *e*<sup>0</sup> as its negation, and *e* as one of them. We use *c*(*e*, *h*) to represent a confirmation measure, which means the degree of inductive support. Note that *c*(*e, h*) here is used as in [8], where *e* is on the left, and *h* is on the right.

In the existing studies of confirmation, logical probability and statistical probability are not definitely distinguished. We still use *P* for both in introducing popular confirmation measures.

The popular confirmation measures include:


Two measures *D* and *C* proposed by Carnap are for incremental confirmation and absolute confirmation respectively. There are more confirmation measures in [8,24]. Measure *F* is also denoted by *l*\* [13], *L* [8], or *k* [24]. Most authors explain that probabilities they use, such as *P*(*h*1) and *P*(*h*1|*e*1) in *D*, *R*, and *C*, are logical probabilities. Some authors explain that probabilities they use, such as *P*(*e*1|*h*1) in *F*, are statistical probabilities.

Firstly, we need to clarify that confirmation is to assess what kind of evidence supports what kind of hypotheses. Let us have a look at the following three hypotheses:

• Hypothesis 1: *h*1(*x*) = "*x* is elderly", where *x* is a variable for an age and *h*1(*x*) is a predicate. An instance *x*=70 may be the evidence, and the truth value *T*(θ1|70) of proposition *h*1(70) should be 1. If *x*=50, the (uncertain) truth value should be less, such as 0.5. Let *e*<sup>1</sup> = "*x* ≥ 60", true *e*<sup>1</sup> may also be the evidence that supports *h*<sup>1</sup> so that *T*(θ1|*e*1) > *T*(θ1).


Hypothesis 1 has a (uncertain) truth function or a conditional logic probability function between 0 and 1, which is ascertained by our definition or usage. Hypothesis 1 need not be confirmed. Hypothesis 2 or Hypothesis 3 is what we need to confirm. The degree of confirmation is between −1 and 1.

There exist two different understandings about *c*(*e*, *h*):


Fortunately, although researchers understand *c*(*e*, *h*) in different ways, most researchers agree to use a sample including four types of examples (*e*1, *h*1), (*e*0, *h*1), (*e*1, *h*0), and (*e*0, *h*0) as the evidence to confirm a rule and to use the four examples' numbers *a, b, c*, and *d* (see Table 1) to construct confirmation measures. The following statements are based on this common view.



The *a* is the number of example (*e*1, *h*1). For example, *e*<sup>1</sup> = "raven" ("raven" is a label or the abbreviate of "*x* is a raven") and *h*<sup>1</sup> = "black"; *a* is the number of black ravens. Similarly, *b* is the number of black non-raven things; *c* is the number of non-black ravens; *d* is the number of non-black and non-raven things.

To make the confirmation task clearer, we follow Understanding 2 to treat *e*→*h* = "if *e* then *h*" as the rule to be confirmed and replace *c*(*e*, *h*) with *c*(*e*→*h*). To research confirmation is to construct or select the function *c*(*e*→*h*)=*f*(*a*, *b*, *c*, *d*).

To screen reasonable confirmation measures, Elles and Fitelson [14] propose the following symmetries:


They conclude that only HS is desirable; the other three symmetries are not desirable. We call this conclusion the symmetry/asymmetry requirement. Their conclusion is supported by most researchers. Since TS is the combination of HS and ES, we only need to check HS, ES, and CS. According to this symmetry/asymmetry requirement, only measures *L*, *F*, and *Z* among the measures mentioned above are screened out. It is uncertain whether *N* can be ruled out by this requirement [15]. See [14,25,26] for more discussions about the symmetry/asymmetry requirement.

Greco et al. [15] propose monotonicity as a desirable property. If *f*(*a*, *b*, *c*, *d*) does not decrease with *a* or *d* and does not increase with *b* or *c*, then we say that *f*(*a*, *b*, *c*, *d*) has the monotonicity. Measures *L*, *F*, and *Z* have this monotonicity, whereas measures *D*, *M*, and *N* do not have. If we further require that *c*(*e*→*h*) are normalizing (between −1 and 1) [8,12], then only *F* and *Z* are screened out. There are also other properties discussed [15,19]. One is logicality, which means *c*(*e*→*h*) = 1 without counterexample and *c*(*e*→*h*) = −1 without positive example. We can also screen out *F* and *Z* using the logicality requirement.

Consider the medical test, such as the test for COVID-19. Let *e*<sup>1</sup> = "positive" (e.g., "*x* is positive", where *x* is a specimen), *e*<sup>0</sup> = "negative", *h*<sup>1</sup> = "infected" (e.g.,"*x* is infected"), and *h*<sup>0</sup> = "uninfected". Then the positive likelihood ratio is *LR*<sup>+</sup> = *P*(*e*1|*h*1)/*P*(*e*1|*h*0), which indicates the reliability of the rule *e*1→*h*1. Measures *L* and *F* have the one-to-one correspondence with *LR*:

$$L(\mathfrak{e}\_1 \to h\_1) = \log LR^+;\tag{9}$$

$$F(e\_1, h\_1) = (LR^+ - 1)/(LR^+ + 1). \tag{10}$$

Hence, *L* and *F* can also be used to assess the reliability of the medical test. In comparison with *LR* and *L*, *F* can indicate the distance between a test (any *F*) and the best test (*F* = 1) or the worst test (*F* = −1) better than *LR* and *L*. However, *LR* can be used for the probability predictions of diseases more conveniently [27].

#### *2.3. To Distinguish a Major Premise's Evidence and Its Consequent's Evidence*

The evidence for the consequent of a syllogism is the minor premise, whereas the evidence for a major premise or a rule is a sample or a sampling distribution *P*(*e*, *h*). In some researchers' studies, *e* is used sometimes as the minor premise, and sometimes as an example or a sample; *h* is used sometimes as a consequent, and sometimes as a major premise. Researchers use *c*(*e*, *h*) or *c*(*h*, *e*) instead of *c*(*e*→*h*) because they need to avoid the contradiction between the two understandings. However, if we distinguish the two types of evidence, it has no problem to use *c*(*e*→*h*). We only need to emphasize that the evidence for a major premise is a sampling distribution *P*(*e*, *h*) instead of *e*.

If *h* is used as a major premise and *e* is used as the evidence (such as in [14,28]), −*e* (the negation of *e*) is puzzling because there are four types of examples instead of two. Suppose *h* = *p*→*q* and that *e* is one of (*p*, *q*), (*p*, −*q*), (−*p*, *q*), and (−*p*, *q*). If (*p*, −*q*) is the counterexample, and other three examples (*p*, *q*), (−*p*, *q*) and (−*p*, −*q*) are positive examples, which support *p*→*q*, then (−*p*, *q*) and (−*p*, −*q*) should also support *p*→−*q* because of the same reason. However, according to HS [14], it is unreasonable that the same evidence supports both *p*→*q* and *p*→−*q*. In addition, *e* is a sample with many examples in general. A sample's negation or a sample's probability is also puzzling.

Fortunately, though many researchers say that *e* is the evidence of a major premise *h*, they also treat *e* as the antecedent and treat *h* as the consequent of a major premise because, only in this way, one can calculate the probabilities or conditional probabilities of *e* and *h* for a confirmation measure. Why, then, should we replace *c*(*e*, *h*) with *c*(*e*→*h*) to make the task clearer? Section 5.3 will show that *h* used as a major premise will result in the misunderstanding of the symmetry/asymmetry requirement.

#### *2.4. Incremental Confirmation or Absolute Confirmation*

Confirmation is often explained as assessing the impact of evidence on hypotheses, or the impact of the premise on the consequent of a rule [14,19]. However, this paper has a different point of view that confirmation is to assess how well a sample or sampling distribution supports a major premise or a rule; the impact on the rule (e.g., the increment of degree of confirmation) may be made by newly added examples.

Since one can use one or several examples to calculate the degree of confirmation with a confirmation measure, many researchers call their confirmation incremental confirmation [14,15]. There are also researchers who claim that we need absolute confirmation [29]. This paper supports absolute confirmation.

The problem with incremental confirmation is that the degrees of confirmation calculated are often bigger than 0.5 and are irrelevant to our prior knowledge or *a, b, c*, and *d* that we knew before. It is unreasonable to ignore prior knowledge. Suppose that the logical probability of *h*<sup>1</sup> = "*x* is elderly" is 0.2; the evidence is one or several people with age(s) *x* > 60; the conditionally logical probability of *h*<sup>1</sup> is 0.9. With measure *D*, the degree of confirmation is 0.9 − 0.2 = 0.7, which is very large and irrelevant to the prior knowledge.

In confirmation function *f*(*a*, *b*, *c*, *d*), the numbers *a*, *b*, *c*, and *d* should be those of all examples including past and current examples. A measure *f*(*a*, *b*, *c*, *d*) should be an absolute confirmation measure. Its increment should be

$$
\Delta f = f(a + \Delta a, b + \Delta b, c + \Delta c, d + \Delta d) - f(a, b, c, d). \tag{11}
$$

The increment of the degree of confirmation brought about by a new example is closely related to the number of old examples. Section 5.2 will further discuss incremental confirmation and absolute confirmation.

#### *2.5. The Semantic Channel and the Degree of Belief of Medical Tests*

We now consider the Shannon channel and the semantic channel of the medical test. The relation between *h* and *e* is shown in Figure 2.

**Figure 2.** The relationship between Positive/Negative and Infected/Uninfected in the medical test.

In Figure 2, *h*<sup>1</sup> denotes an infected specimen (or person), *h*<sup>0</sup> denotes an uninfected specimen, *e*<sup>1</sup> is positive, and *e*<sup>0</sup> is negative. We can treat *e*<sup>1</sup> as a prediction "*h* is infected" and *e*<sup>0</sup> as a prediction "*h* is uninfected". In other word, *h* is a true label or true statement, and *e* is a prediction or selected label. The *x* is the observed feature of *h*; *E*<sup>1</sup> and *E*<sup>2</sup> are two sub-sets of the domain of *x*. If *x* is in *E*1, then *e*<sup>1</sup> is selected; if *x* is in *E*0, then *e*<sup>0</sup> is selected.

Figure 3 shows the relationship between *h* and *x* by two posterior probability distributions *P*(*x*|*h*0) and *P*(*x*|*h*1) and the magnitudes of four conditional probabilities (with four colors).

**Figure 3.** The relationship between two feature distributions and four conditional probabilities for the Shannon channel of the medical test.

In the medical test, *P*(*e*1|*h*1) is called sensitivity [18], and *P*(*h*0|*e*0) is called specificity. They ascertain a Shannon channel, which is denoted by *P*(*e*|*h*), as shown in Table 2.

**Table 2.** Sensitivit*y* and specificit*y* ascertain a Shannon's Channel *P*(*e*|*h*).


We regard predicate *e*1(*h*) as the combination of believable and unbelievable parts (see Figure 4). The truth function of the believable part is *T*(*E*1|*h*)∈{0,1}. The unbelievable part is a tautology, whose truth function is always 1. Then we have the truth functions of predicates *e*1(*h*) and *e*0(*h*):

$$T(\theta\_{\mathcal{E}1}|h) = b\_1 \, ^\prime + b\_1 \, ^\prime T(E\_1|h); \quad T(\theta\_{\mathcal{E}0}|h) = b\_0 \, ^\prime + b\_0 \, ^\prime T(E\_0|h). \tag{12}$$

where model parameter *b*1' is the proportion of the unbelievable part, and also the truth value for the counter-instance *h*0.

**Figure 4.** Truth function *T*(θ*e*1|*h*) includes the believable part with proportion *b*<sup>1</sup> and the unbelievable part with proportion *b*1' (*b*1' = 1 − |*b*1|).

The four truth values form a semantic channel, as shown in Table 3.

*e***<sup>0</sup> (Negative)** *e***<sup>1</sup> (Positive)** *h*<sup>1</sup> (infected) *T*(θ*e*0|*h*1) = *b*0' *T*(θ*e*1|*h*1) = 1 *h*<sup>0</sup> (uninfected) *T*(θ*e*0|*h*0) = 1 *T*(θ*e*1|*h*0) = *b*1'

**Table 3.** The semantic channel ascertained by *b*1' and *b*0' for the medical test.

For medical tests, the logical probability of *e*<sup>1</sup> is

$$T(\theta\_{\epsilon1}) = \sum\_{i} P(h\_i) T(\theta\_{\epsilon1} | h\_i) = P(h\_1) + b\_1 \epsilon P(h\_0) \tag{13}$$

The likelihood function is

$$P(h|\theta\_{\mathfrak{e}1}) = P(h)T(\theta\_{\mathfrak{e}1}|h)/T(\theta\_{\mathfrak{e}1}).\tag{14}$$

*P*(*h*|θ*j*) is also the predicted probability of *h* according to *T*(θ*e*1|*h*) or the semantic meaning of *e*1.

To measure subjective or semantic information, we need subjective probability or logical probability [17]. To measure confirmation, we need statistical probability.

#### *2.6. Semantic Information Formulas and the Nicod–Fisher Criterion*

According to the semantic information G theory [17], the (amount of) semantic information conveyed by *yj* about *xi* is defined with the log-normalized-likelihood:

$$I(\mathbf{x}\_{i}; \theta\_{j}) = \log \frac{P(\mathbf{x}\_{i}|\theta\_{j})}{P(\mathbf{x}\_{i})} = \log \frac{T(\theta\_{j}|\mathbf{x}\_{i})}{T(\theta\_{j})},\tag{15}$$

where *T*(θ*j*|*xi*) is the truth value of proposition *yj*(*xi*) and *T*(θ*j*) is the logical probability of *yj*. If *T*(θ*j*|*x*) is always 1, then this semantic information formula becomes Carnap and Bar-Hillel's semantic information formula [30].

In semantic communication, we often see hypotheses or predictions, such as "The temperature is about 10 ◦C", "The time is about seven o'clock", or "The stock index will go up about 10% next month". Each one of them may be represented by *yj* = "*x* is about *xj*." We can express the truth functions of *yj* by

$$T(\theta\_{\dot{\jmath}}|\mathbf{x}) = \exp[-(\mathbf{x} - \mathbf{x}\_{\dot{\jmath}})^2/(2\sigma^2)].\tag{16}$$

Introducing Equation (16) into Equation (15), we have

$$I(\mathbf{x}\_i; \theta\_j) = \log[1/T(\theta\_j)] - (\mathbf{x}\_i - \mathbf{x}\_j)^2 / (2\sigma^2),\tag{17}$$

by which we can explain that this semantic information is equal to the Carnap–Bar-Hillel's semantic information minus the squared relative deviation. This formula is illustrated in Figure 5.

**Figure 5.** The semantic information conveyed by *yj* about *xi.*

Figure 5 indicates that the smaller the logical probability is, the more information there is; and the larger the deviation is, the less information there is. Thus, a wrong hypothesis will convey negative information. These conclusions accord with Popper's thought (see [2], p. 294).

To average *I*(*xi*; θ*j*), we have generalized Kullback–Leibler information or relative cross-entropy:

$$I(\mathbf{X}; \boldsymbol{\theta}\_{\bar{j}}) = \sum\_{\bar{i}} P(\mathbf{x}\_{\bar{i}} | \boldsymbol{y}\_{\bar{j}}) \log \frac{P(\mathbf{x}\_{\bar{i}} | \boldsymbol{\theta}\_{\bar{j}})}{P(\mathbf{x}\_{\bar{i}})} = \sum\_{\bar{i}} P(\mathbf{x}\_{\bar{i}} | \boldsymbol{y}\_{\bar{j}}) \log \frac{T(\boldsymbol{\theta}\_{\bar{j}} | \mathbf{x}\_{\bar{i}})}{T(\boldsymbol{\theta}\_{\bar{j}})},\tag{18}$$

where *P*(*x*|*yj*) is the sampling distribution, and *P*(*x*|θ*j*) is the likelihood function. If *P*(*x*|θ*j*) is equal to *P*(*x*|*yj*), then *I*(*X*; θ*j*) reaches its maximum and becomes the relative entropy or the Kullback–Leibler divergence.

Consider medical tests, the semantic information conveyed by *e*<sup>1</sup> about *h* becomes

$$I(h\_i; \theta\_{\varepsilon1}) = \log \frac{P(h\_i | \theta\_{\varepsilon1})}{P(h\_i)} = \log \frac{T(\theta\_{\varepsilon1} | h)}{T(\theta\_{\varepsilon1})}.\tag{19}$$

The average semantic information is:

$$I(h; \theta\_{\varepsilon 1}) = \sum\_{i=0}^{1} P(h\_i | \varepsilon\_1) \log \frac{P(h\_i | \theta\_{\varepsilon 1})}{P(h\_i)} = \sum\_{i=0}^{1} P(h\_i | \varepsilon\_1) \log \frac{T(\theta\_{\varepsilon 1} | h\_i)}{T(\theta\_{\varepsilon 1})} \tag{20}$$

where *P*(*hi*|*e*1) is the conditional probability from a sample.

We now consider the relationship between the likelihood and the average semantic information. Let **D** be a sample {(*h*(*t*), *e*(*t*))|*t* = 1 to *N*; *h*(*t*)∈{*h*0, *h*1}; *e*(*t*)∈{*e*0, *e*1}}, which includes two sub-samples or conditional samples **H**<sup>0</sup> with label *e*<sup>0</sup> and **H**<sup>1</sup> with label *e*1. When *N* data points in **D** come from Independent and Identically Distributed random variables, we have the log-likelihood

$$\begin{split} L(\boldsymbol{\theta}\_{\ell1}) &= \log P(\mathbf{H}\_{1}|\boldsymbol{\theta}\_{\ell1}) = \log P(h(1), h(2), \dots, h(N)|\boldsymbol{\theta}\_{\ell1}) = \log \prod\_{i=0}^{1} P(h\_{i}|\boldsymbol{\theta}\_{\ell1})^{N\_{\text{l}i}} \\ &= N\_{1} \sum\_{i=0}^{1} P(h\_{i}|\boldsymbol{\epsilon}\_{1}) \log P(h\_{i}|\boldsymbol{\theta}\_{\ell\bar{f}}) = -N\_{1}H(h|\boldsymbol{\theta}\_{\ell1}). \end{split} \tag{21}$$

where *N*1*<sup>i</sup>* is the number of example (*hi, e*1) in **D**; *N*<sup>1</sup> is the size of **H**1. *H*(*h*|θ*e*1) is the cross-entropy. If *P*(*h*|θ*e*1) = *P*(*h*|*e*1), then the cross-entropy becomes the Shannon entropy. Meanwhile, the cross-entropy reaches its minimum, and the likelihood reaches its maximum.

Comparing the above two equations, we have

$$I(h; \theta\_{\epsilon 1}) = L(\theta\_{\epsilon 1}) / N\_1 - \sum\_{i=0}^{1} P(h\_i | e\_1) \log P(h\_i) \tag{22}$$

which indicates the relationship between the average semantic information and the likelihood. Since the second term on the right side is constant, the maximum likelihood criterion is equivalent to the maximum average semantic information criterion. It is easy to find that a positive example (*e*1, *h*1) increases the average log-likelihood *L*(θ*e*1)/*N*1; a counterexample (*e*1, *h*0) decreases it; examples (*e*0, *h*0) and (*e*0, *h*1) with *e*<sup>0</sup> are irrelevant to it.

The Nicod criterion about confirmation is that a positive example (*e*1, *h*1) supports rule *e*1→*h*1; a counterexample (*e*1, *h*0) undermines *e*1→*h*1. No reference exactly indicates if Nicod affirmed that (*e*0, *h*1) and (*e*0, *h*1) are irrelevant to *e*1→*h*1. If Nicod did not affirm, we can add this affirmation to the criterion, then call the corresponding criterion the Nicod–Fisher criterion, since Fisher proposed the maximum likelihood estimation. From now on, we use the Nicod–Fisher criterion to replace the Nicod criterion.

#### *2.7. Selecting Hypotheses and Confirming Rules: Two Tasks from the View of Statistical Learning*

Researchers have noted the similarity between most confirmation measures and information measures. One explanation [31] is that information is the average of confirmatory impact. However, this paper gives a different explanation as follows.

There are three tasks in statistical learning: label learning, classification, and reliability analysis. There are similar tasks in inductive reasoning:


Classification and reliability analysis are two different tasks. Similarly, hypothesis selection and confirmation are two different tasks.

In statistical learning, classification depends on the criterion. The often-used criteria are the maximum posterior probability criterion (which is equivalent to the maximum correctness criterion) and the maximum likelihood criterion (which is equivalent to the maximum semantic information criterion [17]). The classifier for binary classifications is

$$\varepsilon(\mathbf{x}) = \begin{cases} \varepsilon\_{1\prime} \text{ if } P(\theta\_1 | \mathbf{x}) \ge P(\theta\_0 | \mathbf{x}), \ P(\mathbf{x} | \theta\_1) \ge P(\mathbf{x} | \theta\_0), \text{ or } \mathbf{l}(\mathbf{x}; \theta\_1) \ge \mathbf{l}(\mathbf{x}; \theta\_0);\\ \varepsilon\_{0\prime} \text{ otherwise.} \end{cases} \tag{23}$$

After the above classification, we may use information criterion to assess how well *ej* is used to predict *hj*:

$$\begin{split} I^\*(h\_j; \theta\_{\varepsilon j}) &= I(h\_j; \varepsilon\_j) = \log \frac{P(h\_j|\varepsilon\_j)}{P(h\_j)} = \log \frac{P(\varepsilon\_j|h\_j)}{P(\varepsilon\_j)} \\ I^\* &= \log P(h\_j|\varepsilon\_j) - \log P(h\_j) = \log P(\varepsilon\_j|h\_j) - \log P(\varepsilon\_j) \\ I^\* &= \log P(h\_j, \varepsilon\_j) - \log [P(h\_j)P(\varepsilon\_j)] \end{split} \tag{24}$$

where *I*\* means optimized semantic information. With information amounts *I*(*hi;* θ*ej*) (*i, j* = 0,1), we can optimize the classifier [17]:

$$\sigma\_j^\* = f(\mathbf{x}) = \underset{c\_j}{\text{argmax}} [P(h\_0|\mathbf{x})I(h\_0; \theta\_{cj}) + P(h\_1|\mathbf{x})I(h\_1; \theta\_{cj})].\tag{25}$$

The new classifier will provide the new Shannon's channel *P*(*e*|*h*). The maximum mutual information classification can be achieved by repeating Equations (23) and (25) [17,32].

With the above classifiers, we can make prediction *ej* = "*x* is *hj*" according to *x*. To tell information receivers how reliable the rule *ej*→*hj* is, we need the likelihood ratio *LR* to indicate how good the channel is or need the correct rate to indicate how good the probability prediction is. Confirmation is similar. We need to provide a confirmation measure similar to *LR*, such as *F*, and a confirmation measure similar to the correct rate. The difference is that the confirmation measures should change between −1 and 1.

According to above analyses, it is easy to find that confirmation measures *D, N, R*, and *C* are more like information measures for assessing and selecting predictions instead of confirming rules. *Z* is their normalization [8]; it seems between an information measure and a confirmation measure. However, confirming rules is different from measuring predictions' information; it needs the proportions of positive examples and counterexamples.

#### **3. Two Novel Confirmation Measures**

#### *3.1. To Derive Channel Confirmation Measure b\**

We use the maximum semantic information criterion, which is consistent with the maximum likelihood criterion, to derive the channel confirmation measure. According to Equations (13) and (18), the average semantic information conveyed by *e*<sup>1</sup> about *h* is

$$I(h; \theta\_{\varepsilon 1}) = P(h\_0 | \varepsilon\_1) \log \frac{b\_1'}{P(h\_1 + b\_1' P(h\_0))} + P(h\_1 | \varepsilon\_1) \log \frac{1}{P(h\_1 + b\_1' P(h\_0))}\tag{26}$$

Letting d*I*(*h;*θ*e*1)/d*b*1' = 0, we can obtain the optimized *b*1':

$$b\_1^{\prime \*} = \frac{P(h\_0|e\_1)}{P(h\_0)} / \frac{P(h\_1|e\_1)}{P(h\_1)}\,,\tag{27}$$

where *P*(*h*1|*e*1)/ *P*(*h*1) ≥ *P*(*h*0|*e*1)/ *P*(*h*0). The *b*'\* can be called a disconfirmation measure. Letting both the numerator and the denominator multiply by *P*(*e*1), the above formula becomes:

$$b\_1 \prime \prime = P(e\_1 | h\_0) \prime P(e\_1 | h\_1) = (1 - \text{specificity}) \prime \text{sensitivity} = 1/LR^+. \tag{28}$$

According to the semantic information G theory [17], when a truth function is proportional to the corresponding transition probability function, e.g., *T*\*(θ*e*1|*h*)∝*P*(*e*1|*h*), the average semantic information reaches its maximum. Using *T*\*(θ*e*1|*h*)∝*P*(*e*1|*h*), we can directly obtain

$$\frac{b\_1^{\prime \*}}{P(e\_1 | h\_0)} = \frac{1}{P(e\_1 | h\_1)}\tag{29}$$

and Equation (28). We call

$$b\_1 \star = 1 - b\_1 \prime = [P(c\_1|h\_1) - P(c\_1|h\_0)] / P(c\_1|h\_1) \tag{30}$$

the degree of confirmation of the rule *e*1→*h*1. Considering *P*(*e*1|*h*1) < *P*(*e*1|*h*0), we have

$$b\_1 \star = b\_1 \prime \ast - 1 = [P(e\_1|h\_0) - P(e\_1|h\_1)] / P(e\_1|h\_0). \tag{31}$$

Combining the above two formulas, we obtain

$$b\_1^\* = b^\*(e\_1 \to h\_1) = \frac{P(e\_1|h\_1) - P(e\_1|h\_0)}{\max\left[P(e\_1|h\_1), P(e\_1|h\_0)\right]} = \frac{LR^+ - 1}{\max\left(LR^+, 1\right)}.\tag{32}$$

Since

$$b\_1^\* = b^\*(e\_1 \to h\_0) = \frac{P(e\_1|h\_0) - P(e\_1|h\_1)}{\max[P(e\_1|h\_0), P(e\_1|h\_1)]} = -b^\*(e\_1 \to h\_1),\tag{33}$$

the *b*1\* possesses HS or Consequent Symmetry.

In the same way, we obtain

$$b\_0^\* = b^\*(e\_0 \to h\_0) = \frac{P(e\_0 \| h\_0) - P(e\_0 \| h\_1)}{\max[P(e\_0 \| h\_0), P(e\_0 \| h\_1)]} = \frac{LR^- - 1}{\max(LR^-, 1)}.\tag{34}$$

Using Consequent Symmetry, we can obtain *b*\*(*e*1→*h*0) = −*b*\*(*e*1→*h*1) and *b*\*(*e*0→*h*1) = −*b*\*(*e*0→*h*0). Using measure *b*\* or *F*, we can answer the question: if the result of NAT is negative and the result of CT is positive, which should we believe? Section 4.2 will provide the answer that is consistent with the improved diagnosis of COVID-19 in Wuhan.

Compared with *F*, *b*\* is better for probability predictions. For example, from *b*1\* > 0 and *P*(*h*), we obtain

$$P(h\_1|\partial\_{\varepsilon 1}) = P(h\_1) \langle \!\!\!/\!\!/P(h\_1) + b\_1 \, \!\!\/^\ast P(h\_0) \!\!\/) = P(h\_1) \langle \!\!\!/\!\!/ \!1 - b\_1 \, \!\!\/^\ast P(h\_0) \!\!\/). \tag{35}$$

This formula is much simpler than the classical Bayes' formula (see Equation (5)).

If *b*1\* = 0, then *P*(*h*1|θ*e*1) = *P*(*h*1). If *b*1\* < 0, then we can make use of HS or Consequent Symmetry to obtain *b*10\* = *b*1\*(*e*1→*h*0) = |*b*1\*(*e*1→*h*1)|=|*b*1\*|. Then we have

$$P(h\_0|\partial\_{\varepsilon 1}) = P(h\_0) \{ \lceil P(h\_0) + b\_{10} \prime \prime P(h\_1) \rceil \} = P(h\_0) \{ \lceil 1 - b\_{10} \prime \prime P(h\_1) \rceil \}. \tag{36}$$

We can also obtain *b*1\* = 2*F*1/(1 + *F*1) from *F*<sup>1</sup> = *F*(*e*1→*h*1) for the probability prediction *P*(*h*1|θ*e*1), but the calculation of probability predictions with *F*<sup>1</sup> is a little complicated.

So far, it is still problematic to use *b*\*, *F*, or another measure to handle the Raven Paradox. For example, as shown in Table 13, the increment of *F*(*e*1→*h*1) caused by Δ*d* = 1 is 0.348 − 0.333, whereas the increment caused by Δ*a* = 1 is 0.340 − 0.333. The former is greater than the latter, which means that a piece of white chalk can support "Ravens are black" better than a black raven. Hence measure *F* does not accord with the Nicod–Fisher criterion. Measures *b*\* and *Z* do not either.

Why does not measure *b*\* and *F* accord with the Nicod–Fisher criterion? The reason is that the likelihood *L*(θ*e*1) is related to prior probability *P*(*h*), whereas *b*\* and *F* are irrelevant to *P*(*h*).

#### *3.2. To Derive Prediction Confirmation Measure c\**

Statistics not only uses the likelihood ratio to indicate how reliable a testing means (as a channel) is but also uses the correct rate to indicate how reliable a probability prediction is. Measure *F* and *b*\* like *LR* cannot indicate the quality of a probability prediction. Most other measures have similar problems.

For example, we assume that an NAT for COVID-19 [33] has sensitivity *P*(*e*1|*h*1) = 0.5 and specificity *P*(*e*0|*h*0) = 0.95. We can calculate *b*1'\* = 0.1 and *b*1\* = 0.9. When the prior probability *P*(*h*1) of the infection changes, predicted probability *P*(*h*1|θ*e*1) (see Equation (35)) changes with the prior probability, as shown in Table 4. We can obtain the same results using the classical Bayes' formula (see Equation (5)).

**Table 4.** Predictive probability *P*(*h*1|θ*e*1) changes with prior probability *P*(*h*1) as *b*1\* = 0.9.


Data in Table 4 show that measure *b*\* cannot indicate the quality of probability predictions. Therefore, we need to use *P*(*h*) to construct a confirmation measure that can reflect the correct rate.

We now treat probability prediction *P*(*h*|θ*e*1) as the combination of a believable part with proportion *c*<sup>1</sup> and an unbelievable part with proportion *c*1', as shown in Figure 6. We call *c*<sup>1</sup> the degree of belief of the rule *e*1→*h*<sup>1</sup> as a prediction.

**Figure 6.** Likelihood function *P*(*h*|θ*e*1) may be regarded as a believable part plus an unbelievable part.

When the prediction accords with the fact, e.g., *P*(*h*|θ*e*1) = *P*(*h*|*e*1), *c*<sup>1</sup> becomes *c*1\*. The degree of disconfirmation for predictions is

$$\begin{aligned} \mathcal{L}^{\prime \star}(\varepsilon\_1 \to h\_1) &= P(h\_0|\varepsilon\_1)/P(h\_1|\varepsilon\_1), \text{ if } P(h\_0|\varepsilon\_1) \le P(h\_1|\varepsilon\_1);\\ \mathcal{L}^{\prime \star}(\varepsilon\_1 \to h\_1) &= P(h\_1|\varepsilon\_1)/P(h\_0|\varepsilon\_1), \text{ if } P(h\_1|\varepsilon\_1) \le P(h\_0|\varepsilon\_1). \end{aligned} \tag{37}$$

Further, we have the prediction confirmation measure

$$\begin{split} c\_1^\* &= c^\*(\varepsilon\_1 \to h\_1) = \frac{P(h\_1|\varepsilon\_1) - P(h\_0|\varepsilon\_1)}{\max(P(h\_1|\varepsilon\_1), P(h\_0|\varepsilon\_1))} \\ &= \frac{2P(h\_1|\varepsilon\_1) - 1}{\max(P(h\_1|\varepsilon\_1), 1 - P(h\_1|\varepsilon\_1))} = \frac{2CR\_1 - 1}{\max(CR\_1, 1 - CR\_1)}. \end{split} \tag{38}$$

where *CR*<sup>1</sup> = *P*(*h*1|θ*e*1) = *P*(*h*1|*e*1) is the correct rate of rule *e*1→*h*1. This correct rate means that the probability of *h*<sup>1</sup> we predict as *x*∈*E*<sup>1</sup> is *CR*1. Letting both the numerator and denominator of Equation (38) multiply by *P*(*e*1), we obtain

$$c\_1^\* = c^\*(c\_1 \to h\_1) = \frac{P(h\_1, c\_1) - P(h\_0, c\_1)}{\max(P(h\_1, c\_1), P(h\_0, c\_1))} = \frac{a - c}{\max(a, c)}.\tag{39}$$

The sizes of four areas covered by two curves in Figure 7 may represent *a*, *b*, *c*, and *d*.

**Figure 7.** The numbers of positive examples and counterexamples for *c*\*(*e*0→*h*0) (see the left side) and *c*\*(*e*1→*h*1) (see the right side).

*Entropy* **2020**, *22*, 384

In like manner, we obtain

$$c\_0^\* = c^\*(c\_0 \to h\_0) = \frac{P(h\_{0\prime}c\_0) - P(h\_{1\prime}c\_0)}{\max(P(h\_{0\prime}c\_0), P(h\_{1\prime}c\_0))} = \frac{d - b}{\max(d, b)}.\tag{40}$$

Making use of Consequent Symmetry, we can obtain *c*\*(*e*1→*h*0) = −*c*\*(*e*1→*h*1) and *c*\*(*e*0→*h*1) = −*c*\*(*e*0→*h*0).

In Figure 7, the sizes of the two areas covered by two curves are *P*(*h*0) and *P*(*h*1), which are different. If *P*(*h*0) = *P*(*h*1) = 0.5, then prediction confirmation measure *c*\* is equal to channel confirmation measure *b*\*.

Using measure *c*\*, we can directly assess the quality of the probability predictions. For *P*(*h*1|θ*e*1) = 0.77 in Table 4, we have *c*1\* = (0.77 − 0.23)/0.77 = 0.701. We can also use *c*\* for probability predictions. When *c*1\* > 0, according to Equation (39), we have the correct rate of rule *e*1→*h*1:

$$CR\_1 = P(h\_1 | \theta\_{c1}) = 1/(1 + c\_1'') = 1/(2 - c\_1^\*)\tag{41}$$

For example, if *c*1\* = 0.701, then *CR*<sup>1</sup> = 1/(2−0.701) = 0.77. If *c*\*(*e*1→*h*1) = 0, then *CR*<sup>1</sup> = 0.5. If *c*\*(*e*1→*h*1) < 0, we may make use of HS to have *c*10\* = *c*\*(*e*1→*h*0) = |*c*\*1|, and then make probability prediction:

$$\begin{aligned} P(h\_0 | \theta\_{\ell 1}) &= 1/(2 - c\_{10}^\*), \\ P(h\_1 | \theta\_{\ell 1}) &= 1 - P(h\_0 | \theta\_{\ell 1}) = (1 - c\_{10}^\*)/(2 - c\_{10}^\*). \end{aligned} \tag{42}$$

We may define another prediction confirmation measure by replacing operation max( ) with +:

$$\begin{split} c\_{F1} &= c\_F^\*(\mathbf{c}\_1 \rightarrow h\_1) = \frac{P(h\_1|\mathbf{c}\_1) - P(h\_0|\mathbf{c}\_1)}{P(h\_1|\mathbf{c}\_1) + P(h\_0|\mathbf{c}\_1)} = P(h\_1|\mathbf{c}\_1) - P(h\_0|\mathbf{c}\_1) \\ &= \frac{P(h\_1|\mathbf{c}\_1) - P(h\_0|\mathbf{c}\_1)}{P(\mathbf{c}\_1)} = \frac{\mathbf{a} - \boldsymbol{\xi}}{\mathbf{a} + \mathbf{c}}. \end{split} \tag{43}$$

The *cF*\* is also convenient for probability predictions when *P*(*h*) is certain. There is

$$\begin{aligned} P(h\_1|\theta\_{\ell1}) &= \mathcal{C}R\_1 = (1 + c\_{F1}^\*) / 2; \\ P(h\_0|\theta\_{\ell1}) &= 1 - \mathcal{C}R\_1 = (1 - c\_{F1}^\*) / 2. \end{aligned} \tag{44}$$

However, when *P*(*h*) is variable, we should still use *b*\* with *P*(*h*) for probability predictions.

It is easy to prove that *c*\*(*e*1→*h*1) and *cF*\*(*e*1→*h*1) possess all the above-mentioned desirable properties.

*3.3. Converse Channel*/*Prediction Confirmation Measures b\*(h*→*e) and c\*(h*→*e)*

Greco et al. [19] divide confirmation measures into


Similarly, this paper divides confirmation measures into


We now consider *c*\*(*h*1→*e*1). The positive examples' proportion and the counterexamples' proportion can be found in the upside of Figure 7. Then we have

$$c^\*(h\_1 \to e\_1) = \frac{P(e\_1|h\_1) - P(e\_0|h\_1)}{\max(P(e\_1|h\_1), P(e\_0|h\_1))} = \frac{a-b}{\max(a,b)}.\tag{45}$$

The correct rate reflected by *c*\*(*h*1→*e*1) is sensitivity or true positive rate *P*(*h*1|*e*1). The correct rate reflected by *c*\*(*h*0→*e*0) is specificity or true negative rate *P*(*h*0|*e*0).

Consider the converse channel confirmation measure *b*\*(*h*1→*e*1). Now the source is *P*(*e*) instead of *P*(*h*). We may swap *e*<sup>1</sup> with *h*<sup>1</sup> in *b*\*(*e*1→*h*1) or swap *a* with *d* and *b* with *c* in *f*(*a, b, c, d*) to obtain

$$b^\*(h\_1 \to \varepsilon\_1) = \frac{P(h\_1|\varepsilon\_1) - P(h\_1|\varepsilon\_0)}{P(h\_1|\varepsilon\_1) \vee P(h\_1|\varepsilon\_0)} = \frac{ad - bc}{a(b+d) \vee b(a+c)}\tag{46}$$

where ∨ is the operator for the maximum of two numbers and is used to replace max( ). There are also four types of converse channel/prediction confirmation formulas with *a*, *b*, *c*, and *d* (see Table 7). Due to Consequent Symmetry, there are the eight types of converse channel/prediction confirmation formulas altogether.

#### *3.4. Eight Confirmation Formulas for Di*ff*erent Antecedents and Consequents*

Table 5 shows the positive examples' and counterexamples' proportions needed by measures *b*\* and *c*\*.


**Table 5.** Eight proportions for calculating *b*\*(*e*→*h*) and *c*\*(*e*→*h*).

Table 6 provides four types of confirmation formulas with *a*, *b*, *c*, and *d* for rule *e*→*h*, where function max( ) is replaced with the operator ∨.


**Table 6.** Channel/prediction confirmation measures expressed by *a*, *b*, *c*, and *d*.

These confirmation measures are related to the misreporting rates of the rule *e*→*h*. For example, smaller *b*\*(*e*1→*h*1) or *c*\*(*e*1→*h*1) means that the test shows positive for more uninfected people. Table 7 includes four types of confirmation measures for *h*→*e*.

**Table 7.** Converse channel/prediction confirmation measures expressed by *a*, *b*, *c*, and *d*.


*Entropy* **2020**, *22*, 384

These confirmation measures are related to the underreporting rates of the rule *h*→*e*. For example, smaller *b*\*(*h*1→*e*1) or *c*\*(*h*1→*e*1) means that the test shows negative for more infected people. Underreports are more serious problems.

Each of the eight types of confirmation measures in Tables 6 and 7 has its consequent-symmetrical form. Therefore, there are 16 types of function *f*(*a*, *b*, *c*, *d*) altogether for confirmation.

In a prediction and converse prediction confirmation formula, the conditions of two conditional probabilities are the same; they are the antecedents of rules so that a confirmation measure *c*\* only depends on the two numbers of positive examples and counterexamples. Therefore, these measures accord with the Nicod–Fisher criterion.

If we change "∨" into "+" in *f*(*a, b, c, d*), then measure *b*\* becomes measure *bF*\* = *F*, and measure *c*\* becomes measure *cF*\*. For example,

$$c\_F "(c\_1 \to h\_1) = (a - c)/(a + c). \tag{47}$$

*3.5. Relationship Between Measures b\* and F*

Measure *b*\* is like measure *F*. The two measures changes with likelihood ratio *LR*, as shown in Figure 8.

**Figure 8.** Measures *b*\* and *F* change with likelihood ratio *LR*.

Measure *F* has four confirmation formulas for different antecedents and consequents [8], which are related to measure *bF*\* as follows:

$$F(\varepsilon\_1 \to h\_1) = \frac{P(\varepsilon\_1 | h\_1) - P(\varepsilon\_1 | h\_0)}{P(\varepsilon\_1 | h\_1) + P(\varepsilon\_1 | h\_0)} = \frac{ad - bc}{ad + bc + 2ac} = b\circ^\*(\varepsilon\_1 \to h\_1) \tag{48}$$

$$F(h\_1 \to e\_1) = \frac{P(h\_1|e\_1) - P(h\_1|e\_0)}{P(h\_1|e\_1) + P(h\_1|e\_0)} = \frac{ad - bc}{ad + bc + 2ab} = b\circ ^\ast(h\_1 \to e\_1) \tag{49}$$

$$F(\varepsilon\_0 \to h\_0) = \frac{P(\varepsilon\_0 | h\_0) - P(\varepsilon\_0 | h\_1)}{P(e \eta | h\_0) + P(e \eta | h\_1)} = \frac{ad - bc}{ad + bc + 2bd} = b\_F^\*(\varepsilon\_0 \to h\_0) \tag{50}$$

$$F(h\_0 \to c\_0) = \frac{P(h\_0|c\_0) - P(h\_0|c\_1)}{P(h\_0|c\_0) + P(h\_0|c\_1)} = \frac{ad - bc}{ad + bc + 2cd} = b\_F^\*(h\_0 \to c\_0) \tag{51}$$

*F* is equivalent to *bF*\*. Measure *b*\* has all the above-mentioned desirable properties as well as measure *F*. The differences are that measure *b*\* has a greater absolute value than measure *F*; measure *b*\* can be used for probability predictions more conveniently (see Equation (35)).

#### *3.6. Relationships between Prediction Confirmation Measures and Some Medical Test's Indexes*

Channel confirmation measures are related to likelihood ratios, whereas Prediction Confirmation Measures (PCMs) including converse PCMs are related to correct rates and false rates in the medical test.

To help us understand the significances of PCMs in the medical test, Table 8 shows that each PCM is related to which correct rate and which false rate.

**Table 8.** PCMs (Prediction Confirmation Measures) are related to different correct rates and false rates in the medical test [18].


The false rates related to PCMs are the misreporting rates of the rule *e*→*h*, whereas the false rates related to converse PCMs are the underreporting rates of the rule *h*→*e*. For example, False Discovery Rate *P*(*h*0|*e*1) is also the misreporting rate of rule *e*1→*h*1; False Negative Rate *P*(*e*0|*h*1) is also the underreporting rate of rule *h*1→*e*1.

#### **4. Results**

#### *4.1. Using Three Examples to Compare Various Confirmation Measures*

In China's war against COVID-19, people often ask the question: since the true positive rate, e.g., sensitivity, of NAT is so low (less than 0.5), why do we still believe it? Medical experts explain that though NAT has low sensitivity, it has high specificity, and hence its positive is very believable.

We use the following two extreme examples (see Figure 9) to explain why a test with very low sensitivity can provide more believable positive than another test with very high sensitivity, and whether popular confirmation measures support this conclusion.

**Figure 9.** How the proportions of positive examples and counterexamples affect *b*\*(*e*1→*h*1). (**a**) Example 1: positive examples' proportion is *P*(*e*1,|*h*1) = 0.1, and counterexamples' proportion is *P*(*e*1|*h*0) = 0.01. **(b)** Example 2: positive examples' proportion is *P*(*e*1,|*h*1) = 1, and counterexamples' proportion is *P*(*e*1|*h*0) = 0.9.

In Example 1, *b*\*(*e*1→*h*1) = (0.1 − 0.01)/0.1 = 0.9, which is very large. In Example 2, *b*\*(*e*1→*h*1) = (1 − 0.9)/1 = 0.1, which is very small. The two examples indicate that fewer counterexamples' existence is more important to *b*\* than more positive examples' existence. Measures *F*, *c*\*, and *cF*\* also possess this characteristic, which is compatible with the Logicality requirement [15]. However, most confirmation measures do not possess this characteristic.

We supposed *P*(*h*1) = 0.2 and *n* = 1000 and then calculated the degrees of confirmation with different confirmation measures for the above two examples, as shown in Table 9, where the base of log for *R* and *L* is 2. Table 9 also includes Example 3 (e.g., Ex. 3), in which *P*(*h*1) is 0.01. Example 3 reveals the difference between *Z* and *b*\* (or *F*).


**Table 9.** Three examples to show the differences between different confirmation measures.

Data for Examples 1 and 2 show that *L*, *F* and *b*\* give Example 1 a much higher rating than Example 2, whereas *M*, *C*, and *N* give Example 2 a higher rating than Example 1 (see red numbers). The excel file for Table 9, Tables 12 and 13 can be find in Supplementary Material.

In Examples 2 and 3, where *c* > *a* (counterexamples are more than positive examples), only the values of *c*\*(*e*1→*h*1) are negative. The negative values should be reasonable for assessing probability predictions when counterexamples are more than positive examples.

The data for Example 3 show that when *P*(*h*0) = 0.99>>*P*(*h*1) = 0.01, measure *Z* is very different from measures *F* and *b*\* (see blue numbers) because *F* and *b*\* are independent of *P*(*h*) unlike *Z*.

Although measure *L* (log-likelihood ratio) is compatible with *F* and *b*\*, its values, such as 3.32 and 0.152, are not intuitionistic as well as the values of *F* or *b*\*, which are normalizing.

#### *4.2. Using Measures b\* to Explain Why And How CT is also Used to Test COVID-19*

The COVID-19 outbreak in Wuhan of China in 2019 and 2020 has infected many people. In the early stage, only NAT was used to diagnose the infection. Later, many doctors found that NAT often failed to report the viral infection. Because this test has low sensitivity (which may be less than 0.5) and high specificity, we can confirm the infection when NAT is positive, but it is not good for confirming the non-infection when NAT is negative. That means that NAT-negative is not believable. To reduce the underreports of the infection, CT gained more attention because CT had higher sensitivity than NAT.

When both NAT and CT were used in Wuhan, doctors improved the diagnosis, as shown in Figure 10 and Table 11. If we diagnose the infection according to confirmation measure *b*\*, will the diagnosis be the same as the improved diagnosis? Besides NAT and CT, patients' symptoms, such as fever and cough, were also used for the diagnosis. To simplify the problem, we assumed that all patients had the same symptoms so that we could diagnose only according to the results of NAT and CT.

**Figure 10.** Using both NAT and CT to diagnose the infection of COVID-19 with the help of confirmation measure *b*\*.

Reference [34] introduces the sensitivity and specificity of CT that the authors achieved. According to [33,34] and other reports on the internet, the author of this paper estimated the sensitivities and specificities, as shown in Table 10.


**Table 10.** Sensitivities and specificities of NAT (Nucleic Acid Test) and CT for COVID-19.

Figure 10 was drawn according to Table 10. Figure 10 also shows sensitivities and specificities. For example, the half of the red circle on the right side indicates that the sensitivity of NAT is 0.5.

We use *c*(NAT+) to denote the degree of confirmation of NAT-positive with any measure *c*, and used *c*(NAT−), *c*(CT+), and *c*(CT−) in like manner. Then we have

$$b^\*(\text{NAD} +) = [P(e\_1|h\_1) - P(e\_1|h\_0)]/P(e\_1|h\_1) = [0.5 - (1 - 0.95)]/0.5 = 0.9\%$$

$$b^{\star}(\text{NAT}-) = [P(e\_0|h\_0) - P(e\_0|h\_1)]/P(e\_0|h\_0) = [0.95 - (1 - 0.5)]/0.95 = 0.47.5$$

We can also obtain *b*\*(CT+) = 0.69 and *b*\*(CT−) = 0.73 in like manner (see Table 11).

**Table 11.** Improved diagnosis (for final positive or negative) according to NAT and CT.


If we only use the positive or negative of NAT as the final positive or negative, we confirm the non-infection as NAT shows negative. According to measure *b*\*, if we use both results of NAT and CT, when NAT shows a negative whereas CT shows positive, the final diagnosis should be positive (see blue words in Table 11) because *b*\*(CT+) = 0.69 is higher than *b*\*(NAT−) = 0.47. This diagnosis is the same as the improved diagnosis in Wuhan.

Assuming the prior probability of the infection *P*(*e*1) = 0.25, the author calculated the various degrees of confirmation with different confirmation measures for the same sensitivities and specificities, as shown in Table 12.


**Table 12.** Various confirmation measures for assessing the results of NAT and CT.

If there is a "No" under a measure, this measure will result in a different diagnosis from the improved diagnosis. The red numbers mean that *c*(CT+) < *c*(NAT−) or *c*(NAT+)<*c*(CT−). Measures *D, M*, and *F*, as well as *b*\*, are consistent with the improved diagnosis. If we change *P*(*h*1) from 0.1 to 0.6, we will find that measure *M* is also not consistent with the improved diagnosis. If we believe

a test-positive or test-negative when its degree of confirmation is greater than 0.2, then *D* is also undesirable, and only measures *F* and *b*\* satisfy our requirements.

The above sensitivities and specificities in Table 10 were not specially selected. When NAT-sensitivity changed between 0.3 and 0.7, or CT-sensitivity changed between 0.6 and 0.9, it was the same that only measures *D, F*, and *b*\* were consistent with the improved diagnosis.

Measure *c*\* is also not suitable for the diagnosis because it reflects correctness and cannot reduce the underreports of the infection. Yet, the underreports of the infection will cause greater loss than the misreports of the infection.

#### *4.3. How Various Confirmation Measures are A*ff*ected by Increments* Δ*a and* Δ*d*

The following example is used to check if we can use popular confirmation measures to explain that a black raven can confirm "Ravens are black" more strongly than a piece of white chalk.

Table 13 shows the degrees of confirmation calculated with nine different measures. First, we supposed *a* = *d* = 20 and *b* = *c* = 10 to calculate the nine degrees of confirmation. Next, we only replaced *a* with *a* + 1 to calculate the nine degrees. Last, we only replaced *d* with *d* + 1 to calculate them.


**Table 13.** How confirmation measures are affected by Δ*a* = 1 and Δ*d* = 1.

The results must have exceeded many researchers' expectations. Table 13 indicates that all measures except *c*\* (see blue numbers) cannot ensure that Δ*a* = 1 increases *f*(*a, b, c, d*) more than Δ*d* = 1. If we change *b* and *c* between 1 and 19, all measures except *c*\*, *S*, and *N* cannot ensure Δ*f*/Δ*a*≥Δ*f*/Δ*d*. When *b*>*c*, measures *S* and *N* also cannot ensure Δ*f*/Δ*a*≥Δ*f*/Δ*d*. The cause for measures *D* and *M* is that Δ*d* = 1 decreases *P*(*h*1) and *P*(*e*1) more than increasing *P*((*h*1|*e*1) and *P*(*e*1|*h*1). The causes for other measures except *c*\* are similar.

#### **5. Discussions**

#### *5.1. To Clarify the Raven Paradox*

To clarify the Raven Paradox, some researchers including Hemple [3] affirm the Equivalence Condition and deny the Nicod–Fisher criterion; some researchers, such as Scheffler and Goodman [35], affirm the Nicod–Fisher criterion and deny the Equivalence Condition. There are also some researchers who do not fully affirm the Equivalence Condition or the Nicod–Fisher criterion.

First, we consider measure *F* to see if we can use it to eliminate the Raven Paradox. The difference between *F*(*e*1→*h*1) and *F*(*h*0→*e*0) is that their counterexamples are the same, yet, their positive examples are different. When *d* increases to *d*+Δ*d*, *F*(*e*1→*h*1) and *F*(*h*0→*e*0) unequally increase. Therefore,


Measure *b*\* is like *F*. The conclusion is that measures *F* and *b*\* cannot eliminate our confusion about the Raven Paradox.

After inspecting many different confirmation measures from the perspective of the rough set theory, Greco et al. [15] conclude that Nicod criterion (e.g., the Nicod–Fisher criterion) is right, but it is difficult to find a suitable measure that accords with the Nicod criterion. However, many researchers still think that the Nicod criterion is incorrect; it accords with our intuition only because a confirmation measure *c*(*e*1→*h*1) can evidently increase with *a* and slightly increase with *d*. After comparing different confirmation measures, Fitelson and Hawthorne [28] believe that the likelihood ratio may be used to explain that a black raven can confirm "Ravens are black" more strongly than a non-black non-raven thing.

Unfortunately, Table 13 shows that the increments of all measures except *c*\* caused by Δ*d* = 1 are greater than or equal to those caused by Δ*a* = 1. That means that these measures support the conclusion that a piece of white chalk can confirm "Ravens are black" more strongly than (or as well as) a black raven. Therefore, these measures cannot be used to clarify the Raven Paradox.

However, measure *c*\* is different. Since *c*\*(*e*1→*h*1) = (*a* − *c*)/(*a*∨*c*) and *c*\*(*h*0→*e*0) = (*d* − *c*)/(*d*∨*c*), the Equivalence Condition does not hold, and measure *c*\* accords with the Nicod–Fisher criterion very well. Hence, the Raven Paradox does not exist anymore according to measure *c*\*.

#### *5.2. About Incremental Confirmation and Absolute Confirmation*

In Table 13, if the initial numbers are *a* = *d* = 200 and *b* = *c* = 100, the increments of all measures caused by Δ*a* = 1 will be much less than those in Table 13. For example, *D*(*e*1→*h*1) increases from 0.1667 to 0.1669; *c*\*( *e*1→*h*1) increase from 0.5 to 0.5025. The increments are about 1/10 of those in Table 13. Therefore, the increment of the degree of confirmation brought about by a new example is closely related to the number of old examples or our prior knowledge.

The absolute confirmation requires that


Otherwise, the degree of confirmation calculated is unreliable. We need to replace the degree of confirmation with the degree interval of confirmation, such as [0.5, 1] instead of 1.

#### *5.3. Is Hypothesis Symmetry or Consequent Symmetry desirable?*

Elles and Fitelson defined HS by *c*(*e*, *h*) = −*c*(*e*, −*h*). Actually, it means *c*(*x*, *y*) = −*c*(*x*, −*y*) for any *x* and *y*. Similarly, ES is Antecedent Symmetry, which means *c*(*x*, *y*) = −*c*(−*x*, *y*) for any *x* and *y*. Since *e* and *h* are not the antecedent and the consequent of a major premise from their point of view, they cannot say Antecedent Symmetry and Consequent Symmetry. Consider that *c*(*e*, *h*) becomes *c*(*h*, *e*). According the literal meaning of HS (Hypothesis Symmetry), one may misunderstand HS as shown in Table 14.

**Table 14.** Misunderstood HS (Hypothesis Symmetry) and ES (Evidence Symmetry).


For example, the misunderstanding happens in [8,19], where the authors call *c*(*h, e*) = −*c*(*h*, −*e*) ES. However, it is in fact HS or Consequent Symmetry. In [19], the authors think that *F*(*H*, *E*) (where the right side is evidence) should have HS: *F*(*H*, *E*) = −*F*(−*H*, *E*), whereas *F*(*E*, *H*) should have ES: *F*(*E*, *H*)= −*F*(−*E*, *H*). However, this "ES" does not accord with the original meaning of ES in [14]. Both *F*(*H*, *E*) and *F*(*E*, *H*) possess HS instead of ES. The more serious thing because of the misunderstanding is that [19] concludes that ES and EHS (e.g., *c*(*H*, *E*) = *c*(−*H*, −*E*)), as well as HS, are desirable, and hence, measures *S*, *N*, and *C* are particularly valuable.

The author of this paper approves the conclusion of Elles and Fitelson that only HS (e.g., Consequent Symmetry) is desirable. Therefore, it is necessary to make clear that *e* and *h* in *c*(*e*, *h*) are the antecedent and the consequent of the rule *e*→*h*. To avoid the misunderstanding, we had better replace *c*(*e*, *h*) with *c*(*e*→*h*) and use "Antecedent Symmetry" and "Consequent Symmetry" instead of "Evidence Symmetry" and "Hypothesis Symmetry".

#### *5.4. About Bayesian Confirmation and Likelihoodist Confirmation*

Measure *D* proposed by Carnap is often referred to as the standard Bayesian confirmation measure. The above analyses, however, show that *D* is only suitable as a measure for selecting hypotheses instead of a measure for confirming major premises. Carnap opened the direction of Bayesian confirmation, but his explanation about *D* easily lets us confuse a major premise's evidence (a sample) and a consequent's evidence (a minor premise).

Greco et al. [19] call confirmation measures with conditional probability *p*(*h*|*e*) as Bayesian confirmation measures, those with *P*(*e*|*h*) as Likelihoodist confirmation measures, and those for *h*→*e* as converse Bayesian/Likelihoodist confirmation measures. This division is very enlightening. However, the division of confirmation measures in this paper does not depend on symbols, but on methods. The optimized proportion of the believable part in the truth function is the channel confirmation measure *b*\*, which is similar to the likelihood ratio, reflecting how good the channel is. The optimized proportion of the believable part in the likelihood function is the prediction confirmation measure *c*\*, which is similar to the correct rate, reflecting how good the probability prediction is. The *b*\* may be called the logical Bayesian confirmation measure because it is derived with Logical Bayesian Inference [17], although *P*(*e*|*h*) may be used for *b*\*. The *c*\* may be regarded as the likelihoodist confirmation measure, although *P*(*h*|*e*) may be used for *c*\*.

This paper also provides converse channel/prediction confirmation measures for rule *h*→*e*. Confirmation measures *b*\*(*e*→*h*) and *c*\*(*e*→*h*) are related to misreporting rates, whereas converse confirmation measures *b*\*(*h*→*e*) and *c*\*(*h*→*e*) are related to underreporting rates.

#### *5.5. About the Certainty Factor for Probabilistic Expert Systems*

The Certainty Factor, which is denoted by *CF*, was proposed by Shortliffe and Buchanan for a backward chaining expert system [7]. It indicates how true an uncertain inference *h*→*e* is. The relationship between measures *CF* and Z is *CF*(*h*→*e*) = *Z*(*e*→*h*) [36].

As pointed out by Heckerman and Shortliffe [36], the Certainty Factor method has been widely adopted in rule-based expert systems, it also has its theoretical and practical limitations. The main reason is that the Certainty Factor method is not compatible with statistical probability theory. They believe that the belief-network representation can overcome many of the limitations of the Certainty Factor model; however, the Certainty Factor model is simpler than the belief-network representation; it is possible to combine both to develop simpler probabilistic expert systems.

Measure *b*\*(*e*1→*h*1) is related to the believable part of the truth function of predicate *e*1(*h*). It is similar to *CF*(*h*1→*e*1). The differences are that *b*\*(*e*1→*h*1) is independent of *P*(*h*) whereas *CF*(*h*1→*e*1) is related to *P*(*h*); *b*\*(*e*1→*h*1) is compatible with statistical probability theory whereas *CF*(*h*1→*e*1) is not.

Is it possible to use measure *b*\* or *c*\* as the Certainty Factor to simplify belief-networks or probabilistic expert systems? This issue is worth exploring.

#### *5.6. How Confirmation Measures F, b\*, and c\* are Compatible with Popper's Falsification Thought*

Popper affirms that a counterexample can falsify a universal hypothesis or a major premise. However, for an uncertain major premise, how do counterexamples affect its degree of confirmation? Confirmation measures *F, b*\*, and *c*\* can reflect the importance of counterexamples. In Example 1 of Table 9, the proportion of positive examples is small, and the proportion of counterexamples is smaller still, so that the degree of confirmation is large. This example shows that to improve the degree of confirmation, it is not necessary to increase the conditional probability *P*(*e*1|*h*1) (for *b*\*) or *P*(*h*1|*e*1) (for *c*\*). In Example 2 of Table 9, although the proportion of positive examples is large, the proportion of counterexamples is not small so that the degree of confirmation is very small. This example shows that to raise degree of confirmation, it is not sufficient to increase the posterior probability. It is necessary and sufficient to decrease the relative proportion of counterexamples.

Popper affirms that a counterexample can falsify a universal hypothesis, which can be explained by that for the falsification of a strict universal hypothesis, it is important to have no counterexample. Now for the confirmation of a universal hypothesis that is not strict or uncertain, we can explain that it is important to have fewer counterexamples. Therefore, confirmation measures *F*, *b*\*, and *c*\* are compatible with Popper's falsification thought.

Scheffler and Goodman [35] proposed selective confirmation based on Popper's falsification thought. They believe that black ravens support "Ravens are black" because black ravens undermine "Ravens are not black". Their reason why non-black ravens support "Ravens are not black" is that non-black ravens undermine the opposite hypothesis "Ravens are black". Their explanation is very meaningful. However, they did not provide the corresponding confirmation measure. Measure *c*\*(*e*1→*h*1) is what they need.

#### **6. Conclusions**

Using the semantic information and statistical learning methods and taking the medical test as an example, this paper has derived two confirmation measures *b*\*(*e* →*h*) and *c*\*(*e* →*h*). The measure *b*\* is similar to the measure *F* proposed by Kemeny and Oppenheim; it can reflect the channel characteristics of the medical test like the likelihood ratio, indicating how good a testing means is. Measure *c*\*(*e*→*h*) is similar to the correct rate but varies between −1 and 1. Both *b*\* and *c*\* can be used for probability predictions. The *b*\* is suitable for predicting the probability of disease when the prior probability of disease is changed. Measures *b*\* and *c*\* possess symmetry/asymmetry proposed by Elles and Fitelson [14], monotonicity proposed by Greco et al. [16], normalizing property (between −1 and 1) suggested by many researchers. The new confirmation measures support absolute confirmation instead of incremental confirmation.

This paper has shown that most popular confirmation measures cannot help us diagnose the infection of COVID-19, but measures *F* and *b*\* and the like, which are the functions of likelihood ratio, can. It has also proved that popular confirmation measures did not support the conclusion that a black raven could confirm more strongly than a non-black non-raven thing, such as a piece of chalk. It has shown that measure *c*\* could definitely deny the Equivalence Condition and exactly reflect Nicod–Fisher Criterion, and hence, could be used to eliminate the Raven Paradox. The new confirmation measures *b*\* and *c*\* as well as *F* indicates that fewer counterexamples' existence is more important than more positive examples' existence; therefore, measures *F*, *b*\*, and *c*\* are compatible with Popper's falsification thought.

When the sample is small, the degree of confirmation calculated by any confirmation measure is not reliable, and hence, the degree of confirmation should be replaced with the degree interval of confirmation. We need further studies combining the theory of hypothesis testing. It is also worth conducting further studies ensuring that the new confirmation measures are used as the Certainty Factors for belief-networks.

**Supplementary Materials:** The Excel File for Data in Tables 9, 12 and 13 is available online at http://survivor99. com/lcg/Table9-12-13NAT.zip. We can test different confirmation measures by changing *a*, *b*, *c*, and *d*.

**Funding:** This research received no external funding.

**Acknowledgments:** The author thanks Zhilin Zhang of Fudan University and Jianyong Zhou of Changsha University because this study benefited from communication with them. The author thanks Peizhuang Wang of Liaoning Technical University for his long-term support and encouragement. The author also thanks the anonymous reviewers for their comments and suggestions, which evidently improved this paper.

**Conflicts of Interest:** The author declares no conflict of interest.

#### **References**


© 2020 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

### *Article* **On a Class of Tensor Markov Fields**

#### **Enrique Hernández-Lemus 1,2**


Received: 6 March 2020; Accepted: 9 April 2020; Published: 16 April 2020

**Abstract:** Here, we introduce a class of Tensor Markov Fields intended as probabilistic graphical models from random variables spanned over multiplexed contexts. These fields are an extension of Markov Random Fields for tensor-valued random variables. By extending the results of Dobruschin, Hammersley and Clifford to such tensor valued fields, we proved that tensor Markov fields are indeed Gibbs fields, whenever strictly positive probability measures are considered. Hence, there is a direct relationship with many results from theoretical statistical mechanics. We showed how this class of Markov fields it can be built based on a statistical dependency structures inferred on information theoretical grounds over empirical data. Thus, aside from purely theoretical interest, the Tensor Markov Fields described here may be useful for mathematical modeling and data analysis due to their intrinsic simplicity and generality.

**Keywords:** Markov random fields; probabilistic graphical models; multilayer networks

#### **1. General Definitions**

Here, we introduce Tensor Markov Fields, i.e., Markov random fields [1,2] over tensor spaces. Tensor Markov Fields (TMFs) represent the joint probability distribution for a set of tensor-valued random variables.

Let *X* = *X<sup>β</sup> <sup>α</sup>* be one of such tensor-valued random variables. Here *<sup>X</sup><sup>j</sup> <sup>i</sup>* ∈ *X* may represent either a variable *i* ∈ *α*, that may exist in a given context or layer *j* ∈ *β* (giving rise to a class of so-called multilayer graphical models or multilayer networks) or a single tensor-valued quantity *X<sup>j</sup> i* . A TMF will be an undirected multilayer graph representing the statistical dependency structure of *X* as given by the joint probability distribution P(*X*).

As an extension of the case of Markov random fields, a TMF is a multilayer graph *G*ˆ = (*V*, *E*) formed by a set *V* of vertices or nodes (the *X<sup>j</sup> i* 's) and a set *E* ⊆ *V* × *V* of edges connecting the nodes, either on the same *layer* or through different layers (Figure 1). The set of edges represents a neighborhood law *N* stating which vertex is connected (dependent) to which other vertex in the multilayer graph. With this in mind, a TMF can be also represented (slightly abusing notation) as *G*ˆ = (*V*, *N*). The set of neighbors of a given point *X<sup>j</sup> <sup>i</sup>* will be denoted *NXj* .

*i*

**Figure 1.** A Tensor Markov Field: represented as a multilayer graph spanning over *X<sup>j</sup> <sup>i</sup>* with *i* = {1, 2, 3, 4} and *j* = {*I*, *I I*}. To illustrate, layer *I* is colored in blue and layer *I I* is colored green.

#### *1.1. Configuration*

It is possible to assign to each point in the multilayer graph, one of a finite set *S* of labels. Such assignment will be called a *configuration*. We will assign probability measures to the set Ω of all possible configurations *ω*. Hence, *ω<sup>A</sup>* represents the configuration *ω* restricted to the subset *A* of *V*. It is possible to think of *ω<sup>A</sup>* as a configuration on the smaller multilayer graph *G*ˆ *<sup>A</sup>* restricting *V* to points of *A* (Figure 2).

**Figure 2.** Three different configurations of a Tensor Markov Fieldpanels (**i**), (**ii**) and (**iii**) present different configurations or *states* of the TMF. Labels are represented by color intensity.

#### *1.2. Local Characteristics*

It is also possible to extend the notion of *local characteristics* from MRFs. The local characteristics of a probability measure P defined on Ω are the conditional probabilities of the form:

$$\mathbb{P}(\omega\_{\mathbb{H}} \mid \omega\_{T\backslash t}) = \mathbb{P}(\omega\_{\mathbb{H}} \mid \omega\_{N\_{\mathbb{H}}}) \tag{1}$$

i.e., the probability that the point *t* is assigned the value *ωt*, given the values at all other points of the multilayer graph. In order to make explicit the tensorial nature of the multilayer graph *G*ˆ, let us re-write Equation (1). Let us also recall the fact that the probability measure will define a tensor Markov random field (a TMF) if the local characteristics depend only of the knowledge of the outcomes at neighboring points, i.e., if for every *ω*

$$\mathbb{P}(\boldsymbol{\omega}\_{\boldsymbol{X}^{j}\_{i}} \mid \boldsymbol{\omega}\_{\boldsymbol{G} \backslash \boldsymbol{X}^{j}\_{i}}) = \mathbb{P}(\boldsymbol{\omega}\_{\boldsymbol{X}^{j}\_{i}} \mid \boldsymbol{\omega}\_{\boldsymbol{X}^{j}\_{i}}) \tag{2}$$

#### *1.3. Cliques*

Given an arbitrary graph (or in the present case a multilayer graph), we shall say that a set of points *C* is a *clique* if every pair of points in *C* are neighbors (see Figure 3). This definition includes the empty set as a clique. A clique is thus a set whose *induced subgraph* is complete, for this reason cliques are also called *complete induced subgraphs* or *maximal subgraphs* (although these latter term may be ambiguous).

**Figure 3.** Cliques on a Tensor Markov Field: The set {*X<sup>I</sup>* <sup>2</sup>, *<sup>X</sup><sup>I</sup>* <sup>3</sup>, *<sup>X</sup><sup>I</sup>* <sup>4</sup>} forms an intra-layer 2-clique (as marked by the red edges, all on layer *<sup>I</sup>*), the set {*X<sup>I</sup>* <sup>3</sup>, *<sup>X</sup>I I* <sup>3</sup> } forms an inter-layer 1-clique (marked by the blue edge connecting layers *<sup>I</sup>* and *I I*). However, the set {*X<sup>I</sup>* <sup>3</sup>, *<sup>X</sup>I I* <sup>3</sup> , *<sup>X</sup><sup>I</sup>* <sup>4</sup>, *<sup>X</sup>I I* <sup>4</sup> , } is not a clique since there are no edges between *X<sup>I</sup>* <sup>3</sup> and *<sup>X</sup>I I* <sup>4</sup> nor between *<sup>X</sup><sup>I</sup>* <sup>3</sup> and *<sup>X</sup>I I* 4 .

#### *1.4. Configuration Potentials*

A *potential η* is a way to assign a number *ηA*(*ω*) to every subconfiguration *ω<sup>A</sup>* of a configuration *ω* in the multilayer graph *G*ˆ. Given a potential, we shall say that it defines (or better, induces) a *dimensionless energy U*(*ω*) on the set of all configurations *ω* by

$$\mathcal{U}\mathcal{U}(\omega) = \sum\_{A} \eta\_{A}(\omega) \tag{3}$$

In the preceeding expression, for fixed *ω*, the sum is taken over all subsets *A* ⊆ *V* including the empty set. We can define a probability measure, called the *Gibbs measure induced* by *U* as

$$\mathbb{P}(\omega) = \frac{\mathfrak{e}^{-\mathrm{ll}(\omega)}}{Z} \tag{4}$$

*Entropy* **2020**, *22*, 451

with *Z* a normalization constant called the *partition function*.

$$Z = \sum\_{\omega} \varepsilon^{-lI(\omega)}\tag{5}$$

In physics, the term *potential* is often used in connection with the so-called potential energies. Physicists often call *η<sup>A</sup>* a *dimensionless potential energy*, and they call *φ<sup>A</sup>* = *e*−*η<sup>A</sup>* a potential.

Equations (4) and (5) can be thus rewritten as:

$$\mathbb{P}(\omega) = \frac{\prod\_{A} \phi\_{A}(\omega)}{Z} \tag{6}$$

$$Z = \sum\_{\omega} \prod\_{A} \phi\_A(\omega) \tag{7}$$

Since this latter use is more common in probability and graph theory, we will refer to Equations (6) and (7) as the definitions of Gibbs measure and partition function (respectively) unless otherwise stated.

#### *1.5. Gibbs Fields*

A potential is called a nearest neighbor Gibbs potential if *φA*(*ω*) = 1 whenever *A* is not a clique. It is customary to refer as a *Gibbs measure* to a measure induced by a nearest neighbor Gibbs potential. However, it is possible to define more general Gibbs measures by considering other types of potentials.

The inclusion of all cliques in the calculation of the Gibbs measure is necessary to establish the equivalence between Gibbs random fields and Markov random fields. Let us see how a nearest neighbor Gibbs measure on a multilayer graph determines a TMF.

Let P(*ω*) be a probability measure determined on Ω by a nearest neighbor Gibbs potential *φ*:

$$\mathbb{P}(\omega) = \frac{\prod\_{\mathbb{C}} \phi\_{\mathbb{C}}(\omega)}{Z} \tag{8}$$

With the product taken over all cliques *C* on the multilayer graph *G*ˆ. Then,

$$\mathbb{P}(\omega\_{\boldsymbol{X}\_i^j}|\omega\_{\boldsymbol{\hat{G}}^\circ|\boldsymbol{X}\_i^j}) = \frac{\mathbb{P}(\omega)}{\sum\_{\omega'} \mathbb{P}(\omega')}\tag{9}$$

Here *ω* is any configuration which agrees with *ω* at all points except *X<sup>j</sup> i* .

$$\mathbb{P}(\omega\_{X\_i^j}|\omega\_{\mathbb{G}\backslash X\_i^j}) = \frac{\prod\_{\mathbb{C}} \Phi\_{\mathbb{C}}(\omega)}{\sum\_{\omega'} \prod\_{\mathbb{C}} \Phi\_{\mathbb{C}}(\omega')}\tag{10}$$

For any clique *C* that does not contain *X<sup>j</sup> i* , *φC*(*ω*) = *φC*(*ω* ), So that all the terms that correspond to cliques that do not contain the point *X<sup>j</sup> <sup>i</sup>* cancel both from the numerator and the denominator in Equation (10), therefore this probability depends only on the values *x<sup>j</sup> <sup>i</sup>* at *<sup>X</sup><sup>j</sup> <sup>i</sup>* and its neighbors. <sup>P</sup> defines thus a TMF.

A more general proof of this equivalence is given by Hammersley-Clifford theorem that will be presented in the following section.

#### **2. Extended Hammersley Clifford Theorem**

Here we will outline a proof for an extension of Hammersley-Clifford theorem for Tensor Markov Fields (i.e., we will show that a Tensor Markov Field is equivalent to a Tensor Gibbs Field).

Let *G*ˆ = (*V*, *N*) be a multilayer graph representing a TMF as defined in the previous section. With *V* = *X<sup>β</sup> <sup>α</sup>* <sup>=</sup> {*X<sup>j</sup> i* }, a set of vertices over a tensor field and *N* a neighborhood law that connects vertices over this tensor field. The field *G*ˆ obeys the following neighborhood law given its Markov property (see Equation (2))

$$\mathbb{P}(X\_i^j | X\_{\hat{G}\backslash X\_i^j}) = \mathbb{P}(X\_i^j | X\_{N\_i^j}) \tag{11}$$

Here *XNj i* is any neighbor of *X<sup>j</sup> i* . The Hammersley-Clifford theorem states that a MRF is also a local Gibbs field. In the case of a TMF we have the following expression:

$$\mathbb{P}(X) = \frac{1}{Z} \prod\_{\mathfrak{c} \in \mathbb{C}\_{\mathcal{G}}} \phi\_{\mathfrak{c}}(X\_{\mathfrak{c}}) \tag{12}$$

In order to prove the equivalence of Equations (11) and (12), we will first bult a deductive (backward direction) part of the proof to be complemented with a constructive (forward direction) part as presented in the following subsections.

#### *2.1. Backward Direction*

Let us consider Equation (11) at the light of Bayes' theorem:

$$\mathbb{P}(X\_i^j|X\_{\hat{G}\backslash X\_i^j}) = \frac{\mathbb{P}(X\_{i'}^j, X\_{N\_i^j})}{\mathbb{P}(X\_{N\_i^j})} \tag{13}$$

Using a clique-approach to calculate the joint and marginal probabilities (see next subsection to support the following statement):

$$\mathbb{P}(X\_i^j | X\_{\mathbb{G}/X\_i^j}) = \frac{\sum\_{\mathbb{G} \backslash D\_i^j} \prod\_{\mathbf{c} \in \mathbb{C}\_{\mathbb{G}}} \phi\_{\mathbf{c}}(X\_{\mathbf{c}})}{\sum\_{X\_i^j} \sum\_{\mathbf{C} \backslash D\_i^j} \prod\_{\mathbf{c} \in \mathbb{C}\_{\mathbb{G}}} \phi\_{\mathbf{c}}(X\_{\mathbf{c}})} \tag{14}$$

Let us split the product <sup>∏</sup>*c*∈*CG*<sup>ˆ</sup> *<sup>φ</sup>c*(*Xc*) into two products, one over the set of cliques that contain *Xj <sup>i</sup>* (let us call it *<sup>C</sup><sup>j</sup> i* ) and another set formed by cliques not containing *X<sup>j</sup> <sup>i</sup>* (let us call it *<sup>R</sup><sup>j</sup> i* ):

$$\mathbb{P}(X\_i^j | X\_{\hat{G} \backslash X\_i^j}) = \frac{\sum\_{\mathcal{G} \backslash D\_i^j} \prod\_{\mathcal{c} \in \mathcal{C}\_i^j} \phi\_{\mathcal{c}}(X\_{\mathcal{c}}) \prod\_{\mathcal{c} \in \mathcal{R}\_i^j} \phi\_{\mathcal{c}}(X\_{\mathcal{c}})}{\sum\_{X\_i^j} \sum\_{\mathcal{G} \backslash D\_i^j} \prod\_{\mathcal{c} \in \mathcal{C}\_i^j} \phi\_{\mathcal{c}}(X\_{\mathcal{c}}) \prod\_{\mathcal{c} \in \mathcal{R}\_i^j} \phi\_{\mathcal{c}}(X\_{\mathcal{c}})} \tag{15}$$

Factoring out the terms depending on *X<sup>j</sup> <sup>i</sup>* (that do not contribute to cliques in the domain *<sup>G</sup>*<sup>ˆ</sup> \ *<sup>X</sup><sup>j</sup> i* ):

$$\mathbb{P}(X\_i^j | X\_{\mathbb{G}/X\_i^j}) = \frac{\prod\_{\mathbf{c} \in \mathbf{C}\_i^j} \phi\_{\mathbf{c}}(X\_{\mathbf{c}}) \sum\_{\mathbf{G} \backslash D\_i^j} \prod\_{\mathbf{c} \in \mathbf{R}\_i^j} \phi\_{\mathbf{c}}(X\_{\mathbf{c}})}{\sum\_{\mathbf{X}\_i^j} \prod\_{\mathbf{c} \in \mathbf{C}\_i^j} \phi\_{\mathbf{c}}(X\_{\mathbf{c}}) \sum\_{\mathbf{G} \backslash D\_i^j} \prod\_{\mathbf{c} \in \mathbf{R}\_i^j} \phi\_{\mathbf{c}}(X\_{\mathbf{c}})} \tag{16}$$

The term <sup>∑</sup>*G*ˆ\*D<sup>j</sup> i* <sup>∏</sup>*c*∈*R<sup>j</sup> i <sup>φ</sup>c*(*Xc*) does not involve *<sup>X</sup><sup>j</sup> <sup>i</sup>* (by construction) so, it can be factored out from the summation over *X<sup>j</sup> <sup>i</sup>* in the denominator.

$$\mathbb{P}(X\_i^j | X\_{\hat{G} \backslash X\_i^j}) = \frac{\prod\_{c \in \mathcal{C}\_i^j} \phi\_{\mathcal{C}}(\mathcal{X}\_c) \sum\_{\mathcal{G} \backslash D\_i^j} \prod\_{c \in \mathcal{R}\_i^j} \phi\_{\mathcal{C}}(\mathcal{X}\_c)}{\sum\_{\mathcal{G} \backslash D\_i^j} \prod\_{c \in \mathcal{R}\_i^j} \phi\_c(\mathcal{X}\_c) \sum\_{\mathcal{X}\_i^j} \prod\_{c \in \mathcal{C}\_i^j} \phi\_c(\mathcal{X}\_c)} \tag{17}$$

We can cancel the term in the numerator and denominator:

$$\mathbb{P}(X\_i^j | X\_{\bullet\_{\backslash} X\_i^j}) = \frac{\prod\_{\mathbf{c} \in \mathbf{C}\_i^j} \phi\_{\mathbf{c}}(X\_{\mathbf{c}})}{\sum\_{\mathbf{X}\_i^j} \prod\_{\mathbf{c} \in \mathbf{C}\_i^j} \phi\_{\mathbf{c}}(X\_{\mathbf{c}})} \tag{18}$$

*Entropy* **2020**, *22*, 451

Then we multiply by <sup>∏</sup>*c*∈*R<sup>j</sup> i φ<sup>c</sup>* (*Xc* ) <sup>∏</sup>*c*∈*R<sup>j</sup> i φ<sup>c</sup>* (*Xc* )

$$\mathbb{P}(X\_i^j | X\_{\mathcal{G}/X\_i^j}) = \frac{\prod\_{\mathbf{c} \in \mathcal{C}\_i^j} \phi\_{\mathbf{c}}(X\_{\mathbf{c}}) \prod\_{\mathbf{c} \in \mathcal{R}\_i^j} \phi\_{\mathbf{c}}(X\_{\mathbf{c}})}{\sum\_{\mathbf{X}\_i^j} \prod\_{\mathbf{c} \in \mathcal{C}\_i^j} \phi\_{\mathbf{c}}(X\_{\mathbf{c}}) \prod\_{\mathbf{c} \in \mathcal{R}\_i^j} \phi\_{\mathbf{c}}(X\_{\mathbf{c}})} \tag{19}$$

Remembering that *C<sup>j</sup> i* . *R<sup>j</sup> <sup>i</sup>* = *CG*<sup>ˆ</sup> ,

$$\mathbb{P}(X\_i^j | X\_{\bigcirc\_{\mathcal{G}} X\_i^j}) = \frac{\prod\_{\mathfrak{c} \in \mathcal{G}} \phi\_{\mathfrak{c}}(X\_{\mathfrak{c}})}{\sum\_{X\_i^j} \prod\_{\mathfrak{c} \in \mathcal{G}} \phi\_{\mathfrak{c}}(X\_{\mathfrak{c}})} \tag{20}$$

Equation (20) is nothing but the definition of a local Gibbs Tensor Field (Equation (12)).

#### *2.2. Forward Direction*

In this subsection we will show how to express the clique potential functions *φc*(*Xc*), given the joint probability distribution over the tensor field and the Markov property.

Consider any subset *<sup>σ</sup>* <sup>⊂</sup> *<sup>G</sup>*<sup>ˆ</sup> of the multilayer graph *<sup>G</sup>*ˆ. We define a candidate potential function (following Möbius inversion lemma) [3] as follows:

$$f\_{\sigma}(\mathbf{X}\_{\sigma} = \mathbf{x}\_{\sigma}) = \prod\_{\mathbb{Z}\_{\mathfrak{p}} \subset \sigma} \mathbb{P}(\mathbf{X}\_{\mathfrak{k}} = \mathbf{x}\_{\mathfrak{k}'}, \mathbf{X}\_{\widehat{\mathbf{G}} \backslash \mathbb{Z}\_{\mathfrak{p}}} = \mathbf{0})^{-1^{|\boldsymbol{\nu}| - |\boldsymbol{\zeta}|}} \tag{21}$$

In order for *fσ* to be a proper clique potential, it must satisfy the following two conditions:

(i) <sup>∏</sup>*σ*⊂*G*<sup>ˆ</sup> *<sup>f</sup>σ*(*Xσ*) = <sup>P</sup>(*X*)

(ii) *fσ*(*Xσ*) = 1 whenever *σ* is not a clique

To prove (i), we need to show that all factors in *fσ*(*X<sup>σ</sup>* = *xσ*) cancel out, except for P(*X*). To do this, it will be useful to consider the following *combinatorial expansion of zero*:

$$0 = (1 - 1)^{K} = \mathbb{C}\_{0}^{K} - \mathbb{C}\_{1}^{K} + \mathbb{C}\_{2}^{K} + \cdots + (-1)^{K}\mathbb{C}\_{K}^{K} \tag{22}$$

Here, of course *C<sup>A</sup> <sup>B</sup>* is the number of combinations of B elements from an A-element set. Let us consider any subset *<sup>ζ</sup>* of *<sup>G</sup>*ˆ. Let us consider a factor <sup>Δ</sup> <sup>=</sup> <sup>P</sup>(*X<sup>ζ</sup>* <sup>=</sup> *<sup>x</sup><sup>ζ</sup>* , *XG*<sup>ˆ</sup>\*<sup>ζ</sup>* <sup>=</sup> <sup>0</sup>). For

the case of *<sup>f</sup> <sup>ζ</sup>*(*X<sup>ζ</sup>* ) it occurs as <sup>Δ</sup>−<sup>10</sup> = Δ. Such factor also occurs in subsets containg *ζ* and other additional elements. If it includes *<sup>ζ</sup>* and one additional element, there are *<sup>C</sup>*|*G*ˆ|−|*ζ*<sup>|</sup> <sup>1</sup> such functions. The additional element creates an inverse factor Δ−<sup>11</sup> = Δ−1. The functions over subsets containg *ζ* and two additional elements contributes with a factor Δ−<sup>12</sup> = Δ<sup>1</sup> = Δ. If we continue this process and consider Equation (22), it is evident that all odd cardinality difference terms cancel out with all even cardinality difference terms so that the only remaining factor corresponds to *ζ* = *G*ˆ equal to P(*X*) thus fulfilling condition (i).

In order to show how condition (ii) is fulfilled, we will need to use the Markov property of TMFs. Let us consider *<sup>σ</sup>*∗ ⊂ *<sup>G</sup>*<sup>ˆ</sup> that is not a clique. Then it will be possible to find two nodes *<sup>X</sup><sup>h</sup> <sup>i</sup>* and *<sup>X</sup><sup>k</sup> <sup>j</sup>* in *σ*∗ that are not connected to each other. Let us recall Equation (21):

$$f\_{\sigma}(X\_{\sigma} \* = x\_{\sigma} \*) = \prod\_{\mathbb{J} \subseteq \sigma \*} \mathbb{P}(X\_{\mathbb{J}} = x\_{\mathbb{J}'} X\_{\mathbb{G} \backslash \mathbb{J}} = 0)^{-1^{|\sigma \* | - |\zeta|}} \tag{23}$$

An arbitrary subset *ζ* may belong to any of the following classes: (i) *ζ* = *ω* a generic subset of *σ*; (ii) *<sup>ζ</sup>* = *<sup>ω</sup>* ∪ {*X<sup>h</sup> <sup>i</sup>* } ; (iii) *<sup>ζ</sup>* = *<sup>ω</sup>* ∪ {*X<sup>k</sup> <sup>j</sup>* } or iv) *<sup>ζ</sup>* = *<sup>ω</sup>* ∪ {*X<sup>h</sup> <sup>i</sup>* , *<sup>X</sup><sup>k</sup> <sup>j</sup>* }. If we write down Equation (23) factored down to these contributions we get:

$$f\_{\mathcal{F}}(\mathbf{X}\_{\mathcal{F}}\ast = \mathbf{x}\_{\mathcal{F}}\ast) = \prod\_{\omega \in \mathcal{L}^{\bullet} \ast \backslash \{\mathcal{X}\_{i}^{\emptyset}\boldsymbol{X}\_{j}^{\emptyset}\}} \left[\frac{\mathbb{P}(\mathbf{X}\_{\omega}\boldsymbol{X}\_{\mathcal{G}\mid\omega} = 0)\mathbb{P}(\boldsymbol{X}\_{\omega \cup \{\boldsymbol{X}\_{i}^{\emptyset}\boldsymbol{X}\_{j}^{\emptyset}\}}, \boldsymbol{X}\_{\mathcal{G}\mid\omega \cup \{\boldsymbol{X}\_{i}^{\emptyset}\boldsymbol{X}\_{j}^{\emptyset}\}} - 0)}{\mathbb{P}(\overline{\boldsymbol{X}\_{\omega \cup \{\boldsymbol{X}\_{i}^{\emptyset}\}}}, \boldsymbol{X}\_{\mathcal{G}\mid\omega \cup \{\boldsymbol{X}\_{j}^{\emptyset}\}} - 0)\mathbb{P}(\overline{\boldsymbol{X}\_{\omega \cup \{\boldsymbol{X}\_{j}^{\emptyset}\}}}, \boldsymbol{X}\_{\mathcal{G}\mid\omega \cup \{\boldsymbol{X}\_{j}^{\emptyset}\}} - 0)}\right]^{-1^{\left[\mathsf{r}\cdot\left| -\left| \frac{\mathbf{x}\_{j}^{\emptyset}\right|}{\mathbf{x}\_{j}^{\emptyset}} \right|}}\tag{24}$$

Let us consider two of the factors in Equation (24) at the light of Bayes' theorem:

$$\frac{\mathbb{P}(X\_{\omega},X\_{\breve{C}/\omega}=0)}{\mathbb{P}(X\_{\omega\cup\{X\_{j}^{k}\}},X\_{\breve{C}/\omega\cup\{X\_{j}^{k}\}}=0)} = \frac{\mathbb{P}(X\_{\{X\_{j}^{k}\}}-0|X\_{\{X\_{j}^{k}\}}-0,X\_{\omega\cup\{X\_{j}^{k}\}},X\_{\breve{C}/\omega\cup\{X\_{j}^{k}\}}^{k})-0)\mathbb{P}(X\_{\{X\_{j}^{k}\}}-0,X\_{\omega\cup\{X\_{j}^{k}\}},X\_{\breve{C}/\omega\cup\{X\_{j}^{k}\}}^{k})}{\mathbb{P}(X\_{\{X\_{j}^{k}\}}|X\_{\{X\_{j}^{k}\}}-0,X\_{\omega\cup\{X\_{j}^{k}\}}^{k})-0)\mathbb{P}(X\_{\{X\_{j}^{k}\}}-0,X\_{\omega\cup\{X\_{j}^{k}\}}^{k})-0)}\tag{25}$$

We can notice that the priors in the numerator and denominator of Equation (25) are the same. We can then cancell them out. Since by definition *X<sup>h</sup> <sup>i</sup>* and *<sup>X</sup><sup>k</sup> <sup>j</sup>* are conditionally independent given the rest of the multilayer graph, we can also replace the default value *X<sup>k</sup> <sup>j</sup>* = 0 for *<sup>X</sup><sup>k</sup> <sup>j</sup>* instead.

$$\frac{\mathbb{P}(X\_{\omega}, X\_{\mathbb{C}\backslash\omega} = 0)}{\mathbb{P}(X\_{\omega \cup \{X\_{\mathbb{C}}^{k}\}}, X\_{\mathbb{C}\backslash\omega \cup \{X\_{\mathbb{C}}^{k}\}} = 0)} = \frac{\mathbb{P}(X\_{\{X\_{\mathbb{C}}^{k}\}} = 0 | X\_{\{X\_{\mathbb{C}}^{k}\}}, X\_{\mathbb{C}\backslash\omega \cup \{X\_{\mathbb{C}}^{k}, X\_{\mathbb{C}}^{k}\}} = 0)}{\mathbb{P}(X\_{\{X\_{\mathbb{C}}^{k}\}} | X\_{\{X\_{\mathbb{C}}^{k}\}}, X\_{\mathbb{C}\backslash\omega \cup \{X\_{\mathbb{C}}^{k}, X\_{\mathbb{C}}^{k}\}} = 0)} \frac{\mathbb{P}(X\_{\{X\_{\mathbb{C}}^{k}\}}, X\_{\omega}, X\_{\mathbb{C}\backslash\omega \cup \{X\_{\mathbb{C}}^{k}, X\_{\mathbb{C}}^{k}\}} = 0)}{\mathbb{P}(X\_{\{X\_{\mathbb{C}}^{k}\}}, X\_{\omega}, X\_{\mathbb{C}\backslash\omega \cup \{X\_{\mathbb{C}}^{k}, X\_{\mathbb{C}}^{k}\}} = 0)} \tag{26}$$

Since *X<sup>h</sup> <sup>i</sup>* and *<sup>X</sup><sup>k</sup> <sup>j</sup>* are conditionally independent given the rest of the multilayer graph, we can also replace the condition for *X<sup>k</sup> <sup>j</sup>* with any other, without affecting *<sup>X</sup><sup>h</sup> <sup>i</sup>* . By adjusting this prior *conveniently*, we can write out:

$$\frac{\mathbb{P}(X\_{\omega}, X\_{\mathbb{G}/\omega} = 0)}{\mathbb{P}(X\_{\omega \cup \{X\_i^h\}'}, X\_{\mathbb{G}/\omega \cup \{X\_i^k\}} = 0)} = \frac{\mathbb{P}(X\_{\omega \cup \{X\_j^k\}'}, X\_{\mathbb{G}/\omega \cup \{X\_j^k\}} = 0)}{\mathbb{P}(X\_{\omega \cup \{X\_i^h, X\_j^k\}'}, X\_{\mathbb{G}/\omega \cup \{X\_i^h, X\_j^k\}} = 0)}\tag{27}$$

By substituting Equation (27) into Equation (24) we get (condition (ii)):

$$f\_{\sigma} \* (X\_{\sigma} \*) = 1\tag{28}$$

#### **3. An Information-Theoretical Class of Tensor Markov Fields**

Let us consider again the set of tensor-valued random variables *X* = *X<sup>β</sup> <sup>α</sup>* . It is possible to calculate, for every duplex in *X*, the mutual information function *I*(·, ·) [4]:

$$I(X\_i^{\hbar}, X\_j^k) = \sum\_{\Omega} \sum\_{\Omega'} p(X\_i^{\hbar}, X\_j^k) \log \frac{p(X\_i^{\hbar}, X\_j^k)}{p(X\_i^{\hbar}) \ p(X\_j^k)} \tag{29}$$

Let us consider a multilayer graph scenario. From now on, the indices *i*, *j* will refer to the random variables, whereas *h*, *k* will be indices for the layers. Ω and Ω are the respective sampling spaces (that may, of course, be equal). In order to discard self-information, let us define the *off-diagonal mutual information* as follows:

$$I^\dagger(X\_i^h, X\_j^k) = I(X\_i^h, X\_j^k) \times \left(1 - \delta\_{X\_i^h X\_j^k}\right) \tag{30}$$

With the bi-delta function *δX<sup>h</sup> <sup>i</sup> <sup>X</sup><sup>k</sup> j* defined as:

$$\delta\_{X\_i^h X\_j^k} = \begin{cases} 1, & \text{if } i = j \text{ and } h = k \\ 0, & \text{otherwise} \end{cases} \tag{31}$$

By having the complete set of off-diagonal mutual information functions for all the random variables and layers, it is possible to define the following hyper-matrix elements:

$$A\_{ij}^{\text{lok}} = \Theta \left[ I^{\dagger} (X\_i^{\text{h}}, X\_j^{\text{k}}) \; \right] \; \; \; \; I\_0 \right] \tag{32}$$

as well as:

$$\mathcal{W}\_{ij}^{hk} = A\_{ij}^{hk} \odot I^{\dagger}(\mathcal{X}\_i^h, \mathcal{X}\_j^k) \tag{33}$$

Here Θ[·] is Heavyside's function and *I*<sup>0</sup> is a lower bound for mutual information (a threshold) to be considered *significant*.

We call *Ahk ij* and *<sup>W</sup>hk ij* the *adjacency hypermatrix* and the *strength hypermatrix* respectively (notice that ◦ in Equation (33) represents the product of a scalar times a hypermatrix). The adjacency hyper-matrix and the strength hyper-matrix define the (unweighted and weighted, respectively) neighborhood law of the associated TMF, hence the statistical dependency structure for the set of random variables and contexts (layers).

Although the adjacency and strength hypermatrices are indeed, proper representations of the undirected (unweighted and weighted) dependency structure of P(*X*), it has been considered advantageous to embed them into a tensor linear structure, in order to be able to work out some of the mathematical properties of such fields relying on the methods of tensor algebra. One relevant proposal in this regard, has been advanced by De Domenico and collaborators, in the context of multilayer networks.

Following the ideas of De Domenico and co-workers [5], we introduce the unweighted and weighted adjacency 4-tensors (respectively) as follows:

$$\mathbb{A} = \sum\_{h,k=1}^{L} \sum\_{i,j=1}^{N} A\_{ij}^{hk} \odot \xi\_{j\&}^{\text{av}\gamma} \tag{34}$$

$$\mathcal{W} = \sum\_{h,k=1}^{L} \sum\_{i,j=1}^{N} \mathcal{W}\_{ij}^{hk} \odot \mathfrak{f}\_{\mathcal{JS}}^{a\gamma} \tag{35}$$

Here, *ξαγ βδ* <sup>=</sup> *<sup>ξ</sup>αγ βδ* [*ijhk*] is a unit four-tensor whose role is to provide the hypermatrices with the desired linear properties (projections, contractions, etc.). Square brackets indicate that the indices *i*, *j*, *h* and *k* belong to the *α*, *β*, *γ* and *δ* dimensions and ⊗ represents a form of a *tensor matricization* product (i.e., the one producing a 4-tensor out of a 4-index hypermatrix times a unitary 4-tensor).

#### *3.1. Conditional Independence in Tensor Markov Fields*

In order to discuss the conditional independence structure induced by the present class of TMFs, let us analyze Equation (32). As already mentioned, the hyper-adjacency matrix *Ahk ij* represents the neigborhood law (as given by the Markov property) on the multilayer graph *G*ˆ (i.e., the TMF). Every non-zero entry on this hypermatrix represents a statistical dependence relation between two elements on *X*. The conditional dependence structure on TMFs inferred from mutual information measures via Equation (32) are related not only to the statistical independence conditions (as given by a zero mutual information measure between two elements), but also to the lower bound *I*<sup>0</sup> and in general to the dependency structure of the whole multilayer graph.

The definition of conditional independence (CI) for tensor random variables is as follows:

$$(X\_i^{\hbar} \perp \! \! \perp X\_j^{\hbar}) | X\_l^m \iff \mathbb{F}\_{X\_i^{\hbar}, X\_j^{\hbar} | X\_l^m = X\_l^m \ast} (X\_i^{\hbar} \ast, X\_j^{\hbar} \ast) = \mathbb{F}\_{X\_i^{\hbar} | X\_l^m = X\_l^m \ast} (X\_l^{\hbar} \ast) \cdot \mathbb{F}\_{X\_j^{\hbar} | X\_l^m = X\_l^m \ast} (X\_j^{\hbar} \ast) \tag{36}$$

$$\forall \ \mathcal{X}\_i^h, \mathcal{X}\_j^k, \mathcal{X}\_l^m \in \mathcal{X}\_r$$

Here ⊥⊥ represents conditional independence between two random variables, were F*Xh <sup>i</sup>* ,*X<sup>k</sup> <sup>j</sup>* <sup>|</sup>*X<sup>m</sup> <sup>l</sup>* <sup>=</sup>*X<sup>m</sup> <sup>l</sup>* <sup>∗</sup>(*X<sup>h</sup> <sup>i</sup>* ∗, *<sup>X</sup><sup>k</sup> <sup>j</sup>* ∗) = *Pr*(*X<sup>h</sup> <sup>i</sup>* ≤ *<sup>X</sup><sup>h</sup> <sup>i</sup>* ∗, *<sup>X</sup><sup>k</sup> <sup>j</sup>* ≤ *<sup>X</sup><sup>k</sup> <sup>j</sup>* ∗ |*X<sup>m</sup> <sup>l</sup>* = *<sup>X</sup><sup>m</sup> <sup>l</sup>* ∗) is the joint conditional cumulative distribution of *X<sup>h</sup> <sup>i</sup>* and *<sup>X</sup><sup>k</sup> <sup>j</sup>* given *<sup>X</sup><sup>m</sup> <sup>l</sup>* and *<sup>X</sup><sup>h</sup> <sup>i</sup>* ∗, *<sup>X</sup><sup>k</sup> <sup>j</sup>* ∗ and *<sup>X</sup><sup>m</sup> <sup>l</sup>* ∗ are realization events of the corresponding random variables.

In the case of MRFs (and by extension TMFs), CI is defined by means of (multi)graph separation: in this sense we say that *X<sup>h</sup> <sup>i</sup>* ⊥⊥*G*<sup>ˆ</sup> *<sup>X</sup><sup>k</sup> <sup>j</sup>* |*X<sup>m</sup> <sup>l</sup>* iff *<sup>X</sup><sup>m</sup> <sup>l</sup>* separates *<sup>X</sup><sup>h</sup> <sup>i</sup>* from *<sup>X</sup><sup>k</sup> <sup>j</sup>* in the multilayer graph *<sup>G</sup>*ˆ. This means that if we remove node *X<sup>m</sup> <sup>l</sup>* there are no undirected paths from *<sup>X</sup><sup>h</sup> <sup>i</sup>* to *<sup>X</sup><sup>k</sup> <sup>j</sup>* in *<sup>G</sup>*ˆ.

Conditional independence in random fields is often considered in terms of subsets of *V*. Let *A*, *B* and *C* be three subsets of *V*. The statement *XA* ⊥⊥*G*<sup>ˆ</sup> *XB*|*XC*, which holds only iff *C* separates *A* from *B* in the multilayer graph *G*ˆ, meaning that if we remove all vertices in *C* there will be no paths connecting any vertex in *A* to any vertex in *B* is called the *global Markov property* of TMFs.

The smallest set of vertices that renders a vertex *X<sup>h</sup> <sup>i</sup>* conditionally independent of all other vertices in the multilayer graph is called its *Markov blanket*, denoted *mb*(*X<sup>h</sup> <sup>i</sup>* ). If we define the *closure* of a node *Xh <sup>i</sup>* as C(*X<sup>h</sup> <sup>i</sup>* ) then *<sup>X</sup><sup>h</sup> <sup>i</sup>* ⊥⊥ *<sup>G</sup>*<sup>ˆ</sup> \ C(*X<sup>h</sup> <sup>i</sup>* )|*mb*(*X<sup>h</sup> i* ).

It is possible to show that in a TMF, the Markov blanket of a vertex is its set of first neighbors. This is called the *undirected local Markov property*. Starting from the local Markov property it is possible to show that two vertices *X<sup>h</sup> <sup>i</sup>* and *<sup>X</sup><sup>k</sup> <sup>j</sup>* are conditionally independent given the rest if there is no direct edge between them. This has been called the *pairwise Markov property*.

If we denote by *G*ˆ *Xh <sup>i</sup>* <sup>→</sup>*X<sup>k</sup> j* the set of undirected paths in the multilayer graph *G*ˆ connecting vertices *Xh <sup>i</sup>* and *<sup>X</sup><sup>k</sup> <sup>j</sup>* , then the pairwise Markov property of a TMF can be stated as:

$$\{X\_i^{\hbar} \perp \!\!\perp X\_j^{k} | \hat{G} \backslash \{X\_i^{\hbar}, X\_j^{k}\} \quad \text{iff} \quad \hat{G}\_{X\_i^{\hbar} \to X\_j^{k}} = \mathcal{Q} \tag{37}$$

It is clear that the global Markov property implies the local Markov property which in turn implies the pairwise Markov property. For systems with positive definity probability densities, it has been probed (in the case of MRFs) that pairwise Markov actually implied global Markov (See [6] p. 119 for a proof). For the present extension this is important since it is easier to assess pairwise conditional independence statements.

#### *3.2. Indepence Maps*

Let *I <sup>G</sup>*<sup>ˆ</sup> denote the set of all conditional independence relations encoded by the multilayer graph *<sup>G</sup>*<sup>ˆ</sup> (i.e., those CI relations given by the Global Markov property). Let *I*P be the set of all CI relations implied by the probability distribution P(*X<sup>j</sup> i* ). A multilayer graph *G*ˆ will be called an *independence map* (*I-map*) for a probability distribution P(*X<sup>j</sup> i* ), if all CI relations implied by *G*ˆ hold for P(*X<sup>j</sup> i* ), i.e., *I <sup>G</sup>*<sup>ˆ</sup> ⊆ *I*<sup>P</sup> [6].

The converse statement is not necessarily true, i.e., there may be some CI relations implied by P(*X<sup>j</sup> i* ) that are not encoded in the multilayer graph *G*ˆ. We may be usually interested in *minimal I-maps*, i.e., I-maps from which none of the edges could be removed without destroying its CI properties.

Every distribution has a unique minimal I-map (and a given graph representation). Let P(*X<sup>j</sup> i* ) > 0. Let *G*ˆ † be the multilayer graph obtained by introducing edges between all pairs of vertices *X<sup>h</sup> <sup>i</sup>* , *<sup>X</sup><sup>k</sup> <sup>j</sup>* such that *X<sup>h</sup> <sup>i</sup>* ⊥⊥ *<sup>X</sup><sup>k</sup> <sup>j</sup>* |*<sup>X</sup>* \ {*X<sup>h</sup> <sup>i</sup>* , *<sup>X</sup><sup>k</sup> <sup>j</sup>* }, then *<sup>G</sup>*<sup>ˆ</sup> † is the unique minimal I-map. We call *<sup>G</sup>*<sup>ˆ</sup> <sup>a</sup> *perfect map* of <sup>P</sup> when there is no dependencies *G*ˆ which are not indicated by P, i.e., *I <sup>G</sup>*<sup>ˆ</sup> = *I*<sup>P</sup> [6].

#### *3.3. Conditional Independence Tests*

Conditional independence tests are useful to evaluate whether CI conditions apply either exactly or in the case of applications under a certain bounded error. In order to be able to write down expressions for C.I. tests let us introduce the following *conditional kernels* [7]:

$$\mathbb{C}\_A(B) = \mathbb{P}(B|A) = \frac{\mathbb{P}(AB)}{\mathbb{P}(A)}\tag{38}$$

as well as their generalized recursive relations:

$$\mathbb{C}\_{ABC}(D) = \mathbb{C}\_{AB}(D|\mathbb{C}) = \frac{\mathbb{C}\_{AB}(\mathbb{C}D)}{\mathbb{C}\_{AB}(\mathbb{C})} \tag{39}$$

The conditional probability of *X<sup>k</sup> <sup>h</sup>* given *<sup>X</sup><sup>j</sup> <sup>i</sup>* can be thus written as:

$$\mathbb{C}\_{X\_i^{j}}(X\_h^k) = \mathbb{P}(X\_h^k | X\_i^j) = \frac{\mathbb{P}(X\_{h'}^k X\_i^j)}{\mathbb{P}(X\_i^j)} \tag{40}$$

We can then write down expressions for Markov conditional independence as follows:

$$X\_i^j \perp \! \perp X\_h^k | X\_l^m \Rightarrow \mathbb{P}(X\_{i'}^j | X\_h^k | X\_l^m) = \mathbb{P}(X\_i^j | X\_l^m) \times \mathbb{P}(X\_h^k | X\_l^m) \tag{41}$$

Following Bayes' theorem, CI conditions –in this case– will be of the form:

$$\mathbb{P}(X\_{i'}^{\bar{l}}, X\_h^k | X\_l^m) = \frac{\mathbb{P}(X\_{i'}^{l}, X\_l^m)}{\mathbb{P}(X\_l^m)} \times \frac{\mathbb{P}(X\_{h'}^{k}, X\_l^m)}{\mathbb{P}(X\_l^m)} = \frac{\mathbb{P}(X\_{i'}^{l}, X\_l^m) \times \mathbb{P}(X\_{h'}^{k}, X\_l^m)}{\mathbb{P}(X\_l^m)^2} \tag{42}$$

Equation (42) is useful since in large scale data applications is computationally cheaper to work with joint and marginal probabilities rather than conditionals.

Now let us consider the case of conditional independence given several conditional variables. The case for CI given two variables could be written—using conditional kernels—as follows:

$$X\_i^j \perp \!\!\perp X\_h^k | X\_l^m, X\_n^o \Rightarrow \mathbb{P}(X\_{i'}^j, X\_h^k | X\_l^m, X\_n^o) = \mathbb{P}(X\_i^j | X\_l^m, X\_n^o) \times \mathbb{P}(X\_h^k | X\_l^m, X\_n^o) \tag{43}$$

Hence,

$$\mathbb{P}(X\_{l'}^{\dot{j}}, X\_h^k | X\_{l'}^m, X\_n^o) = \mathbb{C}\_{X\_l^m, X\_n^o}(X\_i^{\dot{j}}) \times \mathbb{C}\_{X\_l^m, X\_n^o}(X\_h^k) \tag{44}$$

Using Bayes' theorem,

$$\mathbb{P}(X\_{l'}^{\dot{j}}, X\_{l}^{k}|X\_{l'}^{m}, X\_{n}^{o}) = \frac{\mathbb{P}(X\_{l'}^{\dot{j}}, X\_{l'}^{m}, X\_{n}^{o})}{\mathbb{P}(X\_{l'}^{m}, X\_{n}^{o})} \times \frac{\mathbb{P}(X\_{l'}^{k}, X\_{l'}^{m}, X\_{n}^{o})}{\mathbb{P}(X\_{l'}^{m}, X\_{n}^{o})} \tag{45}$$

$$\mathbb{P}(X\_{i'}^{j}, X\_{h}^{k}|X\_{l}^{m}, X\_{n}^{o}) = \frac{\mathbb{P}(X\_{l'}^{j}, X\_{l}^{m}, X\_{n}^{o}) \times \mathbb{P}(X\_{h'}^{k}, X\_{l}^{m}, X\_{n}^{o})}{\mathbb{P}(X\_{l'}^{m}, X\_{n}^{o})^{2}} \tag{46}$$

In order to generalize the previous results to CI relations given an arbitrary set of conditionals, let us consider the following *sigma-algebraic* approach:

Let Σ*jk ih* be the *<sup>σ</sup>*-algebra of all subsets of *<sup>X</sup>* that do not contain *<sup>X</sup><sup>j</sup> <sup>i</sup>* or *<sup>X</sup><sup>k</sup> <sup>h</sup>*. If we consider the contravariant index *i* ∈ *α* with *i* = 1, 2, ... , *N* and the covariant index *j* ∈ *β* with *j* = 1, 2, ... , *L*, then there are <sup>M</sup> <sup>=</sup> *NL* <sup>2</sup> such *σ*-algebras in *X* (let us recall that TMFs are *undirected* graphical models).

A relevant problem for network reconstruction is that of establishing the more general Markov pairwise CI conditions, i.e., the CI relations for every edge not drawn in the graph. Two arbitrary nodes *X<sup>j</sup> <sup>i</sup>* and *<sup>X</sup><sup>k</sup> <sup>h</sup>* are conditionally independent given the rest of the graph iff:

$$\mathbb{P}X\_i^j \perp \mathbb{L}\left. X\_h^k \right| \Sigma\_{ih}^{jk} \Rightarrow \mathbb{P}(X\_{i^\*}^j \left. X\_h^k \right| \Sigma\_{ih}^{jk}) = \mathbb{P}(X\_i^j \left| \Sigma\_{ih}^{jk} \right. \tag{47} \times \mathbb{P}(X\_h^k \left| \Sigma\_{ih}^{jk} \right.)\tag{47}$$

By using conditional kernels, the recursive relations and Bayes' theorem it is possible to write down M expressions of the form:

$$\mathbb{P}(X\_{i'}^{j}, X\_{h}^{k} \mid \Sigma\_{ih}^{jk}) = \frac{\mathbb{P}(X\_{i'}^{j}, \Sigma\_{ih}^{jk}) \times \mathbb{P}(X\_{h'}^{k}, \Sigma\_{ih}^{jk})}{\mathbb{P}(\Sigma\_{ih}^{jk})} \tag{48}$$

The family of Equations (48) represent the CI relations for all the non-existing edges in the hypergraph *G*ˆ, i.e., every pair of nodes *X<sup>j</sup> <sup>i</sup>* and *<sup>X</sup><sup>k</sup> <sup>h</sup>* not-connected in *<sup>G</sup>*<sup>ˆ</sup> must be conditionally independent given the rest of the nodes in the graph. These expression may serve to implement exact tests or optimization strategies for *graph reconstruction* and/or *graph sparsification* in applications considering a mutual information threshold *I*<sup>0</sup> as in Equation (32).

In brief, for every node pair with a mutual information value lesser than *I*0, the presented graph reconstruction approach will not draw an edge, hence implying CI between the two nodes given the rest. Such CI condition may be tested on the data to see whether it holds or the threshold itself can be determined by resorting to optimization schemes (e.g. error bounds) in Equation (48).

#### **4. Graph Theoretical Features and Multilinear Structure**

Once the probabilistic properties of TMFs have been set, it may be fit to briefly present some of their graph theoretical features, as well as some preliminaries as to the reasons to embed hyperadjecency matrices into multilayer adjacency tensors. Given that TMFs are indeed PGMs, some of their graph characteristics will result relevant here.

Since the work by De Domenico and coworkers [5] covers in great detail how the multilinear structure of the multilayer adjacency tensor allows the calculation of these quantities—usually as projection operations—we will only mention connectivity degree vectors since these are related with the size of the TMF dependency neighborhoods.

Let us recall multilayer adjacency tensors, as defined in Equations (34) and (35). To ease presentation, we will work with the unweighted tensor A*αγ βδ* (Equation (34)). The *multidegree centrality vector K<sup>α</sup>* which contains the connectivity degrees of the nodes spanning different layers can be written as follows:

$$\mathcal{K}^{\mathfrak{a}} = \mathbb{A}^{\mathfrak{a}\gamma}\_{\beta\delta} \mathcal{U}^{\delta}\_{\gamma} \mathfrak{u}^{\beta} \tag{49}$$

Here *U<sup>δ</sup> <sup>γ</sup>* is a rank 2 tensor that contains a 1 in every component and *u<sup>β</sup>* is a rank 1 tensor that contains a 1 in every component—these quantities are called 1—tensors by De Domenico and coworkers [5]. It can be shown that *K<sup>α</sup>* is indeed given by the sums of the *connectivity degree vectors k<sup>α</sup>* corresponding to all different layers:

$$K^{\mathfrak{a}} = \sum\_{h=1}^{L} \sum\_{k=1}^{L} k^{\mathfrak{a}}(hk) \tag{50}$$

*kα*(*hk*) is the vector of connections that nodes in the set *α* = 1, 2, ... , *N* in layer *h* have to any other nodes in layer *k*. Whereas *K<sup>α</sup>* is the vector with connections in all the layers. Appropriate projections will yield measures such as the size of the neighborhood to a given vertex |*NXj i* |, the size of its Markov blanquet |*mb*(*X<sup>h</sup> <sup>i</sup>* )|, or other similar quantities.

#### **5. Specific Applications**

After having considered some of the properties of this class of Tensor Markov Fields, it may become evident that aside from purely theoretical importance, there is a number of important applications that may arise as probabilistic graphical models in tensor valued problems, among the ones that are somewhat evident are the following:


Some of these problems are being treated indeed as multiple instances of Markov fields or as multipartite graphs or hypergraphs. However, it may become evident that when random variables *across layers* are interdependent (which is often the case), the definitions of potentials,cliques and partition functions, as well as the conditional statistical independence features become manageable (and in some cases even meaninful) under the presented formalism of Tensor Markov Fields.

**Figure 4.** Gene and microRNA regulatory network: A Tensor Markov Field depicting the statistical dependence of genome wide gene and microRNA (miR) on a human phenotype. Edge width is given by the mutual information *I*†(*X<sup>j</sup> i* , *X<sup>k</sup> <sup>h</sup>*) between expression levels of genes (layer *j*) and miRs (layer *k*) in a very large corpus of RNASeq samples, vertex size is proportional to the *degree*, i.e., the size of the node's neighborhood, *NXj i* .

#### **6. Conclusions**

Here we have presented the definitions and fundamental properties of Tensor Markov Fields, i.e., random Markov fields over tensor spaces. We have proved –by extending the results of Dobruschin, Hammersley and Clifford to such tensor valued fields– that tensor Markov fields are indeed Gibbs fields whenever strictly positive probability measures are considered. We also introduced a class of tensor Markov fields obtained by using information theoretical statistical dependence measures inducing local and global Markov properties, and show how these can be used as probabilistic graphical models in multi-context environments much in the spirit of the so-called multilayer network approach. Finally, we discuss the convenience of embedding tensor Markov fields in the multilinear tensor representation of multilayer networks.

**Funding:** This research received no external funding.

**Conflicts of Interest:** The author declares no conflict of interest.

#### **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **Objective Bayesian Inference in Probit Models with Intrinsic Priors Using Variational Approximations**

**Ang Li †,‡, Luis Pericchi \*,†,‡ and Kun Wang †,‡**

Río Piedras Campus, University of Puerto Rico, 00925 San Juan, Puerto Rico; ang.li@upr.edu (A.L.); kun.wang@upr.edu (K.W.)


‡ These authors contributed equally to this work.

Received: 3 March 2020; Accepted: 26 April 2020; Published: 30 April 2020

**Abstract:** There is not much literature on objective Bayesian analysis for binary classification problems, especially for intrinsic prior related methods. On the other hand, variational inference methods have been employed to solve classification problems using probit regression and logistic regression with normal priors. In this article, we propose to apply the variational approximation on probit regression models with intrinsic prior. We review the mean-field variational method and the procedure of developing intrinsic prior for the probit regression model. We then present our work on implementing the variational Bayesian probit regression model using intrinsic prior. Publicly available data from the world's largest peer-to-peer lending platform, LendingClub, will be used to illustrate how model output uncertainties are addressed through the framework we proposed. With LendingClub data, the target variable is the final status of a loan, either charged-off or fully paid. Investors may very well be interested in how predictive features like FICO, amount financed, income, etc. may affect the final loan status.

**Keywords:** objective Bayesian inference; intrinsic prior; variational inference; binary probit regression; mean-field approximation

#### **1. Introduction**

There is not much literature on objective Bayesian analysis for binary classification problems, especially for intrinsic prior related methods. By far, only two articles have explored intrinsic prior related methods on classification problems. Reference [1] implements integral priors into the generalized linear models with various link functions. In addition, reference [2] considers intrinsic priors for probit models. On the other hand, variational inference methods have been employed to solve classification problem with logistic regression ([3]) and probit regression ([4,5]) with normal priors. Variational approximation methods have been reviewed in [6,7], and more recently [8].

In this article, we propose to apply variational approximations on probit regression models with intrinsic priors. In Section 4, we review the mean-field variational method that will be used in this article. In Section 3, procedures for developing intrinsic priors for probit models will be introduced following [2]. Our work is presented in Section 5. Our motivations for combining intrinsic prior methodology and variational inference is as following


problems. In fact, some recently developed priors that proposed to solve inference or estimation problems turned out to be also intrinsic priors. For example, the Scaled Beta2 prior [9] and the Matrix-*F* prior [10].


As for model comparison, due to the fact that the output of variational inference methods cannot be employed directly to compare models, we propose in Section 5.3 to simply make use of the variational approximation of the posterior distribution as an importance function and get the Monte Carlo estimated marginal likelihood by importance sampling for model comparison.

#### **2. Background and Development of Intrinsic Prior Methodology**

#### *2.1. Bayes Factor*

The Bayesian framework of model selection coherently involves the use of probability to express all uncertainty in the choice of model, including uncertainty about the unknown parameters of a model. Suppose that models *M*1, *M*2, ..., *Mq* are under consideration. We shall assume that the observed data **x** = (*x*1, *x*2, ..., *xn*) is generated from one of these models but we do not know which one it is. We express our uncertainty through prior probability *P*(*Mj*), *j* = 1, 2, ..., *q*. Under model *Mi*, **x** has density *fi*(**x**|*θi*, *Mi*), where *θ<sup>i</sup>* are unknown model parameters, and the prior distribution for *θ<sup>i</sup>* is *πi*(*θi*|*Mi*). Given observed data and prior probabilities, we can then evaluate the posterior probability of *Mi* using Bayes' rule

$$P(M\_i|\mathbf{x}) = \frac{p\_i(\mathbf{x}|M\_i)P(M\_i)}{\sum\_{j=1}^{q} p\_j(\mathbf{x}|M\_j)P(M\_j)},\tag{1}$$

where

$$p\_i(\mathbf{x}|M\_i) = \int f\_i(\mathbf{x}|\boldsymbol{\theta}\_i, M\_i) \, \pi\_i(\boldsymbol{\theta}\_i|M\_i) d\boldsymbol{\theta}\_i \tag{2}$$

is the marginal likelihood of **x** under *Mi*, also called the evidence for *Mi* [12]. A common choice of prior model probabilities is *P*(*Mj*) = <sup>1</sup> *<sup>q</sup>* , so that each model has the same initial probability. However, there are other alternatives of assigning probabilities to correct for multiple comparison (See [13]). From (1), the posterior odds are therefore the prior odds multiplied by the Bayes factor

$$\frac{P(M\_{\dot{j}}|\mathbf{x})}{P(M\_{\dot{i}}|\mathbf{x})} = \frac{P(M\_{\dot{j}})p\_{\dot{j}}(\mathbf{x})}{P(M\_{\dot{i}})p\_{\dot{i}}(\mathbf{x})} = \frac{P(M\_{\dot{j}})}{P(M\_{\dot{i}})} \times B\_{\ddot{j}i}.\tag{3}$$

where the Bayes factor of *Mj* to *Mi* is defined by

$$B\_{ji} = \frac{p\_j(\mathbf{x})}{p\_i(\mathbf{x})} = \frac{\int f\_j(\mathbf{x}|\boldsymbol{\theta}\_j) \, \boldsymbol{\pi}\_j(\boldsymbol{\theta}\_j) d\boldsymbol{\theta}\_j}{\int f\_i(\mathbf{x}|\boldsymbol{\theta}\_i) \, \boldsymbol{\pi}\_i(\boldsymbol{\theta}\_i) d\boldsymbol{\theta}\_i}. \tag{4}$$

Here we omit the dependence on models *Mj*, *Mi* to keep the notation simple. The marginal likelihood, *pi*(**x**) expresses the preference shown by the observed data for different models. When *Bji* > 1, the data favor *Mj* over *Mi*, and when *Bji* < 1 the data favor *Mi* over *Mj*. A scale for interpretation of *Bji* is given by [14].

#### *2.2. Motivation and Development of Intrinsic Prior*

Computing *Bji* requires specification of *πi*(*θi*) and *πj*(*θj*). Often in Bayesian analysis, when prior information is weak, one can use non-informative (or default) priors *π<sup>N</sup> <sup>i</sup>* (*θi*). Common choices for non-informative priors are the uniform prior, *π<sup>U</sup> <sup>i</sup>* (*θi*) <sup>∝</sup> 1; the Jeffreys prior, *<sup>π</sup><sup>J</sup> <sup>i</sup>* (*θi*) <sup>∝</sup> det(**I***i*(*θi*))1/2 where **I***i*(*θi*) is the expected Fisher information matrix corresponding to *Mi*.

Using any of the *π<sup>N</sup> <sup>i</sup>* in (4) would yield

$$B\_{ji}^{N} = \frac{p\_j^{N}(\mathbf{x})}{p\_i^{N}(\mathbf{x})} = \frac{\int f\_j(\mathbf{x}|\boldsymbol{\theta}\_j)\pi\_j^{N}(\boldsymbol{\theta}\_j)d\boldsymbol{\theta}\_j}{\int f\_i(\mathbf{x}|\boldsymbol{\theta}\_i)\pi\_i^{N}(\boldsymbol{\theta}\_i)d\boldsymbol{\theta}\_i}.\tag{5}$$

The difficulty with (5) is that *π<sup>N</sup> <sup>i</sup>* are typically improper and hence are defined only up to an unspecified constant *ci*. So *B<sup>N</sup> ji* is defined only up to the ratio *cj*/*ci* of two unspecified constants.

An attempt to circumvent the ill definition of the Bayes factors for improper non-informative priors is the intrinsic Bayes factor introduced by [15], which is a modification of a partial Bayes factor [16]. To define the intrinsic Bayes factor we consider the set of subsamples **x**(*l*) of the data **x** of minimal size *l* such that 0 < *p<sup>N</sup> <sup>i</sup>* (**x**(*l*)) < ∞. These subsamples are called training samples (not to be confused with training sample in machine learning). In addition, there is a total number of *L* such subsamples.

The main idea here is that training sample **x**(*l*) will be used to convert the improper *π<sup>N</sup> <sup>i</sup>* (*θi*) to proper posterior

$$
\pi\_i^N(\boldsymbol{\theta}\_i|\mathbf{x}(l)) = \frac{f\_i(\mathbf{x}(l)|\boldsymbol{\theta}\_i)\pi\_i^N(\boldsymbol{\theta}\_i)}{p\_i^N(\mathbf{x}(l))}\tag{6}
$$

where *p<sup>N</sup> <sup>i</sup>* (**x**(*l*)) = *fi*(**x**(*l*)|*θi*)*π<sup>N</sup> <sup>i</sup>* (*θi*)*dθi*. Then, the Bayes factor for the remaining of the data **x**(*n* − *l*), where **<sup>x</sup>**(*l*) ∪ **<sup>x</sup>**(*<sup>n</sup>* − *<sup>l</sup>*) = **<sup>x</sup>**, using *<sup>π</sup><sup>N</sup> <sup>i</sup>* (*θi*|**x**(*l*)) as prior is called a "partial" Bayes factor,

$$B\_{ji}^{N}(\mathbf{x}(n-l)|\mathbf{x}(l)) = \frac{\int f\_{\bar{j}}(\mathbf{x}(n-l)|\boldsymbol{\theta}\_{\bar{j}}) \pi\_{\bar{j}}^{N}(\boldsymbol{\theta}\_{\bar{j}}|\mathbf{x}(l))d\boldsymbol{\theta}\_{\bar{j}}}{\int f\_{\bar{i}}(\mathbf{x}(n-l)|\boldsymbol{\theta}\_{\bar{i}}) \pi\_{\bar{j}}^{N}(\boldsymbol{\theta}\_{\bar{i}}|\mathbf{x}(l))d\boldsymbol{\theta}\_{\bar{i}}} \tag{7}$$

This partial Bayes factor is a well-defined Bayes factor, and can be written as *B<sup>N</sup> ji* (**x**(*n* − *l*)|**x**(*l*)) = *B<sup>N</sup> ji* (**x**)*Bij*(**x**(*l*)), where *<sup>B</sup><sup>N</sup> ji* (**x**) = *<sup>p</sup><sup>N</sup> <sup>j</sup>* (**x**) *pN <sup>i</sup>* (**x**) and *Bij*(**x**(*l*)) = *<sup>p</sup><sup>N</sup> <sup>i</sup>* (**x**(*l*)) *pN <sup>j</sup>* (**x**(*l*)). Clearly, *<sup>B</sup><sup>N</sup> ji* (**x**(*n* − *l*)|**x**(*l*)) will depend on the choice of the training samples **x**(*l*). To eliminate this arbitrariness and increase stability, reference [15] suggests averaging over all training samples and obtained the arithmetic intrinsic Bayes factor (AIBF)

$$B\_{ji}^{AIRF}(\mathbf{x}) = B\_{ji}^{N}(\mathbf{x}) \frac{1}{L} \sum\_{l=1}^{L} B\_{ij}^{N}(\mathbf{x}(l)). \tag{8}$$

The strongest justification of the arithmetic IBF is its asymptotic equivalence with a proper Bayes factor arising from *Intrinsic priors*. These intrinsic priors were identified through an asymptotic analysis (see [15]). For the case where *Mi* is nested in *Mj*, it can be shown that the intrinsic priors are given by

$$
\pi\_i^I(\boldsymbol{\theta}\_i) = \pi\_i^N(\boldsymbol{\theta}\_i) \text{ and } \pi\_j^I(\boldsymbol{\theta}\_j) = \pi\_j^N(\boldsymbol{\theta}\_j) E\_{M\_j} \left[ \frac{m\_i^N(\mathbf{x}(l))}{m\_j^N(\mathbf{x}(l))} | \boldsymbol{\theta}\_j \right]. \tag{9}
$$

#### **3. Objective Bayesian Probit Regression Models**

#### *3.1. Bayesian Probit Model and the Use of Auxiliary Variables*

Consider a sample **y** = (*y*1, ..., *yn*), where *Yi*, *i* = 1, ..., *n*, is a 0 − 1 random variable such that under model *Mj*, it follows a probit regression model with a *j* + 1-dimensional vector of covariates *xi*, where *j* ≤ *p*. Here, *p* is the total number of covariate variables under our consideration. In addition, this probit model *Mj* has the form

$$\Psi\_{i}|\beta\_{0},...,\beta\_{j},M\_{j} \sim \text{Bernoulli}(\Phi(\beta\_{0}\mathbf{x}\_{0i} + \beta\_{1}\mathbf{x}\_{1i} + ... + \beta\_{j}\mathbf{x}\_{j})), \qquad 1 \le i \le n,\tag{10}$$

where Φ denotes the standard normal cumulative distribution function and *β<sup>j</sup>* = (*β*0, ..., *βj*) is a vector of dimension *j* + 1. The first component of the vector *xi* is set equal to 1 so that when considering models of the form (10), the intercept is in any submodel. The maximum length of the vector of covariates is *p* + 1. Let *π*(*β*), proper or improper, summarize our prior information about *β*. Then the posterior density of *β* is given by

$$\pi(\boldsymbol{\beta}|\boldsymbol{y}) = \frac{\pi(\boldsymbol{\beta})\prod\_{i=1}^{n} \Phi(\boldsymbol{x}\_i^{\prime}\boldsymbol{\beta})^{y\_i}(1-\Phi(\boldsymbol{x}\_i^{\prime}\boldsymbol{\beta})^{1-y\_i})}{\int \pi(\boldsymbol{\beta})\prod\_{i=1}^{n} \Phi(\boldsymbol{x}\_i^{\prime}\boldsymbol{\beta})^{y\_i}(1-\Phi(\boldsymbol{x}\_i^{\prime}\boldsymbol{\beta})^{1-y\_i})d\boldsymbol{\beta}} \quad \boldsymbol{\gamma}$$

which is largely intractable.

As shown by [17], the Bayesian probit regression model becomes tractable when a particular set of auxiliary variables is introduced. Based on the data augmentation approach [18], introducing *n* latent variables *Z*1, ..., *Zn*, where

$$Z\_i|\mathcal{J} \sim N(x\_i' \mathcal{J}, 1).$$

The probit model (10) can be thought of as a regression model with incomplete sampling information by considering that only the sign of *zi* is observed. More specifically, define *Yi* = 1 if *Zi* > 0 and *Yi* = 0 otherwise. This allows us to write the probability density of *yi* given *zi*

$$p(y\_i|z\_i) = \mathbb{I}(z\_i > 0)\mathbb{I}(y\_i = 1) + \mathbb{I}(z\_i \le 0)\mathbb{I}(y\_i = 0).$$

Expansion of the parameter set from {*β*} to {*β*, **Z**} is the key to achieving a tractable solution for variational approximation.

*3.2. Development of Intrinsic Prior for Probit Models*

For the sample **z** = (*z*1, ..., *zn*) , the null normal model is

$$\mathcal{M}\_\mathbf{1} : \{ \mathcal{N}\_\mathbf{n} (\mathbf{z} | \alpha \mathbf{1}\_{\mathbf{n}}, \mathbf{I}\_\mathbf{n}) , \pi(\alpha) \} .$$

For a generic model *Mj* with *j* + 1 regressors, the alternative model is

$$\mathcal{M}\_{\vec{\jmath}} : \{ \mathcal{N}\_n(\mathbf{z}|\mathbf{X}\_{\vec{\jmath}}\boldsymbol{\beta}\_{\vec{\jmath}}, \mathbf{I}\_n), \pi(\boldsymbol{\beta}\_{\vec{\jmath}}) \}\_{\vec{\imath}}$$

where the design matrix **X***<sup>j</sup>* has dimensions *n* × (*j* + 1). Intrinsic prior methodology for the linear model was first developed by [19], and was further developed in [20] by using the methods of [21]. This intrinsic methodology gives us an automatic specification of the priors *π*(*α*) and *π*(*β*), starting with the non-informative priors *πN*(*α*) and *πN*(*β*) for *α* and *β*, which are both improper and proportional to 1.

The marginal distributions for the sample **z** under the null model, and under the alternative model with intrinsic prior, are formally written as

$$p\_1(\mathbf{z}) = \int \mathcal{N}\_n(\mathbf{z}|a\mathbf{1}\_n, \mathbf{I}\_n) \pi^N(a) da,$$

$$p\_j(\mathbf{z}) = \int \int \mathcal{N}\_n(\mathbf{z}|\mathbf{X}\_j \mathbf{g}\_j, \mathbf{I}\_n) \pi^l(\mathcal{g}|a) \pi^N(a) da d\mathcal{g}.\tag{11}$$

However, these are marginals of the sample **z**, but our selection procedure requires us to compute the Bayes factor of model *Mj* versus the reference model *M***<sup>1</sup>** for the sample **y** = (*y*1, ..., *yn*). To solve this problem, reference [2] proposed to transform the marginal *pj*(**z**) into the marginal *pj*(**y**) by using the probit transformations *yi* = 1(*zi* > 0), *i* = 1, ..., *n*. These latter marginals are given by

$$p\_{\dot{\jmath}}(\mathbf{y}) = \int\_{A\_1 \times \ldots \times A\_n} p\_{\dot{\jmath}}(\mathbf{z}) d\mathbf{z} \tag{12}$$

where

$$A\_{\bar{i}} = \begin{cases} (0, \infty) \text{ if } y\_{\bar{i}} = 1, \\ (-\infty, 0) \text{ if } y\_{\bar{i}} = 0. \end{cases} \tag{13}$$

#### **4. Variational Inference**

#### *4.1. Overview of Variational Methods*

Variational methods have their origins in the 18th century with the work of Euler, Lagrange, and others on the calculus of variations (The derivation in this section is standard in the literature on variational approximation and will at times follow the arguments in [22,23]). Variational inference is a body of deterministic techniques for making approximate inference for parameters in complex statistical models. Variational approximations are a much faster alternative to Markov Chain Monte Carlo (MCMC), especially for large models, and are a richer class of methods than the Laplace approximation [6].

Suppose we have a Bayesian model and a prior distribution for the parameters. The model may also have latent variables, here we shall denote the set of all latent variables and parameters by *θ*. In addition, we denote the set of all observed variables by **X**. Given a set of *n* independent, identically distributed data, for which **X** = {**x**1, ..., **x***n*} and *θ* = {*θ*1, ..., *θn*}, our probabilistic model (e.g., probit regression model) specifies the joint distribution *p*(**X**, *θ*), and our goal is to find an approximation for the posterior distribution *p*(*θ*|**X**) as well as for the marginal likelihood *p*(**X**). For any probability distribution *q*(*θ*), we have the following decomposition of the log marginal likelihood

$$\ln p(\mathbf{X}) = \mathcal{L}(q) + \text{KL}(q||p)$$

where we have defined

$$\mathcal{L}(q) = \int q(\theta) \ln \left\{ \frac{p(\mathbf{X}, \theta)}{q(\theta)} \right\} d\theta \tag{14}$$

$$\text{KL}(q||p) = -\int q(\theta) \ln \left\{ \frac{p(\theta|\mathbf{X})}{q(\theta)} \right\} d\theta \tag{15}$$

We refer to (14) as the lower bound of the log marginal likelihood with respect to the density *q*, and (15) is by definition the Kullback–Leibler divergence of the posterior *q*(*θ*|**X**) from the density *q*. Based on this decomposition, we can maximize the lower bound L(*q*) by optimization with respect to the distribution *q*(*θ*), which is equivalent to minimizing the KL divergence. In addition, the lower bound is attained when the KL divergence is zero, which happens when *q*(*θ*) equals the posterior distribution *p*(*θ*|**X**). It would be hard to find such a density since the true posterior distribution is intractable.

#### *4.2. Factorized Distributions*

The essence of the variational inference approach is approximation to the posterior distribution *p*(*θ*|**X**) by *q*(*θ*) for which the *q* dependent lower bound L(*q*) is more tractable than the original model evidence. In addition, tractability is achieved by restricting *q* to a more manageable class of distributions, and then maximizing L(*q*) over that class.

Suppose we partition elements of *θ* into disjoint groups {*θi*} where *i* = 1, ..., *M*. We then assume that the *q* density factorizes with respect to this partition, i.e.,

$$q(\boldsymbol{\theta}) = \prod\_{i=1}^{M} q\_i(\theta\_i). \tag{16}$$

The product form is the only assumption we made about the distribution. Restriction (16) is also known as *mean-field* approximation and has its root in Physics [24].

For all distributions *q*(*θ*) with the form (16), we need to find the distribution for which the lower bound L(*q*) is largest. Restriction of *q* to a subclass of product densities like (16) gives rise to explicit solutions for each product component in terms of the others. This fact, in turn, leads to an iterative scheme for obtaining the solutions. To achieve this, we first substitute (16) into (14) and then separate out the dependence on one of the factors *qj*(*θj*). Denoting *qj*(*θj*) by *qj* to keep the notation clear, we obtain

$$\begin{split} \mathcal{L}(q) &= \int \prod\_{i=1}^{M} q\_i \left\{ \ln p(\mathbf{X}, \boldsymbol{\theta}) - \sum\_{i=1}^{M} \ln q\_i \right\} d\boldsymbol{\theta} \\ &= \int q\_j \left\{ \int \ln p(\mathbf{X}, \boldsymbol{\theta}) \prod\_{i \neq j} q\_i d\boldsymbol{\theta}\_i \right\} d\boldsymbol{\theta}\_j - \int q\_j \ln q\_j d\boldsymbol{\theta}\_j + \text{constant} \\ &= \int q\_j \ln \boldsymbol{\tilde{p}}(\mathbf{X}, \boldsymbol{\theta}\_j) d\boldsymbol{\theta}\_j - \int q\_j \ln q\_j d\boldsymbol{\theta}\_j + \text{constant} \end{split} \tag{17}$$

where *p*˜(**X**, *θj*) is given by

$$\ln \vec{p}(\mathbf{X}, \theta\_{\hat{\jmath}}) = \mathbb{E}\_{i \neq j} [\ln p(\mathbf{X}, \theta)] + \text{constant}.\tag{18}$$

The notation <sup>E</sup>*i*=*j*[·] denotes an expectation with respect to the *<sup>q</sup>* distributions over all variables **<sup>z</sup>***<sup>i</sup>* for *i* = *j*, so that

$$\mathbb{E}\_{i\nmid j} \left[ \ln p(\mathbf{X}, \boldsymbol{\theta}) \right] = \int \ln p(\mathbf{X}, \boldsymbol{\theta}) \prod\_{i \nmid j} q\_i d\boldsymbol{\theta}\_i.$$

Now suppose we keep the {*qi*=*j*} fixed and maximize L(*q*) in (17) with respect to all possible forms for the density *qj*(*θj*). By recognizing that (17) is the negative KL divergence between *p*˜(**X**, *θj*) and *qj*(*θj*), we notice that maximizing (17) is equivalent to minimize the KL divergence, and the minimum occurs when *qj*(*θj*) = *p*˜(**X**, *θj*). The optimal *q*<sup>∗</sup> *<sup>j</sup>* (*θj*) is then

$$\ln q\_j^\*(\theta\_j) = \mathbb{E}\_{i \neq j} [\ln p(\mathbf{X}, \theta)] + \text{constant}.\tag{19}$$

The above solution says that the log of the optimal *qj* is obtained simply by considering the log of the joint distribution of all parameter, latent and observable variables and then taking the expectation with respect to all the other factors *qi* for *i* = *j*. Normalizing the exponential of (19), we have

$$q\_j^\*(\boldsymbol{\theta}\_j) = \frac{\exp\left(\mathbb{E}\_{i \nmid j} [\ln p(\mathbf{X}, \boldsymbol{\theta})]\right)}{\int \exp\left(\mathbb{E}\_{i \nmid j} [\ln p(\mathbf{X}, \boldsymbol{\theta})]\right) d\boldsymbol{\theta}\_j}.$$

The set of equations in (19) for *j* = 1, ..., *M* are not an explicit solution because the expression on the right hand side of (19) for the optimal *q*∗ *<sup>j</sup>* depends on expectations taken with respect to the other factors *qi* for *i* = *j*. We will need to first initialize all of the factors *qi*(*θi*) and then cycle through the factors one by one and replace each in turn with an updated estimate given by the right hand side of (19) evaluated using the current estimates for all of the other factors. Convexity properties can be used to show that convergence to at least local optima is guaranteed [25]. The iterative procedure is described in Algorithm 1.

**Algorithm 1** Iterative procedure for obtaining the optimal densities under factorized density restriction (16). The updates are based on the solutions given by (19).


$$\begin{aligned} q\_1^\*(\boldsymbol{\theta}\_1) &\leftarrow \frac{\exp\left(\mathbb{E}\_{i \neq 1}[\ln p(\mathbf{X}, \boldsymbol{\theta})]\right)}{\int \exp\left(\mathbb{E}\_{i \neq 1}[\ln p(\mathbf{X}, \boldsymbol{\theta})]\right) d\boldsymbol{\theta}\_1} \\ \vdots \\ q\_M^\*(\boldsymbol{\theta}\_M) &\leftarrow \frac{\exp\left(\mathbb{E}\_{i \neq M}[\ln p(\mathbf{X}, \boldsymbol{\theta})]\right)}{\int \exp\left(\mathbb{E}\_{i \neq M}[\ln p(\mathbf{X}, \boldsymbol{\theta})]\right) d\boldsymbol{\theta}\_M} \end{aligned}$$

until the increase in L(*q*) is negligible.

#### **5. Incorporate Intrinsic Prior with Variational Approximation to Bayesian Probit Models**

#### *5.1. Derivation of Intrinsic Prior to Be Used in Variational Inference*

Let **X***<sup>l</sup>* be the design matrix of a minimal training sample (mTS) of a normal regression model *Mj* for the variable **Z** ∼ *N*(**X***jβ<sup>j</sup>* ,**I***j*+1). We have, for the *j* + 1-dimensional parameter *β<sup>j</sup>* ,

$$\int \mathcal{N}\_{j+1}(\mathbf{z}\_l | \mathbf{X}\_l \boldsymbol{\mathfrak{G}}\_{j'} \mathbf{1}\_{j+1}) d\boldsymbol{\mathfrak{G}}\_j = \begin{cases} |\mathbf{X}\_l' \mathbf{X}\_l|^{-1/2} & \text{if rank of } \mathbf{X}\_l \ge j+1\\ \infty & \text{otherwise} \end{cases} \text{ .}$$

Therefore, it follows that the mTS size is *j* + 1 [2]. Given that priors for *α* and *β* are proportional to 1, the intrinsic prior for *β* conditional on *α* could be derived. Let *β*<sup>0</sup> denote the vector with the first component equal to *α* and the others equal to zero. Based on Formula (9), we have

*πI* (*β*|*α*) = *<sup>π</sup><sup>N</sup> <sup>j</sup>* (*β*)E*Mj* **z***<sup>l</sup>* |*β p*1(**z***l*|*α*) *pj*(**z***l*|*β*)*π<sup>N</sup> <sup>j</sup>* (*β*)*dβ* = E*Mj* **z***<sup>l</sup>* |*β* exp{−<sup>1</sup> <sup>2</sup> (**z***<sup>l</sup>* − **X***lβ*0) (**z***<sup>l</sup>* − **X***lβ*0)} exp{−<sup>1</sup> <sup>2</sup> (**z***<sup>l</sup>* − **X***lβ*)(**z***<sup>l</sup>* − **X***lβ*)}*dβ* = (2*π*)<sup>−</sup> (*j*+1) <sup>2</sup> |(**X** *<sup>l</sup>***X***l*)−1<sup>|</sup> − 1 <sup>2</sup> <sup>×</sup> <sup>E</sup>*Mj* **z***<sup>l</sup>* |*β* exp{−<sup>1</sup> 2 (**z***<sup>l</sup>* − **X***lβ*0) (**z***<sup>l</sup>* − **X***lβ*0)} = (2*π*)<sup>−</sup> (*j*+1) <sup>2</sup> |2(**X** *<sup>l</sup>***X***l*)−1<sup>|</sup> − 1 <sup>2</sup> exp{−<sup>1</sup> 2 [(*<sup>β</sup>* <sup>−</sup> *<sup>β</sup>*0) **<sup>X</sup>** *l* **X***l* <sup>2</sup> (*<sup>β</sup>* <sup>−</sup> *<sup>β</sup>*0)]}.

*Entropy* **2020**, *22*, 513

Therefore,

$$\pi^I(\mathcal{B}|\alpha) = N\_{j+1}(\mathcal{B}|\mathcal{B}\_0, 2(\mathsf{X}\_l^\prime \mathsf{X}\_l)^{-1}), \text{ where } \mathcal{B}\_0 = \begin{pmatrix} \alpha \\ 0 \\ \vdots \\ 0 \end{pmatrix}\_{(j+1)\times 1}$$

Notice that **X** *l* **X***<sup>l</sup>* is unknown because it is a theoretical design matrix corresponding to the training sample **z***l*. It can be estimated by averaging over all submatrices containing *j* + 1 rows of the *n* × (*j* + 1) design matrix **<sup>X</sup>***j*. This average is *<sup>j</sup>*+<sup>1</sup> *<sup>n</sup>* **X** *j* **X***<sup>j</sup>* (See [26] and Appendix A in [2]), and therefore

$$
\pi^I(\boldsymbol{\beta}|\boldsymbol{\alpha}) = N\_{\boldsymbol{j}+1}(\boldsymbol{\beta}|\boldsymbol{\beta}\_{0^\prime} \frac{2n}{\boldsymbol{j}+1} (\mathbf{X}\_{\boldsymbol{j}}^{\prime}\mathbf{X}\_{\boldsymbol{j}})^{-1}).
$$

Next, based on *π<sup>I</sup>* (*β*|*α*), the intrinsic prior for *β* can be obtained by

$$
\pi^I(\mathfrak{F}) = \int \pi^I(\mathfrak{F}|\alpha)\pi^I(\alpha)d\alpha. \tag{20}
$$

.

Since we assume that *π<sup>I</sup>* (*α*) = *πN*(*α*) is proportional to one, set *πN*(*α*) = *c* where *c* is an arbitrary positive constant. Denote <sup>2</sup>*<sup>n</sup> <sup>j</sup>*+<sup>1</sup> (**X** *j* **<sup>X</sup>***j*)−<sup>1</sup> by <sup>Σ</sup>*β*|*α*, we obtain

$$\begin{split} \pi^{I}(\boldsymbol{\varbeta}) &= \int \pi^{I}(\boldsymbol{\varbeta}|\boldsymbol{a}) \pi^{I}(\boldsymbol{a}) d\boldsymbol{a} \\ &= c \cdot (2\pi)^{-\frac{i+1}{2}} |\boldsymbol{\Sigma\_{\boldsymbol{\varbeta}[\boldsymbol{a}]}}|^{-\frac{1}{2}} \int \exp\{-\frac{1}{2} (\boldsymbol{\upbeta} - \boldsymbol{\upbeta}\_{0})^{\prime} \boldsymbol{\Sigma}\_{\boldsymbol{\varbeta}[\boldsymbol{a}]}^{-1} (\boldsymbol{\upbeta} - \boldsymbol{\upbeta}\_{0})\} d\boldsymbol{a} \\ &\propto \exp\{-\frac{1}{2} \boldsymbol{\upbeta}^{\prime} \boldsymbol{\Sigma}\_{\boldsymbol{\varbeta}[\boldsymbol{a}]}^{-1} \boldsymbol{\upbeta}\} \times \int \exp\{-\frac{1}{2} [\boldsymbol{\upbeta}\_{0}^{\prime} \boldsymbol{\upSigma}\_{\boldsymbol{\varbeta}[\boldsymbol{a}]}^{-1} \boldsymbol{\upbeta}\_{0} - 2\boldsymbol{\upbeta}^{\prime} \boldsymbol{\upSigma}\_{\boldsymbol{\varbeta}[\boldsymbol{a}]}^{-1} \boldsymbol{\upbeta}\_{0}]\} d\boldsymbol{a} \\ &\propto \exp\{-\frac{1}{2} \boldsymbol{\upbeta}^{\prime} \boldsymbol{\upSigma}\_{\boldsymbol{\varbeta}[\boldsymbol{a}]}^{-1} \boldsymbol{\upbeta}\} \times \int \exp\{-\frac{1}{2} (\boldsymbol{\upSigma}\_{\boldsymbol{\varbeta}[\boldsymbol{a}\_{(1)}]}^{-1} \boldsymbol{a}^{2} - 2\boldsymbol{\upbeta}^{\prime} \boldsymbol{\upSigma}\_{\boldsymbol{\varbeta}[\boldsymbol{a}\_{(1)}]}^{-1} \boldsymbol{a})\} d\boldsymbol{a} \end{split} \tag{21}$$

where Σ−<sup>1</sup> *β*|*α*(1,1) is component of Σ−<sup>1</sup> *<sup>β</sup>*|*<sup>α</sup>* at position row 1 column 1 and <sup>Σ</sup>−<sup>1</sup> *<sup>β</sup>*|*α*(·1) is the first column of Σ−<sup>1</sup> *β*|*α*. Denote Σ−<sup>1</sup> *β*|*α*(1,1) by *σ*<sup>11</sup> and Σ−<sup>1</sup> *<sup>β</sup>*|*α*(·1) by *γ*1, we then obtain

$$\begin{split} \pi^{l}(\boldsymbol{\theta}) &\propto \exp\{-\frac{1}{2}\boldsymbol{\theta}^{\prime}\Sigma\_{\boldsymbol{\theta}|a}^{-1}\boldsymbol{\theta}\} \times \int \exp\{-\frac{1}{2}\sigma\_{11}(a-\frac{\boldsymbol{\theta}^{\prime}\boldsymbol{\gamma}\_{1}}{\sigma\_{11}})^{2}+\frac{1}{2}\frac{(\boldsymbol{\theta}^{\prime}\boldsymbol{\gamma}\_{1})^{2}}{\sigma\_{11}}\} da \\ &\propto \exp\{-\frac{1}{2}(\boldsymbol{\theta}^{\prime}\Sigma\_{\boldsymbol{\theta}|a}^{-1}\boldsymbol{\theta}-\boldsymbol{\theta}^{\prime}\frac{\boldsymbol{\gamma}\_{1}\boldsymbol{\gamma}\_{1}^{\prime}}{\sigma\_{11}}\boldsymbol{\theta})\} \times \sqrt{2\pi}\sigma\_{11}^{-1/2} \\ &\propto \exp\{-\frac{1}{2}\boldsymbol{\theta}^{\prime}(\Sigma\_{\boldsymbol{\theta}|a}^{-1}-\frac{\boldsymbol{\gamma}\_{1}\boldsymbol{\gamma}\_{1}^{\prime}}{\sigma\_{11}})\boldsymbol{\theta}\}. \end{split} \tag{22}$$

Therefore, we have derived that

$$
\pi^I(\mathfrak{F}) \propto N\_{\mathfrak{H}+1}(\mathfrak{O}, (\Sigma\_{\mathfrak{F}|\mathfrak{a}}^{-1} - \frac{\gamma\_1 \gamma\_1'}{\sigma\_{11}})^{-1}).\tag{23}
$$

*Entropy* **2020**, *22*, 513

For model comparison, the specific form of the intrinsic prior may be needed, including the constant factor. Therefore, by following (21) and (22) we have

$$\begin{split} \pi^{l}(\mathfrak{F}) &= c \cdot (2\pi)^{-\frac{l+1}{2}} |\Sigma\_{\mathfrak{F}|a}|^{-\frac{1}{2}} (2\pi)^{\frac{l+1}{2}} |(\Sigma\_{\mathfrak{F}|a}^{-1} - \frac{\gamma\_{1}\gamma\_{1}'}{\sigma\_{11}})^{-1}|^{\frac{1}{2}} \sqrt{2\pi} \sigma\_{11}^{-1/2} \times N\_{j+1} (\mathfrak{0}, (\Sigma\_{\mathfrak{F}|a}^{-1} - \frac{\gamma\_{1}\gamma\_{1}'}{\sigma\_{11}})^{-1}) \\ &= c \cdot |\Sigma\_{\mathfrak{F}|a} (\Sigma\_{\mathfrak{F}|a}^{-1} - \frac{\gamma\_{1}\gamma\_{1}'}{\sigma\_{11}})|^{-\frac{1}{2}} \sqrt{2\pi} \sigma\_{11}^{-1/2} \times N\_{j+1} (\mathfrak{0}, (\Sigma\_{\mathfrak{F}|a}^{-1} - \frac{\gamma\_{1}\gamma\_{1}'}{\sigma\_{11}})^{-1}) \\ &= c \cdot \sqrt{2\pi} \sigma\_{11}^{-1/2} (|1 - \frac{\gamma\_{1}\gamma\_{1}'}{\sigma\_{11}} \Sigma\_{\mathfrak{F}|a}|)^{-\frac{1}{2}} \times N\_{j+1} (\mathfrak{0}, (\Sigma\_{\mathfrak{F}|a}^{-1} - \frac{\gamma\_{1}\gamma\_{1}'}{\sigma\_{11}})^{-1}). \end{split} \tag{24}$$

#### *5.2. Variational Inference for Probit Model with Intrinsic Prior*

#### 5.2.1. Iterative Updates for Factorized Distributions

We have that

$$\begin{aligned} \mathbb{Z}\_i|\mathfrak{g} &\sim \mathcal{N}(x\_i' \mathfrak{g}, 1) \quad \text{and} \\ p(y\_i|z\_i) &= \mathbb{I}(z\_i > 0)\mathbb{I}(y\_i = 1) + \mathbb{I}(z\_i \le 0)\mathbb{I}(y\_i = 0) \end{aligned}$$

in Section 3.1. We have shown in Section 5.1 that

$$
\pi^l(\mathfrak{f}) \ltimes N\_{\mathfrak{f}+1}(\mu\_{\mathfrak{f}\mathfrak{e'}} \Sigma\_{\mathfrak{f}}),
$$

where *μ<sup>β</sup>* = **0** and Σ*<sup>β</sup>* = (Σ−<sup>1</sup> *<sup>β</sup>*|*<sup>α</sup>* <sup>−</sup> *<sup>γ</sup>*1*γ* 1 *<sup>σ</sup>*<sup>11</sup> )−1. Since **<sup>y</sup>** is independent of *<sup>β</sup>* given **<sup>z</sup>**, we have

$$\begin{split} p(\mathbf{y}, \mathbf{z}, \boldsymbol{\beta}) &= p(\mathbf{y}|\mathbf{z}, \boldsymbol{\beta}) p(\mathbf{z}|\boldsymbol{\beta}) p(\boldsymbol{\beta}) \\ &= p(\mathbf{y}|\mathbf{z}) p(\mathbf{z}|\boldsymbol{\beta}) p(\boldsymbol{\beta}). \end{split} \tag{25}$$

To apply the variational approximation to probit regression model, unobservable variables are considered in two separate groups, coefficient parameter *β* and auxiliary variable **Z**. To approximate the posterior distribution of *β*, consider the product form

$$q(\mathbf{Z}, \boldsymbol{\mathfrak{k}}) = q\_{\mathbf{Z}}(\mathbf{Z}) q\_{\boldsymbol{\mathfrak{k}}}(\boldsymbol{\mathfrak{k}}) .$$

We proceed by first describing the distribution for each factor of the approximation, *q***Z**(**Z**) and *qβ*(*β*). Then variational approximation is accomplished by iteratively updating the parameters of each factor distribution.

Start with *q***Z**(**Z**), when *yi* = 1, we have

$$\log p(\mathbf{y}, \mathbf{z}, \boldsymbol{\theta}) = \log \left( \prod\_{i} \frac{1}{\sqrt{2\pi}} \exp \{ -\frac{(z\_i - \mathbf{x}\_i^\prime \boldsymbol{\theta})^2}{2} \} \times \pi^l(\boldsymbol{\theta}) \right) \qquad \text{where } z\_i > 0.$$

Now, according to (19) and Algorithm 1, the optimal *q***<sup>Z</sup>** is proportional to

$$\begin{split} \mathbb{E}\_{\boldsymbol{\beta}}[\log p(\mathbf{y}, \mathbf{z}, \boldsymbol{\beta})] &= -\frac{1}{2} \mathbb{E}\_{\boldsymbol{\beta}}[\mathbf{z}^{\prime}\mathbf{z} - 2\boldsymbol{\beta}^{\prime}\mathbf{X}\mathbf{z} + \boldsymbol{\beta}^{\prime}\mathbf{X}^{\prime}\mathbf{X}\boldsymbol{\beta}] + \mathbb{E}\_{\boldsymbol{\beta}}[\log \pi^{\prime}(\boldsymbol{\beta})] \\ &= -\frac{1}{2} \mathbf{z}^{\prime}\mathbf{z} + \mathbb{E}\_{\boldsymbol{\beta}}[\boldsymbol{\beta}]^{\prime}\mathbf{X}^{\prime}\mathbf{z} + \frac{1}{-2} \mathbb{E}\_{\boldsymbol{\beta}}[\boldsymbol{\beta}^{\prime}\mathbf{X}^{\prime}\widetilde{\mathbf{X}}\boldsymbol{\beta}] + \underset{- \rightarrow}{\text{constant}} \text{constant} \end{split}$$

So, we have the optimal *q***Z**,

$$\begin{split} q\_{\mathbf{Z}}^{\*}(\mathbf{Z}) &\approx \exp\{-\frac{1}{2}\mathbf{z}^{\prime}\mathbf{z} + \mathbb{E}\_{\boldsymbol{\beta}}[\boldsymbol{\beta}]^{\prime}\mathbf{X}^{\prime}\mathbf{z} + \text{constant}\} \\ &\approx \exp\{-\frac{1}{2}(\mathbf{z} - \mathbf{X}\mathbb{E}\_{\boldsymbol{\beta}}[\boldsymbol{\beta}])^{\prime}(\mathbf{z} - \mathbf{X}\mathbb{E}\_{\boldsymbol{\beta}}[\boldsymbol{\beta}])\}. \end{split}$$

Similar procedure could be used to develop cases when *yi* = 0. Therefore, we have that the optimal approximation for *q***<sup>Z</sup>** is a truncated normal distribution, where

$$q\_{\mathbf{Z}}^{\*}(\mathbf{Z}) = \begin{cases} N\_{[0, +\infty)}(\mathfrak{X}\mathbb{E}\_{\mathfrak{F}}[\mathfrak{F}]\_{i\prime} \mathbf{1}) & \text{if } y\_{i} = \mathbf{1}\_{\prime} \\ N\_{(-\infty, 0]}(\mathfrak{X}\mathbb{E}\_{\mathfrak{F}}[\mathfrak{F}]\_{i\prime} \mathbf{1}) & \text{if } y\_{i} = \mathbf{0}. \end{cases} \tag{26}$$

Denote **X**E*β*[*β*] by *μ***z**, the location of distribution *q*<sup>∗</sup> **<sup>Z</sup>**(**Z**). The expectation <sup>E</sup>*<sup>β</sup>* is taken with respect to the density form of *q*(*β*) for which we shall derive now.

For *qβ*(*β*), given the joint form in (25), we have

$$\log p(\mathbf{y}, \mathbf{z}, \boldsymbol{\theta}) = -\frac{1}{2} \exp\{ (\mathbf{z} - \mathbf{X}\boldsymbol{\theta})^{\prime} (\mathbf{z} - \mathbf{X}\boldsymbol{\theta}) \} - \frac{1}{2} \exp\{ (\boldsymbol{\theta} - \boldsymbol{\mu}\_{\boldsymbol{\theta}})^{\prime} \boldsymbol{\Sigma}\_{\boldsymbol{\theta}}^{-1} (\boldsymbol{\theta} - \boldsymbol{\mu}\_{\boldsymbol{\theta}}) \} + \text{constant}.$$

Taking expectation with respect to *q***Z**(**z**), we have

$$\begin{split} \mathbb{E}\_{\mathbf{Z}}[\log p(\mathbf{y}, \mathbf{z}, \boldsymbol{\theta})] &= -\frac{1}{2} \mathbb{E}\_{\mathbf{Z}}[\mathbf{\tilde{Z}}^{\prime}\mathbf{Z}] + \mathbb{E}\_{\mathbf{Z}}[\mathbf{Z}]^{\prime}\mathbf{X}\boldsymbol{\beta} - \frac{1}{2} \boldsymbol{\mathsf{f}}^{\prime}\mathbf{X}^{\prime}\mathbf{X}\boldsymbol{\beta} \\ &- \frac{1}{2} \boldsymbol{\mathsf{f}}^{\prime}\Sigma\_{\boldsymbol{\theta}}^{-1}\boldsymbol{\mathsf{f}} + \boldsymbol{\mu}\_{\boldsymbol{\theta}}^{\prime}\Sigma\_{\boldsymbol{\theta}}^{-1}\boldsymbol{\mathsf{f}} + \underline{\boldsymbol{\mu}}\_{\boldsymbol{\theta}}^{\prime}\Sigma\_{\boldsymbol{\theta}}^{-1}\boldsymbol{\mu}\_{\boldsymbol{\theta}}^{\prime\prime} \overset{\text{constant}}{\cdot} \end{split}$$

Again, based on (19) and Algorithm 1, the optimal *qβ*(*β*) is proportional to E**Z**[log *p*(**y**, **z**, *β*)],

$$q^\*\_{\mathcal{B}}(\boldsymbol{\beta}) \propto -\frac{1}{2} \boldsymbol{\mathfrak{f}}'(\boldsymbol{\mathsf{X}}'\mathbf{X} + \boldsymbol{\Sigma}\_{\boldsymbol{\mathsf{\mathsf{f}}}}^{-1})\boldsymbol{\mathfrak{f}} + (\mathbb{E}\_{\mathbf{Z}}[\mathbf{Z}]^{\prime}\mathbf{X} + \boldsymbol{\mu}\_{\boldsymbol{\mathsf{\mathsf{f}}}}^{\prime}\boldsymbol{\Sigma}\_{\boldsymbol{\mathsf{\mathsf{f}}}}^{-1})\boldsymbol{\mathfrak{f}}.$$

First notice that any constant terms, including constant factor in the intrinsic prior, were canceled out due to the ratio form of (19). Then by noticing the quadratic form in the above formula we have

$$\eta\_{\mathfrak{g}}^\*(\mathfrak{B}) = \mathcal{N}(\mu\_{q\_{\mathfrak{g}'}} \Sigma\_{q\_{\mathfrak{g}}}),\tag{27}$$

where

$$\begin{aligned} \boldsymbol{\Sigma}\_{\boldsymbol{\eta}\_{\mathcal{B}}} &= (\mathbf{X}^{\prime}\mathbf{X} + \boldsymbol{\Sigma}\_{\boldsymbol{\mathcal{B}}}^{-1})^{-1}, \\ \boldsymbol{\mu}\_{\boldsymbol{\eta}\_{\mathcal{B}}} &= (\mathbf{X}^{\prime}\mathbf{X} + \boldsymbol{\Sigma}\_{\boldsymbol{\mathcal{B}}}^{-1})^{-1}(\boldsymbol{\mathbb{E}}\_{\mathbf{Z}}[\mathbf{Z}]^{\prime}\mathbf{X} + \boldsymbol{\mu}\_{\boldsymbol{\mathcal{B}}}^{\prime}\boldsymbol{\Sigma}\_{\boldsymbol{\mathcal{B}}}^{-1}). \end{aligned}$$

Notice that *<sup>μ</sup>q<sup>β</sup>* , i.e., <sup>E</sup>*β*[*β*], depends on <sup>E</sup>**Z**[**Z**]. In addition, from our previous derivation, we found that the update for E**Z**[**Z**] depends on E*β*[*β*]. Given that the density form of *q***<sup>Z</sup>** is truncated normal, we have

$$\mathbb{E}\_{\mathbf{Z}}[\mathbf{Z}\_{i}] = \begin{cases} \mathsf{X}\mathbb{E}\_{\boldsymbol{\theta}}[\boldsymbol{\beta}]\_{i} + \frac{\boldsymbol{\phi}(-\mathsf{X}\mathbb{E}\_{\boldsymbol{\theta}}[\boldsymbol{\beta}|\_{i})}{1 - \boldsymbol{\Phi}(-\mathsf{X}\mathbb{E}\_{\boldsymbol{\theta}}[\boldsymbol{\beta}|\_{i})} } & \text{if } y\_{i} = 1, \\\mathsf{X}\mathbb{E}\_{\boldsymbol{\theta}}[\boldsymbol{\beta}]\_{i} - \frac{\boldsymbol{\phi}(-\mathsf{X}\mathbb{E}\_{\boldsymbol{\theta}}[\boldsymbol{\beta}|\_{i})}{\boldsymbol{\Phi}(-\mathsf{X}\mathbb{E}\_{\boldsymbol{\theta}}[\boldsymbol{\beta}|\_{i})} } & \text{if } y\_{i} = 0, \end{cases}$$

where *φ* is the standard normal density and Φ is the standard normal cumulative density. Denote <sup>E</sup>**Z**[**Z**] by *<sup>μ</sup>q***<sup>Z</sup>** . See properties of truncated normal distribution in Appendix A. Updating procedures for parameters *μq<sup>β</sup>* and *μq***<sup>Z</sup>** of each factor distribution are summarized in Algorithm 2.

**Algorithm 2** Iterative procedure for updating parameters to reach optimal factor densities *q*∗ *<sup>β</sup>* and *q*<sup>∗</sup> **Z** in Bayesian probit regression model. The updates are based on the solutions given by (26) and (27).

1: Initialize *μq***<sup>Z</sup>** .

2: Cycle through

$$\begin{split} \mu\_{q\_{\mathcal{\mathcal{B}}}} &\leftarrow (\mathsf{X}^{\prime}\mathsf{X} + \Sigma\_{\mathfrak{g}}^{-1})^{-1} (\mu\_{q\_{\mathfrak{x}}}^{\prime}\mathsf{X} + \mu\_{\mathfrak{g}}^{\prime}\Sigma\_{\mathfrak{g}}^{-1}), \\ \mu\_{q\_{\mathcal{Z}}} &\leftarrow \mathsf{X}\mu\_{q\_{\mathcal{\mathcal{B}}}} + \frac{\phi(\mathsf{X}\mu\_{q\_{\mathcal{\mathcal{B}}}})}{\Phi(\mathsf{X}\mu\_{q\_{\mathcal{\mathcal{B}}}})^{\mathfrak{y}}[\Phi(\mathsf{X}\mu\_{q\_{\mathcal{\mathcal{B}}}}) - \mathbbm{1}]^{1-\mathfrak{y}}} \end{split}$$

until the increase in L(*q*) is negligible.

#### 5.2.2. Evaluation of the Lower Bound L(*q*)

During the process of optimization of variational approximation densities, the lower bound for the log marginal likelihood need to be evaluated and monitored to determine when the iterative updating process converges. Based on derivations from previous section, we now have the exact form for the variational inference density,

$$q(\boldsymbol{\beta}, \mathbf{Z}) = q\_{\boldsymbol{\beta}}(\boldsymbol{\beta}) q \mathbf{Z}(\mathbf{Z}) .$$

According to (14), we can write down the lower bound L(*q*) with respect to *q*(*β*, **Z**).

$$\begin{split} \mathcal{L}(q) &= \int q(\boldsymbol{\mathfrak{f}}, \mathbf{Z}) \log \left\{ \frac{p(\mathbf{Y}, \boldsymbol{\mathfrak{f}}, \mathbf{Z})}{q(\boldsymbol{\mathfrak{f}}, \mathbf{Z})} \right\} d\boldsymbol{\mathfrak{f}} d\mathbf{Z} \\ &= \int q\_{\boldsymbol{\mathfrak{f}}}(\boldsymbol{\mathfrak{g}}) q(\boldsymbol{\mathfrak{Z}}) \log \left\{ \frac{p(\mathbf{Y}, \boldsymbol{\mathfrak{f}}, \mathbf{Z})}{q\_{\boldsymbol{\mathfrak{f}}}(\boldsymbol{\mathfrak{f}}) q\_{\boldsymbol{\mathfrak{Z}}}(\mathbf{Z})} \right\} d\boldsymbol{\mathfrak{f}} d\mathbf{Z} \\ &= \int q\_{\boldsymbol{\mathfrak{f}}}(\boldsymbol{\mathfrak{g}}) q\_{\boldsymbol{\mathfrak{Z}}}(\mathbf{Z}) \log \{ p(\mathbf{Y}, \boldsymbol{\mathfrak{g}}, \mathbf{Z}) \} d\boldsymbol{\mathfrak{f}} d\mathbf{Z} - \int q\_{\boldsymbol{\mathfrak{f}}}(\boldsymbol{\mathfrak{g}}) q\_{\boldsymbol{\mathfrak{Z}}}(\mathbf{Z}) \log \{ q\_{\boldsymbol{\mathfrak{f}}}(\boldsymbol{\mathfrak{g}}) q\_{\boldsymbol{\mathfrak{Z}}}(\mathbf{Z}) \} d\boldsymbol{\mathfrak{f}} d\mathbf{Z} \\ &= \mathbb{E}\_{\boldsymbol{\mathfrak{f}}, \mathbf{Z}}[\log \{ p(\mathbf{Y}, \boldsymbol{\mathfrak{Z}} | \boldsymbol{\mathfrak{g}}) \}] + \mathbb{E}\_{\mathbb{R}, \mathbf{Z}}[\boldsymbol{\pi}^{\mathsf{I}}(\boldsymbol{\mathfrak{f}})] - \mathbb{E}\_{\boldsymbol{\mathfrak{f}}, \mathbf{Z}}[\log \{ q\_{\boldsymbol{\mathfrak{f}}}(\boldsymbol{\mathfrak{g}}) \}] - \mathbb{E}\_{\mathbb{R}, \mathbf{Z}}[\log \{ q\_{\mathbb{Z}}(\mathbf{Z}) \}]. \end{split} \tag{28}$$

As we can see in (28), L(*q*) has been divided into four different parts with expectation taken over the variational approximation density *q*(*β*, **Z**) = *qβ*(*β*)*q***Z**(**Z**). We now find the expression of these expectations one by one.

Part 1: <sup>E</sup>*β*,**Z**[log{*p*(**Y**, **<sup>Z</sup>**|*β*)}]

$$\begin{split} \mathbf{z} &= \log(2\pi)^{-\frac{\pi}{2}} + \int \int q\_{\boldsymbol{\varbeta}}(\boldsymbol{\uptheta}) q\_{\mathbf{Z}}(\mathbf{Z}) \{-\frac{1}{2} (\mathbf{z} - \mathbf{X}\boldsymbol{\upbeta})'(\mathbf{z} - \mathbf{X}\boldsymbol{\upbeta})\} d\boldsymbol{\upbeta} d\mathbf{z} \\ &= \log(2\pi)^{-\frac{\pi}{2}} + \int q\_{\mathbf{Z}}(\mathbf{Z}) \int q\_{\boldsymbol{\varbeta}}(\boldsymbol{\uptheta}) \{-\frac{1}{2} (\boldsymbol{\upbeta}'\mathbf{X}'\mathbf{X}\boldsymbol{\upbeta} - 2\mathbf{z}'\mathbf{X}\boldsymbol{\upbeta} + \mathbf{z}'\mathbf{z})\} d\boldsymbol{\upbeta} d\mathbf{z} \end{split} \tag{29}$$

Deal with the inner integral first, we have

$$\begin{split} \int q\_{\boldsymbol{\theta}}(\boldsymbol{\theta}) \left\{ -\frac{1}{2} (\boldsymbol{\mathsf{f}} \mathsf{X}^{\prime} \mathsf{X} \mathsf{f} \boldsymbol{\mathsf{f}} - 2 \mathsf{x}^{\prime} \mathsf{X} \mathsf{f} + \mathsf{z}^{\prime} \mathsf{z}) \right\} d\boldsymbol{\mathsf{f}} &= -\frac{1}{2} \int q\_{\boldsymbol{\theta}}(\boldsymbol{\theta}) [\mathsf{f} \mathsf{X}^{\prime} \mathsf{X} \mathsf{f}] d\boldsymbol{\mathsf{f}} + \mathsf{z}^{\prime} \mathsf{X} \mathsf{E}\_{\boldsymbol{\theta}}[\boldsymbol{\theta}] - \frac{1}{2} \mathsf{z}^{\prime} \mathsf{z} \\ &= -\frac{1}{2} \int q\_{\boldsymbol{\theta}}(\boldsymbol{\theta}) [\mathsf{f} \mathsf{X}^{\prime} \mathsf{X} \mathsf{f}] d\boldsymbol{\mathsf{f}} + \mathsf{z}^{\prime} \mathsf{X} \boldsymbol{\mu}\_{q\_{\boldsymbol{\theta}}} - \frac{1}{2} \mathsf{z}^{\prime} \mathsf{z} \end{split} \tag{30}$$

where

$$\begin{split}-\frac{1}{2}\int q\_{\boldsymbol{\theta}}(\boldsymbol{\mathfrak{f}})[\boldsymbol{\mathfrak{f}}\prime\prime\prime\prime\mathbb{1}\emptyset]d\boldsymbol{\mathfrak{f}} &= -\frac{1}{2}\int q\_{\boldsymbol{\theta}}(\boldsymbol{\mathfrak{f}})[(\boldsymbol{\mathfrak{f}}-\boldsymbol{\mu}\_{q\_{\boldsymbol{\theta}}}+\boldsymbol{\mu}\_{q\_{\boldsymbol{\theta}}})\prime\prime\prime\prime\prime(\boldsymbol{\mathfrak{f}}-\boldsymbol{\mu}\_{q\_{\boldsymbol{\theta}}}+\boldsymbol{\mu}\_{q\_{\boldsymbol{\theta}}})]d\boldsymbol{\mathfrak{f}} \\ &= -\frac{1}{2}\text{trace}(\mathbf{X}\prime\mathbf{X}\_{\overline{\boldsymbol{\theta}}}[(\boldsymbol{\mathfrak{f}}-\boldsymbol{\mu}\_{q\_{\boldsymbol{\theta}}})(\boldsymbol{\mathfrak{f}}-\boldsymbol{\mu}\_{q\_{\boldsymbol{\theta}}})\prime]) - \frac{1}{2}\boldsymbol{\mu}\_{q\_{\boldsymbol{\theta}}}^{\prime}\mathbf{X}^{\prime}\mathbf{X}\boldsymbol{\mu}\_{q\_{\boldsymbol{\theta}}} \\ &= -\frac{1}{2}\text{trace}(\mathbf{X}^{\prime}\mathbf{X}[\boldsymbol{\mu}\_{q\_{\boldsymbol{\theta}}}\boldsymbol{\mu}\_{q\_{\boldsymbol{\theta}}}^{\prime}+\boldsymbol{\Sigma}\_{q\_{\boldsymbol{\theta}}}]).\end{split}\tag{31}$$

Substitute (31) into (30), we got

$$\int q\_{\boldsymbol{\theta}}(\boldsymbol{\theta}) \{-\frac{1}{2}(\boldsymbol{\beta}^{\prime}\mathbf{X}^{\prime}\mathbf{X}\boldsymbol{\beta} - 2\mathbf{z}^{\prime}\mathbf{X}\boldsymbol{\beta} + \mathbf{z}^{\prime}\mathbf{z})\} d\boldsymbol{\beta} = -\frac{1}{2}\text{trace}(\mathbf{X}^{\prime}\mathbf{X}[\boldsymbol{\mu}\_{q\_{\boldsymbol{\theta}}}\boldsymbol{\mu}\_{q\_{\boldsymbol{\theta}}}^{\prime} + \boldsymbol{\Sigma}\_{q\_{\boldsymbol{\theta}}}]) + \mathbf{z}^{\prime}\mathbf{X}\boldsymbol{\mu}\_{q\_{\boldsymbol{\theta}}} - \frac{1}{2}\mathbf{z}^{\prime}\mathbf{z}.\tag{32}$$

Substituting (32) back into (29) gives

<sup>E</sup>*β*,**Z**[log{*p*(**Y**, **<sup>Z</sup>**|*β*)}] = log(2*π*)<sup>−</sup> *<sup>n</sup>* <sup>2</sup> + *<sup>q</sup>***Z**(**z**){−<sup>1</sup> 2 trace(**X X**[*μq<sup>β</sup> μ <sup>q</sup><sup>β</sup>* + Σ*q<sup>β</sup>* ]) + **z <sup>X</sup>***μq<sup>β</sup>* <sup>−</sup> <sup>1</sup> 2 **z z**}*d***z** = log(2*π*)<sup>−</sup> *<sup>n</sup>* <sup>2</sup> <sup>−</sup> <sup>1</sup> 2 trace(**X X**[*μq<sup>β</sup> μ <sup>q</sup><sup>β</sup>* <sup>+</sup> <sup>Σ</sup>*q<sup>β</sup>* ]) <sup>−</sup> <sup>1</sup> 2 E**Z**[**z z**] + *μ <sup>q</sup>***z***μ***<sup>z</sup>** = log(2*π*)<sup>−</sup> *<sup>n</sup>* <sup>2</sup> <sup>−</sup> <sup>1</sup> 2 trace(**X X**[*μq<sup>β</sup> μ <sup>q</sup><sup>β</sup>* + Σ*q<sup>β</sup>* ]) + *μ <sup>q</sup>***z***μ***<sup>z</sup>** − 1 2 *n* ∑ *i*=1 [1 + *μ*<sup>2</sup> **<sup>z</sup>***<sup>i</sup>* − *μ***z***<sup>i</sup> φ*(−*μ***z***i*) Φ(−*μ***z***i*) ] I(*yi*=0) [1 + *μ*<sup>2</sup> **<sup>z</sup>***<sup>i</sup>* + *μ***z***<sup>i</sup> φ*(−*μ***z***i*) 1 − Φ(−*μ***z***i*) ] I(*yi*=1) = log(2*π*)<sup>−</sup> *<sup>n</sup>* <sup>2</sup> <sup>−</sup> <sup>1</sup> 2 trace(**X X**[*μq<sup>β</sup> μ <sup>q</sup><sup>β</sup>* + Σ*q<sup>β</sup>* ]) + *μ <sup>q</sup>***z***μ***<sup>z</sup>** − 1 2 *n* ∑ *i*=1 [1 + *μq***z***<sup>i</sup> μ***z***<sup>i</sup>* ] I(*yi*=0) [1 + *μq***z***<sup>i</sup> μ***z***<sup>i</sup>* ] I(*yi*=1) = log(2*π*)<sup>−</sup> *<sup>n</sup>* <sup>2</sup> <sup>−</sup> <sup>1</sup> 2 trace(**X X**[*μq<sup>β</sup> μ <sup>q</sup><sup>β</sup>* <sup>+</sup> <sup>Σ</sup>*q<sup>β</sup>* ]) + <sup>1</sup> 2 *μ <sup>q</sup>***z***μ***<sup>z</sup>** <sup>−</sup> *<sup>n</sup>* 2 . (33)

We applied properties of truncated normal distribution in Appendix B to find the expression of the second moment E**Z**[**z z**].

Part 2: E*β*,**Z**[log *<sup>q</sup>***Z**(**z**)]

$$\begin{split} &= \int \int q\_{\mathbf{f}}(\mathbf{f}) q\_{\mathbf{f}}(\mathbf{z}) \log q\_{\mathbf{Z}}(\mathbf{z}) d\mathbf{f} d\mathbf{Z} \\ &= \int q\_{\mathbf{Z}}(\mathbf{z}) \log q\_{\mathbf{Z}}(\mathbf{z}) d\mathbf{Z} \\ &= -\frac{n}{2} (\log(2\pi) + 1) \\ &+ \sum\_{i=1}^{n} \{ [\log(\Phi(-\mu\_{\mathbf{z}\_{i}})) + \mu\_{\mathbf{z}\_{i}} \frac{\Phi(-\mu\_{\mathbf{z}\_{i}})}{2\Phi(-\mu\_{\mathbf{z}\_{i}})}]^{\mathrm{I}(y\_{i}=0)} [\log(1 - \Phi(-\mu\_{\mathbf{z}\_{i}})) - \mu\_{\mathbf{z}\_{i}} \frac{\Phi(-\mu\_{\mathbf{z}\_{i}})}{2(1 - \Phi(-\mu\_{\mathbf{z}\_{i}}))}]^{\mathrm{I}(y\_{i}=1)} \\ &= -\frac{n}{2} (\log(2\pi) + 1) - \frac{1}{2} \mu\_{\mathbf{z}}^{\prime} \mu\_{\mathbf{z}} + \frac{1}{2} \mu\_{\mathbf{z}\_{i}}^{\prime} \mu\_{\mathbf{z}} + \sum\_{i=1}^{n} \{ [\log(\Phi(-\mu\_{\mathbf{z}\_{i}}))]^{\mathrm{I}(y=0)} [\log(1 - \Phi(-\mu\_{\mathbf{z}\_{i}}))]^{\mathrm{I}(y\_{i}=1)} \} \end{split} \tag{34}$$

Again, see Appendix B for well-known properties of truncated normal distribution. Now subtracting (34) from (33) we got

$$\begin{split} \mathbb{E}\_{\mathsf{f},\mathsf{Z}}[\log\{p(\mathbf{Y},\mathsf{Z}|\boldsymbol{\theta})\}] - \mathbb{E}\_{\mathsf{f},\mathsf{Z}}[\log q\mathbf{z}(\mathbf{z})] &= -\frac{1}{2}\text{trace}(\mathbf{X}^{\prime}\mathsf{X}[\mu\_{\mathsf{f}\boldsymbol{\theta}}\mu\_{\mathsf{f}\boldsymbol{\theta}}^{\prime} + \Sigma\_{\mathsf{f}\boldsymbol{\theta}}]) + \frac{1}{2}\mu\_{\mathsf{z}}^{\prime}\mu\_{\mathsf{z}} + \\ &\sum\_{i=1}^{n}\{[\log(\Phi(-\mu\_{\mathsf{z}i}))]^{\mathbb{I}(y\_{i}=0)}[\log(1-\Phi(-\mu\_{\mathsf{z}i}))]^{\mathbb{I}(y\_{i}=1)}\}. \end{split} \tag{35}$$

*Entropy* **2020**, *22*, 513

Based on the exact expression of the intrinsic prior *π<sup>I</sup>* (*β*), denoting all constant terms by *C*, we have

Part 3: E*β*,**Z**[log *<sup>p</sup>β*(*β*)]

$$\begin{split} \mathbf{g} &= \int \int q\_{\mathbf{Z}}(\mathbf{z}) q\_{\mathbf{\hat{\boldsymbol{\beta}}}}(\mathbf{\hat{\boldsymbol{\beta}}}) \log \pi^{\mathbf{I}}(\mathbf{\hat{\boldsymbol{\beta}}}) d\mathbf{\hat{\boldsymbol{\beta}}d\mathbf{z}} \\ &= \log \mathbb{C} - \frac{(j+1)}{2} \log(2\pi) - \frac{1}{2} \log |\boldsymbol{\Sigma}\_{\boldsymbol{\beta}}| - \frac{1}{2} \int q\_{\boldsymbol{\beta}}(\boldsymbol{\beta}) [\boldsymbol{\mathcal{B}}^{\boldsymbol{\prime}} \boldsymbol{\Sigma}\_{\boldsymbol{\beta}}^{-1} \boldsymbol{\mathcal{B}}] d\mathbf{\hat{\boldsymbol{\beta}}} \end{split} \tag{36}$$

To find the expression for the integral, we have

$$\begin{split} \int q\_{\mathfrak{F}}(\mathfrak{F}) [\mathfrak{F} \Sigma\_{\mathfrak{F}}^{-1} \mathfrak{F}] d\mathfrak{F} &= \int q\_{\mathfrak{F}}(\mathfrak{F}) (\mathfrak{F} - \mu\_{q\_{\mathfrak{F}}} + \mu\_{q\_{\mathfrak{F}}})' \Sigma\_{\mathfrak{F}}^{-1} (\mathfrak{F} - \mu\_{q\_{\mathfrak{F}}} + \mu\_{q\_{\mathfrak{F}}}) d\mathfrak{F} \\ &= \mathbb{E} [\text{trace} (\Sigma\_{\mathfrak{F}}^{-1} (\mathfrak{F} - \mu\_{q\_{\mathfrak{F}}}) (\mathfrak{F} - \mu\_{q\_{\mathfrak{F}}})')] + \mu\_{q\_{\mathfrak{F}}}' \Sigma\_{\mathfrak{F}}^{-1} \mu\_{q\_{\mathfrak{F}}} \\ &= \text{trace} (\Sigma\_{\mathfrak{F}}^{-1} \Sigma\_{q\_{\mathfrak{F}}}) + \mu\_{q\_{\mathfrak{F}}}' \Sigma\_{\mathfrak{F}}^{-1} \mu\_{q\_{\mathfrak{F}}} \end{split} \tag{37}$$

Substituting (37) back into (36), we obtained

$$\mathbb{E}\_{\mathsf{f}\mathsf{Z}}[\log p\_{\mathsf{f}}(\mathsf{f})] = \log \mathbb{C} - \frac{(j+1)}{2} \log(2\pi) - \frac{1}{2} \log |\Sigma\_{\mathsf{f}}| - \frac{1}{2} [\text{trace}(\Sigma\_{\mathsf{f}}^{-1} \Sigma\_{\mathsf{q}\mathsf{g}}) + \mu\_{\mathsf{q}\mathsf{g}}' \Sigma\_{\mathsf{f}}^{-1} \mu\_{\mathsf{q}\mathsf{g}}].\tag{38}$$

Part 4: E*β*,**Z**[log *<sup>q</sup>β*(*β*)]

$$\begin{split} \mathcal{I} &= \int \int q\_{\mathcal{Z}}(\mathbf{z}) q\_{\boldsymbol{\theta}}(\boldsymbol{\theta}) \log q\_{\boldsymbol{\theta}}(\boldsymbol{\theta}) d\boldsymbol{\theta} \\ &= -\frac{j+1}{2} \log(2\pi) - \frac{1}{2} \log|\Sigma\_{\boldsymbol{\theta}\boldsymbol{\rho}}| - \frac{1}{2} \int q\_{\boldsymbol{\theta}}(\boldsymbol{\theta}) (\boldsymbol{\theta} - \boldsymbol{\mu}\_{\boldsymbol{\theta}\boldsymbol{\rho}})' \Sigma\_{\boldsymbol{\theta}\boldsymbol{\rho}}^{-1} (\boldsymbol{\theta} - \boldsymbol{\mu}\_{\boldsymbol{\theta}\boldsymbol{\rho}}) d\boldsymbol{\theta} \\ &= -\frac{j+1}{2} \log(2\pi) - \frac{1}{2} \log|\Sigma\_{\boldsymbol{\theta}\boldsymbol{\rho}}| - \frac{1}{2} \text{trace}(\Sigma\_{\boldsymbol{\theta}}^{-1} \Sigma\_{\boldsymbol{\theta}}) \\ &= -\frac{j+1}{2} (\log(2\pi) + 1) - \frac{1}{2} \log|\Sigma\_{\boldsymbol{\theta}\boldsymbol{\rho}}| \end{split} \tag{39}$$

Combining all four parts together, we get

<sup>L</sup>(*q*) = <sup>E</sup>*β*,**Z**[log{*p*(**Y**, **<sup>Z</sup>**|*β*)}] + <sup>E</sup>*β*,**Z**[*π<sup>I</sup>* (*β*)] <sup>−</sup> <sup>E</sup>*β*,**Z**[log{*qβ*(*β*)}] <sup>−</sup> <sup>E</sup>*β*,**Z**[log{*q***Z**(**Z**)}] <sup>=</sup> <sup>−</sup><sup>1</sup> 2 trace(**X X**[*μq<sup>β</sup> μ <sup>q</sup><sup>β</sup>* <sup>+</sup> <sup>Σ</sup>*q<sup>β</sup>* ]) + <sup>1</sup> 2 *μ* **<sup>z</sup>***μ***<sup>z</sup>** + *n* ∑ *i*=1 {[log(Φ(−*μ***z***i*))]I(*yi*=0) [log(<sup>1</sup> <sup>−</sup> <sup>Φ</sup>(−*μ***z***i*))]I(*yi*=1) } 3 45 6 <sup>E</sup>*<sup>β</sup>*,**Z**[log{*p*(**Y**,**Z**|*β*)}]−E*<sup>β</sup>*,**Z**[log{*q***Z**(**Z**)}] <sup>+</sup> log *<sup>C</sup>* <sup>−</sup> <sup>1</sup> <sup>2</sup> log <sup>|</sup>Σ*β*| − <sup>1</sup> 2 [trace(Σ−<sup>1</sup> *<sup>β</sup>* Σ*q<sup>β</sup>* ) + *μ <sup>q</sup>β*Σ−<sup>1</sup> *<sup>β</sup> <sup>μ</sup>q<sup>β</sup>* ] + *<sup>j</sup>* <sup>+</sup> <sup>1</sup> <sup>2</sup> <sup>+</sup> 1 <sup>2</sup> log <sup>|</sup>Σ*q<sup>β</sup>* <sup>|</sup> 3 45 6 <sup>E</sup>*<sup>β</sup>*,**Z**[log *<sup>p</sup>β*(*β*)]−E*<sup>β</sup>*,**Z**[log *<sup>q</sup>β*(*β*)] . (40)

#### *5.3. Model Comparison Based on Variational Approximation*

Suppose we want to compare two models, *M*<sup>1</sup> and *M*0, where *M*<sup>0</sup> is the simpler model. An intuitive thought on comparing two models by variational approximation methods is just to compare the lower bounds L(*q*1) and L(*q*0). However, we should note that by comparing the lower bounds, we are assuming that the KL divergences in the two approximations are the same, so that we can use just these lower bounds as guide. Unfortunately, it is not easy to measure how tight in theory any particular bound can be, if this can be accomplished we could then more accurately estimate

the log marginal likelihood from the beginning. As clarified in [27], when comparing two exact log marginal likelihood, we have

$$\log p\_1(\mathbf{X}) - \log p\_0(\mathbf{X}) = \left[ \mathcal{L}(q\_1) + KL(q\_1 \parallel p\_1) \right] - \left[ \mathcal{L}(q\_0) - KL(q\_0 \parallel p\_0) \right] \tag{41}$$

$$=\mathcal{L}(q\_1) - \mathcal{L}(q\_0) + \left[ KL(q\_1 \parallel p\_1) - KL(q\_0 \parallel p\_0) \right] \tag{42}$$

$$\neq \mathcal{L}(q\_1) - \mathcal{L}(q\_0). \tag{43}$$

The difference in log marginal likelihood, log *p*1(**X**) − log *p*0(**X**), is the quantity we wish to estimate. However, if we base this on the lower bounds difference, we are basing our model comparison on (43) rather than (42). Therefore, there exists a systematic bias towards simpler model when comparing models if *KL*(*q*<sup>1</sup> *p*1) − *KL*(*q*<sup>0</sup> *p*0) is not zero.

Realizing that we have a variational approximation for the posterior distribution of *β*, we propose the following method to estimate *p*(**X**) based on our variational approximation *qβ*(*β*) (27). First, writing the marginal likelihood as

$$p(\mathbf{x}) = \int \left[ \frac{p(\mathbf{x}|\mathcal{B})\pi^{l}(\boldsymbol{\mathcal{B}})}{q\_{\boldsymbol{\mathcal{B}}}(\boldsymbol{\mathcal{B}})} \right] q\_{\boldsymbol{\mathcal{B}}}(\boldsymbol{\mathcal{B}}) d\boldsymbol{\mathcal{B}}.$$

we can interpret it as the conditional expectation

$$p(\mathbf{x}) = \mathbb{E}\left[\frac{p(\mathbf{x}|\mathcal{B})\,\pi^{I}(\mathcal{B})}{q\_{\mathcal{B}}(\mathcal{B})}\right],$$

with respect to *<sup>q</sup>β*(*β*). Next, draw samples *<sup>β</sup>*(1) , ..., *<sup>β</sup>*(*n*) from *<sup>q</sup>β*(*β*) and obtain the estimated marginal likelihood

$$\widehat{p\_{\mathbf{X}}(\mathbf{x})} = \frac{1}{n} \sum\_{i=1}^{n} \frac{p(\mathbf{x}|\mathcal{B}^{(i)}) \pi^{I}(\mathcal{B}^{(i)})}{q\_{\mathcal{B}}(\mathcal{B}^{(i)})}.$$

Please note that this method proposed is equivalent to importance sampling with importance function being *<sup>q</sup>β*(*β*), for which we know the exact form and the generation of the random *<sup>β</sup>*(*i*) is easy and inexpensive.

#### **6. Modeling Probability of Default Using Lending Club Data**

#### *6.1. Introduction*

LendingClub (https://www.lendingclub.com/) is the world's largest peer-to-peer lending platform. LendingClub enables borrowers to create unsecured personal loans between \$1000 and \$40,000. The standard loan period is three or five years. Investors can search and browse the loan listings on LendingClub website and select loans that they want to invest in based on the information supplied about the borrower, amount of loan, loan grade, and loan purpose. Investors make money from interest. LendingClub makes money by charging borrowers an origination fee and investors a service fee. To attract lenders, LendingClub publishes most of the information available in borrowers' credit reports as well as information reported by borrowers for almost every loan issued through its website.

#### *6.2. Modeling Probability of Default—Target Variable and Predictive Features*

Publicly available LendingClub data, from 2007 June to 2018 Q4, has a total of 2,260,668 issued loans. Each loan has a status, either Paid-off, Charged-off, or Ongoing. We only adopted loans with an end status, i.e., either paid-off or charged-off. In addition, that loan status is the target variable. We then selected following loan features as our predictive covariates.


We took a sample from the original data set that has customer yearly income between \$15,000 and \$60,000 and end up with a data set of 520,947 rows.

#### *6.3. Addressing Uncertainty of Estimated Probit Model Using Variational Inference with Intrinsic Prior*

Using the process developed in Section 5, we can update the intrinsic prior for parameters (see Figure 1) of the probit model using variational inference, and get the posterior distribution for the estimated parameters. Based on the derived parameter distributions, questions of interest may be explored with model uncertainty being considered.

**Figure 1.** Intrinsic Prior.

Investors will be interested in understanding how each loan feature affect the probability of default, given a certain loan term, either 36 or 60. To answer this question, we samples 6000 cases from the original data set and draw from derived posterior distribution 100 times. We end up with 6000 × 100 calculated probability of default, where each one of the 6000 samples yield 100 different probit estimates based on 100 different posterior draws. We summarize some of our findings in Figure 2, where color red representing 36 months loans and green representing 60 months loans.


• There is no clear pattern regarding income. This is probably because we only included customers with income between \$15,000 and \$60,000 in our training data, which may not representing the true income level of the whole population.

Model uncertainty could also be measured through credible intervals. Again, with the derived posterior distribution, the credible interval is just the range containing a particular percentage of estimated effect/parameter values. For instance, the 95% credible interval of the estimated parameter value of FICO is simply the central portion of the posterior distribution that contains 95% of the estimated values. Contrary to the frequentist confidence intervals, Bayesian credible interval is much more straightforward to interpret. Using the Bayesian framework created in this article, from Figure 3, we can simply state that given the observed data, the estimated effect of DTI on default has 89% probability of falling within [8.300, 8.875]. Instead of the conventional 95%, we used 89% following suggestions in [28,29], which is just as arbitrary as any of the conventions.

One of the main advantages of using variational inference over MCMC is that variational inference is much faster. Comparisons were made between the two approximation frameworks on a 64-bit Windows 10 laptop, with 32.0 GB RAM. Using the data set introduced in Section 6.2, we have that


**Figure 2.** Effect of term months and other covariates on probability of default

**Figure 3.** Credible intervals for estimated coefficients

#### *6.4. Model Comparison*

Following the procedure proposed in Section 5.3, we compare the following series of nested models. From the data set introduced in Section 6.2, 2000 records were sampled to estimate the likelihood *<sup>p</sup>*(**x**|*β*(*i*) ). Where *β*(*i*) is one of the 2500 draws sampled directly from the approximated posterior distribution *qβ*(*β*), which serves as the importance function used to estimate the marginal likelihood *p*(**x**).


Estimated log marginal likelihood for each model is plotted in Figure 4. We can see that the model evidence has increased by adding predictive features *Loan Amount* and *Annual Income* sequentially. However, if we further adding home ownership information, i.e., *Mortgage Indicator* as a predictive feature, the model evidence decreased. We have the Bayes factor

$$BF\_{45} = \frac{p(\mathbf{x}|M\_4)}{p(\mathbf{x}|M\_5)} = e^{-1014.78 - (-1016.42)} = 5.16\sqrt{}$$

which suggests a substantial evidence for model *M*4, indicating home ownership information may be irrelevant in predicting probability of default given that all the other predictive features are relevant.

**Figure 4.** Log marginal likelihood comparison

#### **7. Further Work**

The authors thank the reviewers for pointing out that mean-field variational Bayes underestimates the posterior variance. This could be an interesting topic for our future research. We plan to study the *linear response variational Bayes* (LRVB) method proposed in [30] to see if it can be applied on the framework we proposed in this article. To see if we can get the approximated posterior variance close enough to the true variance using our proposed method, comparisons should be made between normal conjugate prior with the MCMC procedure, normal conjugate prior with LRVB, and intrinsic prior with LRVB.

**Author Contributions:** Methodology, A.L., L.P. and K.W.; software, A.L.; writing–original draft preparation, A.L., L.P. and K.W.; writing–review and editing, A.L. and L.P.; visualization, A.L. All authors have read and agreed to the published version of the manuscript.

**Funding:** The work of L.R.Pericchi was partially funded by NIH grants U54CA096300, P20GM103475 and R25MD010399.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Appendix A. Density Function**

Suppose *<sup>X</sup>* ∼ *<sup>N</sup>*(*μ*, *<sup>σ</sup>*2) has a normal distribution and lies within the interval *<sup>X</sup>* ∈ (*a*, *<sup>b</sup>*), −<sup>∞</sup> ≤ *a* < *b* ≤ ∞. Then *X* conditional on *a* < *X* < *b* has a truncated normal distribution. Its probability density function, *f* , for *a* ≤ *X* < *b*, is given by

$$f(x|\mu, \sigma, a, b) = \frac{\frac{1}{\sigma} \phi(\frac{x-\mu}{\sigma})}{\Phi(\frac{b-\mu}{\sigma}) - \Phi(\frac{a-\mu}{\sigma})}$$

and by *f* = 0 otherwise. Here

$$\phi(\xi) = \frac{1}{\sqrt{2\pi}} \exp\left(-\frac{1}{2}\xi^2\right)$$

is the probability density function of the standard normal distribution and Φ(·) is its cumulative distribution function. If *b* = ∞, then Φ( *<sup>b</sup>*−*<sup>μ</sup> <sup>σ</sup>* ) = 1, and similarly, if *<sup>a</sup>* <sup>=</sup> <sup>−</sup>∞, then <sup>Φ</sup>( *<sup>a</sup>*−*<sup>μ</sup> <sup>σ</sup>* ) = 0. And the cumulative density for the truncated normal distribution is

$$F(x|\mu, \sigma, a, b) = \frac{\Phi(\xi) - \Phi(a)}{Z}\prime$$

where *ξ* = *<sup>x</sup>*−*<sup>μ</sup> <sup>σ</sup>* and *Z* = Φ(*β*) − Φ(*α*).

#### **Appendix B. Moments and Entropy**

Let *α* = *<sup>a</sup>*−*<sup>μ</sup> <sup>σ</sup>* and *<sup>β</sup>* <sup>=</sup> *<sup>b</sup>*−*<sup>μ</sup> <sup>σ</sup>* . For two-sided truncation:

$$\begin{aligned} \mathbb{E}(X|a < X < b) &= \mu + \sigma \frac{\phi(a) - \phi(\beta)}{\Phi(\beta) - \Phi(a)}, \\ Var(X|a < X < b) &= \sigma^2 \left[ 1 + \frac{a\phi(a) - \beta\phi(\beta)}{\Phi(\beta) - \Phi(a)} - \left(\frac{\phi(a) - \phi(\beta)}{\Phi(\beta) - \Phi(a)}\right)^2 \right]. \end{aligned}$$

For one sided truncation (upper tail):

$$\begin{aligned} \mathbb{E}(X|X>a) &= \mu + \sigma \lambda(a) \\ Var(X|X>a) &= \sigma^2 [1 - \delta(a)]. \end{aligned}$$

where *α* = *<sup>a</sup>*−*<sup>μ</sup> <sup>σ</sup>* , *<sup>λ</sup>*(*α*) = *<sup>φ</sup>*(*α*) <sup>1</sup>−Φ(*α*) and *<sup>δ</sup>*(*α*) = *<sup>λ</sup>*(*α*)[*λ*(*α*) <sup>−</sup> *<sup>α</sup>*]. For one sided truncation (lower tail):

$$\begin{aligned} \mathbb{E}(X|X$$

More generally, the moment generating function for truncated normal distribution is

$$e^{\mu t + \sigma^2 t^2 / 2} \cdot \left[ \frac{\Phi(\beta - \sigma t) - \Phi(\alpha - \sigma t)}{\Phi(\beta) - \Phi(\alpha)} \right].$$

For a density *f*(*x*) defined over a continuous variable, the *entropy* is given by

$$H[\mathbf{x}] = -\int f(\mathbf{x}) \log f(\mathbf{x}) d\mathbf{x}.$$

And the entropy for a truncated normal density is

$$
\log(\sqrt{2\pi}e\sigma Z) + \frac{\alpha\phi(\alpha) - \beta\phi(\beta)}{2Z}.
$$

#### **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## **A New Multi-Attribute Emergency Decision-Making Algorithm Based on Intuitionistic Fuzzy Cross-Entropy and Comprehensive Grey Correlation Analysis**

#### **Ping Li 1, Ying Ji 1,\*, Zhong Wu <sup>1</sup> and Shao-Jian Qu <sup>2</sup>**


Received: 11 June 2020; Accepted: 9 July 2020; Published: 14 July 2020

**Abstract:** Intuitionistic fuzzy distance measurement is an effective method to study multi-attribute emergency decision-making (MAEDM) problems. Unfortunately, the traditional intuitionistic fuzzy distance measurement method cannot accurately reflect the difference between membership and non-membership data, where it is easy to cause information confusion. Therefore, from the intuitionistic fuzzy number (IFN), this paper constructs a decision-making model based on intuitionistic fuzzy cross-entropy and a comprehensive grey correlation analysis algorithm. For the MAEDM problems of completely unknown and partially known attribute weights, this method establishes a grey correlation analysis algorithm based on the objective evaluation value and subjective preference value of decision makers (DMs), which makes up for the shortcomings of traditional model information loss and greatly improves the accuracy of MAEDM. Finally, taking the Wenchuan Earthquake on May 12th 2008 as a case study, this paper constructs and solves the ranking problem of shelters. Through the sensitivity comparison analysis, when the grey resolution coefficient increases from 0.4 to 1.0, the ranking result of building shelters remains stable. Compared to the traditional intuitionistic fuzzy distance, this method is shown to be more reliable.

**Keywords:** multi-attribute emergency decision-making; intuitionistic fuzzy cross-entropy; grey correlation analysis; earthquake shelters; attribute weights

#### **1. Introduction**

At present, earthquakes, fires, novel coronavirus infections, and other frequent disasters have caused great loss to human beings. Owing to the uncertainty and fuzziness of such emergency problems, it is difficult for decision makers (DMs) to determine alternatives with real numbers to make quick decisions. The accurate processing of information has become an unavoidable problem in the development of the emergency decision [1–3] field. Under this urgent demand, fuzzy set theory, which can deal well with the uncertainty of decision-making problems, came into being [4]. Fuzzy sets [5,6] use membership as a single scale to reflect the support and opposition of DMs to objective things. However, with the development of decision theory, it is difficult to accurately describe the uncertainty of objective things by fuzzy sets alone. Based on this, Atanassov, a Bulgarian professor, put forward the concept of the intuitionistic fuzzy set (IFS) in the 1980s [7,8]. He used membership degree and non-membership degree to express the support, opposition, and hesitation of decision information. Compared to the fuzzy set, the IFS can describe the natural attributes of objective things more accurately [9–11].

The IFS is a new mathematical tool for dealing with uncertain and complex information efficiently, which is widely used in the field of multi-attribute decision-making (MADM) [12–14]. In recent years, scholars have made great progress in the research of intuitionistic fuzzy multi-attribute decision-making (IFMADM). The similarity measure is one of the most important decision-making methods in IFMADM. Xu et al. [15] systematically analyzed the similarity measurement formula based on geometric distance, set theory, and intuitionistic fuzzy matching degree. In order to improve the measurement accuracy of the similarity of the IFS, Park et al. [16] and Hu et al. [17] used the similarity measurement formula based on intuitionistic fuzzy entropy for the intuitionistic fuzzy number (IFN) and interval IFN, respectively, and optimized the alternatives. The IFS can represent the uncertainty of decision information well, but there are some difficulties in data comparison. Score function and precise function are effective means for data comparison and ranking in IFMADM. Chen et al. [18] were the first experts to study the score function of the IFN. They used the difference between membership and non-membership in the IFN to construct a function to compare the size relationship of the IFN, which is the basis of IFMADM. On the basis of score function, Hong et al. [19] proposed an intuitionistic fuzzy precise function, which greatly improved the efficiency of decision-making. The classical multi-attribute method has a wide range of development and application in the field of intuitionistic fuzzy. Table 1 summarizes some main methods of IFMADM.

**Table 1.** A brief overview of preprocessing methods in intuitionistic fuzzy multi-attribute decision-making (IFMADM).


Unfortunately, natural disasters, such as fires and floods, often lead to unexpected and disastrous consequences. A large number of emergency decision-making problems have evolved into MADM. Up to now, domestic and foreign scholars have conducted in-depth research in this field. Xu et al. [29] proposed a two-stage method to support the consensus-building process of large-scale MADMand applied it to earthquake shelter selection. Taking a fire and explosion accident as the study, Xu et al. [30] defined a generalized asymmetric language D number and proposed the corresponding MADM fusion algorithm, which verified the effectiveness of the method. Li et al. [31] proposed a risk decision analysis method based on the TODIM (an acronym in Portuguese of interactive and MADM) method to solve the emergency evacuation problem of tourist attractions, in which the attribute value and the probability of state occurrence are in the interval number format. This method solves this kind of emergency decision-making problem well, which shows that it is more effective than the traditional method. Based on an example of ship collision, Xiong et al. [32] used two intelligent algorithms, multi-attribute differential evolution algorithm and non-dominant sorting genetic algorithm, to verify the feasibility and effectiveness of the model. From the prediction model of the triple exponential smoothing method, Wang et al. [33] proposed an MADM additive weighting method, weighted product method, and elimination selection transformation reality method to sort the recycled electric vehicles, which provided an effective solution for managers and researchers in the electric vehicle industry and improved the efficiency of the electric vehicle industry. For the multi-attribute group decision-making problem of community sustainable development emergency response, Wu et al. [34] proposed a method based on subjective imprecise estimation of the reliability of binary language vocabulary, which greatly

improved the efficiency of MADM. Karimi et al. [35] introduced the best and worst algorithm to solve the MADM problem in the fuzzy environment and applied this method to the evaluation of hospital maintenance, which proves the satisfactory performance of this method. Based on the above analysis, the MADM method is widely used in the field of emergency decision-making, which can solve the uncertainty well in the case of emergency. Table 2 summarizes some applications of the MADM method in emergency situations.

**Table 2.** A brief literature list on the applications of multi-attribute decision-making (MADM) methods in emergency situations.


The above method is effective in solving the multi-attribute emergency decision-making (MAEDM) problem in a fuzzy environment. However, it has some limitations in the following aspects.


According to the above limitations, the motivation of this paper is summarized as follows:


Therefore, based on intuitionistic fuzzy and grey correlation analysis, this paper proposes a method to solve MAEDM by using intuitionistic fuzzy cross-entropy distance. First, the average information entropy of intuitionistic fuzzy is defined, and the measurement method of cross-entropy distance of intuitionistic fuzzy is given. On this basis, considering the unknown and known attribute weights, an optimization model with the subjective preference of the DMs is established and solved. Secondly, the intuitionistic fuzzy decision matrix is obtained according to the objective attribute evaluation of DMs. The intuitionistic fuzzy cross-entropy distance matrix is constructed by combining the objective evaluation value and subjective preference value of alternatives. Then, the attribute weight is determined according to the adjusted intuitionistic fuzzy average information entropy. By using the method of grey correlation analysis, the comprehensive grey relation coefficient of each

alternative is obtained, and the order of alternatives is generated. Therefore, a new method is proposed to solve the MAEDM problem by using intuitionistic fuzzy cross-entropy and grey correlation analysis. The important contributions of this paper are mainly reflected in six aspects. (1) The intuitionistic fuzzy cross-entropy distance is defined. (2) A multi-attribute emergency decision with subjective preference is considered. (3) The uncertainty of attribute weight is discussed and solved by intuitionistic fuzzy information entropy. (4) The grey correlation analysis method is applied to MAEDM, which makes full use of decision-making information such as membership, non-membership, and hesitation. (5) According to the grey resolution coefficient, the sensitivity analysis is carried out to verify the reliability and stability of the decision results. (6) Compared to the traditional intuitionistic fuzzy distance, this method is shown to be more stable.

The remainder of this paper is organized as follows. Section 2 defines some basic knowledge of intuitionistic fuzzy theory and introduces the concept of intuitionistic fuzzy cross-entropy distance. In Section 3, a MAEDM model based on intuitionistic fuzzy cross-entropy and comprehensive grey correlation analysis is constructed. In Section 4, taking the ranking of earthquake shelters as an example, the practical application of this method is illustrated by comparing to the traditional intuitionistic fuzzy method. Lastly, Section 5 is the conclusion of the method proposed in this paper and the prospect of future research.

#### **2. Preliminaries**

This section first reviews some basic concepts and definitions of intuitionistic fuzzy theory.

As the preference relationship in fuzzy theory is often assigned by the complementary 0.1–0.9 five-scale, we believe that the distribution of the levels between opposition and support is uniform and symmetric. However, in an actual situation, some problems require the use of a non-consistent and asymmetric distribution to evaluate variables, such as the marginal utility decline rate in economics. Therefore, it is very popular to solve this kind of asymmetric problem by fuzzy set theory.

**Definition 1 [4].** *If the domain X is a non-empty set, a fuzzy set is defined as:*

$$A = \{ <\mathbf{x}, \mu\_A(\mathbf{x}) | \mathbf{x} \in X \} \tag{1}$$

*which is characterized by a membership function* μ*<sup>A</sup>* : *X* → [0, 1], *where* μ*A*(*x*)*denotes the degree of membership of the element x to the set A*.

Ordinary fuzzy sets can only represent membership function, which refers to the support degree of an alternative without non-membership degree information. Therefore, Atanassov [7,8] extended the fuzzy set to the IFS. It is shown as follows:

**Definition 2 [7].** *If the domain X is a non-empty set, then the intuitionistic fuzzy set A onX can be expressed as:*

$$A = \{ <\mathbf{x}, \mu\_A(\mathbf{x}), \nu\_A(\mathbf{x}) > |\mathbf{x} \in X \}\tag{2}$$

*where* μ*A*(*x*) *and* ν*A*(*x*) *are the membership degree and non-membership degree of the element x belonging to A in the domain X, respectively,*

$$\begin{aligned} \mu\_A: X &\to [0,1], x \in X \to \mu\_A(\mathfrak{x}) \in [0,1], \\ \nu\_A: X &\to [0,1], x \in X \to \nu\_A(\mathfrak{x}) \in [0,1]. \end{aligned}$$

*It satisfies* 0 ≤ μ*A*(*x*) + *vA*(*x*) ≤ 1; *let*

$$
\pi\_A = 1 - \mu\_A(\mathbf{x}) - \nu\_A(\mathbf{x}) \tag{3}
$$

*denote the degree of hesitation or uncertainty that element x in X belongs to IFS A, obviously for any x* ∈ *X, with the condition* 0 ≤ π*<sup>A</sup>* ≤ 1.

**Example 1.** *Take an example to illustrate the specific meaning of the IFS. Suppose there is an IFS A* = {< *x*, 0.7, 0.2 > |*x* ∈ *X*}*, which indicates that the membership degree of IFS X is 0.7, the non-membership degree is 0.2, and the hesitation degree is 0.1. If we use this set to represent the voting process, assuming that the number of participants is 10, then 7 people support it, 2 oppose it, and 1 hesitates to remain neutral.*

**Definition 3 [36].** *Let* α*<sup>A</sup>* = (μ*A*, *vA*) *and* α*<sup>B</sup>* = (μ*B*, *vB*) *be the two intuitionistic fuzzy numbers. Then, the normalized Hamming distance between* α*<sup>A</sup> and* α*<sup>B</sup> is defined as follows:*

$$d(\alpha\_{A\prime}, \alpha\_B) = \frac{1}{2} \left( \left| \mu\_A - \mu\_B \right| + \left| \nu\_A - \nu\_B \right| \right) \tag{4}$$

*where* μ*<sup>A</sup>* ∈ [0, 1]*, vA* ∈ [0, 1] *and* 0 ≤ μ*<sup>A</sup>* + *vA* ≤ 1*; meanwhile, all intuitionistic fuzzy numbers are expressed as* θ. *Obviously, the fuzzy number a*<sup>+</sup> = (1, 0) *is the maximum value in the fuzzy set, and a*<sup>−</sup> = (0, 1) *is the minimum value in the set.*

Geometric distance is not suitable for processing fuzzy decision information. According to the traditional distance model, Xu [15] proposed the distance measure formula of the intuitionistic fuzzy set:

**Definition 4.** *Suppose d is a mapping: d*: (φ(*x*))<sup>2</sup> <sup>→</sup> [0, 1]. *If there are intuitionistic fuzzy sets,*

$$\begin{array}{l} A = \{ < \mathfrak{x}, \mu\_{A}(\mathfrak{x}), \nu\_{A}(\mathfrak{x}) > |\mathfrak{x} \in X \} \\ B = \{ < \mathfrak{x}, \mu\_{B}(\mathfrak{x}), \nu\_{B}(\mathfrak{x}) > |\mathfrak{x} \in X \} \\ C = \{ < \mathfrak{x}, \mu\_{C}(\mathfrak{x}), \nu\_{C}(\mathfrak{x}) > |\mathfrak{x} \in X \} \end{array}$$

*then the distance measure between the IFSs is*

$$d\_{X\mu} = \left[\frac{1}{2n} \sum\_{j=1}^{n} \left( \begin{array}{c} \left| \mu\_{A}(\mathbf{x}\_{j}) - \mu\_{B}(\mathbf{x}\_{j}) \right|^{\lambda} + \left| \nu\_{A}(\mathbf{x}\_{j}) - \nu\_{B}(\mathbf{x}\_{j}) \right|^{\lambda} \\ + \left| \pi\_{A}(\mathbf{x}\_{j}) - \pi\_{B}(\mathbf{x}\_{j}) \right|^{\lambda} \end{array} \right) \right]^{\frac{1}{\lambda}} \tag{5}$$

*where* λ ≥ 1*. When* λ = 1*, dXu degenerates into Hamming distance with IFS:*

$$d\_{H} = \frac{1}{2n} \sum\_{j=1}^{n} \left( \begin{vmatrix} \mu\_{A}(\mathbf{x}\_{j}) - \mu\_{B}(\mathbf{x}\_{j}) \end{vmatrix} + \begin{vmatrix} \nu\_{A}(\mathbf{x}\_{j}) - \nu\_{B}(\mathbf{x}\_{j}) \end{vmatrix} \right) \tag{6}$$
 
$$\begin{cases} \pi\_{A}(\mathbf{x}\_{j}) - \pi\_{B}(\mathbf{x}\_{j}) \end{cases} \tag{7}$$

*When* λ = 2*, dXu degenerates into Euclidean distance with IFS:*

$$\mathbf{d}\_{E} = \left[ \frac{1}{2n} \sum\_{j=1}^{n} \left( \begin{vmatrix} \mu\_{A}(\mathbf{x}\_{j}) - \mu\_{B}(\mathbf{x}\_{j}) \end{vmatrix}^{2} + \begin{vmatrix} \nu\_{A}(\mathbf{x}\_{j}) - \nu\_{B}(\mathbf{x}\_{j}) \end{vmatrix}^{2} \right) \right]^{\frac{1}{2}} \tag{7}$$

*Hamming and Euclidean distance formulas are an extension of intuitionistic fuzzy distance.*

Considering the attribute weight vector of *xj*(*j* = 1, 2, ... *n*), ω = (ω1, ω2, ... , ω*n*) *<sup>T</sup>*, satisfies <sup>0</sup> <sup>≤</sup> <sup>ω</sup>*<sup>j</sup>* <sup>≤</sup> 1 and *<sup>n</sup>* Σ *j*=1 ω*<sup>j</sup>* = 1, and the above two distance formulas *dH* and *dE* can be expressed as:

$$d\_{H\omega} = \frac{1}{2n} \sum\_{j=1}^{n} \alpha\_{j} \left( \begin{array}{c} \left| \mu\_{A}(\mathbf{x}\_{j}) - \mu\_{B}(\mathbf{x}\_{j}) \right| + \left| \nu\_{A}(\mathbf{x}\_{j}) - \nu\_{B}(\mathbf{x}\_{j}) \right| \\ + \left| \pi\_{A}(\mathbf{x}\_{j}) - \pi\_{B}(\mathbf{x}\_{j}) \right| \end{array} \right) \tag{8}$$

$$d\_{Eo} = \left[\frac{1}{2n} \sum\_{j=1}^{n} \alpha\_{j} \left( \begin{array}{c} \left| \mu\_{A}(\mathbf{x}\_{j}) - \mu\_{B}(\mathbf{x}\_{j}) \right|^{2} + \left| \nu\_{A}(\mathbf{x}\_{j}) - \nu\_{B}(\mathbf{x}\_{j}) \right|^{2} \\ + \left| \pi\_{A}(\mathbf{x}\_{j}) - \pi\_{B}(\mathbf{x}\_{j}) \right|^{2} \end{array} \right) \right]^{\frac{1}{2}} \tag{9}$$

It is not difficult to see from the formula that all intuitionistic fuzzy distances satisfy the following properties:

(1) 0 ≤ *d*(*A*, *B*) ≤ 1; (2) When *A* = *B*, *d*(*A*, *B*) = 0 (3) *d*(*A*, *B*) = *d*(*B*, *A*); (4) If *A* ⊆ *B* ⊆ *C*, *d*(*A*, *B*) ≤ *d*(*A*,*C*) and *d*(*B*,*C*) ≤ *d*(*A*,*C*).

In order to define the concept of intuitionistic fuzzy cross-entropy, the definition of information entropy is introduced. The average level of residual information after information redundancy eliminated is called information entropy, which is used to measure the uncertainty of information source in the communication process.

**Definition 5.** *There is a discrete random variable X* = {*x*1, *x*2, ... , *xn*} *that can be represented as: I* = *x*1, *x*2, ···, *xn p*1, *p*2, ···, *pn , where P* = (*p*1, *p*2, ... , *pn*) *is the probability of discrete random variable X satisfying* <sup>0</sup> <sup>≤</sup> *pj* <sup>≤</sup> <sup>1</sup> *and <sup>n</sup>* Σ *i*=1 *pj* = 1*; then, the information entropy of I can be expressed as*

$$I = -\eta \sum\_{j=1}^{n} p\_j \log\_c p\_j \tag{10}$$

The constant η means the unit of measurement of information entropy, which is a constant greater than 0, and the base number *c* of the logarithmic function in the formula can take a non-negative constant. In particular, when *c* = 2, the unit of information entropy is bit. When *c* = *e*, the unit of information entropy is nat. When *c* = 10, its unit is dit. In general calculation, η = 1, *c* = 2.

Burillo et al. [37] extended the basic idea of information entropy to the field of intuitionistic fuzzy, and creatively used it to describe the uncertainty of the IFS.

**Definition 6.** *Let X* = {*x*1, *x*2, ... *xn*} *be a domain and A* =  < *x*, μ*A*(*x*), *vA*(*x*) > |*x* ∈ *X be an IFS on X. The intuitionistic fuzzy entropy of A can be expressed as:*

$$E\_{LH}(A) = \frac{1}{n} \sum\_{i=1}^{n} \frac{1 - \left| \mu\_A(\mathbf{x}\_i) - \nu\_A(\mathbf{x}\_i) \right| + \pi\_A(\mathbf{x}\_i)}{1 + \left| \mu\_A(\mathbf{x}\_i) - \nu\_A(\mathbf{x}\_i) \right| + \pi\_A(\mathbf{x}\_i)} \tag{11}$$

**Definition 7.** *Another equivalent transformation of intuitionistic fuzzy entropy ELH is:*

$$E(A) = \frac{1}{n} \sum\_{i=1}^{n} \frac{1 - \max(\mu\_A(\mathbf{x}\_i) - \nu\_A(\mathbf{x}\_i))}{1 - \min(\mu\_A(\mathbf{x}\_i) - \nu\_A(\mathbf{x}\_i))}. \tag{12}$$

**Proof.** Model (11) and model (12) are equivalent.

*ELH*(*A*) = <sup>1</sup> *n n* Σ *i*=1 <sup>1</sup>−|μ*A*(*xi*)−ν*A*(*xi*)|+π*A*(*xi*) <sup>1</sup>+|μ*A*(*xi*)−ν*A*(*xi*)|+π*A*(*xi*) = <sup>1</sup> *n n* Σ *i*=1 <sup>1</sup>−|μ*A*(*xi*)−ν*A*(*xi*)|+1−μ*A*(*xi*)−ν*A*(*xi*) <sup>1</sup>+|μ*A*(*xi*)−ν*A*(*xi*)|+1−μ*A*(*xi*)−ν*A*(*xi*) = <sup>1</sup> *n n* Σ *i*=1 <sup>2</sup>−|μ*A*(*xi*)−ν*A*(*xi*)|−(μ*A*(*xi*)+ν*A*(*xi*)) <sup>2</sup>+|μ*A*(*xi*)−ν*A*(*xi*)|−(μ*A*(*xi*)+ν*A*(*xi*)) = <sup>1</sup> *n n* Σ *i*=1 <sup>1</sup>−<sup>1</sup> <sup>2</sup> (|μ*A*(*xi*)−ν*A*(*xi*)|+|μ*A*(*xi*)+ν*A*(*xi*)|) <sup>1</sup>−<sup>1</sup> <sup>2</sup> (|μ*A*(*xi*)+ν*A*(*xi*)|−|μ*A*(*xi*)−ν*A*(*xi*)|) = <sup>1</sup> *n n* Σ *i*=1 1−max(μ*A*(*xi*)−ν*A*(*xi*)) <sup>1</sup>−min(μ*A*(*xi*)−ν*A*(*xi*)) <sup>=</sup> *<sup>E</sup>*(*A*)

.


Definition 7 is more concise in form and simpler in calculation. It eliminates the influence of hesitation and is a better expression of intuitionistic fuzzy entropy.

For the MAEDM problem discussed in this paper, when the attributes are completely unknown, it is necessary to calculate the average information entropy of each attribute. Combining with the intuitionistic fuzzy entropy, the intuitionistic fuzzy cross-entropy distance is defined as:

**Definition 8.** *Suppose there is a domain X* = {*x*1, *x*2, ... , *xn*}*, where A and B are two IFSs on X*,

$$\begin{array}{l} A = \left\{ < \mathcal{x}\_{j'} \mu\_A(\mathfrak{x}\_j) , \nu\_A(\mathfrak{x}\_j) > \left| \mathfrak{x}\_j \in X \right\rangle \right\} \\ B = \left\{ < \mathcal{x}\_{j'} \mu\_B(\mathfrak{x}\_j) , \nu\_B(\mathfrak{x}\_j) > \left| \mathfrak{x}\_j \in X \right\rangle \right\} \end{array}$$

*then, the intuitionistic fuzzy cross-entropy distance formula of A and B is* [38]:

$$\begin{split} \mathsf{CE}(A,B) &= \quad \sum\_{j=1}^{n} \left\{ \frac{1+\mu\_{A}(\mathbf{x}\_{j})-\nu\_{A}(\mathbf{x}\_{j})}{2} \times \\ & \quad \log\_{2} \frac{1+\mu\_{A}(\mathbf{x}\_{j})-\nu\_{A}(\mathbf{x}\_{j})}{1/2\left[1+\mu\_{A}(\mathbf{x}\_{j})-\nu\_{A}(\mathbf{x}\_{j})+1+\mu\_{B}(\mathbf{x}\_{j})-\nu\_{B}(\mathbf{x}\_{j})\right]} \right\} \\ & \quad + \sum\_{j=1}^{n} \left\{ \frac{1-\mu\_{A}(\mathbf{x}\_{j})+\nu\_{A}(\mathbf{x}\_{j})}{2} \times \\ & \quad \log\_{2} \frac{1-\mu\_{A}(\mathbf{x}\_{j})+\nu\_{A}(\mathbf{x}\_{j})}{1/2\left[1-\mu\_{A}(\mathbf{x}\_{j})+\nu\_{A}(\mathbf{x}\_{j})+1-\mu\_{B}(\mathbf{x}\_{j})+\nu\_{B}(\mathbf{x}\_{j})\right]} \right\} \end{split} \tag{13}$$

*As the intuitionistic fuzzy cross-entropy CE*(*A*, *B*) *does not satisfy the symmetry, considering the problems of emergency decision-making, let*

$$\text{CE}^\*(A, B) = \text{CE}(A, B) + \text{CE}(B, A) \tag{14}$$

*define the intuitionistic fuzzy cross-entropy distance combined with the characteristics of multi-attribute.*

**Theorem 1.** *Referring to the properties of the intuitionistic fuzzy geometric distance formula, the intuitionistic fuzzy cross-entropy satisfies the following properties:*

$$\begin{aligned} \text{(1) } &0 \le \text{CE}^\*(A,B);\\ \text{(2) If } A = B, \text{CE}^\*(A,B) = 0;\\ \text{(3) If } A \subseteq B \subseteq \mathbb{C} \text{, then } \text{CE}^\*(A,B) \le \text{CE}^\*(A,\mathbb{C}) \text{ and } \text{CE}^\*(B,\mathbb{C}) \le \text{CE}^\*(A,\mathbb{C}). \end{aligned}$$

**Proof.** As

$$\begin{array}{lcl} CE(A,B) &=& \sum\_{j=1}^{n} \left\{ \frac{1+\mu\_{A}(\mathbf{x}\_{j})-\nu\_{A}(\mathbf{x}\_{j})}{2} \times \\ & \quad \log\_{2} \frac{1+\mu\_{A}(\mathbf{x}\_{j})-\nu\_{A}(\mathbf{x}\_{j})}{1/2\left[1+\mu\_{A}(\mathbf{x}\_{j})-\nu\_{A}(\mathbf{x}\_{j})+1+\mu\_{B}(\mathbf{x}\_{j})-\nu\_{B}(\mathbf{x}\_{j})\right]} \right\} \\ & \quad + \sum\_{j=1}^{n} \left\{ \frac{1-\mu\_{A}(\mathbf{x}\_{j})+\nu\_{A}(\mathbf{x}\_{j})}{2} \times \\ & \quad \log\_{2} \frac{1-\mu\_{A}(\mathbf{x}\_{j})+\nu\_{A}(\mathbf{x}\_{j})}{1/2\left[1-\mu\_{A}(\mathbf{x}\_{j})+\nu\_{A}(\mathbf{x}\_{j})+1-\mu\_{B}(\mathbf{x}\_{j})+\nu\_{B}(\mathbf{x}\_{j})\right]} \right\} \end{array}$$

and model (13) has been given, the following exists

<sup>−</sup>*CE*(*A*, *<sup>B</sup>*) <sup>=</sup> <sup>−</sup> *<sup>n</sup>* Σ *j*=1 1+μ*A*(*xj*)−ν*A*(*xj*) <sup>2</sup> × log2 1+μ*A*(*xj*)−ν*A*(*xj*) 1/2[1+μ*A*(*xj*)−ν*A*(*xj*)+1+μ*B*(*xj*)−ν*B*(*xj*)] + *n* Σ *j*=1 1−μ*A*(*xj*)+ν*A*(*xj*) <sup>2</sup> × log2 1−μ*A*(*xj*)+ν*A*(*xj*) 1/2[1−μ*A*(*xj*)+ν*A*(*xj*)+1−μ*B*(*xj*)+ν*B*(*xj*)] <sup>=</sup> *<sup>n</sup>* Σ *j*=1 1+μ*A*(*xj*)−ν*A*(*xj*) <sup>2</sup> × log2 1/2[1+μ*A*(*xj*)−ν*A*(*xj*)+1+μ*B*(*xj*)−ν*B*(*xj*)] 1+μ*A*(*xj*)−ν*A*(*xj*) + *n* Σ *j*=1 1−μ*A*(*xj*)+ν*A*(*xj*) <sup>2</sup> × log2 1/2[1−μ*A*(*xj*)+ν*A*(*xj*)+1−μ*B*(*xj*)+ν*B*(*xj*)] 1−μ*A*(*xj*)+ν*A*(*xj*) 

As the above logarithmic function is strictly convex, according to the relevant properties,

$$\begin{cases} f(a\_1\mathbf{x}\_1 + a\_2\mathbf{x}\_2 + \dots + a\_n\mathbf{x}\_n) \le \\ a\_1f(\mathbf{x}\_1) + a\_2f(\mathbf{x}\_2) + \dots + a\_llf(\mathbf{x}\_ll) \end{cases} \tag{15}$$

therefore, we can obtain the following expression,

$$\begin{split}-\mathsf{CE}(A,B) &\leq \sum\_{j=1}^{n} \log\_{2} \left\{ \frac{1+\mu\_{A}(\boldsymbol{x}\_{j})-\nu\_{A}(\boldsymbol{x}\_{j})}{2} \chi \\ &\quad \frac{1/2\left[1+\mu\_{A}(\boldsymbol{x}\_{j})-\nu\_{A}(\boldsymbol{x}\_{j})+1+\mu\_{B}(\boldsymbol{x}\_{j})-\nu\_{B}(\boldsymbol{x}\_{j})\right]}{1+\mu\_{A}(\boldsymbol{x}\_{j})-\nu\_{A}(\boldsymbol{x}\_{j})} \right\} \\ &\quad + \sum\_{j=1}^{n} \log\_{2} \left\{ \frac{1-\mu\_{A}(\boldsymbol{x}\_{j})+\nu\_{A}(\boldsymbol{x}\_{j})}{2} \chi \\ &\quad \frac{1/2\left[1-\mu\_{A}(\boldsymbol{x}\_{j})+\nu\_{A}(\boldsymbol{x}\_{j})+1-\mu\_{B}(\boldsymbol{x}\_{j})+\nu\_{B}(\boldsymbol{x}\_{j})\right]}{1-\mu\_{A}(\boldsymbol{x}\_{j})+\nu\_{A}(\boldsymbol{x}\_{j})} \right\} \\ &\leq \quad \log\_{2} \left\{ \left[\left(1+\mu\_{A}(\boldsymbol{x}\_{j})-\nu\_{A}(\boldsymbol{x}\_{j})\right)+\left(1-\mu\_{B}(\boldsymbol{x}\_{j})+\nu\_{B}(\boldsymbol{x}\_{j})\right)\right]+1 \right\} \\ &\quad \left(1+\mu\_{B}(\boldsymbol{x}\_{j})-\nu\_{B}(\boldsymbol{x}\_{j})\right)+\left(1-\mu\_{A}(\boldsymbol{x}\_{j})+\nu\_{A}(\boldsymbol{x}\_{j})\right)\right\}/4=0. \end{split}$$

Through the above proof, obviously, *CE*(*A*, *B*) ≥ 0 and *CE*(*B*, *A*) ≥ 0, and the same can be obtained. According to model (13) and (14), we can prove that *CE*∗(*A*, *B*) ≥ 0. -

**Proof.** When *A* = *B*, there are the following relationships: μ*<sup>A</sup> xj* = μ*<sup>B</sup> xj* ,*vA xj* = *vB xj* . By substituting it into the model (13), we can obtain the conclusion *CE*(*A*, *B*) = 0, *CE*(*B*, *A*) = 0. Then, combining model (14), we can prove that *CE*∗(*A*, *B*) = 0. -

**Proof.** According to the understanding of the geometric intuitionistic fuzzy distance formula, it is not difficult to prove that the size of the fuzzy cross-entropy set is positively correlated with the size of distance. Let us assume that with *A* ⊆ *B* ⊆ *C*, we have μ*A*(*xi*) ≤ μ*B*(*xi*) ≤ μ*C*(*xi*) and *vA*(*xi*) ≤ *vB*(*xi*) ≤ *vC*(*xi*). The following conclusions can be drawn: μ*A*(*xi*) − ν*A*(*xi*) ≤ μ*B*(*xi*) − ν*B*(*xi*) ≤ μ*C*(*xi*) − ν*C*(*xi*). For the sake of proving convenience, μ*A*(*xi*) − *vA*(*xi*), μ*B*(*xi*) − *vB*(*xi*), and μ*C*(*xi*) − *vC*(*xi*) are recorded as a, b, c, respectively, and satisfy −1 ≤ *a* ≤ *b* ≤ *c* ≤ 1. Comparing the size relationship between two

intuitionistic fuzzy cross-entropies can be done by subtraction. Δ*CE*<sup>∗</sup> = *CE*∗(*A*,*C*) − *CE*∗(*A*, *B*) can be transformed into:

$$\begin{array}{ll} \text{ACE}^\* &= \frac{1+a}{2} \log\_2 \frac{1+a}{1/2[1+a+1+c]} + \frac{1-a}{2} \log\_2 \frac{1-a}{1/2[1-a+1-c]} \\ &+ \frac{1+c}{2} \log\_2 \frac{1+\varsigma}{1/2[1+c+1+a]} + \frac{1-\varsigma}{2} \log\_2 \frac{1-\varsigma}{1/2[1-c+1-a]} \\ &- \frac{1+a}{2} \log\_2 \frac{1+a}{1/2[1+a+1+b]} - \frac{1-a}{2} \log\_2 \frac{1-\kappa}{1/2[1-a+1-b]} \\ &- \frac{1+b}{2} \log\_2 \frac{1+b}{1/2[1+b+1+a]} - \frac{1-b}{2} \log\_2 \frac{1-b}{1/2[1-b+1-a]} \end{array}$$

,

.

thus,

$$\begin{array}{rcl} -\Delta CE^\* &=& \frac{1+a}{2}\log\_2 \frac{1/2[1+a+1+c]}{1+a} + \frac{1-a}{2}\log\_2 \frac{1/2[1-a+1-c]}{1-a} \\ &+\frac{1+c}{2}\log\_2 \frac{1/2[1+c+1+a]}{1+c} + \frac{1-c}{2}\log\_2 \frac{1/2[1-c+1-a]}{1-c} \\ &-\frac{1+a}{2}\log\_2 \frac{1/2[1+a+1+b]}{1+a} - \frac{1-a}{2}\log\_2 \frac{1/2[1-a+1-b]}{1-a} \\ &-\frac{1+b}{2}\log\_2 \frac{1/2[1+b+1+a]}{1+b} - \frac{1-b}{2}\log\_2 \frac{1/2[1-b+1-a]}{1-b} \end{array}$$

As the −Δ*CE*<sup>∗</sup> is a strictly convex function, it has the property (15). It satisfies

$$\begin{array}{rcl} -\Delta CE^\* & \leq \log\_2 \left\{ \begin{array}{rcl} \frac{1+a}{2} \times \frac{1/2\left(1+a+1+c\right)}{1+a} + \frac{1-a}{2} \times \frac{1/2\left(1-a+1-c\right)}{1-a} \\ + \frac{1+c}{2} \times \frac{1/2\left(1+c+1+a\right)}{1+c} + \frac{1-c}{2} \times \frac{1/2\left(1-c+1-a\right)}{1-c} \\ \frac{1+a}{2} \times \frac{1/2\left(1+a+1+b\right)}{1+a} - \frac{1-a}{2} \times \frac{1/2\left(1-a+1-b\right)}{1-a} \\ + \frac{1+b}{2} \times \frac{1/2\left(1+b+1+a\right)}{1+b} + \frac{1-b}{2} \times \frac{1/2\left(1-b+1-a\right)}{1-b} \end{array} \right\} = 0$$

Obviously, with −Δ*CE*<sup>∗</sup> ≤ 0, which is Δ*CE*<sup>∗</sup> ≥ 0, we can easily obtain *CE*∗(*A*,*C*) ≥ *CE*∗(*A*, *B*). The same reasoning can be proved, *CE*∗(*A*,*C*) − *CE*∗(*B*,*C*) ≥ 0; thus, *CE*∗(*A*,*C*) ≥ *CE*∗(*B*,*C*). -

It can be seen from property (1) that the fuzzy entropy distance is non-negative. Property (2) means that when two IFSs are completely equal, the minimum intuitionistic fuzzy cross-entropy distance is equal to 0; thus, cross-entropy can be used to measure the difference degree or distance between two IFSs. Property (3) provides a sufficient basis for the comparison of intuitionistic fuzzy cross-entropy distance. Intuitionistic fuzzy cross-entropy extends the meaning of information entropy, which can be used to measure the fuzzy degree and unknown degree between IFSs on the basis of preserving the complete information of the original IFS. The greater the distance between two IFSs, the greater the cross-entropy of the fuzzy numbers. However, the traditional intuitionistic fuzzy distance measurement method cannot accurately reflect the differences between the data.

Based on this, a group of simple data can be used to compare the traditional intuitionistic fuzzy distance and fuzzy cross-entropy distance to show the reliability and stability of cross-entropy used to measure the degree of fuzzy.

**Example 2.** *Suppose that there are three voting activities with a population of 10. The voting can be represented by three groups of fuzzy numbers:* α<sup>1</sup> = (0.6, 0.3)*,* α<sup>2</sup> = (0.5, 0.4)*,* α<sup>3</sup> = (0.4, 0.2). *First, we use the traditional Hamming and Euclidean distance model (6) and model (7), respectively, to solve dH*(α1, α3) = *dH*(α2, α3) = 0.3 *and dE*(α1, α3) = *dE*(α2, α3) = 0.2646. *Obviously, it can be seen from the calculation results that two traditional distance formulas cannot measure the distance between fuzzy numbers* α<sup>1</sup> *and* α3*, or* α<sup>2</sup> *and* α3, *which is the disadvantage of the classical intuitionistic fuzzy distance measurement method. It is solved by the intuitionistic fuzzy cross-entropy distance method, CE*∗(α1, α3) = 0.0037 *and CE*∗(α2, α3) = 0.0101.

The results show that the distance between α<sup>1</sup> and α<sup>3</sup> is closer than that of the traditional intuitionistic fuzzy distance. Therefore, it is more effective to introduce intuitionistic fuzzy cross-entropy to deal with uncertainty decision information.

#### **3. A Multi-Attribute Emergency Decision Model Based on Intuitionistic Fuzzy Cross-Entropy and Grey Correlation Analysis**

This section analyzes the IFMAEDM problem in which DMs have a certain subjective preference for alternatives.

#### *3.1. Problem Description*

Taking the Wenchuan earthquake on May 12th 2008 as a study case, the government needs to build a batch of temporary shelters to rescue the victims in the disaster area. Considering the impact of earthquakes, the government has a certain priority (subjective preference) for the construction of regional shelters. After determining the geographical location, disaster risk, rescue facilities, and feasibility, a number of rescues in disaster-affected areas began in an orderly manner. The whole decision-making process aims to find the optimal solution through intuitionistic fuzzy cross-entropy and grey correlation analysis, which determines the area where the shelter is built first. It can be abstractly understood as: The decision-maker (government) gives the IFN representing the attribute value (agree, disagree, neutral) μ*ij*, ν*ij* from a series of alternatives (disaster-affected areas) *Ai*(*i* = 1, 2, ... *m*) according to the objective evaluation attribute (specific factors of disaster situation) *Cj*(*j* = 1, 2, ... *n*), which denotes that the decision maker's approval degree is μ*ij*, objection degree is ν*ij*, and neutrality degree is π*ij* = 1 − μ*ij* − ν*ij* for alternative *Ai* under the condition of attribute *Cj*. The attribute weight is expressed in <sup>ω</sup>*<sup>j</sup>* and satisfies 0 <sup>≤</sup> <sup>ω</sup>*j*(*<sup>j</sup>* <sup>=</sup> 1, 2, ... *<sup>n</sup>*) <sup>≤</sup> 1 and *<sup>n</sup>* Σ *j*=1 ω*<sup>j</sup>* = 1. The IFN meets the following conditions: 0 ≤ μ*ij*, ν*ij*, π*ij* ≤ 1. Using a fuzzy number to construct multi-attribute intuitionistic fuzzy decision matrix *Rmn*, the expression form is shown in Table 3:

**Alternative** *C***<sup>1</sup>** *C***<sup>2</sup>** ... *Cn A*<sup>1</sup> (μ11, ν11) (μ12, ν12) ... (μ1*n*, ν1*n*) *A*<sup>2</sup> (μ21, ν21) (μ22, ν22) ... (μ2*n*, ν2*n*) ... ... ... ... ... *Am* (μ*m*1, ν*m*1) (μ*m*2, ν*m*2) ... (μ*mn*, ν*mn*)

**Table 3.** Intuitionistic fuzzy decision matrix.

Analyzing the Wenchuan earthquake, DMs have a certain subjective preference for alternatives, which need to consider the severity of the disaster area. The preference value is also IFN *ci* = (σ*i*, δ*i*)(*i* = 1, 2, ... *m*). The following content uses the method of intuitionistic fuzzy cross-entropy and grey correlation analysis to build the optimal decision model and solve it.

#### *3.2. Steps of Intuitionistic Fuzzy Cross-Entropy and Grey Correlation Analysis Algorithm*

For the uncertain MAEDM problem with certain subjective preference, taking the Wenchuan earthquake shelter ranking problem for analysis, the comprehensive algorithm of intuitionistic fuzzy cross-entropy and grey correlation analysis is used to solve it. The specific steps are as follows (see Figure 1 for the flow framework):

**Figure 1.** Algorithm framework of intuitionistic fuzzy cross-entropy and grey correlation analysis.

**Step 1.** According to the data given in the background of the Wenchuan earthquake case, alternative *Ai*, objective evaluation attribute value *Cj*, decision maker's subjective preference value *ci*, and intuitionistic fuzzy evaluation decision matrix *Rmn* are determined.

**Step 2.** Using intuitionistic fuzzy cross-entropy distance to solve the grey correlation coefficient between the objective evaluation value of alternatives and the subjective preference value of DMs, the formula is expressed as:

$$\theta\_{ij} = \frac{\min\_{\stackrel{\*}{i}} \min\_{\stackrel{\*}{j}} CE^\*\_{ij} + \xi \max\_{\stackrel{\*}{i}} \max\_{\stackrel{\*}{j}} \max\_{\stackrel{\*}{j}} CE^\*\_{ij}}{CE^\*\_{ij} + \xi \max\_{\stackrel{\*}{i}} \max\_{\stackrel{\*}{j}} CE^\*\_{ij}},\tag{16}$$

ξ is called the grey resolution coefficient, and the value range is 0 ≤ ξ ≤ 1, which is often set as ξ = 0.5. It satisfies 0 ≤ θ*ij*(*i* = 1, 2, ... *m*; *j* = 1, 2, ... *n*) ≤ 1. The larger the grey correlation coefficient θ*ij*, the closer the objective evaluation value and subjective preference value. In model (16), *CE*<sup>∗</sup> *ij* is the intuitionistic fuzzy cross-entropy distance, and the specific formula is as follows:

$$\begin{array}{lcl} \text{CE}\_{ij}^{\*} &= \frac{1+\mu\_{ij}-\nu\_{ij}}{2} \times \log\_2 \frac{1+\mu\_{ij}-\nu\_{ij}}{1/2\left[1+\mu\_{ij}-\nu\_{ij}+1+\sigma\_i-\delta\_i\right]} \\ &+ \frac{1-\mu\_{ij}+\nu\_{ij}}{2} \times \log\_2 \frac{1-\mu\_{ij}+\nu\_{ij}}{1/2\left[1-\mu\_{ij}+\nu\_{ij}+1-\sigma\_i+\delta\_i\right]} \\ &+ \frac{1+\sigma\_i-\delta\_i}{2} \times \log\_2 \frac{1}{1/2\left[1+\sigma\_i-\delta\_i+1+\mu\_{ij}-\nu\_{ij}\right]} \\ &+ \frac{1-\sigma\_i+\delta\_i}{2} \times \log\_2 \frac{1-\sigma\_i+\delta\_i}{1/2\left[1-\sigma\_i+\delta\_i+1-\mu\_{ij}+\nu\_{ij}\right]} \end{array} \tag{17}$$

**Step 3.** On the basis of the solution method of the grey correlation coefficient given in model (16), the weight of each attribute is calculated to determine the comprehensive correlation coefficient θ*<sup>i</sup>* of each alternative. The following three cases are discussed: The attribute weight is completely unknown, completely known, and the value range is known.

**Case 1.** Attribute weight is completely unknown. In order to determine the attribute weight, the average information entropy of each attribute must be obtained. On the basis of intuitionistic fuzzy entropy, the calculation method of information entropy is as follows:

$$E\{\mathbb{C}\_{j}\} = -\frac{1}{\ln m} \sum\_{i=1}^{m} \left\{ \frac{\mathcal{C}E\_{ij}^{\*}}{\sum\limits\_{i=1}^{m} \mathcal{C}E\_{ij}^{\*}} \ln \frac{\mathcal{C}E\_{ij}^{\*}}{\sum\limits\_{i=1}^{m} \mathcal{C}E\_{ij}^{\*}} \right\} \tag{18}$$

The natural logarithm is taken to make the entropy value return to 1 and ensure the boundedness of information entropy. By transforming the formula of average information entropy, we can obtain the method of solving attribute weight:

$$\omega\_{j} = \frac{1 - E(\mathbb{C}\_{j})}{\sum\_{k=1}^{n} [1 - E(\mathbb{C}\_{k})]} (j = 1, 2, \dots, n), \tag{19}$$

The weight parameters of each attribute can be determined and substituted,

$$\theta\_i = \sum\_{j=1}^{n} \theta\_{ij} \omega\_j (i = 1, 2, \dots, m; j = 1, 2, \dots, n), \tag{20}$$

In model (20), the comprehensive correlation coefficient of alternatives θ*<sup>i</sup>* can be aggregated.

**Case 2.** Attribute weights are fully known. Under the condition that the attribute is completely known, the grey correlation coefficient θ*ij* of each alternative attribute is obtained by using model (16), and the comprehensive correlation degree θ*<sup>i</sup>* of the alternative is obtained by combining model (20).

**Case 3.** The value range of attribute weight is known. Based on the maximum approach between weights with a known range of values and the subjective decision maker's preference, a linear programming model with attribute weight as a variable is constructed,

$$\begin{aligned} \max \, & \mathbf{y} (\boldsymbol{\omega}\_{\boldsymbol{j}}) = \sum\_{i=1}^{m} \sum\_{j=1}^{n} \theta\_{ij} \boldsymbol{\omega}\_{\boldsymbol{j}} (\boldsymbol{j} = 1, 2, \boldsymbol{\cdot} \cdot \boldsymbol{m}) \\ \text{s.t.} & \begin{cases} \sum\_{j=1}^{n} \boldsymbol{\omega}\_{\boldsymbol{j}} = 1, \boldsymbol{\omega}\_{\boldsymbol{j}} \in \boldsymbol{W} \\ 0 \le \boldsymbol{\omega}\_{\boldsymbol{j}\prime} (\boldsymbol{j} = 1, 2, \boldsymbol{\cdot} \cdot \boldsymbol{m}) \, (\boldsymbol{i} = 1, 2, \boldsymbol{\cdot} \cdot \boldsymbol{m}) \end{cases} \end{aligned} \tag{21}$$

In this way, the weight parameters of each attribute can be determined.

The weight ω*<sup>j</sup>* of each attribute can be calculated by establishing the optimization model of the maximum comprehensive grey correlation coefficient θ*i*:

$$
\partial\_i = \sum\_{j=1}^n \theta\_{ij} w\_j (i = 1, 2, \dots, m; j = 1, 2, \dots, n), \tag{22}
$$

The corresponding linear programming model is constructed by programming software Matlab (R2017b) to solve the code, and the attribute weight of each alternative is obtained. Then, the model is substituted into (20) to determine the comprehensive correlation degree θ*i*.

**Step 4.** Based on the comprehensive correlation coefficient obtained under three different attribute weights in Step 3, the alternatives of the earthquake shelter are ranked according to the size relationship. The larger the θ*i*, the better the alternative, which is in the front row.

**Step 5.** The sensitivity analysis is made by setting different values of the grey resolution constant in the correlation coefficient, and the difference of ranking alternatives under different resolution coefficients is compared and analyzed.

#### **4. A Numerical Case Study on the Ranking of Wenchuan Earthquake Shelters**

In this section, the traditional intuitionistic fuzzy distance and the intuitionistic fuzzy cross-entropy distance are used to analyze and compare the ranking of earthquake shelters.

#### *4.1. Intuitionistic Fuzzy Cross-Entropy Distance and Grey Correlation Analysis*

The stability and reliability of the method of intuitionistic fuzzy cross-entropy and the grey correlation coefficient are analyzed through comparative experiments. Assume that the government carries out shelter assessment and optimization for the five areas with a large disaster impact, and use *A*, *B*, *C*, *D*, and *E* to represent them. The government analyzes and evaluates the geographical location *C*1, disaster risk *C*2, rescue facilities *C*3, and feasibility *C*<sup>4</sup> of the five disaster areas. The decision-maker adopts an IFN to express the objective evaluation value of alternatives under different attributes, and the intuitionistic fuzzy decision matrix *R*5×<sup>4</sup> is shown in Table 4.


**Table 4.** Objective evaluation value of each alternative.

The decision maker's subjective preference values for alternatives A, B, C, D, and E are also expressed by IFNs: *c*<sup>1</sup> = (0.5, 0.4 ), *c*<sup>2</sup> = (0.6, 0.3), *c*<sup>3</sup> = (0.4, 0.3 ), *c*<sup>4</sup> = (0.4, 0.5), and *c*<sup>5</sup> = (0.6, 0.2). In order to choose the best alternative to build a shelter in the earthquake disaster area, the government adopts the intuitionistic fuzzy cross-entropy and grey correlation analysis method to make a decision.

**Step 1**. Determine the values of alternative *A*, *B*, *C*, *D*, and *E*; the objective evaluation attribute values *C*1,*C*2,*C*3,*C*4; the decision makers' objective evaluation matrix *R*5×4; and subjective preference values *c*1,*c*2,*c*3,*c*4,*c*5.

**Step 2**. According to model (17), the intuitionistic fuzzy cross-entropy distance between the objective evaluation value and the subjective preference value of each alternative is calculated to form the distance matrix:

$$\text{CE}\_{5 \times 4}^{\*} = \begin{bmatrix} 0.0000 \ 0.0151 \ 0.0000 \ 0.1378 \\ 0.0151 \ 0.0038 \ 0.2402 \ 0.0411 \\ 0.0000 \ 0.0327 \ 0.0348 \ 0.0151 \\ 0.0036 \ 0.0000 \ 0.0036 \ 0.0145 \\ 0.0041 \ 0.0159 \ 0.1810 \ 0.0041 \end{bmatrix}$$

**Step 3**. Assuming that the grey resolution coefficient is ξ= 0.5, the grey correlation coefficient between the decision-maker's subjective preference value and the objective evaluation value is calculated according to model (16). The coefficient matrix is as follows:

⎤

⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦

$$
\theta\_{5\times4} = \begin{bmatrix}
1.0000\ 0.8883\ 1.0000\ 0.4657\\0.8883\ 0.9693\ 0.3333\ 0.7450\\1.0000\ 0.7860\ 0.7753\ 0.8883\\0.9709\ 1.0000\ 0.9709\ 0.8923\\0.9670\ 0.8831\ 0.3989\ 0.9670\
\end{bmatrix}
$$

**Step 4**. Calculate the attribute weight ω*<sup>j</sup>* according to the known information provided by the above case. When the attribute weight is known, the model is relatively easy to solve. The following focuses on the analysis of two situations: The attribute weight is completely unknown and the attribute weight range is known.

**Case 1**. The weight of attributes is completely unknown. According to the idea of intuitionistic fuzzy entropy, the average intuitionistic fuzzy entropy of the attribute is obtained by combining model (18): *E*(*C*1)= 0.5424, *E*(*C*2)= 0.7385, *E*(*C*3)= 0.5837, *E*(*C*4)= 0.6498. Then, according to model (19), we obtain the attribute weight ω<sup>1</sup> = 0.3080, ω<sup>2</sup> = 0.1761, ω<sup>3</sup> = 0.2802 and ω<sup>4</sup> = 0.2357. The attribute weight obtained is substituted into model (22), and the comprehensive grey correlation coefficient θ*<sup>i</sup>* of the alternatives under the attribute condition is calculated: θ1= 0.8544, θ2= 0.7133, θ3= 0.8730, θ4= 0.9575, and θ5= 0.7930. From the comprehensive grey correlation coefficient θ*<sup>i</sup>* of the alternatives, the result is θ<sup>4</sup> > θ<sup>3</sup> > θ<sup>1</sup> > θ<sup>5</sup> > θ<sup>2</sup> and *D C A E B*. Therefore, the alternative D is the best and the government should give priority to building earthquake shelters in the region.

For proving the superiority and stability of the intuitionistic fuzzy cross-entropy and the comprehensive grey correlation analysis algorithm proposed in this paper, different resolution coefficients ξ are set for sensitivity analysis to compare and analyze whether the above alternatives will produce fluctuations. Set ξ =0.40, 0.50, 0.60, 0.70, 0.80, 0.90, 1.00. The results of the comprehensive correlation coefficient are shown in Table 5. The ranking results of alternatives did not fluctuate with the change in resolution coefficient.


**Table 5.** Comprehensive grey correlation coefficient of alternatives under different grey resolution coefficients based on completely unknown attribute weights.

In order to verify the reliability and stability of the method proposed in this paper more intuitively, we use Python graphics to carry out simulation experiments on the sequencing and gray resolution coefficient of each alternative, and the specific results are shown in Figure 2 (G is the grey resolution coefficient).

**Figure 2.** Ranking results of alternatives with different grey resolution coefficients based on completely unknown attribute weights.

It can be seen from Figure 2 that in the seven experiments of sensitivity analysis of grey resolution coefficient, the ranking results of alternatives have not changed, and *D C A E B* is always maintained. The simulation experiment shows that *D* is the best alternative to build a shelter in the earthquake disaster area, and the decision result does not fluctuate, which shows the strong stability.

**Case 2**. The value range of attribute weight is known: 0.30 ≤ ω<sup>1</sup> ≤ 0.32, 0.17 ≤ ω<sup>2</sup> ≤ 0.20, 0.25 ≤ ω<sup>3</sup> ≤ 0.28, and 0.20 ≤ ω<sup>4</sup> ≤ 0.24. Through the linear programming model (21), the objective function *Y* to maximize the grey correlation coefficient of alternatives is constructed and solved:

$$\begin{array}{rcl} \max \, \mathsf{Y} \{ \omega\_{\rangle} \} = & 4.8262\omega\_{1} + 4.5267\omega\_{2} \\ & + 3.4784\omega\_{3} + 3.9583\omega\_{4} \end{array}, \begin{array}{c} \begin{cases} 0.30 \le \omega\_{1} \le 0.32 \\ 0.17 \le \omega\_{2} \le 0.20 \\ 0.25 \le \omega\_{3} \le 0.28 \\ 0.20 \le \omega\_{4} \le 0.24 \\ \omega\_{1} + \omega\_{2} + \omega\_{3} + \omega\_{4} = 1 \\ 0 \le \omega\_{j} \le 1, (j = 1, 2, 3, 4) \end{array} \end{array} \tag{23}$$

The attribute weight is ω1= 0.30, ω2= 0.18, ω3= 0.28, and ω4= 0.24 by MATLAB. Combined with model (22), the comprehensive grey correlation coefficient of each alternative is obtained: θ1= 0.8517, θ2= 0.7131, θ3= 0.8718, θ4= 0.9573, θ5= 0.7928. According to the comprehensive grey correlation coefficient θ*<sup>i</sup>* of the alternatives, we can obtain θ<sup>4</sup> > θ<sup>3</sup> > θ<sup>1</sup> > θ<sup>5</sup> > θ2. Therefore, the order of alternatives is *D C A E B*, and, thus, alternative *D* is the best. The government should give priority to building earthquake shelters in area *D*, which is the same as the decision-making result when the attribute weight is unknown.

In order to further verify the stability and superiority of the algorithm of intuitionistic fuzzy cross-entropy and comprehensive grey correlation analysis when the attribute weight range is known, different resolution coefficients are also set for sensitivity analysis, and the optimal alternative and decision results are compared. Taking ξ =0.40, 0.50, 0.60, 0.70, 0.80, 0.90, and 1.00, and attribute weight and comprehensive grey correlation analysis when the attribute weight range is known, different resolution coefficients are also set for sensitivity analysis, and the optimal alternative and decision results are compared. The attribute weight and comprehensive grey correlation coefficient of each alternative are shown in Tables 6 and 7. From the table data, the change in the grey resolution coefficient does not affect the attribute weight and the decision-making result of the alternative, which is still *D C A E B*. It is always the best alternative to build the seismic shelter in the *D* area. In addition, when the weight is completely unknown, the comprehensive grey correlation coefficient of the alternatives is higher than that of the alternatives with known range of attribute weight.


**Table 6.** Attribute weight values under different grey resolution coefficients.

**Table 7.** Comprehensive grey correlation coefficient of alternatives under different grey resolution coefficients based on known range of attribute weight.


More importantly, when the grey resolution coefficient fluctuates from 0.4 to 1.0, whether the weight is known or unknown, the change range of the comprehensive grey correlation coefficient of alternative *D* is the smallest, which is 0.0300 and 0.0302, respectively (see Table 8). Alternative *B* is always the worst, and its fluctuation is also the largest, which is 0.1438 and 0.1239, respectively. Based on this, the stability of the proposed method is proved.


**Table 8.** Change degree of comprehensive grey correlation coefficient of alternatives under fluctuation of grey resolution coefficient.

From Table 7, Python simulation results are shown in Figure 3. Compared to Figure 2, the comprehensive grey correlation coefficient decreases but does not change the overall trend of each alternative, and the decision results remain unchanged. Whether the attribute weights are known or not, the optimal alternative and ranking results are the same, which shows the superiority and stability of the method.

**Figure 3.** Ranking results of alternatives with different grey resolution coefficients based on known attribute weight range.

Through the above comparative analysis, the intuitionistic fuzzy entropy and grey correlation analysis method has achieved good results in solving the MAEDM problems. In this way, the ranking results have strong stability and environmental adaptability.

#### *4.2. Traditional Intuitionistic Fuzzy Distance and Grey Correlation Analysis*

Based on the data given by the above problem of ranking earthquake shelters, the traditional intuitionistic fuzzy distance and grey correlation degree are used to analyze and give the ranking results.

The traditional intuitionistic fuzzy distance model (4) has been given; thus, the corresponding grey correlation coefficient ε*ij* is

$$\varepsilon\_{ij} = \frac{\min\_{i} \min\_{j} d(r\_{ij\prime}c\_{i}) + \xi \max\_{i} \max\_{j} d(r\_{ij\prime}c\_{i})}{d(r\_{ij\prime}c\_{i}) + \xi \max\_{i} \max\_{j} d(r\_{ij\prime}c\_{i})} \tag{24}$$

where *rij* denotes the objective evaluation value, *ci* denotes the subjective preference information, and grey resolution coefficient ξ = 0.50.

**Step 1**. Calculating the grey correlation coefficient of each alternative between the objective evaluation value and subjective preference information.

$$
\varepsilon\_{5 \times 4} = \begin{bmatrix}
0.6667 \ 0.6667 \ 1.0000 \ 0.4000 \\
0.6667 \ 0.8000 \ 0.3333 \ 0.5714 \\
1.0000 \ 0.5714 \ 0.5714 \ 0.6667 \\
0.8000 \ 1.0000 \ 0.8000 \ 0.6667 \\
0.8000 \ 0.6667 \ 0.8000 \ 0.8000
\end{bmatrix}
$$

**Step 2.** Determining the attribute weight. Due to the fact that the range of attribute weight values is known, utilize model (21) to establish the following single-objective programming model:

$$\text{max Z}\{\omega\_{j}\} = \begin{array}{c} \text{3.933} \& \omega\_{1} + 3.7048\omega\_{2} \\ + 3.5047\omega\_{3} + 3.1048\omega\_{4} \end{array} \quad \text{s.t.} \begin{cases} 0.30 \le \omega\_{1} \le 0.32 \\ 0.17 \le \omega\_{2} \le 0.20 \\ 0.25 \le \omega\_{3} \le 0.28 \\ 0.20 \le \omega\_{4} \le 0.24 \\ \omega\_{1} + \omega\_{2} + \omega\_{3} + \omega\_{4} = 1 \\ 0 \le \omega\_{j} \le 1, (j = 1, 2, 3, 4) \end{cases} \tag{25}$$

⎤

⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦

Solving this model, attribute weight can be obtained:ω1= 0.30, ω2= 0.18, ω3= 0.28, and ω4= 0.24.

**Step 3.** On the basis of model (20), the comprehensive grey correlation coefficient is calculated: ε1= 0.6960, ε2= 0.5745, ε3= 0.7229, ε4= 0.8040, ε5= 0.7760.

**Step 4.** Determining the alternatives ranking. Rank the alternatives according to the size of the comprehensive grey correlation coefficient ε*i*. Thus, *D E C A B* is the ranking result.

#### *4.3. Comparative Analysis*

Based on the ranking problem of earthquake shelters, this paper makes a comparative analysis from two aspects:

(1). The attribute weight is completely unknown and the attribute weight range is known

For a more intuitive comparison, it is further explored based on Figures 2 and 3. Regardless of whether the attribute weight is known or unknown, the ranking results of alternatives maintain high stability. The best alternative is always *D*, and the worst is always *B*. The comprehensive grey correlation coefficient of the alternative is positively correlated with the grey resolution coefficient, which indicates that the larger the resolution coefficient, the greater the correlation coefficient of the corresponding alternative.

Moreover, in the case of unknown weight, the comprehensive grey correlation coefficient of each alternative is always better than that of the known weight range, which also indirectly proves the fact that attribute weights are uncertain in most fields of decision problems (see Figures 4 and 5). In addition, the results obtained by using a reasonable method to determine the attribute weights are more practical.

**Figure 4.** The alternatives with different grey resolution coefficients based on completely unknown attribute weights.

**Figure 5.** The alternatives with different grey resolution coefficients based on known attribute weight range.

Meanwhile, based on the data in Table 8, we can further analyze the volatility of the comprehensive grey correlation coefficient in two cases. From Figure 6 (deviation 1 represents unknown weights and deviation 2 represents known weights range), the deviation curves of the comprehensive grey correlation coefficient in the two kinds of weights situation almost coincide. However, when the weight is unknown, the fluctuation amplitude of the comprehensive grey correlation coefficient is still less than that of the known attribute weight range.

**Figure 6.** Deviation of comprehensive grey correlation coefficient in two cases.

Through the comparative analysis, we can see that the ranking result with unknown weight is more reasonable and more consistent with the uncertainty of the decision environment in MAEDM problems.

(2). The traditional intuitionistic fuzzy distance with the intuitionistic fuzzy cross-entropy distance Through the above solution, the ranking results of the intuitionistic fuzzy cross-entropy method is *D C A E B*. Under the sufficient sensitivity analysis, the results maintain a high stability. However, by using the traditional intuitionistic fuzzy distance method, the result of ranking becomes *D E C A B*. Although the ranking result has little change, the best alternative is still *D* and the worst one is *B* (see Table 9). This also fully proves that the method based on intuitionistic fuzzy cross-entropy and grey correlation analysis proposed in this paper has strong stability.


**Table 9.** Ranking results under different methods.

According to the above two groups of comparative analysis, it can be concluded from many aspects that *D* is the best alternative. For the decision maker to make rescue measures, it is the most reasonable decision to give priority to the establishment of earthquake shelters in the *D* area.

#### **5. Conclusions**

This paper presents a new MAEDM method based on intuitionistic fuzzy cross-entropy and comprehensive grey correlation analysis. The main contributions are as follows: (1) Overcome the limitations of the traditional intuitionistic fuzzy geometric distance algorithm, and introduce the intuitionistic fuzzy cross-entropy distance measurement method, which can not only retain the integrity of decision information, but also directly reflect the differences between intuitionistic fuzzy data. (2) This paper focuses on the weight problem in MAEDM, and analyzes and compares the known and unknown attribute weights, which greatly improves the reliability and stability of decision-making results. (3) By using the method of grey correlation analysis, the fitting degree between the objective evaluation value and the subjective preference value of the decision maker can be fully considered. On this basis, a sensitivity analysis is made for the grey resolution coefficient to make the ranking result more reasonable. (4) The intuitionistic fuzzy cross-entropy and grey correlation analysis algorithm are introduced into the emergency decision-making problems such as the location ranking of shelters in earthquake disaster areas, which greatly reduces the risk of decision-making. (5) By comparing the traditional intuitionistic fuzzy distance to the intuitionistic fuzzy cross-entropy, the validity of the proposed method is verified.

Unfortunately, the method proposed in this paper is applicable to the emergency decision-making problems with certain subjective preference. For the emergency problems with which the decision maker has no obvious preference, the method needs to be further studied. In addition, considering more attribute indicators to rank alternatives may obtain more convincing results.

These aspects will become the research hotspot in the future: (1) In the MAEDM, the attribute weight problem will become a research focus. Considering the time factor, it may be an interesting topic to develop the weight into a dynamic field in the future. (2) The decision maker's preference relation and attribute weight often have great uncertainty. It is an effective method to discuss the multi-attribute emergency decision by using a more reliable robust optimization [39–41].

**Author Contributions:** Conceptualization, P.L.; Data curation, P.L.; Formal analysis, P.L.; Funding acquisition, P.L.; Investigation, P.L.; Methodology, Y.J. and P.L.; Project administration, S.-J.Q.; Resources, Y.J. and S.-J.Q.; Software, P.L.; Supervision, Y.J., Z.W. and S.-J.Q.; Validation, P.L., Y.J. and Z.W.; Visualization, P.L.; Writing—original draft, P.L.; Writing—review & editing, Y.J. and Z.W. All authors have read and agreed to the published version of the manuscript.

**Funding:** The work is supported by the National Social Science Foundation of China (No. 17BGL083).

**Acknowledgments:** Thanks is given to my tutor for the guidance of this paper, which greatly improved the quality of the article. Thank you for providing this academic platform for me to submit my manuscript. We are very grateful to the editors and referees for their careful reading and constructive suggestions on the manuscript.

**Conflicts of Interest:** This article has never been published in any journal or institution, and there will be no conflict of interest.

#### **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **Cointegration and Unit Root Tests: A Fully Bayesian Approach**

#### **Marcio A. Diniz 1,\*, Carlos A. B. Pereira <sup>2</sup> and Julio M. Stern <sup>3</sup>**


Received: 3 August 2020; Accepted: 27 August 2020; Published: 31 August 2020

**Abstract:** To perform statistical inference for time series, one should be able to assess if they present deterministic or stochastic trends. For univariate analysis, one way to detect stochastic trends is to test if the series has unit roots, and for multivariate studies it is often relevant to search for stationary linear relationships between the series, or if they cointegrate. The main goal of this article is to briefly review the shortcomings of unit root and cointegration tests proposed by the Bayesian approach of statistical inference and to show how they can be overcome by the Full Bayesian Significance Test (FBST), a procedure designed to test sharp or precise hypothesis. We will compare its performance with the most used frequentist alternatives, namely, the Augmented Dickey–Fuller for unit roots and the maximum eigenvalue test for cointegration.

**Keywords:** time series; Bayesian inference; hypothesis testing; unit root; cointegration

Several times series present deterministic or stochastic trends, which imply that the effects of these trends on the level of the series are permanent. Consequently, the mean and variance of the series will not be constant and will not revert to a long-term value. This feature reflects the fact that the stochastic processes generating these series are not (weakly) stationary, imposing problems to perform inductive inference using the most traditional estimators or predictors. This is so because the usual properties of these procedures will not be valid under such conditions.

Therefore, when modeling non-stationary time series, one should be able to properly detrend the used series, either by directly modeling the trend by deterministic functions, or by transforming the series to remove stochastic trends. To determine which strategy is the suitable solution, several statistical tests were developed since the 1970s by the frequentist school of statistical inference.

The Augmented Dickey–Fuller (ADF) test is one of the most popular tests used to assess if a time series has a stochastic trend or, for series described by auto-regressive models, if they have a unit root. When one is searching for long term relationships between multiple series under analysis, it is crucial to know if there are stationary linear combinations of these series, i.e., if the series are cointegrated. Cointegration tests were developed, also by the frequentist school, in the late 1980s [1] and early 1990s [2]. Only in the late 1980s did the Bayesian approach to test the presence of unit roots start to be developed.

Both unit root and cointegration tests may be considered tests on precise or sharp hypotheses, i.e., those in which the dimension of the parameter space under the tested hypothesis is smaller than the dimension of the unrestricted parameter space. Testing sharp hypotheses poses major difficulties for either the frequentist or Bayesian paradigms, such as the need to eliminate nuisance parameters.

The main goal of this article is to briefly review the shortcomings of the tests proposed by the Bayesian school and how they can be overcome by the Full Bayesian Significance Test (FBST). More specifically, we will compare its performance with the most used frequentist alternatives, the ADF for unit roots, and the maximum eigenvalue test for cointegration. Since this is a review article, it is important to remark that the results presented here were published elsewhere by the same authors, see [3,4].

To accomplish this objective, we will define the FBST in the next section, also showing how it can be implemented in a general context. The following section discusses the problems of testing the existence of unit roots in univariate time series and how the Bayesian tests approach the problem. Section 4 then shows how the FBST is applied to test if a time series has unit roots and illustrates this with applications on a real data set. In the sequel, we discuss the Bayesian alternatives to cointegration tests and then apply the FBST to test for cointegration using real data sets. We conclude with some remarks and possible extensions for future work.

#### **1. FBST**

The Full Bayesian Significance Test was proposed in [5] mainly to deal with sharp hypotheses. The procedure has several properties, see [6,7], most interestingly the fact that it is only based on posterior densities, thus avoiding the necessity of complications such as the elimination of nuisance parameters or the adoption of priors with positive probabilities attached to sets of zero Lebesgue measure.

We shall consider general statistical models in which the parameter space is denoted by <sup>Θ</sup> <sup>⊆</sup> <sup>R</sup>*m*, *<sup>m</sup>* <sup>∈</sup> <sup>N</sup>. A sharp hypothesis *<sup>H</sup>* assumes that *<sup>θ</sup>*, the parameter vector of the chosen statistical model, belongs to a sub-manifold Θ*<sup>H</sup>* of smaller dimensions than Θ. This implies, for continuous parameter spaces, that the subset Θ*<sup>H</sup>* has null Lebesgue measure whenever *H* is sharp. The sample space, the set of all possible values of the observable random variables (or vectors), is here denoted by X .

Following the Bayesian paradigm, let *h*(·) be a probability prior density over Θ, **x** ∈ X , the observed sample (scalar or vector), and *L*(· | **x**) the likelihood derived from data **x**. To evaluate the Bayesian evidence based on the FBST, the sole relevant entity is the posterior probability density for *θ* given **x**,

$$\lg(\theta \mid \mathbf{x}) \propto h(\theta) \cdot L(\theta \mid \mathbf{x}) \dots$$

It is important to highlight that the procedure may be used when the parameter space is discrete. However, when the posterior probability distribution over Θ is absolutely continuous, the FBST appears as a more suitable alternative to significance hypothesis testing. For notational simplicity, we will denote Θ*<sup>H</sup>* by *H* in the sequel.

Let *r*(*θ*) be a reference density on Θ such that the function *s*(*θ*) = *g*(*θ* | **x**)/*r*(*θ*) is a *relative surprise*, (see [8], pp. 145–146) function. The reference density is important because it guarantees that the FBST is invariant to reparametrizations, even when *r*(*θ*) is improper, see [6,9]. Thus, when considering *r*(*θ*) proportional to a constant, the surprise function will be, in practical terms, equivalent to the posterior distribution. For the applications considered in this article, we will use the improper uniform density as reference density on Θ. The authors of [10] remark that it is possible to generalize the procedure using other reference densities such as neutral, invariant, maximum-entropy or non-informative priors, if they are available and desirable.

**Definition 1** (**Tangent set**)**.** *Considering a sharp hypothesis H* : *θ* ∈ Θ*H, the tangential set of the hypothesis given the sample is given by*

$$\mathbb{T}\_{\mathbf{x}} = \{ \theta \in \Theta : s(\theta) > s^\* \}. \tag{1}$$

where *<sup>s</sup>*<sup>∗</sup> <sup>=</sup> sup*θ*∈*<sup>H</sup> <sup>s</sup>*(*θ*)*.*

Notice that the tangent set T**<sup>x</sup>** is the highest relative surprise set, that is, the set of points of the parameter space with higher relative surprise than any point in *H*, being *tangential* to *H* in this sense. This approach takes into consideration the statistical model in which the hypothesis is defined, using several components of the model to define an evidential measure favoring the hypothesis.

**Definition 2** (**Evidence**)**.** *The Bayesian evidence value against H, ev, is defined as*

$$\overline{c\boldsymbol{\varpi}} = P\left(\boldsymbol{\theta} \in \mathbb{T}\_{\mathbf{x}} \mid \mathbf{x}\right) = \int\_{\mathbb{T}\_{\mathbf{x}}} d\mathbf{G}\_{\mathbf{x}}(\boldsymbol{\theta}),\tag{2}$$

*where G***x**(*θ*) *denotes the posterior distribution function of θ and the above integral is of the Riemann–Stieltjes type.*

Definition 2 sets *ev* as the posterior probability of the tangent set that is interpreted as an evidence value against *H*. Hence, the evidence value supporting *H* is the complement of *ev*, namely, *ev* = 1 − *ev*. Notwithstanding, *ev* is not evidence against *A* : *θ* ∈/ Θ*H*, the alternative hypothesis (which is not sharp anyway). Equivalently, *ev* is not evidence in favor of *A*, although it is against *H*.

**Definition 3** (**Test**)**.** *The FBST is the procedure that rejects H whenever ev* = 1 − *ev is smaller than a critical level, evc.*

Thus, we are left with the problem of deciding the critical level *evc* for each particular application. We briefly discuss this and other practical issues in the following subsection.

#### *1.1. Practical Implementation: Critical Values and Numerical Computation*

Since *ev* (also called e-value) is a statistic, it has a sampling distribution derived from the adopted statistical model and in principle this distribution could be used to find a threshold value. If the likelihood and the posterior distribution satisfy certain regularity conditions. See [11], p. 436. [12] proved that, asymptotically, there is a relationship between *ev* and the *p*-values obtained from the frequentist likelihood ratio procedure used to test the same hypotheses. This fact provides a way to find, at least asymptotically, a critical value to *ev* to reject the hypothesis being tested.

In a recent review [7], the authors discuss different ways to provide a threshold for *ev*. Among these alternatives, we highlight the standardized e-value, which follows, asymptotically, the uniform distribution on (0, 1). See also [13] for more on the standardized version of *ev*.

One could also try to define the FBST as a Bayes test derived from a particular loss function and the respective minimization of the posterior expected loss. Following this strategy, [10] showed that there are loss functions which result in *ev* as a Bayes estimator of *φ* = I*H*(*θ*), where I*A*(*x*) denotes the indicator function, being equal to one if *x* ∈ *A* and zero otherwise, *x* ∈/ *A*. Hence, the FBST is in fact a Bayes procedure in the formal sense as defined by Wald in [14].


To compute the evidence value supporting *H* defined in the last section, we need to follow the steps showed in Table 1. Appendix A provides detailed information about the computational resources and codes used to implement the FBST in the examples presented in this work. After defining the statistical model and prior, it is simple to find the surprise function, *s*(*θ*). In step 3, one should find the point of the parameter space in *H* that maximizes *s*(*θ*), that is, to solve a problem of constrained numerical maximization. In several applications, this step does not present a closed form solution, requiring the use of numerical optimizers.

Step 4 involves the integration of the posterior distribution on a subset of Θ, the tangent set T**<sup>x</sup>** that can be highly complex. Once more, since in many cases it is fairly difficult to find an explicit expression for T**x**, one may use various numerical techniques to compute the integral. If it is possible to generate random samples from the posterior distribution, Monte Carlo integration provides an estimate of *ev*, as we will show in this work. Another alternative is to use approximation techniques, such as those proposed in [15], based on a Laplace approximation. We discuss how to implement such approximations for unit root and cointegration tests in [3,4].

#### **2. Bayesian Unit Root Tests**

Before presenting the Bayesian procedures used to test the presence of unit roots, let us fix notation. We will denote by *yt* the *t*-th value of a univariate time series observed in *t* = 1, ... , *T* + *p* dates, where *T* and *p* are positive integers. The usual approach is to assume that the series under analysis is described by an auto-regressive process with *p* lags, *AR*(*p*), meaning that the data generating process is fully described by a stochastic difference equation of order *p*, possibly with an intercept or drift and a deterministic linear trend, i.e.,

$$y\_t = \mu + \delta \cdot t + \phi\_1 y\_{t-1} + \dots + \phi\_p y\_{t-p} + \varepsilon\_t \tag{3}$$

with *<sup>ε</sup><sup>t</sup>* i.i.d. *<sup>N</sup>*(0, *<sup>σ</sup>*2) for *<sup>t</sup>* <sup>=</sup> 1, ... , *<sup>T</sup>* <sup>+</sup> *<sup>p</sup>*. Using the lag or backshift operator *<sup>B</sup>*, we denote *yt*−*<sup>k</sup>* by *Bkyt*, allowing us to rewrite (3) as

$$(1 - \phi\_1 B - \dots \phi\_p B^p) y\_t = \mu + \delta \cdot t + \varepsilon\_t \tag{4}$$

where *<sup>φ</sup>*(*B*)=(<sup>1</sup> − *<sup>φ</sup>*1*<sup>B</sup>* − ... *<sup>φ</sup>pBp*) is the autoregressive polynomial. The difference Equation (3) will be stable, implying that the process generating {*yt*} *T*+*p <sup>t</sup>*=<sup>1</sup> is (weakly) stationary, whenever the roots of the characteristic polynomial *<sup>φ</sup>*(*z*), *<sup>z</sup>* <sup>∈</sup> <sup>C</sup>, lie outside the unit circle, since there may be complex roots. The set of polynomial operators, such as lag polynomials like *φ*(*B*), induces an algebra that is isomorphic to the algebra of polynomials in real or complex variables, see [16].

If some of the roots lie exactly on the unit circle, it is said that the process has unit roots. In order to test such a hypothesis statistically, (3) is rewritten as

$$
\Delta y\_t = \mu + \delta \cdot t + \Gamma\_0 \, y\_{t-1} + \Gamma\_1 \Delta y\_{t-1} + \dots + \Gamma\_{p-1} \Delta y\_{t-p+1} + \varepsilon\_t \tag{5}
$$

where <sup>Δ</sup>*yt* <sup>=</sup> *yt* <sup>−</sup> *yt*−1, <sup>Γ</sup><sup>0</sup> <sup>=</sup> *<sup>φ</sup>*<sup>1</sup> <sup>+</sup> ... <sup>+</sup> *<sup>φ</sup><sup>p</sup>* <sup>−</sup> 1 and <sup>Γ</sup>*<sup>i</sup>* <sup>=</sup> <sup>−</sup> <sup>∑</sup>*<sup>p</sup> <sup>j</sup>*=*i*+<sup>1</sup> *φj*, for *i* = 1, ... , *p* − 1. If the generating process has only one unit root, one root of the complex polynomial *φ*(*z*),

$$1 - \phi\_1 z - \phi\_2 z^2 \dots \phi\_p z^p \dots$$

is equal to one, meaning that

$$1 - \phi\_1 - \phi\_2 - \dots - \phi\_p = 0$$

i.e., *φ*(1) = 0, and all the other roots are on or outside the unit circle. In this case, Γ<sup>0</sup> = 0, the hypothesis that will be tested when modeling (5). Even though tests based on these assumptions verify if the process has a single unit root, there are generalizations based on the same principles that test the existence of multiple unit roots, see [17].

The search for Bayesian unit root tests began in the late 1980s. As far as we know, [18,19] were the first works to propose a Bayesian approach for unit root tests. The frequentist critics of these articles received a proper answer in [20,21], generating a fruitful debate that produced a long list of papers in the literature of Bayesian time series. A good summary of the debate and the Bayesian papers that resulted from it is presented in [22]. We will present here only the most relevant strategies proposed by the Bayesian school to test for unit roots.

Let *θ* = (*ρ*, *ψ*) be the parameters vector, in which *ρ* = ∑*<sup>p</sup> <sup>i</sup>*=<sup>1</sup> *φ<sup>i</sup>* and *ψ* = (*μ*, *δ*, Γ1, ... , Γ*p*−1). Assuming *σ*<sup>2</sup> fixed, the prior density for *θ* can be factorized as

$$h(\theta) = h\_0(\rho) \cdot h\_1(\psi \mid \rho).$$

The marginal likelihood for *ρ*, denoted by *Lm*, is:

$$L\_{\mathfrak{m}}(\rho \mid \mathbf{y}) \propto \int\_{\Psi} L(\theta \mid \mathbf{y}) \cdot h\_1(\psi \mid \rho) \, d\psi \, d\psi$$

where **y** = {*yt*} *T*+*p <sup>t</sup>*=<sup>1</sup> is the observations vector, *L*(*θ*|**y**) the full likelihood, and Ψ the support of the random vector *ψ*. This marginal likelihood, associated with a prior for *ρ*, is the main ingredient used by standard Bayesian procedures to test the existence of unit roots. Even though the procedure varies among authors according to some specific aspects, mentioned below, basically all of them use Bayes factors and posterior probabilities.

One important issue is the specification of the null hypothesis: some authors, starting from [23], consider *H*<sup>0</sup> : *ρ* = 1 against *H*<sup>1</sup> : *ρ* < 1. Starting from [24], this is the way the frequentist school addresses the problem, but following this approach no explosive value for *ρ* is considered. The decision theoretic Bayesian approach solved the problem using the posterior probabilities ratio or Bayes factor:

$$B\_{01} = \frac{L\_{\text{ff}}(\rho = 1 \mid \mathbf{y})}{\int L\_{\text{ff}}(\rho \mid \mathbf{y}) \cdot h\_0(\rho) \, d\rho}.$$

Advocates of this solution argue that one of the advantages of this approach is that the null and the alternative hypotheses are given equal weight. However, the expression above is not defined if *h*0(*ρ*) is not a proper density since the denominator of the Bayes factor is equal to the predictive density, defined just if *h*0(*ρ*) is a proper density. There are also problems if *Lm*(*ρ* = 1|**y**) is zero or infinite.

The problem is approached by [20,25] by testing *H*<sup>0</sup> : *ρ* ≥ 1 against *H*<sup>1</sup> : *ρ* < 1, considering explicitly explosive values for *ρ*. The main advantage of this strategy is the possibility to compute posterior probabilities like

$$P(\rho > 1 \mid \mathbf{y}) = \int\_{1}^{\infty} \mathbf{g}\_{\text{m}}(\rho \mid \mathbf{y}) \, d\rho$$

defined even for improper priors on *ρ*, where *gm* is the marginal posterior for *ρ*.

In [26], the authors do not choose *ρ* as the parameter of interest, examining instead the largest absolute value of the roots of the characteristic polynomial and then verifying if it is smaller or larger than one. Usually, this value is slightly smaller than *ρ*, but the authors argue that this small difference may be important. When this approach is used, unit roots are found less frequently. For an AR(3) model with a constant and deterministic trend, [26] derives the posterior density for the dominant root for the 14 series used in [27] and concluded the following: for eleven of the series, the dominant root was smaller than one, that is to say, the series were trend-stationary. These results were based on flat priors for the autoregressive parameters and the deterministic trend coefficient.

Another controversy is about the prior over *ρ*: [20] argues that the difference between the results given by the frequentist and Bayesian inferences is due to the fact that the flat prior proposed in [18] overweights the stationary region of *ρ*. Hence, he derived a Jeffreys prior for the AR(1) model: this prior quickly diverges as *ρ* increases and becomes larger than one. The obtained posterior led to the same results of [27], which will be discussed in detail in the following section. The critics of the approach adopted by Phillips in [20] judged the Jeffreys prior as unrealistic, from a subjective point of view. See the comments on Phillips's paper on the *Journal of Applied Econometrics*, volume 6, number 4, 1991. The subsequent papers of the same number support the Bayesian approach. This is a nonsensical objection if one considers that the Jeffreys prior is crucial to ensure an invariant inferential procedure, and invariance is a highly desirable property, for either objective or subjective reasons. See [28] for more on invariance in physics and statistical models.

A final controversial point concerns the modeling of initial observations. If the likelihood explicitly models the initial observed values (it is an *exact* likelihood), the process is implicitly considered stationary. In fact, when it is known that the process is stationary, and it is believed that the data

generating process is working for a long period, it is reasonable to assume that the parameters of the model determine the marginal distribution of the initial observations. In the simplest AR(1) model, this would imply that *<sup>y</sup>*<sup>1</sup> ∼ *<sup>N</sup>*(0, *<sup>σ</sup>*2/(<sup>1</sup> − *<sup>ρ</sup>*2)). In this scenario, to perform the inference conditional on the first observation would discard relevant information. On the other hand, there is no marginal distribution defined for *y*<sup>1</sup> if the generating process is not stationary. Then, it is valid to use a likelihood conditional on initial observations. For the models presented here, we always work with the conditional likelihood. As argued in [18], inferences for stationary models are little affected by using conditional likelihoods, especially for large samples. He compares these inferences with the ones based on exact likelihoods under explicit modeling for initial observations.

#### **3. Implementing the FBST for Unit Root Testing**

We will now describe how to use the FBST to test for the presence of unit roots referring to the general model (5). It is also possible to include *<sup>q</sup>* <sup>∈</sup> <sup>N</sup> moving average terms in (3) to model the process, a case that will not be covered in this article but that, in principle, shall not imply major problems for the FBST.

$$
\Delta y\_t = \mu + \delta \cdot t + \Gamma\_0 \, y\_{t-1} + \Gamma\_1 \Delta y\_{t-1} + \dots + \Gamma\_{p-1} \Delta y\_{t-p+1} + \varepsilon\_{t\prime} \tag{5}
$$

where *ε<sup>t</sup> <sup>i</sup>*.*i*.*d*. <sup>∼</sup> *<sup>N</sup>*(0, *<sup>σ</sup>*2) for *<sup>t</sup>* <sup>=</sup> 1, ... , *<sup>T</sup>* <sup>+</sup> *<sup>p</sup>*, recalling also that the hypothesis being tested is <sup>Γ</sup><sup>0</sup> <sup>=</sup> 0. We slightly change the notation of the last section now using *ψ* to denote the vector (*μ*, *δ*, Γ0, ... , Γ*p*−1) and setting *θ* = (*ψ*, *σ*).

Recalling the steps to implement the FBST displayed in Table 1, we have just specified the statistical model. The likelihood, conditional on the first *p* observations, derived from the Gaussian model is

$$L(\theta \mid \mathbf{y}) = (2\pi)^{-T/2} \sigma^{-T} \exp\left\{-\frac{1}{2\sigma^2} \cdot \sum\_{t=p+1}^{T+p} \varepsilon\_t^2\right\},\tag{6}$$

in which *ε<sup>t</sup>* = Δ*yt* − *μ* − *δ* · *t* − Γ0*yt*−<sup>1</sup> − Γ1Δ*yt*−<sup>1</sup> − ... − Γ*p*−1Δ*yt*−*p*<sup>+</sup>1. To complete step 1 of Table 1, we need a prior distribution for *θ*. For all the series modeled in this article, we will use the following non informative prior:

$$h(\theta) = h(\psi, \sigma) \propto 1/\sigma. \tag{7}$$

We are aware of the problems caused by improper priors applied to this problem when one uses alternative approaches, like those mentioned by [22]. However, one of our goals is to show how the FBST can be implemented even for a potentially problematic prior like this one. To write the posterior, we use the following notation:

$$
\Delta Y = \begin{bmatrix}
\Delta y\_{p+1} \\
\Delta y\_{p+2} \\
\vdots \\
\Delta y\_{T+p} \\
\end{bmatrix}, \quad X = \begin{bmatrix}
1 & p+1 & y\_p & \Delta y\_p & \dots & \Delta y\_2 \\
1 & p+2 & y\_{p+1} & \Delta y\_{p+1} & \dots & \Delta y\_3 \\
\vdots & \vdots & \vdots & \vdots & \vdots & \vdots \\
1 & T+p & y\_{T+p-1} & \Delta y\_{T+p-1} & \dots & \Delta y\_{T+1} \\
\end{bmatrix}, \quad \Psi = \begin{bmatrix}
\mu \\
\delta \\
\Gamma\_0 \\
\vdots \\
\Gamma\_{p-1}
\end{bmatrix}, \quad \Gamma\_{p-1} = \begin{bmatrix}
\mu\_0 \\
\vdots \\
\Gamma\_0 \\
\vdots \\
\Gamma\_{p-1}
\end{bmatrix}.
$$

being Δ*Y* of dimension *T* × 1, *X* of dimension *T* × (*p* + 2) and *ψ*, (*p* + 2) × 1. Thanks to this notation, we can write, using primes to denote transposed matrices:

$$\sum\_{t=p+1}^{T+p} \varepsilon\_t^2 = (\Delta Y - X\psi)'(\Delta Y - X\psi) = (\Delta Y - \widehat{\Delta}\widehat{Y})'(\Delta Y - \widehat{\Delta}\widehat{Y}) + (\psi - \widehat{\psi})'X'X(\psi - \widehat{\psi}),$$

where *ψ*' = (*X <sup>X</sup>*)−1*X* · <sup>Δ</sup>*<sup>Y</sup>* is the ordinary least squares (OLS) estimator of *<sup>ψ</sup>* and <sup>Δ</sup> <sup>7</sup>*<sup>Y</sup>* = *<sup>X</sup>ψ*' its prediction for Δ*Y*. Thus, the full posterior is

$$\log(\theta \mid \mathbf{y}) \propto \sigma^{-(T+1)} \exp\left\{ -\frac{1}{2\sigma^2} [ (\Delta Y - \widehat{\Delta}\hat{Y})^{\prime} (\Delta Y - \widehat{\Delta}\hat{Y}) + (\psi - \widehat{\psi})^{\prime} X^{\prime} X (\psi - \hat{\psi}) ] \right\},\tag{8}$$

a Normal-Inverse Gamma density.

Step 2 demands a reference density in order to define the relative surprise function. Since we will use the improper density *r*(*θ*) ∝ 1, the surprise function will be equivalent to the posterior distribution in our applications. Given this, to find *s*∗ (Step 3), we need to find the maximum value of the posterior under the hypothesis being tested, in our case, Γ<sup>0</sup> = 0.

This maximization step is fairly simple to implement given the modeling choices made here: Gaussian likelihood, non informative prior and reference density proportional to a constant. The restricted (assuming *H*) posterior distribution is

$$\mathcal{G}\_{\mathbf{r}}(\boldsymbol{\theta}\_{\mathbf{r}} \mid \mathbf{y}) \propto \boldsymbol{\sigma}^{-(T+1)} \exp\left\{ -\frac{1}{2\sigma^{2}} [(\boldsymbol{\Delta}\boldsymbol{Y} - \widehat{\boldsymbol{\Delta}}\boldsymbol{Y}\_{\mathbf{r}})^{\prime}(\boldsymbol{\Delta}\boldsymbol{Y} - \widehat{\boldsymbol{\Delta}}\boldsymbol{Y}\_{\mathbf{r}}) + (\boldsymbol{\psi}\_{\mathbf{r}} - \widehat{\boldsymbol{\Psi}}\_{\mathbf{r}})^{\prime}\boldsymbol{X}\_{\mathbf{r}}^{\prime}\boldsymbol{X}\_{\mathbf{r}}(\boldsymbol{\psi}\_{\mathbf{r}} - \widehat{\boldsymbol{\Psi}}\_{\mathbf{r}})] \right\},\tag{9}$$

in which *θ<sup>r</sup>* = (*ψr*, *σ*), *ψ<sup>r</sup>* being vector *ψ* without Γ0,

$$\mathbf{X}\_{\mathbf{r}} = \begin{bmatrix} 1 & p+1 & \Delta y\_p & \dots & \Delta y\_2 \\ 1 & p+2 & \Delta y\_{p+1} & \dots & \Delta y\_3 \\ \vdots & \vdots & \vdots & \vdots & \vdots \\ 1 & T+p & \Delta y\_{T+p-1} & \dots & \Delta y\_{T+1} \end{bmatrix}, \quad \widehat{\boldsymbol{\psi}}\_{\mathbf{r}} = (\mathbf{X}\_{\mathbf{r}}^{\prime} \mathbf{X}\_{\mathbf{r}})^{-1} \mathbf{X}\_{\mathbf{r}}^{\prime} \cdot \boldsymbol{\Delta} \mathbf{Y}\_{\mathbf{r}} \text{ and } \widehat{\boldsymbol{\Delta Y}}\_{\mathbf{r}} = \mathbf{X}\_{\mathbf{r}} \widehat{\boldsymbol{\psi}}\_{\mathbf{r}}.$$

that is, *Xr* is simply matrix *X* above without its third column, since under *H* : Γ<sup>0</sup> = 0 and the coefficient of the third column of *X* is Γ0—see Equation (5)—*ψ*'*<sup>r</sup>* is a least squares estimator of *ψ<sup>r</sup>* and Δ 7*Yr* denotes the predicted values for Δ*Y* given by the restricted model. From (9), it is easy to show that the maximum a posteriori (MAP) estimator of *θ<sup>r</sup>* is given by (*ψ*'*r*, '*σr*), with

$$
\widehat{\sigma}\_r = \sqrt{\frac{(\Delta Y - \widehat{\Delta}\widehat{Y}\_r)'(\Delta Y - \widehat{\Delta}\widehat{Y}\_r)}{T+1}}.
$$

Plugging the values of *ψ*'*<sup>r</sup>* and '*σ<sup>r</sup>* into (9), we find *s*∗, as requested in Step 3. Step 4 will also be easy to implement thanks to structure of the models assumed in this section. Since the full posterior, (8), is a Normal-Inverse Gamma density, a simple Gibbs sampler allows us to obtain a random sample from such distribution, suggesting a Monte Carlo approach to compute *ev*. From (8), the conditional posteriors of *ψ* and *σ* are, respectively,

$$\mathcal{g}\_{\Psi}(\boldsymbol{\psi} \mid \boldsymbol{\sigma}, \mathbf{y}) \propto \mathcal{N}(\hat{\boldsymbol{\psi}}, \sigma^{2}(\boldsymbol{X}^{\prime}\boldsymbol{X})^{-1}) \tag{10}$$

$$g\_{\sigma^2}(\sigma^2 \mid \psi, \mathbf{y}) \propto IG\left(\frac{T+1}{2}, H\right) \tag{11}$$

in which *H* = 0.5[(Δ*Y* − Δ 7*Y*) (Δ*Y* − Δ <sup>7</sup>*Y*)+(*<sup>ψ</sup>* <sup>−</sup> *<sup>ψ</sup>*') *X <sup>X</sup>*(*<sup>ψ</sup>* <sup>−</sup> *<sup>ψ</sup>*')], *IG* denotes the Inverse-Gamma distribution and *ψ*' is the OLS estimator of *ψ*, as above. Appendix B brings the parametrization and the probability density function of the Inverse-Gamma distribution. With a sizable random sample from the full posterior, we estimate *ev* as the proportion of sampled vectors that generate a value for the posterior greater than *s*∗, found in Step 3. Hence, in Step 5, we only compute one minus the estimate of *ev* found in Step 4. The whole procedure is summarized in Table 2. For the implementations in this article we sampled 51,000 vectors from (8) and discarded the first 1,000 as a burn-in sample.


**Table 2.** Pseudocode to implement the FBST to unit root tests.

#### *Results*

We implemented the FBST as described above to test the presence of unit roots in 14 U.S. macroeconomic time series, all with annual frequency, first mentioned in [27]. We used the extended series, analyzed in [23]. Appendix A brings more information on the data set and the computational resources and codes used to obtain the results displayed in Table 3 below.

Table 3 reports the names of the tested series, the number of available observations or sample size, the adopted value for *p*—as denoted in Equation (8)—if a linear (deterministic) trend was included in the model or not, the ADF test statistic and its respective *p*-value. We have used the computer package described in [29] to find the ADF *p*-values, available in the R library urca. The last two columns bring the posterior probability of non-stationarity, Γ<sup>0</sup> ≥ 0, and the FBST e-values for the specified models. In order to obtain comparable results, we have adopted the models chosen by [22] for all the series. All the models considered the intercept or constant term, *μ* in (8).

The results show that the non-stationary posterior probabilities are quite distant from the ADF *p*-values. These results were highlighted in [18,19]. Considering the simplest AR(1) model, they argued that, once frequentist inference is based on the distribution of *ρ*'|*ρ* = 1, the non-stationary posterior probabilities provide counterintuitive conclusions since the referred distribution is skewed. Their main argument is that Bayesian inference uses a distribution (marginal posterior of *ρ*) that is not skewed.

As mentioned before, ref. [20] claims that the difference in results between frequentist and Bayesian approaches is due to the flat prior that puts much weight on the stationary region. He proposed the use of Jeffreys priors, which restored the conclusions drawn by the frequentist test. Phillips argued that the flat prior was, actually, informative when used in time series models like those for unit root tests. Using simulations, he shows that *" [the use of a] flat prior has a tendency to bias the posterior towards stationarity. ... even when [the estimate] is close to unity, there may still be a non negligible downward bias in the [flat] posterior probabilities"*. Notwithstanding, the e-values reported in the last column are quite close to the ADF *p*-values even using the flat prior criticized by Phillips.

**Table 3.** Unit root tests for the extended Nelson and Plosser data set.


#### **4. Bayesian Cointegration Tests**

Before starting our brief review of the most relevant Bayesian cointegration tests, we fix notation and present the definitions to which we will refer in the sequel.

All the tests mentioned here are based on the following multivariate framework. Let **Y***<sup>t</sup>* = [*y*1*<sup>t</sup>* ... *ynt*] be a vector with *<sup>n</sup>* <sup>∈</sup> <sup>N</sup> time series, all of them assumed to be integrated of order *<sup>d</sup>* <sup>∈</sup> <sup>N</sup>, i.e., have *d* unit roots. The series are said to be cointegrated if there is a nontrivial linear combination of them that has *<sup>b</sup>* <sup>∈</sup> <sup>N</sup> unit roots, *<sup>b</sup>* <sup>&</sup>lt; *<sup>d</sup>*. We will assume that, as in most applications, *<sup>d</sup>* <sup>=</sup> 1 and *<sup>b</sup>* <sup>=</sup> 0, meaning that, if the time series in **Y***t* is cointegrated, there is a linear combination **a Y***t* that is stationary, where **<sup>a</sup>** <sup>∈</sup> <sup>R</sup>*<sup>n</sup>* is the cointegrating vector. Since the linear combination **<sup>a</sup> Y***<sup>t</sup>* is often motivated by problems found in economics, it is called a long-run equilibrium relationship. The explanation is that non-stationary time series that are related by a long-run relationship cannot drift too far from the equilibrium because economic forces will act to restore the relationship.

Notice also that: (i) the cointegrating vector is not uniquely determined since, for any scalar *s*, (*s* · **a**) is a cointegrating vector; and (ii) if **Y***<sup>t</sup>* has more than two series, it is possible that there is more than one cointegrating vector generating a stationary linear combination.

It is assumed that the data generating process of **Y***<sup>t</sup>* is described by the following vector autoregression with *<sup>p</sup>* <sup>∈</sup> <sup>N</sup> lags, denoted VAR(p), and given by:

$$\mathbf{Y}\_{l} = \mathbf{c} + \Phi\_{0}\mathbf{D}\_{l} + \Phi\_{1}\mathbf{Y}\_{l-1} + \dots + \Phi\_{p}\mathbf{Y}\_{l-p} + \mathbf{E}\_{l} \tag{12}$$

in which **c** is a (*n* × 1) vector of constants, **D***<sup>t</sup>* a vector (*n* × 1) with some deterministic variable, such as deterministic trends or seasonal dummies, Φ*<sup>i</sup>* are (*n* × *n*) coefficients matrices and **E***<sup>t</sup>* is a (*n* × 1) stochastic vector with multivariate normal distribution with null expected value and covariance matrix Ω, denoted *Nn*(**0**, Ω). This dynamic model is assumed valid for *t* = 1, ... , *T* + *p*, the available span of observations of **Y***t*. As in the univariate case, one may include moving average terms in (12), i.e., lags for **E***t*, but this, in principle, would not cause major problems in the Bayesian framework. Model (12) can be rewritten using the lag or backshift operator as

$$(I\_n - \Phi\_1 B - \dots - \Phi\_p B^p)\mathbf{Y}\_t = \mathbf{c} + \Phi\_0 \mathbf{D}\_t + \mathbf{E}\_t. \tag{13}$$

where <sup>Φ</sup>(*B*) = *In* − <sup>Φ</sup>1*<sup>B</sup>* − ... − <sup>Φ</sup>*pB<sup>p</sup>* is the (multivariate) autoregressive polynomial and *In* denotes the *n*-dimensional identity matrix. The associate characteristic polynomial in this context will be the determinant of <sup>Φ</sup>(*z*), *<sup>z</sup>* <sup>∈</sup> <sup>C</sup>. If all the roots of the characteristic polynomial lie outside the unit circle, it is possible to show that **Y***<sup>t</sup>* has a stationary representation—see [30]—such as Equation (12). In order to determine if this is the case, model (12) is rewritten as an (vectorial) error correction model (VECM):

$$
\Delta \mathbf{Y}\_t = \mathbf{c} + \Phi\_0 \mathbf{D}\_t + \Gamma\_1 \Delta \mathbf{Y}\_{t-1} + \dots + \Gamma\_{p-1} \Delta \mathbf{Y}\_{t-p+1} + \Pi \mathbf{Y}\_{t-1} + \mathbf{E}\_t \tag{14}
$$

where Δ**Y***<sup>t</sup>* = [Δ*y*1*<sup>t</sup>* ... Δ*ynt*] , Γ*<sup>i</sup>* = −(Φ*i*+<sup>1</sup> + ... Φ*p*) for *i* = 1, 2, ... , *p* − 1 and Π = −Φ(1) = −(*In* − Φ<sup>1</sup> − ... − Φ*p*). It is possible to show that, when all the roots of det(Φ(*z*)) are outside the unit circle, matrix Π in (14) has full rank, i.e., all the *n* eigenvalues of Π are *n* non null. If the rank of Π is null, this matrix cannot be distinguished from a null matrix, implying that the series in **Y***<sup>t</sup>* has at least one unit root and a valid representation is a VAR of order *p* − 1, i.e., model (14) without the term Π**Y***t*−1. It is possible that the series in **Y***<sup>t</sup>* has two unit roots each, implying that the correct VECM must be written with Δ2**Y***<sup>t</sup>* as a dependent variable.

Finally, if the (*n* × *n*) matrix Π has rank *r*, 0 < *r* < *n*, it has *n* − *r* non null eigenvalues, implying that the series in **Y***<sup>t</sup>* has at least one unit root and its valid representation is given by the VECM in Equation (14). In this case, Π = *αβ* , where *α* and *β* are matrices (*n* × *r*) of rank *r*. Matrix *β* denotes the one with the cointegrating vectors and matrix *α* is called the loading matrix, since it contains the weights of the equilibrium relationships. The tests developed in [2] focus on the rank of matrix Π.

The pioneer Bayesian works to study VAR models and reduced rank regressions are [31–33]. However, the main concern of these papers is to estimate the model parameters and their (marginal) posterior distributions. The usual approach is to assume a given rank for the long run matrix Π, and proceed with all the computations conditional on the given rank. The Bayesian initiatives to test the rank of the referred matrix are recent, the main reference for Bayesian inference on VECM's being [34].

To justify inferential procedures based on prespecified ranks of matrix Π, [22] argued that an empirical cointegration analysis should be based on economic theory, which proposes models obeying equilibrium relationships. According to this view, cointegration research should be "confirmatory" rather than "exploratory". Even though the advocated conditional inference is of simple implementation and very useful for small samples, [22] recognized that tests for the rank of matrix Π should be developed. To our knowledge, few initiatives with this purpose were developed up to now.

One common approach to test sharp hypotheses in the Bayesian framework is by means of Bayes factors. Testing the rank of matrix Π by Bayes factors implies several computational complications and requires the use of proper priors, as shown in [35]. Following an informal approach, [33] obtained the posterior distribution of the ordered eigenvalues of the "squared" long run matrix, Π · Π, obtained from a VAR model without assuming the existence of cointegration relations. As the long run matrix has a reduced rank, it has some null eigenvalues, and this should be revealed by the fact that the smallest eigenvalues should have a lot of probability mass accumulated on values close to zero. The computations can be made straightforwardly, simulating values for the long run matrix from its (marginal) posterior distribution, which is a matrix *t*-Student distribution under the non informative prior (16), also considered in the sequel.

Another common procedure is to estimate the rank of Π as the value *r* that maximizes the (marginal) posterior distribution of the rank. Conditioned on such an estimate, one proceeds to derive the full posterior and eventually estimate the cointegration space, i.e., the linear space spanned by *β*.

A different approach was proposed by [36], who used the Posterior Information Criterion (PIC), developed in [37], as a criterion to choose the mode of the posterior distribution of the rank of Π. However, as highlighted in [34], one of the advantages of the Bayesian approach is the possibility to incorporate the uncertainty about the parameters in the analysis, represented by the posterior distribution of the rank and, whatever the tool the scientist uses to infer the value of *r*, it is derived from this posterior distribution.

The authors of [38] nested the reduced rank models in an unrestricted VAR and used Metropolis– Hastings sampling with the Savage–Dickey density ratio—see [39]—to estimate the Bayes Factors of all the models with incomplete rank up to the model with full rank. The Bayes Factor derivation requires the estimation of an error correction factor for the incomplete rank. This factor, however, is not defined for improper priors due to a problem known as *Bartlett paradox*, which arises whenever one compares models of different dimensions. The difficulty is relevant in the present case because, after deriving the rank posterior density, one may consider that models of different dimensions are being compared. The paradox is stated informally as: improper priors should be avoided when one computes Bayes Factors (except for parameters common to both models) as they depend on arbitrary constants (that are integrals).

More recently, [40] developed an efficient procedure to obtain the posterior distribution of the rank using a uniform proper prior over the cointegration space linearly normalized. The author derived solutions for the posterior probabilities for the null rank and for the full rank of Π. The posterior probabilities of each intermediate rank are derived from the posterior samples of the matrices that compose the long run matrix (*α* and *β*), properly normalized, under each rank and using the method proposed by [41].

#### **5. Implementing the FBST as a Cointegration Test**

This section describes how to implement the FBST to test for cointegration. We will proceed in the same spirit of Section 3, i.e., describing the steps given in Table 1 to implement the test for cointegration. *Entropy* **2020**, *22*, 968

Let us begin recalling the VECM given by Equation (14):

$$
\Delta \mathbf{Y}\_t = \mathbf{c} + \Phi\_0 \mathbf{D}\_t + \Gamma\_1 \Delta \mathbf{Y}\_{t-1} + \dots + \Gamma\_{p-1} \Delta \mathbf{Y}\_{t-p+1} + \Pi \mathbf{Y}\_{t-1} + \mathbf{E}\_t \tag{14}
$$

*t* = 1, . . . , *T* + *p*, in which **E***<sup>t</sup> <sup>i</sup>*.*i*.*d*. <sup>∼</sup> *Nn*(**0**, <sup>Σ</sup>) with **<sup>0</sup>** a null vector of dimension *<sup>n</sup>* <sup>×</sup> 1 and <sup>Ω</sup> a symmetric positive definite real matrix. Notice that these assumptions already specify the statistical model (Gaussian) and its implied likelihood. Before giving it explicitly, let us rewrite Equation (14) using matrix notation:

$$
\Delta \mathbf{Y} = \mathbf{Z} \cdot \boldsymbol{\eta} + \mathbf{E} \tag{15}
$$

$$\text{where } \boldsymbol{\Lambda}\mathbf{Y} = \begin{bmatrix} \boldsymbol{\Lambda}\mathbf{Y}\_{p+1}^{\prime} \\ \boldsymbol{\Lambda}\mathbf{Y}\_{p+2}^{\prime} \\ \vdots \\ \boldsymbol{\Lambda}\mathbf{Y}\_{T+p}^{\prime} \end{bmatrix}, \mathbf{Z} = \begin{bmatrix} 1 & \mathbf{D}\_{p+1}^{\prime} & \boldsymbol{\Lambda}\mathbf{Y}\_{p}^{\prime} & \dots & \boldsymbol{\Lambda}\mathbf{Y}\_{2}^{\prime} & \mathbf{Y}\_{p}^{\prime} \\ 1 & \mathbf{D}\_{p+2}^{\prime} & \boldsymbol{\Lambda}\mathbf{Y}\_{p+1}^{\prime} & \dots & \boldsymbol{\Lambda}\mathbf{Y}\_{3}^{\prime} & \mathbf{Y}\_{p+1}^{\prime} \\ \vdots & \vdots & \vdots & \vdots & \vdots \\ 1 & \mathbf{D}\_{T+p}^{\prime} & \boldsymbol{\Lambda}\mathbf{Y}\_{T-1}^{\prime} & \dots & \boldsymbol{\Lambda}\mathbf{Y}\_{T+p-1}^{\prime} & \mathbf{Y}\_{T+p-1}^{\prime} \end{bmatrix}, \boldsymbol{\eta} = \begin{bmatrix} \mathbf{c}^{\prime} \\ \boldsymbol{\Phi}\_{0} \\ \boldsymbol{\Gamma}\_{1} \\ \vdots \\ \boldsymbol{\Gamma}\_{p-1} \\ \boldsymbol{\Gamma} \end{bmatrix}$$

and the error vector is given by **E** ∼ *MNT*×*n*(0, *IT*, Ω), denoting the matrix normal distribution. See Appendix B for more information on this distribution. Now the parameter vector is given by Θ = (*η*, Ω).

Notice that Δ**Y** is formed by piling up *T* transposed vectors Δ**Y***t*, thus resulting in a matrix with *T* lines and *n* columns (*n* is the number of time series in vector **Y***t*), those being also dimensions of matrix **E**. Matriz **Z** is constructed likewise—always piling up the transposed vectors—resulting in a matrix with *T* lines and *pn* + *n* + 1 columns. Finally, matrix *η* has the matrices of coefficients, all piled up properly, resulting in a matrix with *pn* + *n* + 1 lines and *n* columns.

Given the assumptions above, Δ**Y** ∼ *MNT*×*n*(**Z** · *η*, *IT*, Ω), implying that the likelihood is

$$L(\boldsymbol{\Theta} \mid \mathbf{y}) \propto |\boldsymbol{\Omega}|^{-T/2} \exp\left\{ -\frac{1}{2} \cdot \text{tr}[\boldsymbol{\Omega}^{-1} (\boldsymbol{\Delta} \mathbf{Y} - \mathbf{Z} \cdot \boldsymbol{\eta})' (\boldsymbol{\Delta} \mathbf{Y} - \mathbf{Z} \cdot \boldsymbol{\eta})] \right\}$$

where **y** denotes the set of observed values of vectors **Y***<sup>t</sup>* for *t* = 1, ... , *T* + *p*. As in Section 3, we will consider an improper prior for Θ, given by

$$h(\Theta) = h(\eta, \Omega) \propto |\Omega|^{-(n+1)/2} \prime \tag{16}$$

and our reference density, *r*(Θ), will be proportional to a constant, leading to a surprise function equivalent to the (full) posterior distribution. These choices correspond to steps 1 and 2 of Table 1. These modeling choices imply the following posterior density:

$$\begin{aligned} \mathcal{g}(\boldsymbol{\Theta} \mid \mathbf{y}) & \quad \propto |\Omega|^{-(T+n+1)/2} \exp\left\{-\frac{1}{2} \cdot \text{tr}[\boldsymbol{\Omega}^{-1} (\boldsymbol{\Delta} \mathbf{Y} - \mathbf{Z} \cdot \boldsymbol{\eta})' (\boldsymbol{\Delta} \mathbf{Y} - \mathbf{Z} \cdot \boldsymbol{\eta})] \right\} \\ &= |\Omega|^{-(T+n+1)/2} \exp\left\{-\frac{1}{2} \cdot \text{tr}\{\boldsymbol{\Omega}^{-1} [\mathbf{S} + (\boldsymbol{\eta} - \hat{\boldsymbol{\eta}})' \cdot \mathbf{Z}' \mathbf{Z} \cdot (\boldsymbol{\eta} - \hat{\boldsymbol{\eta}})] \} \right\} \end{aligned} \tag{17}$$

where *η*' = (**Z Z**)−1**Z** Δ**Y** and **S** = Δ**Y** Δ**Y** − Δ**Y Z**(**Z Z**)−1**Z** Δ**Y**.

To implement Step 3 of Table 1, we need to find the maximum a posteriori of (17) under the constraint Θ ⊂ Θ*H*, i.e., we need to maximize the posterior in Θ*H*. Since we are testing the rank of matrix Π, as discussed in the beginning of Section 4, it is necessary to maximize the posterior assuming the rank of Π is *r*, 0 ≤ *r* ≤ *n*. Thanks to the modeling choices made here—Gaussian likelihood and Equation (16) as prior—our posterior is almost identical to a Gaussian likelihood, allowing us to find this maximum using a strategy similar to that proposed by [2], who derived the maximum of the (Gaussian) likelihood function assuming a reduced rank for Π. We will summarize Johansen's algorithm, providing in Appendix C a heuristic argument of why it indeed provides the maximum value of the posterior under the assumed hypotheses.

*Entropy* **2020**, *22*, 968

We begin estimating a VAR(*p* − 1) model for Δ**Y***<sup>t</sup>* with all the explanatory variables shown in (14) except for **Y***t*−1. Using the matrix notation established above, this corresponds to estimate

$$
\Delta \mathbf{Y} = \mathbf{Z}\_1 \cdot \eta\_1 + \mathbf{U}\_2
$$

$$\text{where } \mathbf{Z}\_1 = \begin{bmatrix} 1 & \mathbf{D}\_{p+1}^{\prime} & \boldsymbol{\Delta} \mathbf{Y}\_p^{\prime} & \dots & \boldsymbol{\Delta} \mathbf{Y}\_2^{\prime} \\ 1 & \mathbf{D}\_{p+2}^{\prime} & \boldsymbol{\Delta} \mathbf{Y}\_{p+1}^{\prime} & \dots & \boldsymbol{\Delta} \mathbf{Y}\_3^{\prime} \\ \vdots & \vdots & \vdots & & \vdots \\ 1 & \mathbf{D}\_{T+p}^{\prime} & \boldsymbol{\Delta} \mathbf{Y}\_{T-1}^{\prime} & \dots & \boldsymbol{\Delta} \mathbf{Y}\_{T+p-1}^{\prime} \end{bmatrix} \text{ and } \eta\_1 = \begin{bmatrix} \mathbf{e}^{\prime} \\ \boldsymbol{\tau}\_0 \\ \boldsymbol{\nu}\_1 \\ \vdots \\ \boldsymbol{\nu}\_{p-1} \end{bmatrix} \text{ showing that } \mathbf{Z}\_1 \text{ is obtained.}$$

from matrix **Z** extracting its last *n* columns, exactly those corresponding to **Y***t*−1.

We also estimate a second set of auxiliary equations, regressing **Y***t*−<sup>1</sup> on a vector of constants and **D***t*, Δ**Y***t*−1, ..., Δ**Y***t*−*p*<sup>+</sup>1. By piling up all the (transposed) vectors **Y** *<sup>t</sup>*−<sup>1</sup> for *<sup>t</sup>* <sup>=</sup> *<sup>p</sup>* <sup>+</sup> 1, ... , *<sup>T</sup>* <sup>+</sup> *<sup>p</sup>*, we have a (*T* × *n*) matrix, denoted by **Y**−1. As above, these equations can be represented by

$$\mathbf{Y}\_{-1} = \mathbf{Z}\_1 \cdot \eta\_2 + \mathbf{V}\_1$$
 
$$\text{where } \mathbf{Y}\_{-1} = \begin{bmatrix} \mathbf{Y}\_p' \\ \mathbf{Y}\_{p+1}' \\ \vdots \\ \mathbf{Y}\_{T+p-1}' \end{bmatrix} \text{ and } \eta\_2 = \begin{bmatrix} \mathbf{m}' \\ \mathbf{v}\_0 \\ \mathbb{Z}\_1 \\ \vdots \\ \mathbb{Z}\_{p-1} \end{bmatrix}.$$

Considering the OLS estimates of these sets of equations and their respective estimated residuals, we may write

$$
\delta \widehat{\mathbf{Y}} = \mathbf{Z}\_1 \cdot \widehat{\eta}\_1 + \mathbf{\hat{U}} \tag{18}
$$

$$
\hat{\mathbf{Y}}\_{-1} = \mathbf{Z}\_1 \cdot \hat{\eta}\_2 + \hat{\mathbf{V}} \tag{19}
$$

where *η*'<sup>1</sup> = (**Z** 1**Z**1)−1**Z** <sup>1</sup> · Δ**Y**, *η*'<sup>2</sup> = (**Z** 1**Z**1)−1**Z** <sup>1</sup> · **<sup>Y</sup>**−1, **<sup>U</sup>**' and **<sup>V</sup>**' are the respective matrices of estimated residuals. Thanks to the Frisch–Waugh–Lovell theorem—see [42] theorem 3.3 or [43] Section 2.4—it is possible to show that the estimated residuals of these auxiliary regressions are related by Π in the following regressions:

$$
\hat{\mathbf{U}} = \Pi \,\, \hat{\mathbf{V}} + \overline{\mathbf{W}}.\tag{20}
$$

One can prove that the OLS estimates of Π obtained from (15) and from (20) are numerically identical, as the estimated residuals **<sup>E</sup>**' and <sup>7</sup>**W**.

The second stage of Johansen's algorithm requires the computation of the following sample covariance matrices of the OLS residuals obtained above:

$$\begin{aligned} \hat{\Sigma}\_{\mathbf{V}\mathbf{V}} &= \frac{1}{T} \cdot \hat{\mathbf{V}}^{\prime} \hat{\mathbf{V}} & \hat{\Sigma}\_{\mathbf{U}\mathbf{U}} &= \frac{1}{T} \cdot \hat{\mathbf{U}}^{\prime} \hat{\mathbf{U}}^{\prime} \\ \hat{\Sigma}\_{\mathbf{U}\mathbf{V}} &= \frac{1}{T} \cdot \hat{\mathbf{U}}^{\prime} \hat{\mathbf{V}} & \hat{\Sigma}\_{\mathbf{V}\mathbf{U}} &= \hat{\Sigma}\_{\mathbf{U}\mathbf{V}}^{\prime} \end{aligned}$$

and, from these, we find the *n* eigenvalues of matrix

$$
\widehat{\Sigma}\_{\mathbf{V}\mathbf{V}}^{-1} \cdot \widehat{\Sigma}\_{\mathbf{V}\mathbf{U}} \cdot \widehat{\Sigma}\_{\mathbf{U}\mathbf{U}}^{-1} \cdot \widehat{\Sigma}\_{\mathbf{U}\mathbf{V}\mathbf{V}}.
$$

ordering them decreasingly ' *λ*<sup>1</sup> > ' *λ*<sup>2</sup> > ... > ' *λn*. The maximum value attained by the log posterior subject to the constraint that there are *r* (0 ≤ *r* ≤ *n*) cointegration relationships is

$$\ell^\* = K - \frac{(T+n+1)}{2} \cdot \log|\hat{\Sigma}\_{\mathbf{U}|\mathbf{U}}| - \frac{T+n+1}{2} \cdot \sum\_{i=1}^r \log(1 - \hat{\lambda}\_i),\tag{21}$$

where *K* is a constant that depends only on *T*, *n* and **y** by means of the marginal distribution of the data set, **y**. Since -∗ represents the maximum of the log-posterior, to obtain *s*∗, one should take *s*∗ = exp(-∗), completing step 3 of Table 1.

As in Section 3, we compute *ev* in step 4 by means of a Monte Carlo algorithm. It is easy to factor the full posterior (17) as a product of a (matrix) normal and an Inverse-Wishart, suggesting a Gibbs sampler to generate random samples from the full posterior. See Appendix B for more on the Inverse-Wishart distribution. Thus, the conditional posteriors for *η* and Ω are, respectively,

$$\mathcal{g}\_{\boldsymbol{\eta}}(\boldsymbol{\eta} \mid \boldsymbol{\Omega}, \mathbf{y}) \propto \text{MN}\_{n \times k}(\hat{\boldsymbol{\eta}}\_{\prime}(\mathbf{Z}^{\prime}\mathbf{Z})^{-1}, \boldsymbol{\Omega}) \tag{22}$$

$$\lg\_{\Omega}(\Omega \mid \boldsymbol{\eta}, \mathbf{y}) \propto I\mathcal{W}(\Omega \mid \mathbf{S} + (\boldsymbol{\eta} - \hat{\boldsymbol{\eta}})' \cdot \mathbf{Z}' \mathbf{Z} \cdot (\boldsymbol{\eta} - \hat{\boldsymbol{\eta}}), T) \tag{23}$$

where **S** = Δ**Y** Δ**Y** − Δ**Y Z**(**Z Z**)−1**Z** Δ**Y**, *IW* denotes the Inverse-Wishart, *k* = *pn* + *n* +1 is the number of lines of *η*, and *η*' its OLS estimator, as above. From a Gibbs sampler set with these conditionals, we obtain a random sample from the full posterior to estimate *ev* as the proportion of sampled vectors that generate a value for the posterior greater than *s*∗. Finally, we obtain *ev* = 1 − *ev* in the final step (5). The whole implementation for cointegration tests, following the assumptions made in this section, is summarized in Table 4. See Appendix A for more information on the computational resources needed to implement the steps given by Table 4.

**Table 4.** Pseudocode to implement the FBST to cointegration tests.


1. Statistical model: Gaussian; prior: *h*(Θ) ∝ |Ω| <sup>−</sup>(*n*+1)/2.

2. Reference density: *r*(Θ) ∝ 1; relative surprise function: *g*(Θ | **y**).

3. Find *s*∗: Johansen's algorithm; obtain -∗ from Equation (21) with *s*∗ = exp(-∗).

4. Gibbs sampler (from Equations (22) and (23)) to obtain *N* random samples of parameter vectors from (17). Evaluate the posterior at the sampled vectors and estimate *ev* as the proportion of *N* for which the evaluated values are larger than *s*∗.

5. Find *ev* = 1 − *ev*.

Before presenting the results of the procedure applied to real data sets, it is important to remark one feature of the FBST applied to cointegration tests. The estimated eigenvalues of matrix Π, ' *λi*, correspond to the squared canonical correlations between Δ**Y***<sup>t</sup>* and **Y**−<sup>1</sup> corrected for the variable in **Z**<sup>1</sup> and therefore lie between 0 and 1. Therefore, (21) shows that -∗ <sup>0</sup> ≤ -∗ <sup>1</sup> ≤ ... -∗ *<sup>n</sup>*, where -∗ *<sup>r</sup>* denotes the maximum of the posterior (14) assuming Π has rank 0 ≤ *r* ≤ *n*. Therefore, one may say that the hypotheses rank(Π) = *r* are nested, in the sense that the respective e-values obtained by the FBST for these hypotheses are always non-decreasing *ev*(0) ≤ *ev*(1) ≤ ... ≤ *ev*(*n*).

This nested formulation is also present in the frequentist procedure proposed by [2], based on the likelihood ratio statistics for successive ranks of Π. Thus, the FBST should be used, like the maximum eigenvalue test, in a sequential procedure to test for the number of cointegrating relationships. We will show how this should be done in presenting the applied results in the sequel.

#### *Results*

Now we present, by means of four examples, the application of FBST as a cointegration test. In all the examples, we have adopted a Gaussian likelihood and the improper prior (16). The Gibbs sampler was implemented as described above, providing 51,000 random vectors from the posterior (17). The first 1000 samples were discarded as a burn-in sample, the remaining 50,000 being used to estimate the integral (2). The tables show the e-value computed from the FBST and the maximum eigenvalue test statistics with their respective *p*-values.

**Example 1.** *We analyzed four electroencephalography (EEG) signals from a subject that has previously presented epileptic seizures. The original study, [44], had the aim of detecting seizures based on multiple hours of recordings for each individual and the cointegration analysis of the mentioned signals was presented by [45]. In fact, the cointegration hypothesis is tested using the phase processes estimated from the original signals. This is done by passing the signal into the Hilbert transform and then "unwrapping" the resulting transform. Sections 2 and 5 of [45,46] provide more details on the Hilbert transform and unwrapping.*

The labels of the modeled series refer to the electrodes on the scalp. As seen in Figures 1 and 2, the series are called FP1-F7, FP1-F3, FP2-F4, and FP2-F8, where FP refers to the frontal lobes and F refers to a row of electrodes placed behind these. Even numbered electrodes are on the right side and odd numbered electrodes are on the left side. The electrodes for these four signals mirror each other on the left and right sides of the scalp. The recordings of the studied subject, an 11-year-old female, identified a seizure in the interval (measured in seconds) [2956, 2996]. Therefore, like [45], we analyze the period of 41 seconds prior to the seizure—interval [2956, 2996]—and the subsequent 41 seconds—interval [2996, 3036]—the seizure period. In the sequel, we will refer to these as *prior to seizure* and *during seizure*, respectively. Since the sample frequency has 256 measurements per second, there are a total of 10,496 measurements for each of the four signals. Ref. [45] used 40 seconds for each period, obtaining slightly different results.

Figures 1 and 2 display the estimated phases based on the original signals. The model proposed by [45] is a VAR(1), resulting in a VECM given by

$$
\Delta \mathbf{Y}\_t = \mathbf{c} + \Pi \mathbf{Y}\_{t-1} + \mathbf{E}\_t. \tag{24}
$$

Tables 5 and 6 present the results that essentially lead to the same conclusions obtained by [45], even though they have based their findings on the trace test. See Table 8 of [45].

The comparison between *p*-values and the FBST e-values must be made carefully, the main reason being the fact that *p*-values are not measures supporting the null hypothesis, while e-values provide exactly such a kind of support. That being said, a possible way to compare them is by checking the decision their use recommend regarding the hypothesis being tested, i.e., to reject or not the null hypothesis.

**Figure 1.** Estimated phase processes prior to a seizure.

**Figure 2.** Estimated phase processes during a seizure.

**Table 5.** FBST and max. eig. test: prior to seizure.


**Table 6.** FBST and max. eig. test: during seizure.


Frequentist tests often adopt a significance level approach: given an observed *p*-value, the hypothesis is rejected if the *p*-value is smaller or equal to the mentioned level, usually 0.1, 0.05, or 0.01. Since the cointegration ranks generate nested likelihoods, the hypotheses are tested sequentially, starting with null rank, *r* = 0. For Table 5, adopting a 0.01 significance level, the maximum eigenvalue test would reject*r* = 0 and *r* = 1, and would not reject *r* = 2. The same conclusions follow for Table 6. Thus, the recommended action is to work, for estimation purposes for instance, assuming two cointegration relationships.

The question on which threshold value to adopt for the FBST was already mentioned on Section 1.1, but it is worthwhile to underline it once more. We highly recommend a principled approach deriving the cut-off value from a loss function, which is specific for the problem at hand and the purposes of the analysis. A naive but simpler approach would be to reject the hypothesis if the e-value is smaller than 0.05 or 0.01, emulating the frequentist strategy. Even not recommending this path, since *p*-values are not supporting measures for the hypothesis being tested while e-values are, the researcher may numerically compare *p*-values and e-values in a specific scenario. If the researcher derived the *p*-values from a generalized likelihood ratio test, it is possible to asymptotically compare them. The relationship is: *ev* <sup>=</sup> <sup>1</sup> <sup>−</sup> *Fm*[*F*−<sup>1</sup> *<sup>m</sup>*−*h*(<sup>1</sup> <sup>−</sup> **<sup>p</sup>**)], where *<sup>m</sup>* is the dimension of the full parameter space, *<sup>h</sup>* the dimension of the parameter space under the null hypothesis, *Fm* the chi-square distribution function with *m* degrees of freedom and **p** the corresponding *p*-value. See [9,12] for the proof of the asymptotic relationship between e-values and *p*-values.

Since the maximum eigenvalue test is derived as a likelihood ratio test, this comparison may be done for the results of all the examples presented here, and more appropriately to this example, given its sample size of 10,496 observations. Regarding Tables 5 and 6, one could be in doubt regarding whether to reject or not the hypothesis *r* = 1 since the e-values are larger than 0.01. However, for this model and hypothesis, the e-value corresponding to 0.01 is 0.436. Therefore, in both tables, one could reject the hypothesis and proceed to the next rank that has plenty of evidence in its favor. In conclusion, the practical decisions of both tests (FBST and maximum eigenvalue) would be the same: to not reject *r* = 2.

**Example 2** ([47])**.** *Compare three methods for modeling empirical seasonal temperature forecasts over South America. One of these methods is based on a (possible) long-term cointegration relationship between the temperatures of the quarter March–April–May (MAM) of each year and the temperature of the previous months of November– December–January (NDJ). When there is such a relationship, the authors used the NDJ temperatures (of the previous year) as a predictor for the following MAM season.*

The original data set has monthly temperatures for each coordinate (latitude and longitude) of the covered area. The mentioned series of temperatures (MAM and NDJ) are computed as seasonal averages from this monthly data set by averaging over consecutive three months. Since we have data available from January 1949 to May 2020, the time series of monthly and seasonal average surface temperatures of length 72 for each grid point.

The authors of [47] consider **Y***t* as a two-dimensional vector, its first component being the seasonal (average) MAM temperature of year *t* and the second component the seasonal NDJ temperature of the *previous* year. They consider a VAR(2) without deterministic terms to model the series, resulting in a VECM

$$
\Delta \mathbf{Y}\_t = \Gamma\_1 \Delta \mathbf{Y}\_{t-1} + \Gamma \Pi \mathbf{Y}\_{t-1} + \mathbf{E}\_t. \tag{25}
$$

We have chosen five grid points corresponding to major Brazilian cities to test the cointegration hypothesis of the mentioned seasonal series. The coordinates chosen were the closest ones from: 23.5505◦ S, 46.6333◦ W for São Paulo; 22.9068◦ S, 43.1729◦ W for Rio de Janeiro; 19.9167◦ S, 43.9345◦ W for Belo Horizonte; 15.8267◦ S, 47.9218◦ W for Brasília and 12.9777◦ S, 38.5016◦ W for Salvador. Figures 3 and 4 show the seasonal temperatures for São Paulo and Brasília, respectively, indicating that the cointegration hypothesis is plausible for both cities.

**Figure 3.** Seasonal (MAM and NDJ) temperatures for São Paulo from 1949 to 2020.

**Figure 4.** Seasonal (MAM and NDJ) temperatures for Brasília from 1949 to 2020.

**Table 7.** FBST and maximum eigenvalue test applied to temperature data (MAM and NDJ series) of the mentioned Brazilian cities.


The results are shown in Table 7. Assuming a significance level of 0.01, the maximum eigenvalue test reject the null rank and do not reject *r* = 1 for all the five cities. If we adopt the asymptotic relationship between *p*-values and e-values for the model under analysis, we obtain an e-value of 0.276 corresponding to a 0.01 *p*-value for *r* = 0. Therefore, the FBST would also reject the null rank for all the cities. The hypothesis *r* = 1 is not rejected since all the e-values are close to 1, once more agreeing with the maximum eigenvalue test.

One remark about Brasília seems in order. The city was built to be the federal capital, being officially inaugurated on 21 April 1960. The construction began circa 1957 and before that the site had no human occupation. The process of moving all the administration from Rio de Janeiro, the former capital, was slow and only the 1980 census detected a population over 1 million inhabitants. The present population is almost 3.2 million inhabitants living in the Federal District that includes Brasília and minor surrounding cities. Figure 4 indicates that the seasonal temperatures began to rise exactly after 1980.

**Example 3.** *we applied the FBST to the Finish data set used in their seminal work [2].*

The authors used the logarithms of the series of *M*1 monetary aggregate, inflation rate, real income, and the primary interest rate set by the Bank of Finland to model the money demand, which, in theory, follows a long-term relationship. The sample has 106 quarterly observations of the mentioned variables, starting at the second trimester of 1958 and finishing in the third trimester of 1984. The chosen model was a VAR(2) with unrestricted constant, meaning that the series in **Y***<sup>t</sup>* have one unit root with drift vector **c** and the cointegrating relations may have a non-zero mean. For more information about how to specify deterministic terms in a VAR see [48], chapter 6. Seasonal dummies for the first three quarters

of the year were also considered in the model chosen by [2]. Writing the model in the error correction form, we have:

$$
\Delta\Upsilon\_t = \mathbf{c} + \Phi\_{0,1}\mathbf{D}\_{1t} + \Phi\_{0,2}\mathbf{D}\_{2t} + \Phi\_{0,3}\mathbf{D}\_{3t} + \Gamma\_1\Delta\Upsilon\_{t-1} + \Gamma\Upsilon\_{t-1} + \mathbf{E}\_t.\tag{26}
$$

where Π = Φ<sup>1</sup> + Φ<sup>2</sup> − *I*3, Γ<sup>1</sup> = −Φ2, **c** is a vector with constants and **D***it* denote the seasonal dummies for trimester *i* = 1, 2, 3. The results are displayed in Table 8.

**Table 8.** FBST and maximum eigenvalue test applied to Finish data of Johansen and Juselius (1990).


In [2], the authors concluded that there is, at least, two cointegration vectors, a conclusion that follows if one adopts a 0.01 significance level, for instance. Using the asymptotic relationship between *p*-values and e-values for Equation (26), we obtain, for *r* = 0, an e-value of 0.998, and, for *r* = 1, an e-value of 0.999, corresponding to a 0.01 *p*-value. These apparently discrepant values for the e-values are due to the high dimensions of the unrestricted (*m* = 58) and under *H*<sup>0</sup> (*h* = 42 for *r* = 0 and *h* = 43 for *r* = 1) parameter spaces. Therefore, under this criterion, the FBST also rejects the null rank and *r* = 1 (since 0.132 < 0.998 and 0.994 < 0.999, respectively) and does not reject *r* = 2, recommending the same action as the maximum eigenvalue test.

**Example 4.** *As a final example, we apply the FBST to a US data set discussed in [49]. The observations have annual periodicity and went from 1900 to 1985. We tested for cointegration between real national income, M*1 *monetary aggregate deflated by the GDP deflator and the commercial papers return rate. The chosen model was a VAR(1) with unrestricted constant. The series were used in natural logarithms and the results follow below:*

**Table 9.** FBST and maximum eigenvalue test applied to US data of Lucas (2000).


Table 9 shows that the maximum eigenvalue test rejects *r* = 0 and does not reject *r* = 1 at a 0.05 significance level. Once more adopting the asymptotic relationship between *p*-values and e-values for the chosen model, we obtain, for *r* = 0, an e-value of 0.247 corresponding to a 0.01 *p*-value. Thus, under this criterion, the FBST also rejects the null rank and does not reject *r* = 1.

#### **6. Conclusions**

In the past few decades, the econometric literature introduced statistical tests to identify unit roots and cointegration relationships in time series. The Bayesian approach applied to these topics advanced considerably after the 1990s, developing interesting alternatives, mostly for unit root testing. The (parametric) frequentist tests mentioned here may not be suitable since these procedures rely on the distribution of the test statistic—usually assuming the hypothesis being tested is true—which depend on a particular a statistical model, usually Gaussian. When the distributions of such statistics cannot be obtained, the procedure is saved by asymptotic results. If the researcher considers different statistical models and the available sample is small, the results of the tests may be quite misleading.

The present work reviewed a simple and powerful Bayesian procedure that can be applied to both purposes: unit root and cointegration testing. We have also shown that the FBST works considerably well even when one uses improper priors, a choice that may preclude the derivation of Bayes Factors, a standard Bayesian procedure in hypotheses testing.

A long series of articles provided in [7] and the references therein, has showed the versatility and properties of FBST, such as: a. the e-value derivation and computation are straightforward from its general definition; b. it uses absolutely no artificial restrictions like a distinct probability measure on the hypothesis set, induced by some specific parametrization; c. it is in strict compliance with the likelihood principle; d. it can conduct the test with any prior distribution; e. it does not need closed conjectures concerning error distributions, even for small samples; f. it is an exact procedure, since it does rely on asymptotic assumptions; and g. it is invariant with respect to the null hypothesis parametrization and with respect to the parameter space parametrization. See [9], p. 253 for this property.

To proceed with this research agenda, it would be interesting to perform more simulation studies with the FBST applied to unit root testing for a larger group of parametric and semi-parametric models (likelihoods). Another possibility is to include moving average terms in the data generating processes and work with Gaussian and non-Gaussian ARMA models. Notice that, given the points made above, these extensions would not impose major problems to the FBST as they would to the frequentist procedures. Regarding cointegration, the same extensions may be studied in future works, although the adoption of statistical models outside the Gaussian family would require further efforts to numerically implement the FBST. We shall also investigate the effect of the prior choice in the estimates of cointegration relations, especially for small samples.

**Author Contributions:** M.A.D. was responsible for conceptualization, computational implementation of the methods, formal analysis, investigation, and visualization. C.A.B.P.and J.M.S. were responsible for conceptualization, methodology, formal analysis, supervision, and funding acquisition. All the authors were responsible for writing, reviewing, and editing the original text. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work was also partially funded by CNPq—the Brazilian National Council of Technological and Scientific Development (grants PQ 307648-2018-4, 302767-2017-7, 301892-2015-6, and 308776-2014-3); and FAPESP the State of São Paulo Research Foundation (grants CEPID Shell-RCGI201450279-4; CEPIDCeMEAI 2013-07375-0). The authors are extremely grateful for the support received from their colleagues, collaborators, users, and critics in the construction works of their research projects.

**Acknowledgments:** The authors would like to thank J. Østergaard and C. A. Coelho for kindly providing us access to the data sets used in [45,47], respectively. We also would like to thank the support provided by UFSCar—Federal University of São Carlos, USP—University of São Paulo, and UFMS—Federal University of Mato Grosso do Sul.

**Conflicts of Interest:** The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

#### **Appendix A. Computational Resources**

The FBST was implemented in all the examples using codes written by the authors in Matlab/ Octave programming language. The results displayed in Tables 3 and 5–9 were obtained using GNU Octave version 4.4.1. The only package required to run the routines was the statistics package (version 1.4.1), necessary to simulate vectors of random variables from the distributions mentioned in the text. The codes are briefly described at https://www.ime.usp.br/~jstern/software/, where they can be freely downloaded.

The original data sets used in the examples presented in this work can be obtained from the following sources:


#### **Appendix B. Non-Standard Distributions Used in This Article**

#### *Appendix B.1. Inverse-Gamma*

The probability density function of the Inverse-Gamma distribution is given

$$f\_0(\mathbf{x} \mid a, b) = \frac{b^a}{\Gamma(a)} \cdot \left(\frac{1}{\mathbf{x}}\right)^{a+1} \exp\left(-\frac{b}{\mathbf{x}}\right).$$

for *x* > 0 and zero, otherwise. The parameters *a* and *b* are both positive real numbers and Γ is the gamma function.

#### *Appendix B.2. Matrix Normal*

The probability density function of the random matrix **X** with dimensions *p* × *q* that follows the matrix normal distribution *MNp*×*q*(**M**, **U**, **V**) has the form:

$$f\_1(\mathbf{X} \mid \mathbf{M}, \mathbf{U}, \mathbf{V}) = \frac{\exp\left(-\frac{1}{2}\text{tr}\left[\mathbf{V}^{-1}(\mathbf{X} - \mathbf{M})^\prime \mathbf{U}^{-1}(\mathbf{X} - \mathbf{M})\right]\right)}{(2\pi)^{pq/2}|\mathbf{V}|^{p/2}|\mathbf{U}|^{q/2}}$$

where **<sup>M</sup>** <sup>∈</sup> <sup>R</sup>*p*×*q*, **<sup>U</sup>** <sup>∈</sup> <sup>R</sup>*p*×*<sup>p</sup>* and **<sup>V</sup>** <sup>∈</sup> <sup>R</sup>*q*×*q*, being **<sup>U</sup>** and **<sup>V</sup>** symmetric positive semidefinite matrices. The matrix normal distribution can be characterized by the multivariate normal distribution as follows: **X** ∼ *MNp*×*q*(**M**, **U**, **V**) if and only if vec(**X**) ∼ *Npq*(vec(**M**), **V** ⊗ **U**), where ⊗ denotes the Kronecker product and vec the vectorization of **M**.

#### *Appendix B.3. Inverse-Wishart*

The probability density function of the Inverse-Wishart distribution is

$$f\_2(\mathbf{x} \mid \Lambda, \nu) = \frac{|\Lambda|^{\nu/2}}{2^{vp/2} \Gamma\_p(\frac{\nu}{2})} \left| \mathbf{x} \right|^{-(\nu+p+1)/2} \exp\left[ -\frac{1}{2} \operatorname{tr}(\Lambda \mathbf{x}^{-1}) \right]$$

where **x** and Λ are *p* × *p* positive-definite matrices, and Γ*<sup>p</sup>* is the multivariate gamma function. Notice that we may also write the same density with tr(**x**−1Λ) inside the exponential function, as would be convenient in our implementation of the Gibbs sampler in Section 5.

#### **Appendix C. Heuristic Proof of Johansen's Procedure**

The goal of this appendix is to provide a brief heuristic explanation of the procedure, discussed in Section 5 that finds the maximum of posterior (17) subject to the hypothesis that matrix Π has reduced rank *r*, 0 ≤ *r* ≤ *n*. The procedure is based on the algorithm proposed in [2,50] to maximize a Gaussian likelihood under the same assumption (reduced rank of matrix Π). The formal proof of Johansen's algorithm can be found in [51], chapter 20. As mentioned in Section 5, Johansen's algorithm can be applied to the posterior (17) since this distribution is very close to a (multivariate) Gaussian likelihood. *Entropy* **2020**, *22*, 968

The first step of the algorithm involves "concentrating" the posterior, i.e., to assume Ω and Π are given and maximize the posterior with respect to all the other parameters in Θ. Hence, let *γ* denote the matrix *η* except for matrix Π, i.e., *γ* = **c** Φ <sup>0</sup> Γ <sup>1</sup> ... Γ *p*−1 . The concentrated log-posterior, denoted by M, is found by replacing *γ* with *γ*'(Π) in (17):

$$\mathcal{M}(\Pi, \Pi \mid \mathbf{y}) = \ln[g(\hat{\gamma}(\Pi); \Pi, \Pi \mid \mathbf{y})] = \mathbb{C} + \frac{(T + n + 1)}{2} \ln|\Pi^{-1}| - \left\{ -\frac{1}{2} \cdot \text{tr}[\Pi^{-1}(\hat{\mathbf{U}} - \Pi \hat{\mathbf{V}})^{\prime}(\hat{\mathbf{U}} - \Pi \hat{\mathbf{V}})] \right\} \tag{A1}$$

where *C* is a constant that depends on *T*, *n* and **y**. The strategy behind concentrating the posterior is that, if we can find the values <sup>Ω</sup>' and <sup>Π</sup>' that maximize <sup>M</sup>, then these same values, along with *<sup>γ</sup>*'(Π' ), will maximize (17) under the constraint rank(Π) = *r*. Carrying the concentration on one step further, we can find the value of Ω that maximizes (A1) assuming Π known, giving

$$
\hat{\Omega}(\Pi) = \frac{1}{T + n + 1} \cdot (\hat{\mathbf{U}} - \Pi \hat{\mathbf{V}})'(\hat{\mathbf{U}} - \Pi \hat{\mathbf{V}}) .
$$

To evaluate the concentrated log-posterior at Ω' (Π), notice that

$$\text{tr}\left[\hat{\Omega}(\Pi)^{-1}(\hat{\mathbf{U}}-\Pi\hat{\mathbf{V}})^{\prime}(\hat{\mathbf{U}}-\Pi\hat{\mathbf{V}})\right] = \text{tr}[(T+n+1)I\_{n}] = n(T+n+1)$$

and, therefore, denoting by M<sup>∗</sup> this new concentrated log-posterior, we have

$$\mathcal{M}^\*(\Pi \mid \mathbf{y}) = \mathbb{C} + \frac{(T+n+1)n}{2} - \frac{(T+n+1)}{2} \ln \left| \frac{1}{T+n+1} (\hat{\mathbf{U}} - \Pi \hat{\mathbf{V}})'(\hat{\mathbf{U}} - \Pi \hat{\mathbf{V}}) \right| \tag{A2}$$

$$=\mathcal{C} + \frac{(T+n+1)n}{2} - \frac{(T+n+1)}{2} \ln\left|\frac{T}{T+n+1} \cdot \frac{1}{T}(\hat{\mathbf{U}} - \Pi \hat{\mathbf{V}})'(\hat{\mathbf{U}} - \Pi \hat{\mathbf{V}})\right|\tag{A3}$$

$$=\mathbb{C} + \frac{(T+n+1)n}{2} - \frac{(T+n+1)}{2} \ln\left[ \left(\frac{T}{T+n+1}\right)^n \cdot \left| \frac{1}{T} (\hat{\mathbf{U}} - \Pi \hat{\mathbf{V}})' (\hat{\mathbf{U}} - \Pi \hat{\mathbf{V}}) \right| \right] \tag{A4}$$

$$=K - \frac{(T+n+1)}{2} \cdot \ln\left|\frac{1}{T}(\hat{\mathbf{U}} - \Pi \hat{\mathbf{V}})'(\hat{\mathbf{U}} - \Pi \hat{\mathbf{V}})\right|\tag{A5}$$

where *K* is a new constant depending only on *T*, *n* and **y**. Equation (A5) represents the maximum value one can achieve for the log-posterior for any given matrix Π. Thus, maximizing the posterior comes down to choosing Π so as to minimize the determinant

$$\left| \frac{1}{T} (\hat{\mathbf{U}} - \Pi \hat{\mathbf{V}})' (\hat{\mathbf{U}} - \Pi \hat{\mathbf{V}}) \right| $$

subject to the constraint rank(Π) = *r*. The solution of this problem demands the analysis of the sample covariance matrices of the OLS residuals **U**' and **V**' and here we only present the final expression for the maximum value achieved for the log-posterior, denoted -∗ in Section 5:

$$\ell^\* = K - \frac{(T+n+1)}{2} \cdot \ln|\widehat{\Sigma}\_{\mathbf{U}\mathbf{U}}| - \frac{T+n+1}{2} \cdot \sum\_{i=1}^r \ln(1-\widehat{\lambda}\_i). \tag{A6}$$

Chapter 20 of [51] provides the formal derivation of (A6).

#### **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **A Novel Perspective of the Kalman Filter from the Rényi Entropy**

#### **Yarong Luo 1, Chi Guo 1,2,\*, Shengyong You <sup>1</sup> and Jingnan Liu 1,2**


Received: 21 July 2020; Accepted: 31 August 2020; Published: 3 September 2020

**Abstract:** Rényi entropy as a generalization of the Shannon entropy allows for different averaging of probabilities of a control parameter *α*. This paper gives a new perspective of the Kalman filter from the Rényi entropy. Firstly, the Rényi entropy is employed to measure the uncertainty of the multivariate Gaussian probability density function. Then, we calculate the temporal derivative of the Rényi entropy of the Kalman filter's mean square error matrix, which will be minimized to obtain the Kalman filter's gain. Moreover, the continuous Kalman filter approaches a steady state when the temporal derivative of the Rényi entropy is equal to zero, which means that the Rényi entropy will keep stable. As the temporal derivative of the Rényi entropy is independent of parameter *α* and is the same as the temporal derivative of the Shannon entropy, the result is the same as for Shannon entropy. Finally, an example of an experiment of falling body tracking by radar using an unscented Kalman filter (UKF) in noisy conditions and a loosely coupled navigation experiment are performed to demonstrate the effectiveness of the conclusion.

**Keywords:** Rényi entropy; discrete Kalman filter; continuous Kalman filter; algebraic Riccati equation; nonlinear differential Riccati equation

#### **1. Introduction**

In the late 1940s, Shannon introduced a logarithmic measure of information [1] and a theory that included information entropy (the literature shows that it is related to Boltzmann entropy in statistical mechanics). The more stochastic and unpredictable a variable is, the larger its entropy is. As a measure of information, entropy has been used in various fields, such as information theory, signal processing, information-theoretic learning [2,3], etc. As a generalization of the Shannon entropy, Rényi entropy, named after Alfréd Rényi [4], allows for different averaging of probabilities through a control parameter *α*, and is usually used to quantify the diversity, uncertainty, or randomness of random variables. Liang [5] presented the evolutionary entropy equations and the uncertainty estimation for Shannon entropy and relative entropy, which is also called Kullback–Leibler divergence [6], within the framework of dynamical systems. However, higher-order Rényi entropy has some better properties than Shannon entropy by setting the control parameter *α* in most cases.

The Kalman filter [7] and its variants have been widely used in navigation, control, tracking, etc. Many works focus on combining different entropy and entropy-like quantities with the original Kalman filter to improve the performance. When the state space equation is nonlinear, Rényi entropy can be used to measure the nonlinearity [8,9]. Shannon entropy was used to estimate the weight of each particle from the weights of different measurement models for the fusion algorithm in [10]. Quadratic Rényi entropy [11] of innovation has been used as a minimum entropy criterion under a nonlinear and non-Gaussian circumstance [12] in unscented Kalman filter (UKF) [13] and finite mixtures [14]. A generalized density evolution equation [15] and polynomial-based non-linear

compensation [16] were used to improve the minimum entropy filtering [17]. Relative entropy has been used to measure the similarity between the probabilistic density functions during the recursive processes of the nonlinear filter [18,19]. As for the nonlinear measurement equation with additive Gaussian noise, relative entropy can be deduced to measure the nonlinearity of the measurement [20], and can also be used to measure the approximation error of the *i*-th measurement element in the partitioned update Kalman filter [21]. When the state variables and the measurement variables do not belong to strict Gaussian distribution, such as in the seamless indoor/outdoor multi-source fusion positioning problem [22], the estimation error can be measured by the relative entropy. Relative entropy can also be used to calculate the number of particles in the unscented particle filter for mobile robot self-localization [23] and to calculate the sample window size in the cubature Kalman filter (CKF) [24] for attitude estimation [25]. Moreover, it has been verified that the original Kalman filter can be derived by maximizing the relative entropy [26]. Meanwhile, the robust maximum correntropy criterion has been adopted as the optimal criterion to derive the maximum correntropy Kalman filter [27,28]. However, there has been no work on the direct connections between the Rényi entropy and the Kalman filter theory until now.

In this paper, we propose a new perspective of the Kalman filter from the Rényi entropy for the first time, which bridges the gap between the Kalman filter and the Rényi entropy. We calculate the temporal derivative of the Rényi entropy for the Kalman filter mean square error matrix, which provides the optimal recursive solution mathematically and will be minimized to obtain the Kalman filter gain. Moreover, from the physical point of view, the continuous Kalman filter approaches a steady state when the temporal derivative of the Rényi entropy is equal to zero, which also means that the Rényi entropy will keep stable. A numerical experiment of falling body tracking in noisy conditions with radar using the UKF and a practical experiment of loosely-coupled integration are provided to demonstrate the effectiveness of the above conclusion.

The structure of this paper is as follows. In Section II, the definitions and properties of Shannon entropy and Rényi entropy are presented. In Section III, the Kalman filter is derived from the perspective of minimizing the temporal derivative of Rényi entropy, and the connection between the Rényi entropy and the algebraic Riccati equation is explained. In Section IV, experimental results and analysis are given by the simulation of the UKF and the real integrated navigation data. We finally conclude this paper and provide an outlook for future work in Section V.

#### **2. The Connection between the Kalman Filter and the Temporal Derivative of the Rényi Entropy**

#### *2.1. Rényi Entropy*

To calculate the Rényi entropy of the continuous probability density function (PDF), it is necessary to extend the definition of the Rényi entropy to the continuous form. The Rényi entropy of order *α* for a continuous random variable with a multivariate Gaussian PDF *p*(*x*) is defined [4] and calculated [9] as:

$$H\_R^{\mathfrak{a}}(\mathbf{x}) = \frac{1}{1-\mathfrak{a}} \ln \int\_S p^{\mathfrak{a}}(\mathbf{x}) d\mathbf{x} = \frac{N}{2} \ln(2\pi a^{\frac{1}{a-1}}) + \frac{1}{2} \ln(\det \Sigma), \tag{1}$$

where *α* > 0, *α* = 1, and *α* is a parameter providing a family of entropy functions. *N* is the dimension of the random variable *x*. S is the support. Σ is the covariance matrix of *p*(*x*).

It is straightforward to show that the temporal derivative of the Rényi entropy is given by [9]:

$$\dot{H}\_{\mathbb{R}}^{(a)}(\mathbf{x}) = \frac{1}{2} Tr\{\Sigma^{-1}\Sigma\},\tag{2}$$

where <sup>Σ</sup>˙ is the temporal derivative of the covariance matrix and *Tr*(·) is the trace operator.

It is easy to get the Shannon entropy for the multivariate Gaussian PDF by taking the limitation of Equation (1) as *α* approaches 1. This entropy is given as *H*(*x*) = *<sup>N</sup>* <sup>2</sup> ln(2*πe*) + <sup>1</sup> <sup>2</sup> ln(det Σ), and the temporal derivative of the Shannon entropy is given as *H*˙ (*x*) = <sup>1</sup> <sup>2</sup>*Tr*{Σ−1Σ˙ }. It is obvious the temporal of the Shannon entropy is the same as the temporal of the Rényi entropy. Therefore, we will see later that the conclusion can also be derived from the temporal derivative of the Shannon entropy. However, the Rényi entropy for the multivariate Gaussian PDF instead of the temporal derivative of the Rényi entropy will be used by adjusting the free parameter *α* for different uncertainty measurements in most cases, as the filtering problem has to account for the nonlinearity and the non-Gaussian noise; we adopt the Rényi entropy as the measurement for uncertainty.

#### *2.2. Kalman Filter*

Given the continuous-time linear system [29]:

$$
\dot{X}(t) = F(t)X(t) + G(t)w(t) \tag{3}
$$

$$Z(t) = H(t)X(t) + v(t),\tag{4}$$

where *X*(*t*) is the state vector; *F*(*t*) is the state transition matrix; *G*(*t*) is the system noise driving matrix; *Z*(*t*) is the measurement vector; *H*(*t*) is the measurement matrix; and *w*(*t*) and *v*(*t*) are independent white Gaussian noise with zero mean value; their covariance matrices are *Q*(*t*) and *R*(*t*), respectively:

$$\mathbb{E}[w(t)] = 0, \mathbb{E}[w(t)w^T(\tau)] = Q(t)\delta(t-\tau) \tag{5}$$

$$\mathbb{E}[v(t)] = 0, \mathbb{E}[v(t)v^T(\tau)] = R(t)\delta(t-\tau) \tag{6}$$

$$\mathbb{E}[w(t)v^T(\tau)] = 0,\tag{7}$$

where *δ*(*t*) is the Dirac impulse function, *Q*(*t*) is a symmetric non-negative definite matrix, and *R*(*t*) is a symmetric positive matrix.

The continuous Kalman filter can be deduced by taking the limit of the discrete Kalman filter. The discrete-time state-space model is arranged as follows [29]:

$$X\_k = \Phi\_{k|k-1} X\_{k-1} + \Gamma\_{k|k-1} \mathcal{W}\_{k-1} \tag{8}$$

$$Z\_k = H\_k X\_k + V\_k \tag{9}$$

where *Xk* is an n-dimensional state vector; *Zk* is an m-dimensional measurement vector; <sup>Φ</sup>*k*|*k*−1, <sup>Γ</sup>*k*|*k*−1, and *Hk* are the known system structure parameters, which are called the *<sup>n</sup>* × *<sup>n</sup>* dimensional one-step state update matrix, the *n* × *l* dimensional system noise distribution matrix, and the *m* × *n* dimensional measurement matrix, respectively; *Wk*−<sup>1</sup> is the *l*-dimensional system noise vector, and *Vk* is the m-dimensional measurement noise vector. Both of them are Gaussian noise vector sequences with zero mean value, and are independent of each other:

$$\mathbb{E}[\mathcal{W}\_k] = 0, \mathbb{E}[\mathcal{W}\_k \mathcal{W}\_j^T] = Q\_k \delta\_{kj} \tag{10}$$

$$\mathbb{E}[V\_k] = 0, \mathbb{E}[V\_k V\_j^T] = R\_k \delta\_{kj} \tag{11}$$

$$\mathbb{E}[\mathbb{W}\_k V\_j^T] = 0.\tag{12}$$

The above equation is the basic assumption for the noise requirement in the Kalman filtering state space model, where *Qk* is a symmetric non-negative definite matrix, and *Rk* is a symmetric positive definite matrix. *δkj* is the Kronecker *δ* function.

The covariance parameters *Qk* and *Rk* play roles similar to those of Q and R in the continuous filter, but they do not have the same numerical values. Next, the relationship between the corresponding continuous and discrete filter parameters will be derived.

To achieve the transformation from the continuous form to the discrete form, the relations between Q and R and the corresponding *Qk* and *Rk* for a small step size *Ts* are needed. According to the linear system theory, the relation between *Q* and *Qk* from Equation (3) to Equation (8) is as follows:

$$\Phi\_{k|k-1} = \Phi(t\_{k\prime}t\_{k-1}) \approx e^{\int\_{t\_{k-1}}^{t\_k} F(\tau)d\tau} \tag{13}$$

$$
\Gamma\_{k|k-1} \mathcal{W}\_{k-1} = \int\_{t\_{k-1}}^{t\_k} \Phi(t\_k, \tau) G(\tau) w(t) d\tau. \tag{14}
$$

Denote the discrete-time interval as *Ts* = *tk* − *tk*−1, when *<sup>F</sup>*(*t*) does not change too dramatically within the shorter integral interval [*tk*−1, *tk*]. Take the Taylor expansion of *<sup>e</sup>F*(*tk*−1)*Ts* with respect to *<sup>F</sup>*(*tk*−1)*Ts* and set *<sup>F</sup>*(*tk*−1)*Ts* << *<sup>I</sup>*, so the higher-order terms are negligible and the one-step transition matrix, Equation (13), can be approximated as:

$$\Phi\_{k|k-1} \approx \varepsilon^{F(t\_{k-1})T\_s} = I + F(t\_{k-1})T\_s + F^2(t\_{k-1})\frac{T\_s^2}{2!} + F^3(t\_{k-1})\frac{T\_s^3}{3!} + \dots \approx I + F(t\_{k-1})T\_s.\tag{15}$$

Equation (14) shows that <sup>Γ</sup>*k*|*k*−<sup>1</sup>*Wk*−<sup>1</sup> is the linear transform of the Gaussian white noise *<sup>w</sup>*(*τ*); the result remains the normal distribution random vector. Therefore, the first- and second-order statistical characteristics can be used to describe and be equivalent to <sup>Γ</sup>*k*|*k*−<sup>1</sup>*Wk*−1. Referring to Equation (5), the mean of <sup>Γ</sup>*k*|*k*−<sup>1</sup>*Wk*−<sup>1</sup> is given as follows:

$$\mathbb{E}[\Gamma\_{k|k-1}\mathcal{W}\_{k-1}] = \mathbb{E}[\int\_{t\_{k-1}}^{t\_k} \Phi(t\_k, \tau)G(\tau)w(\tau)d\tau] = \int\_{t\_{k-1}}^{t\_k} \Phi(t\_k, \tau)G(\tau)\mathbb{E}[w(\tau)]d\tau = 0. \tag{16}$$

For the second-order statistical characteristics, when *k* = *j*, the time parameter between the noise *<sup>w</sup>*(*τk*) and *<sup>w</sup>*(*τj*) is independent, so <sup>Γ</sup>*k*|*k*−<sup>1</sup>*Wk*−<sup>1</sup> and <sup>Γ</sup>*j*|*j*−<sup>1</sup>*Wj*−<sup>1</sup> are uncorrelated:

$$\mathbb{E}[(\Gamma\_{k|k-1}\mathcal{W}\_{k-1})(\Gamma\_{j|j-1}\mathcal{W}\_{j-1})^T] = 0 \quad (k \neq j). \tag{17}$$

When *k* = *j*, thus

$$\begin{split} \mathbb{E}[(\Gamma\_{k|k-1}\mathcal{W}\_{k-1})(\Gamma\_{k|k-1}\mathcal{W}\_{k-1})^{T}] &= \mathbb{E}\left\{ [\int\_{t\_{k-1}}^{t\_{k}} \Phi(t\_{k},\tau)G(\tau)w(\tau)d\tau] [\int\_{t\_{k-1}}^{t\_{k}} \Phi(t\_{k},s)G(s)w(s)ds]^{T} \right\} \\ &= \mathbb{E}\left\{ \int\_{t\_{k-1}}^{t\_{k}} \Phi(t\_{k},\tau)G(\tau)w(\tau) \int\_{t\_{k-1}}^{t\_{k}} w^{T}(s)G^{T}(s)\Phi^{T}(t\_{k},s)dsd\tau \right\} \\ &= \int\_{t\_{k-1}}^{t\_{k}} \Phi(t\_{k},\tau)G(\tau) \int\_{t\_{k-1}}^{t\_{k}} \mathbb{E}[w(\tau)w^{T}(s)]G^{T}(s)\Phi^{T}(t\_{k},s)dsd\tau. \end{split} \tag{18}$$

Substituting Equation (5) into the above equation:

$$\begin{split} \mathbb{E}[(\Gamma\_{k|k-1}\mathcal{W}\_{k-1})(\Gamma\_{k|k-1}\mathcal{W}\_{k-1})^{\mathrm{T}}] &= \int\_{t\_{k-1}}^{t\_{k}} \Phi(t\_{k},\boldsymbol{\tau}) \mathcal{G}(\boldsymbol{\tau}) \int\_{t\_{k-1}}^{t\_{k}} \mathcal{Q}(t) \delta(\boldsymbol{\tau}-\mathbf{s}) \mathcal{G}^{\mathrm{T}}(\boldsymbol{s}) \Phi^{\mathrm{T}}(t\_{k},\boldsymbol{s}) d\boldsymbol{s} d\boldsymbol{\tau} \\ &= \int\_{t\_{k-1}}^{t\_{k}} \Phi(t\_{k},\boldsymbol{\tau}) \mathcal{G}(\boldsymbol{\tau}) \mathcal{Q}(\boldsymbol{\tau}) \mathcal{G}^{\mathrm{T}}(\boldsymbol{\tau}) \Phi^{\mathrm{T}}(t\_{k},\boldsymbol{\tau}) d\boldsymbol{\tau}. \end{split} \tag{19}$$

When the noise control matrix *<sup>G</sup>*(*τ*) changes slowly during the time interval [*tk*−1, *tk*], Equation (19) becomes:

*Entropy* **2020**, *22*, 982

$$\begin{split} & \mathbb{E}[(\Gamma\_{k|k-1} W\_{k-1})(\Gamma\_{k|k-1} W\_{k-1})^T] \\ & \approx \int\_{t\_{k-1}}^{t\_k} [I + F(t\_{k-1})(t\_k - \tau)] G(t\_{k-1}) Q(\tau) G^T(t\_{k-1}) [I + F(t\_{k-1})(t\_k - \tau)]^T d\tau \\ & = [I + F(t\_{k-1}) T\_s] \cdot [G(t\_{k-1}) Q(t\_{k-1}) G^T(t\_{k-1}) T\_s] \cdot [I + F(t\_{k-1}) T\_s]^T \\ & \quad + \frac{1}{12} F(t\_{k-1}) G(t\_{k-1}) Q(t\_{k-1}) G^T(t\_{k-1}) F(t\_{k-1})^T T\_s^3 \\ & \approx \left\{ \left[ I + F(t\_{k-1}) T\_s \right] G(t\_{k-1}) \right\} \cdot [Q(t\_{k-1}) T\_s] \cdot \left\{ \left[ I + F(t\_{k-1}) T\_s \right] G(t\_{k-1}) \right\}^T. \end{split} \tag{20} $$

When *<sup>F</sup>*(*tk*−1)*Ts* << *<sup>I</sup>* is satisfied, the above equation can be further approximated:

$$\mathbb{E}\left[\left(\Gamma\_{k|k-1}\mathcal{W}\_{k-1}\right)\left(\Gamma\_{k|k-1}\mathcal{W}\_{k-1}\right)^{T}\right] \approx \mathbf{G}\left(t\_{k-1}\right) \cdot \left[\mathbf{Q}\left(t\_{k-1}\right)T\_{\mathfrak{s}}\right] \cdot \mathbf{G}^{T}\left(t\_{k-1}\right). \tag{21}$$

Comparing the result with Equation (10):

$$\Gamma\_{k|k-1} \approx \left[ I + F(t\_{k-1}) T\_s \right] G(t\_{k-1}) \approx G(t\_{k-1}) \tag{22}$$

$$\mathbb{E}[\mathcal{W}\_k \mathcal{W}\_j^T] = Q\_k \delta\_{kj} = [Q(t\_k) T\_s] \delta\_{kj}. \tag{23}$$

Notice that [29]:

$$Q\_k = Q(t\_k)T\_s.\tag{24}$$

The derivation of the equation relating to *Rk* and *R* is more subtle. In the continuous model, *v*(*t*) is white, so simple sampling of *Z*(*t*) leads to measurement noise with infinite variance. Hence, in the sampling process, we have to imagine averaging the continuous measurement over the *Ts* interval to get an equivalent discrete sample. This is justified because *x* is not the Gaussian white noise and can be approximately constant within the interval.

$$Z\_k = \frac{1}{T\_s} \int\_{t\_{k-1}}^{t\_k} Z(t)dt = \frac{1}{T\_s} \int\_{t\_{k-1}}^{t\_k} [H(t)x(t) + v(t)]dt = H(t\_k)x\_k + \frac{1}{T\_s} \int\_{t\_{k-1}}^{t\_k} v(t)dt. \tag{25}$$

Then, the discrete noise matrix and the continuous noise matrix are equivalent:

$$V\_k = \frac{1}{T\_s} \int\_{t\_{k-1}}^{t\_k} v(t)dt. \tag{26}$$

From Equation (12), we have:

$$\begin{split} \mathbb{E}[V\_{k}V\_{j}^{T}] &= R\_{k}\delta\_{kj} = \frac{1}{T\_{s}^{2}} \int\_{t\_{k-1}}^{t\_{k}} \int\_{t\_{j-1}}^{t\_{j}} \mathbb{E}[\boldsymbol{\nu}(\boldsymbol{\tau})\boldsymbol{\nu}(\boldsymbol{s})] d\boldsymbol{\tau} d\boldsymbol{s} \\ &= \frac{1}{T\_{s}^{2}} \int\_{t\_{k-1}}^{t\_{k}} \int\_{t\_{j-1}}^{t\_{j}} \mathbb{R}(\boldsymbol{\tau})\boldsymbol{\delta}(\boldsymbol{s}-\boldsymbol{\tau}) d\boldsymbol{\tau} d\boldsymbol{s} = \frac{1}{T\_{s}^{2}} \int\_{t\_{k-1}}^{t\_{k}} \mathbb{R}(\boldsymbol{\tau})\boldsymbol{\delta}\_{kj} d\boldsymbol{\tau} \approx \frac{\mathbb{R}(t\_{k})}{T\_{s}} \delta\_{kj}. \end{split} \tag{27}$$

Comparing it with Equation (6), we have [29]:

$$R\_k = \frac{R(t\_k)}{T\_s}.\tag{28}$$

#### *2.3. Derivation of the Kalman Filter*

Assuming that the optimal state estimation at *tk*−<sup>1</sup> is *<sup>X</sup>*<sup>ˆ</sup> *<sup>k</sup>*−1, the state estimation error is *<sup>X</sup>*˜ *<sup>k</sup>*−1, and the state estimation covariance matrix is Σ*k*−1:

$$
\hat{X}\_{k-1} = X\_{k-1} - \hat{X}\_{k-1} \tag{29}
$$

and

*Entropy* **2020**, *22*, 982

$$\Sigma\_{k-1} = \mathbb{E}[\tilde{X}\_{k-1}\tilde{X}\_{k-1}^T] = \mathbb{E}[(X\_{k-1} - \hat{X}\_{k-1})(X\_{k-1} - \hat{X}\_{k-1})^T].\tag{30}$$

If we take the expectation operator of both sides of Equation (8), we obtain the state one-step prediction and the state one-step estimation error:

$$\mathbb{E}[X\_{k|k-1}^{-}=\mathbb{E}[X\_k]=\mathbb{E}[\Phi\_{k|k-1}X\_{k-1}+\Gamma\_{k|k-1}\mathbb{W}\_{k-1}]=\Phi\_{k|k-1}\mathbb{E}[X\_{k-1}]=\Phi\_{k|k-1}\hat{\mathbb{X}}\_{k-1}\tag{31}$$

$$
\tilde{X}\_{k|k-1} = X\_k - X\_{k|k-1}^-.\tag{32}
$$

Substituting Equations (8) and (31) into Equation (32) leads to:

$$\begin{split} \mathcal{X}\_{k|k-1} &= (\Phi\_{k|k-1} X\_{k-1} + \Gamma\_{k|k-1} \mathcal{W}\_{k-1}) - \Phi\_{k|k-1} \mathcal{X}\_{k-1} \\ &= \Phi\_{k|k-1} (X\_{k-1} - \mathcal{X}\_{k-1}) + \Gamma\_{k|k-1} \mathcal{W}\_{k-1} = \Phi\_{k|k-1} \mathcal{X}\_{k-1} + \Gamma\_{k|k-1} \mathcal{W}\_{k-1}. \end{split} \tag{33}$$

Since *<sup>X</sup>*˜ *<sup>k</sup>*−<sup>1</sup> is uncorrelated with *Wk*−1, we therefore obtain the covariance of the state one-step estimation error *<sup>X</sup>*˜ *<sup>k</sup>*|*k*−<sup>1</sup> as follows:

$$\begin{split} \boldsymbol{\Sigma}\_{k|k-1} &= \mathbb{E}[\boldsymbol{\mathcal{X}}\_{k|k-1} \boldsymbol{\mathcal{X}}\_{k|k-1}^T] = \mathbb{E}[(\boldsymbol{\Phi}\_{k|k-1} \boldsymbol{\mathcal{X}}\_{k-1} + \boldsymbol{\Gamma}\_{k|k-1} \boldsymbol{\mathcal{W}}\_{k-1})(\boldsymbol{\Phi}\_{k|k-1} \boldsymbol{\mathcal{X}}\_{k-1} + \boldsymbol{\Gamma}\_{k|k-1} \boldsymbol{\mathcal{W}}\_{k-1})^T] \\ &= \boldsymbol{\Phi}\_{k|k-1} \mathbb{E}[\boldsymbol{\tilde{X}}\_{k-1} \boldsymbol{\tilde{X}}\_{k-1}^T] \boldsymbol{\Phi}\_{k|k-1}^T + \boldsymbol{\Gamma}\_{k|k-1} \mathbb{E}[\boldsymbol{\mathcal{W}}\_{k-1} \boldsymbol{\mathcal{W}}\_{k-1}^T] \boldsymbol{\Gamma}\_{k|k-1}^T \\ &= \boldsymbol{\Phi}\_{k|k-1} \boldsymbol{\Sigma}\_{k-1} \boldsymbol{\Phi}\_{k|k-1}^T + \boldsymbol{\Gamma}\_{k|k-1} \boldsymbol{\mathcal{Q}}\_{k-1} \boldsymbol{\Gamma}\_{k|k-1}^T. \end{split} \tag{34}$$

In a similar way, the measurement at *tk* can be predicted by the state one-step estimation prediction *X*− *<sup>k</sup>*|*k*−<sup>1</sup> and system measurement Equation (9) as follows:

$$Z\_{k|k-1}^{-} = \mathbb{E}[H\_k X\_{k|k-1}^{-} + V\_k] = H\_k X\_{k|k-1}^{-}. \tag{35}$$

In fact, there is difference between the measurement one-step prediction *Z*− *<sup>k</sup>*|*k*−<sup>1</sup> and the actual measurement *Zk*. The difference is denoted as measurement one-step prediction error:

$$
\tilde{Z}\_{k|k-1} = Z\_k - Z\_{k|k-1}^-.\tag{36}
$$

Substituting the measurement Equations (9) and (35) into Equation (36) yields:

$$\mathcal{Z}\_{k|k-1} = \mathcal{Z}\_k - H\_k \mathcal{X}\_{k|k-1}^- = H\_k \mathcal{X}\_k + V\_k - H\_k \mathcal{X}\_{k|k-1}^- = H\_k \mathcal{X}\_{k|k-1} + V\_k. \tag{37}$$

In general, the measurement one-step prediction error *Z*˜ *<sup>k</sup>*|*k*−<sup>1</sup> is called innovation in the classical Kalman filter theory, and it indicates the new information about the state estimate carried by the measurement one-step prediction error.

On the one hand, if the estimation of *Xk* only includes the state one-step prediction *X*<sup>−</sup> *<sup>k</sup>*|*k*−<sup>1</sup> of the system state equation, the estimation accuracy will be low, as no information of the measurement equation has been used. On the other hand, according to Equation (37), the measurement one-step prediction error calculated using the system measurement equation contains the information of the state one-step prediction of *X*− *<sup>k</sup>*|*k*−1. Consequently, it is natural to consider all the state information that comes from the system state equation and the measurement equation, respectively, and correct the state one-step prediction mean *X*− *<sup>k</sup>*|*k*−<sup>1</sup> with the measurement one-step prediction error *<sup>Z</sup>*˜ *<sup>k</sup>*|*k*−1. Thereby, the optimal estimation of *Xk* can be calculated by the combination of *X*<sup>−</sup> *<sup>k</sup>*|*k*−<sup>1</sup> and *<sup>Z</sup>*˜ *k*|*k*−1 as follows:

$$
\hat{X}\_k = X\_{k|k-1}^- + K\_k \vec{Z}\_{k|k-1} \tag{38}
$$

where *Kk* is the undetermined correction factor matrix.

Substituting Equations (31) and (37) into Equation (38) obtains:

$$\begin{split} \hat{X}\_{k} &= X\_{k|k-1}^{-} + K\_{k}(Z\_{k} - H\_{k}X\_{k|k-1}^{-}) = (I - K\_{k}H\_{k})X\_{k|k-1}^{-} + K\_{k}Z\_{k} \\ &= (I - K\_{k}H\_{k})\Phi\_{k|k-1}\hat{X}\_{k-1} + K\_{k}Z\_{k}. \end{split} \tag{39}$$

From Equation (39), the current state estimation *X*ˆ *<sup>k</sup>* is a linear combination of the last state estimation *<sup>X</sup>*<sup>ˆ</sup> *<sup>k</sup>*−<sup>1</sup> and the current measurement *Zk*, which considers the influence of the structural parameters <sup>Φ</sup>*k*|*k*−<sup>1</sup> in the state equation and the structure parameters *Hk* in the measurement equation with different types of construction.

The state estimation error at the current time *tk* is denoted as:

$$
\hat{X}\_k = X\_k - \hat{X}\_{k\prime} \tag{40}
$$

where *Xk* is the true values and *X*ˆ *<sup>k</sup>* is the posterior estimation of *Xk*.

Substituting Equation (39) into Equation (40) obtains:

$$\begin{split} \tilde{\mathcal{R}}\_{k} &= X\_{k} - \left[ X\_{k|k-1}^{-} + K\_{k} (Z\_{k} - H\_{k} X\_{k|k-1}^{-}) \right] = (X\_{k} - X\_{k|k-1}^{-}) - K\_{k} (H\_{k} X\_{k} + V\_{k} - H\_{k} X\_{k|k-1}^{-}) \\ &= \tilde{X}\_{k|k-1} - K\_{k} (H\_{k} \tilde{X}\_{k|k-1} + V\_{k}) = (I - K\_{k} H\_{k}) \tilde{X}\_{k|k-1} - K\_{k} V\_{k}. \end{split} \tag{41}$$

Then, the mean square error matrix of state estimation *X*ˆ *<sup>k</sup>* is given by:

$$\begin{split} \boldsymbol{\Sigma}\_{k} &= \mathbb{E}[\boldsymbol{\mathcal{R}}\_{k}\boldsymbol{\mathcal{R}}\_{k}^{T}] = \mathbb{E}\{ [(\boldsymbol{I} - \boldsymbol{K}\_{k}\boldsymbol{H}\_{k})\boldsymbol{\mathcal{R}}\_{k|k-1} - \boldsymbol{K}\_{k}\boldsymbol{V}\_{k}][(\boldsymbol{I} - \boldsymbol{K}\_{k}\boldsymbol{H}\_{k})\boldsymbol{\mathcal{R}}\_{k|k-1} - \boldsymbol{K}\_{k}\boldsymbol{V}\_{k}]^{T} \} \\ &= (\boldsymbol{I} - \boldsymbol{K}\_{k}\boldsymbol{H}\_{k})\mathbb{E}[\boldsymbol{\mathcal{R}}\_{k|k-1}\boldsymbol{\mathcal{X}}\_{k|k-1}^{T}](\boldsymbol{I} - \boldsymbol{K}\_{k}\boldsymbol{H}\_{k})^{T} + \boldsymbol{K}\_{k}\mathbb{E}[\boldsymbol{V}\_{k}\boldsymbol{V}\_{k}^{T}]\boldsymbol{\mathcal{R}}\_{k}^{T} \\ &= (\boldsymbol{I} - \boldsymbol{K}\_{k}\boldsymbol{H}\_{k})\boldsymbol{\Sigma}\_{k|k-1}(\boldsymbol{I} - \boldsymbol{K}\_{k}\boldsymbol{H}\_{k})^{T} + \boldsymbol{K}\_{k}\mathbb{R}\_{k}\boldsymbol{\mathcal{K}}\_{k}^{T}. \end{split} \tag{42}$$

Substituting Equation (34) into Equation (42) obtains:

$$\begin{split} \boldsymbol{\Sigma}\_{k} &= (\boldsymbol{I} - \boldsymbol{K}\_{k} \boldsymbol{H}\_{k}) [\boldsymbol{\Phi}\_{k|k-1} \boldsymbol{\Sigma}\_{k-1} \boldsymbol{\Phi}\_{k|k-1}^{T} + \boldsymbol{\Gamma}\_{k|k-1} \boldsymbol{Q}\_{k-1} \boldsymbol{\Gamma}\_{k|k-1}^{T}] (\boldsymbol{I} - \boldsymbol{K}\_{k} \boldsymbol{H}\_{k})^{\top} + \boldsymbol{K}\_{k} \boldsymbol{R}\_{k} \boldsymbol{K}\_{k}^{T} \\ &= \boldsymbol{\Phi}\_{k|k-1} \boldsymbol{\Sigma}\_{k-1} \boldsymbol{\Phi}\_{k|k-1}^{T} + \boldsymbol{K}\_{k} \boldsymbol{H}\_{k} \boldsymbol{\Phi}\_{k|k-1} \boldsymbol{\Sigma}\_{k-1} \boldsymbol{\Phi}\_{k|k-1}^{T} \boldsymbol{H}\_{k}^{T} \boldsymbol{K}\_{k}^{T} - \boldsymbol{\Phi}\_{k|k-1} \boldsymbol{\Sigma}\_{k-1} \boldsymbol{\Phi}\_{k|k-1}^{T} \boldsymbol{H}\_{k}^{T} \boldsymbol{K}\_{k}^{T} \\ &- \boldsymbol{K}\_{k} \boldsymbol{H}\_{k} \boldsymbol{\Phi}\_{k|k-1} \boldsymbol{\Sigma}\_{k-1} \boldsymbol{\Phi}\_{k|k-1}^{T} + \boldsymbol{\Gamma}\_{k|k-1} \boldsymbol{Q}\_{k-1} \boldsymbol{\Gamma}\_{k|k-1}^{T} - \boldsymbol{K}\_{k} \boldsymbol{H}\_{k} \boldsymbol{\Gamma}\_{k|k-1} \boldsymbol{Q}\_{k-1} \boldsymbol{\Gamma}\_{k|k-1}^{T} \\ &- \boldsymbol{\Gamma}\_{k|k-1} \boldsymbol{Q}\_{k|k-1} \boldsymbol{\Gamma}\_{k|k-1}^{T} \boldsymbol{K}\_{k}^{T} + \boldsymbol{K}\_{k} \boldsymbol{H}\_{k} \boldsymbol{\Gamma}\_{k|k-1} \boldsymbol{Q}\_{k-1} \boldsymbol{\$$

We now use the approximation <sup>Φ</sup>*k*|*k*−<sup>1</sup> ≈ *<sup>I</sup>* + *<sup>F</sup>*(*tk*−1)*Ts* as Equation (15). From Equation (22) with <sup>Γ</sup>*k*|*k*−<sup>1</sup> ≈ *<sup>G</sup>*(*tk*−1), we have:

$$\begin{split} \boldsymbol{\Sigma}\_{k} &= \left[I + F(t\_{k-1})T\_{\overline{s}}\right] \boldsymbol{\Sigma}\_{k-1} \left[I + F(t\_{k-1})T\_{\overline{s}}\right]^{T} + K\_{k}H\_{k}[I + F(t\_{k-1})T\_{\overline{s}}] \boldsymbol{\Sigma}\_{k-1} \left[I + F(t\_{k-1})T\_{\overline{s}}\right]^{T} H\_{k}^{T} K\_{k}^{T} \\ &- \left[I + F(t\_{k-1})T\_{\overline{s}}\right] \boldsymbol{\Sigma}\_{k-1} \left[I + F(t\_{k-1})T\_{\overline{s}}\right]^{T} H\_{k}^{T} K\_{k}^{T} - K\_{k}H\_{k}[I + F(t\_{k-1})T\_{\overline{s}}] \boldsymbol{\Sigma}\_{k-1} \left[I + F(t\_{k-1})T\_{\overline{s}}\right]^{T} \\ &+ G(t\_{k-1})Q\_{k-1} \boldsymbol{G}^{T}(t\_{k-1}) - K\_{k}H\_{k}G(t\_{k-1})Q\_{k-1} \boldsymbol{G}^{T}(t\_{k-1}) - G(t\_{k-1})Q\_{k-1} \boldsymbol{G}^{T}(t\_{k-1})H\_{k}^{T} K\_{k}^{T} \\ &+ K\_{k}H\_{k}G(t\_{k-1})Q\_{k-1} \boldsymbol{G}^{T}(t\_{k-1})H\_{k}^{T} K\_{k}^{T} + K\_{k}R\_{k}K\_{k}^{T}. \end{split} \tag{44}$$

Note from Equation (24) that *Qk* is of the order of *Ts* and from Equation (28) that *Rk* <sup>=</sup> *<sup>R</sup>*(*tk* ) *Ts* ; then, Equation (44) becomes:

$$\begin{split} \Sigma\_{k} &= \left[I + F(t\_{k-1})T\_{5}\right]\Sigma\_{k-1}\left[I + F(t\_{k-1})T\_{5}\right]^{T}\right] + K\_{k}H\_{k}\left[I + F(t\_{k-1})T\_{5}\right]\Sigma\_{k-1}\left[I + F(t\_{k-1})T\_{5}\right]^{T}H\_{k}^{T}K\_{k}^{T} \\ &- \left[I + F(t\_{k-1})T\_{5}\right]\Sigma\_{k-1}\left[I + F(t\_{k-1})T\_{5}\right]^{T}H\_{k}^{T}K\_{k}^{T} - K\_{k}H\_{k}\left[I + F(t\_{k-1})T\_{5}\right]\Sigma\_{k-1}\left[I + F(t\_{k-1})T\_{5}\right]^{T} \\ &+ G(t\_{k-1})Q(t\_{k})T\_{5}G^{T}(t\_{k-1}) - K\_{k}H\_{k}G(t\_{k-1})Q(t\_{k})T\_{5}G^{T}(t\_{k-1}) \\ &- G(t\_{k-1})Q(t\_{k})T\_{5}G^{T}(t\_{k-1})H\_{k}^{T}K\_{k}^{T} + K\_{k}H\_{k}G(t\_{k-1})Q(t\_{k})T\_{5}G^{T}(t\_{k-1})H\_{k}^{T}K\_{k}^{T} + K\_{k}\frac{R(t\_{k})}{T\_{5}}K\_{k}^{T}. \end{split}{$$

#### *2.4. The Temporal Derivative of the Rényi Entropy and the Kalman Filter Gain*

To obtain the continuous form of covariance matrix Σ, the limit will be taken. However, the relation between the undetermined correction factor matrix *Kk* and its continuous form still remains unknown. Therefore, we make the following assumption.

**Assumption 1.** *Kk is of the order of Ts, that is:*

$$K(t\_k) = \frac{K\_k}{T\_s}.\tag{46}$$

From the conclusion, we can also derive this assumption conversely. We next draw the conclusion as one theorem under the assumption, as follows:

**Theorem 1.** *The discrete form of the undetermined correction factor matrix is the same as the continuous form when the temporal derivative of Rényi entropy is minimized. This can be presented in a mathematical form as follows:*

$$\{\mathcal{K}\_k = \Sigma\_k H\_k^T \mathcal{R}\_{k'} \mathcal{K} = \Sigma H^T \mathcal{R}^{-1} \vert \mathcal{K}^\* = \arg\min\_{\mathcal{K}} H\_{\mathcal{R}}^{(a)}(\mathcal{K})\}. \tag{47}$$

**Proof of Theorem 1.** We substitute the expression for *Kk* into Equation (45) and neglect higher-order terms in *Ts*; Equation (45) becomes:

<sup>Σ</sup>*<sup>k</sup>* = [*<sup>I</sup>* + *<sup>F</sup>*(*tk*−1)*Ts*]Σ*k*−1[*<sup>I</sup>* + *<sup>F</sup>*(*tk*−1)*Ts*] *<sup>T</sup>*] + *TsK*(*tk*)*Hk*Σ*k*−1[*<sup>I</sup>* <sup>+</sup> *<sup>F</sup>*(*tk*−1)*Ts*]Σ*k*−<sup>1</sup> [*<sup>I</sup>* + *<sup>F</sup>*(*tk*−1)*Ts*] *THT <sup>k</sup> TsKT*(*tk*) <sup>−</sup> [*<sup>I</sup>* <sup>+</sup> *<sup>F</sup>*(*tk*−1)*Ts*]Σ*k*−1[*<sup>I</sup>* <sup>+</sup> *<sup>F</sup>*(*tk*−1)*Ts*] *THT <sup>k</sup> TsKT*(*tk*) − *TsK*(*tk*)*Hk*[*<sup>I</sup>* + *<sup>F</sup>*(*tk*−1)*Ts*]Σ*k*−1[*<sup>I</sup>* + *<sup>F</sup>*(*tk*−1)*Ts*] *<sup>T</sup>* <sup>+</sup> *<sup>G</sup>*(*tk*−1)*Q*(*tk*)*TsGT*(*tk*−1) <sup>−</sup> *TsK*(*tk*)*HkG*(*tk*−1)*Q*(*tk*)*TsGT*(*tk*−1) <sup>−</sup> *<sup>G</sup>*(*tk*−1)*Q*(*tk*)*TsGT*(*tk*−1)*H<sup>T</sup> <sup>k</sup> TsKT*(*tk*) <sup>+</sup> *TsK*(*tk*)*HkG*(*tk*−1)*Q*(*tk*)*TsGT*(*tk*−1)*H<sup>T</sup> <sup>k</sup> TsKT*(*tk*) + *TsK*(*tk*) *Rk Ts TsKT*(*tk*) <sup>=</sup> <sup>Σ</sup>*k*−<sup>1</sup> <sup>+</sup> *TsF*(*tk*−1)Σ*k*−<sup>1</sup> <sup>+</sup> *Ts*Σ*k*−1*FT*(*tk*−1) <sup>−</sup> <sup>Σ</sup>*k*−1*H<sup>T</sup> <sup>k</sup> TsK*(*tk*)*<sup>T</sup>* <sup>−</sup> *TsK*(*tk*)*Hk*Σ*k*−<sup>1</sup> <sup>+</sup> *<sup>G</sup>*(*tk*−1)*Q*(*tk*)*TsGT*(*tk*−1) + *TsK*(*tk*) *R*(*tk*) *Ts TsKT*(*tk*). (48)

Moving the first term of Equation (48) from right to left and dividing both sides by *Ts* to form the finite difference expression:

$$\begin{split} \frac{\boldsymbol{\Sigma}\_{k} - \boldsymbol{\Sigma}\_{k-1}}{T\_{s}} &= \boldsymbol{F}(\boldsymbol{t}\_{k-1})\boldsymbol{\Sigma}\_{k-1} + \boldsymbol{\Sigma}\_{k-1}\boldsymbol{\mathcal{F}}^{T}(\boldsymbol{t}\_{k-1}) - \boldsymbol{\Sigma}\_{k-1}\boldsymbol{H}\_{k}^{T}\boldsymbol{K}(\boldsymbol{t}\_{k})^{T} - \boldsymbol{K}(\boldsymbol{t}\_{k})\boldsymbol{H}\_{k}\boldsymbol{\Sigma}\_{k-1} \\ &+ \boldsymbol{G}(\boldsymbol{t}\_{k-1})\boldsymbol{Q}(\boldsymbol{t}\_{k})\boldsymbol{G}^{T}(\boldsymbol{t}\_{k-1}) + \boldsymbol{K}(\boldsymbol{t}\_{k})\boldsymbol{\mathcal{R}}(\boldsymbol{t}\_{k})\boldsymbol{K}^{T}(\boldsymbol{t}\_{k}). \end{split} \tag{49}$$

Finally, passing to the limit as *Ts* → 0 and dropping of the subscripts lead to the matrix differential equation:

$$
\Sigma = F\Sigma + \Sigma F^T - \Sigma H^T K^T - KH\Sigma + GQG^T + KKK^T. \tag{50}
$$

Σ is invertible, as it is a positive matrix. Multiplying Σ−<sup>1</sup> with Equation (50), we can consider the temporal derivative of the Rényi entropy of the mean square error matrix Σ using Equation (2):

*Entropy* **2020**, *22*, 982

$$\begin{split} \dot{H}\_{R}^{(a)} &= \frac{1}{2} \text{Tr} \{ \Sigma^{-1} \dot{\Sigma} \} \\ &= \frac{1}{2} \text{Tr} \{ \Sigma^{-1} F \Sigma + F^{T} - H^{T} K^{T} - \Sigma^{-1} K H \Sigma + \Sigma^{-1} G Q G^{T} + \Sigma^{-1} K R K^{T} \} \\ &= \frac{1}{2} \text{Tr} \{ F + F^{T} - H^{T} K^{T} - K H + \Sigma^{-1} G Q G^{T} + \Sigma^{-1} K R K^{T} \} \\ &= \frac{1}{2} \text{Tr} \{ 2F - 2KH + \Sigma^{-1} G Q G^{T} + \Sigma^{-1} K R K^{T} \}, \end{split} \tag{51}$$

where the invariance under the cyclic permutation property of the trace operator has been used to eliminate Σ−<sup>1</sup> and Σ, as well as the truth that *Tr*(*F*) = *Tr*(*FT*) has been used to simplify the formula.

It is obvious that Equation (51) is a quadratic function of the undetermined correction factor matrix *K*. Thereby, there must be a minimum of *H*˙ (*α*) *<sup>R</sup>* (*x*) in a probabilistic sense. Taking the derivative of both sides of Equation (51) with respect to matrix *K* obtains:

$$\begin{split} \frac{\partial}{\partial K} H\_R^{(a)} &= -2 \frac{\partial Tr(KH)}{\partial K} + \frac{\partial Tr(\Sigma^{-1} K R K^T)}{\partial K} \\ &= -2H^T + \frac{Tr(\Sigma^{-1} K R (\partial K)^T)}{\partial K} + \frac{Tr(\Sigma^{-1} (\partial K) R K^T)}{\partial K} \\ &= -2H^T + \Sigma^{-1} K R + (R K^T \Sigma^{-1})^T. \end{split} \tag{52}$$

In addition, since Σ−<sup>1</sup> and *Rk* are symmetric matrices, the result is:

$$\frac{\partial}{\partial K} \dot{H}\_R^{(a)} = -2H^T + 2\Sigma^{-1}KR. \tag{53}$$

*Rk* is invertible, as it is a positive matrix. According to the extreme value principle of the function, when the above are equal to zero, then we have:

$$K = \Sigma H^T R^{-1}.\tag{54}$$

So far, we have found the analytic solution to the undetermined correction factor matrix *K*, which is called the continuous-time Kalman filter gain in the classical Kalman filter. Then, the recursive formulations of the Kalman filter can be established through the Kalman filter gain *K*. Most importantly, this implies the connection between the temporal derivative of Rényi entropy and the classical Kalman filter: The temporal derivative of the Rényi entropy is minimized when the Kalman filter gain satisfies Equation (54).

Looking back to Assumption 1 and substituting Equation (28) into Equation (54), we obtain:

$$K(t\_k) = \frac{K\_k}{T\_s} = K = \Sigma H^T R^{-1} = \Sigma\_k H\_k^T R\_k(T\_s) = \frac{\Sigma\_k H\_k^T R\_k}{T\_s}.\tag{55}$$

Therefore, the discrete-time Kalman filter gain can be expressed as follows:

$$\mathcal{K}\_k = \Sigma\_k H\_k^T \mathcal{R}\_k. \tag{56}$$

**Remark 1.** *The discrete-time Kalman filter gain has the same form as the continuous-time filter gain, as shown in the Equation (54). In principle, this is consistent with our intuition and proves the correctness and rationality of Assumption 1, in turn.*

**Remark 2.** *The Kalman filter gain is equivalent to the minimization of the temporal derivative of the Rényi entropy, although it has the same result as the original Kalman filter, which is deduced under the minimum mean square error criterion.*

Substituting Equation (54) into Equation (50), we have:

$$\begin{split} \dot{\Sigma} &= F\Sigma + \Sigma F^T - \Sigma H^T K^T - \Sigma H^T R^{-1} H \Sigma + G Q G^T + \Sigma H^T R^{-1} R K^T \\ &= F\Sigma + \Sigma F^T - \Sigma H^T R^{-1} H \Sigma + G Q G^T. \end{split} \tag{57}$$

This is a second-order nonlinear differential equation with respect to the mean square error matrix Σ, and it is commonly called the Riccati equation. This is the same result as that of the Bucy–Kalman filter [7].

If the system equation, Equation (3), and the measurement equation, Equation (4), form a linear time-invariant system with constant noise covariance, the mean square error matrix Σ may reach a steady-state value, and Σ˙ may eventually reach zero. So, we have the continuous algebraic Riccati equation as follows:

$$
\dot{\Sigma} = F\Sigma + \Sigma F^T - \Sigma H^T \mathbb{R}^{-1} H\Sigma + GQG^T = 0. \tag{58}
$$

As we can see, the time derivative of covariance at the steady state is zero; then, the temporal derivative of the Rényi entropy should also be zero:

$$
\dot{H}\_R^{(a)} = 0.\tag{59}
$$

This implies that when the system approaches a stable state, the Rényi entropy approaches a steady value so that the temporal derivative of the Rényi entropy is zero. This is reasonable when the steady system owns a constant Rényi entropy, as uncertainty is stable, which follows our intuitive understanding. Consequently, it is worth noting that whether the value of the Rényi entropy is stable or not can be a validated indicator of whether the system is approaching the steady state.

#### **3. Simulations and Analysis**

In this section, we give two experiments to show that when the nonlinear filter system approaches the steady state, the Rényi entropy of the system approaches stability. The first experiment is a numerical example of a falling body in noisy conditions, tracked by radar [30] using the UKF. The second experiment is a practical experiment of loosely coupled integration [29]. The simulations were carried out on MATLAB 2018a running on a computer with i5-5200U, 2.20 GHz CPU, and the graphs were plotted by MATLAB.

#### *3.1. Falling Body Tracking*

In the example of a falling body being tracked by radar, the body falls vertically. The radar is placed at a vertical distance *L* from the body, and the radar measures the distance *y* from the radar to the body. The state-space equation of the body is given by:

$$\begin{aligned} \dot{x}\_1 &= x\_2\\ \dot{x}\_2 &= d + g\\ \dot{x}\_3 &= 0 \end{aligned} \tag{60}$$

where *<sup>x</sup>*<sup>1</sup> is the height, *<sup>x</sup>*<sup>2</sup> is the velocity, *<sup>x</sup>*<sup>3</sup> is the ballistic coefficient, *<sup>g</sup>* = −9.81 m/s<sup>2</sup> is the gravity acceleration, and *d* is the air drag, which could be approximated as:

$$d = \frac{\rho \mathbf{x}\_2^2}{2\mathbf{x}\_3} = \rho\_0 \exp(-\frac{\mathbf{x}\_1}{k}) \frac{\mathbf{x}\_2^2}{2\mathbf{x}\_3} \,\mathrm{}\tag{61}$$

where *ρ* is the air density with an initial value of *ρ*<sup>0</sup> = 1.225; *ρ*<sup>0</sup> = 1.225 and *k* = 6705.6 are constants.

The measurement equation is:

$$y = \sqrt{L^2 + x\_1^2}.\tag{62}$$

It is worth noting that the drag and the square root cause severely nonlinearity in the state-space function and measurement function, respectively.

The discrete-time nonlinear system can be given by the Euler discretization method. Combining the additive process with Gaussian white noises for measurement, we can obtain:

$$\begin{aligned} \mathbf{x}\_1(n+1) &= \mathbf{x}\_1(n) + \mathbf{x}\_2(n) \cdot T + w\_1(n) \\ \mathbf{x}\_2(n+1) &= \mathbf{x}\_2(n) + (d+\mathbf{g}) \cdot T + w\_2(n) \\ \mathbf{x}\_3(n+1) &= \mathbf{x}\_3(n) + w\_3(n) \end{aligned} \tag{63}$$

$$y(n) = \sqrt{L^2 + x\_1^2(n)} + v(n). \tag{64}$$

In the UKF numerical experiment, we set the sampling period to *T* = 0.4 s, the horizontal distance to *L* = 100 m, the maximum number of samples to *N* = 100, the process noise to *Sw* = *diag*(105, 103, 102), the measurement noise to *Sv* = 106, and the initial state to *<sup>x</sup>* = [105; −5000; 400]. The results are shown as follows:

Figure 1 shows the evolution of covariance matrix Σ. Figures 2 and 3 show the Rényi entropy of covariance matrix Σ and its change in adjacent time, respectively. Notice that the uncertainty increases near the middle of the plots, which is coincident with the drag peak. However, the Rényi entropy fluctuates around 15; even the fourth element of Σ changes dramatically. Of course, the entropy changes are closely accompanied by the drag peak, which means the change of the entropy of covariance reflects the evolution of matrix Σ. Consequently, the Rényi entropy can be viewed as the indicator of whether the system is approaching the steady state or not.

**Figure 1.** Evolution of matrix Σ.

#### *3.2. Practical Integrated Navigation*

In the loosely integrated navigation system, the system state parameter *x* is composed of inertial navigation system (INS) error states in the North–East–Down (NED) local-level navigation frame, and can be expressed as follows:

$$\mathbf{x} = [(\delta r^n)^T \quad (\delta v^n)^T \quad (\psi)^T \quad (b\_{\mathcal{S}})^T \quad (b\_a)^T]^T,\tag{65}$$

where *δrn*, *δvn*, and *ψ* represent the position error, the velocity error, and the attitude error, respectively; *bg* and *ba* are modeled as first-order Gauss–Markov processes, representing the gyroscope bias and the accelerometer bias, respectively.

The discrete-time state update equation is used to update state parameters as follows:

$$\mathbf{x}\_k = \Phi\_{k|k-1} \mathbf{x}\_{k-1} + G\_{k|k-1} w\_{k-1} \tag{66}$$

where *Gk*|*k*−<sup>1</sup> is the system noise matrix, *wk*−<sup>1</sup> is the system noise, and <sup>Φ</sup>*k*|*k*−<sup>1</sup> is the state transition matrix from *tk*−<sup>1</sup> to *tk*; this is determined by the dynamic model of the state parameter.

In the loosely coupled integration, the measurement equation can be simply expressed as:

$$
\delta z = H\_k x\_k + v\_{k\prime} \tag{67}
$$

where *vk* is the measurement noise, *Hk* is the measurement matrix, and *zk* is the measurement vector calculated by subtracting the global navigation satellite system (GNSS) observation with the inertial navigation system (INS) mechanism.

The experiments reported in this section were carried out by processing the data from an unmanned ground vehicle test. The gyroscope random walk was set to 0.03 deg/√*<sup>h</sup>* and the velocity random walk was set to 0.05 m/s/√*h*. The sampling rates of the inertial measurement unit (IMU) and the GNSS are 200 Hz and 1 Hz, respectively. The test lasts 48 min.

The position error curve, velocity error curve, and attitude error curve of the loosely coupled integration are shown in Figures 4–6. The root mean squares (RMSs) of the position errors in the north, east, and earth directions are 0.0057 m, 0.0024 m, and 0.0134 m, respectively. The RMS of the velocity errors in the north, east, and earth directions are 0.0023 m/s, 0.0021 m/s, and 0.0038 m/s, respectively. The RMSs of the attitude errors in the roll, pitch, and yaw directions are 0.0034 deg, 0.0030 deg, and 0.0178 deg, respectively.

The Rényi entropy of the covariance *P* is shown in Figure 7. As we can see, the Rényi entropy fluctuates around −100 once the filter converges, which is consistent with the conclusion from the entropy perspective.

**Figure 4.** Position error of the loosely coupled integration.

**Figure 6.** Attitude error of the loosely coupled integration.

**Figure 7.** Rényi entropy of the covariance Σ.

#### **4. Conclusions and Final Remarks**

We have considered the original Kalman filter by taking the minimization of the temporal derivative of the Rényi entropy. In particular, we show that the temporal derivative of Rényi entropy is equal to zero when the Kalman filter system approaches the steady state, which means that the Rényi entropy approaches a stable value. Finally, simulation experiments and practical experiments show the Rényi entropy truly stays stable when the system becomes steady.

Future work includes calculating the Rényi entropy of the innovation term when the measurements and the noise are non-Gaussian [14] in order to evaluate the effectiveness of measurements and adjust the noise covariance matrix. Meanwhile, we can also calculate the Rényi entropy of the nonlinear dynamical equation to measure the nonlinearity in the propagation step.

**Author Contributions:** Conceptualization, Y.L. and C.G.; Funding acquisition, C.G. and J.L.; Investigation, Y.L.; Methodology, Y.L., C.G., and S.Y.; Project administration, J.L.; Resources, C.G.; Software, Y.L. and S.Y.; Supervision, J.L.; Validation, S.Y.; Visualization, S.Y.; Writing—original draft, Y.L.; Writing—review and editing, C.G., S.Y., and J.L. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was supported by a grant from the National Key Research and Development Program of China (2018YFB1305001).

**Acknowledgments:** In this section you can acknowledge any support given which is not covered by the author contribution or funding sections. This may include administrative and technical support, or donations in kind (e.g., materials used for experiments).

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **Application of Cloud Model in Qualitative Forecasting for Stock Market Trends**

#### **Oday A. Hassen 1, Saad M. Darwish 2,\*, Nur A. Abu <sup>3</sup> and Zaheera Z. Abidin <sup>3</sup>**


Received: 23 July 2020; Accepted: 26 August 2020; Published: 6 September 2020

**Abstract:** Forecasting stock prices plays an important role in setting a trading strategy or determining the appropriate timing for buying or selling a stock. The use of technical analysis for financial forecasting has been successfully employed by many researchers. The existing qualitative based methods developed based on fuzzy reasoning techniques cannot describe the data comprehensively, which has greatly limited the objectivity of fuzzy time series in uncertain data forecasting. Extended fuzzy sets (e.g., fuzzy probabilistic set) study the fuzziness of the membership grade to a concept. The cloud model, based on probability measure space, automatically produces random membership grades of a concept through a cloud generator. In this paper, a cloud model-based approach was proposed to confirm accurate stock based on Japanese candlestick. By incorporating probability statistics and fuzzy set theories, the cloud model can aid the required transformation between the qualitative concepts and quantitative data. The degree of certainty associated with candlestick patterns can be calculated through repeated assessments by employing the normal cloud model. The hybrid weighting method comprising the fuzzy time series, and Heikin–Ashi candlestick was employed for determining the weights of the indicators in the multi-criteria decision-making process. Fuzzy membership functions are constructed by the cloud model to deal effectively with uncertainty and vagueness of the stock historical data with the aim to predict the next open, high, low, and close prices for the stock. The experimental results prove the feasibility and high forecasting accuracy of the proposed model.

**Keywords:** cloud model; fuzzy time series; stock trend; Heikin–Ashi candlestick

#### **1. Introduction**

Forecasting stock prices is an attractive pursuit for investors and researchers who want to beat the stock market. The benefits of having a good estimation of the stock market behavior are well-known, minimizing the risk of investment and maximizing profits. Recently, the stock market has become an easily accessible investment tool, not only for strategic investors, but also for ordinary people. Over the years, investors and researchers have been interested in developing and testing models of stock price behavior. However, analyzing stock market movements and price behaviors is extremely challenging because of the market's dynamic, nonlinear, non–stationary, nonparametric, noisy, and chaotic nature [1]. Stock markets are affected by many highly interrelated uncertain factors that include economic, political, psychological, and company-specific variables. These uncertain factors are undesirable for the stock investor and make stock price prediction very difficult, but at the same time, they are also unavoidable whenever stock trading is preferred as an investment

tool [1,2]. To invest in stocks and achieve high profits with low risks, investors have used technical and fundamental analysis as two major approaches in decision-making in financial markets [2].

Fundamental analysis studies all of the factors that have an impact on the stock price of the company in the future such as financial statements, management processes, industry, etc. It analyzes the intrinsic value of the firm to identify whether the stock is underpriced or overpriced. On the other hand, technical analysis uses past charts, patterns, and trends to forecast the price movements of the entity in the coming time [2,3]. The main weakness of fundamental analysis is that it is time-consuming as people cannot quickly locate and absorb the information needed to make thoughtful stock picks. People's judgments are subjective, as is their definition of fair value. The second drawback of a fundamental analysis is in relation to the efficient market hypothesis. Since all information about stocks is public knowledge—barring illegal insider information—stock prices reflect that knowledge.

A major advantage of technical analysis is its simple logic and application. It is seen in the fact that it ignores all economic, market, technological, and any other factors that may have an impact on the company and the industry and only focuses on the data on prices and the volume traded to estimate future prices. The second advantage of technical analysis is that it excludes the subjective aspects of certain companies such as the analyst's personal expectations [4]. However, technical analysis may get an investor trapped: when price movements are artificially created to lure an investor into the stock and once enough investors are entered, they start selling, and you may be trapped. Furthermore, it is too reliant on mathematics and patterns in the chart of the stock and ignores the underlying reasons or causes of price movements. As a result, the stock movements are too wild to handle or predict through technical analysis.

There exist two types of forecasting techniques to be implemented [5,6]: (a) qualitative forecasting models; and (b) quantitative forecasting models. The qualitative forecasting models are generally subjective in nature and are mostly based on the opinions and judgments of experts. Such types of methods are generally used when there is little or no past data available that can be used to base the forecast. Hence, the outcome of the forecast is based upon the knowledge of the experts regarding the problem. On the other hand, quantitative forecasting models make use of the data available to make predictions into the future. The model basically sums up the interesting patterns in the data and presents a statistical association between the past and current values of the variable. Management can use qualitative inputs in conjunction with quantitative forecasts and economic data to forecast sales trends. Qualitative forecasting is useful when there is ambiguous or inadequate data. The qualitative method of forecasting has certain disadvantages such as anchoring events and selective perception. Qualitative forecasts enable a manager to decrease some of this uncertainty to develop plans that are fairly accurate, but still inexact. However, the lack of precision in the development of a qualitative forecast versus a quantitative forecast ensures that no single qualitative technique produces an accurate forecast every time [2,4,7–10].

In nearly two decades, the fuzzy time series approach has been widely used for its superiorities in dealing with imprecise knowledge (like linguistic) variables in decision making. In the process of forecasting with fuzzy time series models, the fuzzy logical relationship is one of the most critical factors that influence the forecasting accuracy. Many studies seek to deploy neuro-fuzzy inference to the stock market in order to deal with probability. Fuzzy logic is known to be useful for decision-making where there is a great deal of uncertainty as well as vague phenomena, but lacks the learning capability; on the other hand, neural networks are useful in constructing an adaptive system that can learn from historical data, but are not able to process ambiguous rules and probabilistic datasets. It is tedious to develop fuzzy rules and membership functions and fuzzy outputs can be interpreted in a number of ways, making analysis difficult. In addition, it requires a lot of data and expertise to develop a fuzzy system.

Recently, a probabilistic fuzzy set was suggested for forecasting by introducing probability theory into a fuzzy set framework. It changes the secondary MF of type 2 fuzzy into the probability density function (PDF), so it is able to capture the random uncertainties in membership degree. It has the

ability to capture uncertainties with fuzzy and random nature. However, the membership functions are difficult to obtain for existing fuzzy approaches of measurement uncertainty. In order to conquer this disadvantage, the cloud model was used to calculate the measurement uncertainty. A cloud is a new, easily visualized concept for uncertainty with well-defined semantics, mediating between the concept of a fuzzy set and that of a probability distribution [11–16]. A cloud model is an effective tool in transforming qualitative concepts and their quantitative expressions. The digital characteristics of cloud, expect value (*Ex*), entropy (*En*), and hyper–entropy (He), well integrate the fuzziness and randomness of linguistic concepts in a unified way. Cloud is combined with several cloud drops in which the shape of the cloud reflects the important characters of the quantity concept [17]. The essential difference between the cloud model and the fuzzy probability concept lies in the used method to calculate a random membership degree. Basically, with the three numerical characteristics, the cloud model can randomly generate a degree of membership of an element and implement the uncertain transformation between linguistic concepts and its quantitative instantiations.

Candlestick patterns provide a way to understand which buyer and seller groups currently control the price action. This information is visually represented in the form of different colors on these charts. Recently, several traders and investors have used the traditional Japanese candlestick chart pattern and analyzed the pattern visually for both quantitative and qualitative forecasting [6–10]. Heikin–Ashi candlesticks are an offshoot from Japanese candlesticks. Heikin–Ashi candlesticks use the open–close data from the prior period and the open–high–low–close data from the current period to create a combo candlestick. The resulting candlestick filters out some noise in an effort to better capture the trend.

#### *1.1. Problem Statement*

The price variation of the stock market is a non–linear dynamic system that deals with non–stationary and volatile data. This is the reason why its modeling is not a simple task. In fact, it is regarded as one of the most challenging modeling problems due to the fact that prices are stochastic. Hence, the best way to predict the stock price is to reduce the level of uncertainty by analyzing the movement of the stock price. The main motivation of our work was the successful prediction of stock future value that can yield enormous capital profits and can avoid potential market risk. Several classical approaches have been evolved based on linear time series models, but the patterns of the stock market are not linear. These approaches lead to inaccurate results, which may be susceptible to highly dynamic factors such as macroeconomic conditions and political events. Moreover, the existing qualitative based methods developed based on fuzzy reasoning techniques cannot describe the data comprehensively, which has greatly limited the objectivity of fuzzy time series in uncertain data forecasting. The most important disadvantage of the fuzzy time series approach is that it needs subjective decisions, especially in the fuzzification stage.

#### *1.2. Contribution and Novelty*

The objective of the work presented in this paper is to construct an accurate stock trend prediction model through utilizing a combination of the cloud model, Heikin–Ashi candlesticks, and fuzzy time series (FTS) in a unified model. The purpose of the cloud model is to add the randomness and uncertainty to the fuzziness linguistic definition of Heikin–Ashi candlesticks. FTS is utilized to abstract linguistic values from historical data, instead of numerical ones, to find internal relationship rules. Heikin–Ashi candlesticks were employed to give easier readability of the candle's features through the reduction of noise, eliminates the gaps between candles, and smoothens the movement of the market.

As far as the authors know, this is the first time that the cloud model has been used in forecasting stock market trends that is unlike the current methods that adopt a fuzzy probability approach for forecasting that requires an expert to define the extra parameters of the probabilistic fuzzy system such as output probability vector in probabilistic fuzzy rules and variance factor. These selected statistical parameters specify the degree of randomness. The cloud model not only focuses on the studies regarding the distribution of samples in the universe, but also try to generalize the point–based membership to a random variable on the interval [0, 1], which can give a brand new method to study the relationship between the randomness of samples and uncertainty of membership degree. More practically speaking, the degree with the aid of three numeric characteristics, by which the transformation between linguistic concepts and numeric values will become possible.

The outline of the remainder of this paper is as follows. Section 2 presents the background and summary of the state-of-the-art approaches. Section 3 describes the proposed model. The test results and discussion of the meaning are shown in Section 4. The conclusion of this work is given in Section 5.

#### **2. Preliminaries and Literature Review**

In this section, we summarize material that we need later that includes the cloud model, fuzzy time series, and Heikin–Ashi candlesticks. Finally, some state-of-the-art related works are discussed.

#### *2.1. Cloud Model*

The cloud model (CM) proposed by Li et al. [17] relies on probability statistics and traditional fuzzy theory [18,19]. The membership cloud model as shown in Figure 1 can mix the fuzziness and randomness to objectively describe the uncertainty of the complex system. This model makes it possible to obtain the range and the distribution of the quantitative data from qualitative information, which is described by linguistic value and effectively transits precise data into appropriate qualitative language value. The digital character of the cloud can be expressed by expected value (*Ex*), entropy (*En*), and hyper entropy (*He*). CM uses *Ex* to represent the qualitative concept and usually is the value of x corresponding to the cloud center. *En* represents the uncertainty measure of the qualitative concept. It measures the ambiguity of the quantitative numerical range. *He* symbols the uncertainty measure of entropy, namely the entropy of entropy, which reflects the dispersion degree of cloud, which appears in the size of the cloud's thickness [17–21].

**Figure 1.** Cloud model.

The theoretical foundation of CM is the probability measure (i.e., the measure function in the sense of probability). On the basis of normal distribution and Gaussian membership function, CMs describe the vagueness of the membership degree of an element by a random variable defined in the universe. Being an uncertain transition way between a qualitative concept described by linguistic terms and its numerical representation, the cloud has depicted such abundant uncertainties in linguistic terms as randomness, fuzziness, and the relationship between them. CM can acquire the range and distributing law of the quantitative data from the qualitative information expressed in linguistic terms. CM has been successfully applied and gives better performance results in several fields such as intelligence

control [11], data mining [19], and others. Figure 2 illustrates the types of cloud model (see [11,17] for more details).

**Figure 2.** Two different types of cloud generators. (**a**) Forward cloud generator; (**b**) Backward cloud generator.

#### *2.2. The Fuzzy Time Series Model*

Fuzzy time series is another concept to solve forecasting problems in which the historical data are linguistic values. The fuzzy time series has recently received increasing attention because of its capability to deal with vague and incomplete data. There have been a variety of models developed to either improve forecasting accuracy or reduce computation overhead [22]. The fuzzy time series model uses a four–step framework to make forecasts, as shown in Figure 3: (1) define the universe of discourse and partition it into intervals; (2) determine the fuzzy sets on the universe of discourse and fuzzify the time series; (3) build the model of the existing fuzzy logic relationships in the fuzzified time series; and (4) make forecast and defuzzify the forecast values [23–25].

**Figure 3.** Processes of fuzzy time series forecasting.

Nevertheless, the forecasting performance can be significantly affected by the partition of the universe of discourse. Another issue is the consistency of the forecasting accuracy with the interval length. In general cases, better accuracy can be achieved with a shorter interval length. However, an effective forecasting model should adhere to the consistency principle. In accounting, consistency requires that a company's financial statements follow the same accounting principles, methods, practices, and procedures from one accounting period to the next. In general, the effect of some

parameters in fuzzy time series such as population size, number of intervals, and order of fuzzy time series must be tested and analyzed [26,27].

#### *2.3. Heikin–Ashi Candlestick Pattern*

The current forecasting models do not contain the qualitative information that would help in predicting the future. Japanese candlesticks are a technical analysis tool that traders use to chart and analyze the price movement of securities. Japanese candlesticks provide more detailed and accurate information about price movements compared to bar charts. They provide a graphical representation of the supply and demand behind each time period's price action. Each candlestick includes a central portion that shows the distance between the open and the close of the security being traded, the area referred to as the body. The upper shadow is the price distance between the top of the body and the high for the trading period. The lower shadow is the price distance between the bottom of the body and the low for the trading period. The closing price of the security being traded determines whether the candlestick is bullish or bearish. The real body is usually white if the candlestick closes at a higher price than when it opened. In such a case, the closing price is located at the top of the real body and the opening price is located at the bottom. If the security being traded closed at a lower price than it opened for the time period, the body is usually filled up or black in color. The closing price is located at the bottom of the body and the opening price is located at the top. Modern candlesticks now replace the white and black colors of the body with more colors such as red, green, and blue. Traders can choose among the colors when using electronic trading platforms (see Figure 4) [6,7].

**Figure 4.** The dark candle and white candle.

Normal candlestick charts are composed of a series of open–high–low–close (OHLC) candles set apart by a time series. The Heikin–Ashi technique shares some characteristics with standard candlestick charts but uses a modified formula of close–open–high–low (COHL). There are a few differences to note between the two types of charts, and are demonstrated by the charts above. Heikin–Ashi has a smoother look as it essentially takes an average of the movement. There is a tendency with Heikin–Ashi for the candles to stay red during a downtrend and green during an uptrend, whereas normal candlesticks alternate colors, even if the price is moving dominantly in one direction. Since Heikin–Ashi takes an average, the current price on the candle may not match the price the market is actually trading at. For this reason, many charting platforms show two prices on the *y*-axis: one for the calculation of the Heikin–Ashi and another for the current price of the asset [7–10].

#### *2.4. Related Work*

Researchers that believe in the existence of patterns in a financial time series that make them predictable have centered their work mainly in two different approaches: statistical and artificial intelligence (AI). The statistical techniques most used in financial time series modeling are the autoregressive integrated moving average (ARIMA) and the smooth transition autoregressive (STAR) [2]. On the other hand, artificial intelligence provides sophisticated techniques to model time series and search for behavior patterns: genetic algorithms, fuzzy models, the adaptive neuro-fuzzy inference system (ANFIS), artificial neural networks (ANN), support vector machines (SVM), hidden Markov models, and expert systems, are some examples. Unlike statistical techniques, they are capable of obtaining adequate models for nonlinear and unstructured data. There exists a huge amount of literature that uses AI approaches for time series forecasting [2,4,8]. However, most of them are inaccurate: the computer programs are more effective in syntax analysis than semantic analysis. Furthermore, most of them follow the quantitative forecasting category; qualitative forecasting is useful when there is ambiguous or inadequate data. Most of the current studies were conducted from single time scale features of the stock market index, but it is also meaningful for studying from multiple time scale features [8]. With the development of deep learning, there are many methods based on deep learning used for stock forecasting and have drawn some essential conclusions [3].

In the literature, many studies have used an integrated neuro-fuzzy model to estimate the dynamics of the stock market using technical indicators [3]. This approach integrates the advantages in both the neural and fuzzy models to facilitate reliable intelligent stock value forecasting. However, most of these works did not consider the fractional deviation within a day. Another group of research work utilized hidden Markov models (HMMs) to predict the stock price based on the daily fractional change in the stock share value of intra-day high and low. To benefit from the correlation between the technical indicators and reduce the large dimensionality space, the principal component analysis (PCA) concept was deployed to select the most effective technical indicators among a large number of highly correlated variables. PCA linearly transforms the original large set of input variables into a smaller set of uncorrelated variables to reduce the large dimensionality space.

In addition, some researchers are currently using soft computing techniques (e.g., genetic algorithm) for selecting the most optimal subset of features among a large number of input features, and then selected features are given as input to the machine learning module (e.g., SVM Light software package). Technical analysis is carried out based on technical indicators from the stock to be predicted and also from other stocks that are highly correlated with it. However, the decision is carried out only based on the input feature variables of technical indicators. This leads to prediction errors due to the lack of precise domain knowledge and no consideration of various political and economic factors that affect the stock market other than the technical indicators [3,8].

Song and Chissom [13] suggested a forecasting model using fuzzy time series, which provided a theoretical framework to model a special dynamic process whose observations were linguistic values. The main difference between the traditional time series and fuzzy time series was that the observed values of the former were real numbers while the observed values of the latter were fuzzy sets or linguistic values. Chen et al. [16] presented a new method for forecasting university enrolment using fuzzy time series. Their method is more efficient than the suggested method by Song and Chissom due to the fact that their method used simplified arithmetic operation rather than the complicated MaxMin composition operation. Hwang [22] suggested a new method based on fuzzification to revise Song and Chissom's method. He used a different triangle fuzzification method to fuzzily crisp values. His method involved determining an interval of extension from both sides of crisp value in triangle membership function to get a variant degree of membership. The results obtained a better average forecasting error. In addition, the influences of factors and variables in a fuzzy time series model such as definition area, number and length of intervals, and the interval of extension in triangle membership function were discussed in detail. More techniques that used fuzzy time series for forecasting can be found in [23–27].

Nison [5] introduced the Japanese candlestick concepts to the Western world. Japanese candlestick patterns are believed to show both quantitative information like price, trend ... etc., and qualitative information like the psychology of the market. It considers not only the close values, but also the information on the body of the candlestick can offer an informative summary of the trading sessions [28] and some of its components are predictable [29]. Some researchers have combined technical patterns and candlestick information [30]. In the last decades, several researchers have used Japanese candlesticks in creative forecasting methods [31–36]. Lee et al. [31] suggested an expert system with IF–THEN rules to detect candlestick patterns, flag sell, and buy orders with good hit ratios in the Korean market. The authors in [32] displayed Japanese candlestick patterns using fuzzy linguistic variables and knowledge-based by fuzzing both the candle line and the candle lines relationship. In [33], a prediction model was suggested for the financial decision system based on fuzzy candlestick patterns. Lee [34] extended this work through creating and using personal candlestick pattern ontologies to allow different users to have their explanation of a candlestick pattern. Kamo et al. [8,35,36] suggested a model that combined neural networks, committee machines, and fuzzy logic to identify candlestick patterns and generate a market strength weight using fuzzy rules in [35], the type–1 fuzzy logic system in [36], and finally, the type–2 fuzzy logic system in [6].

Naranjo et al. [37] presented a model that used the K-nearest neighbors (KNN) algorithm to forecast the candlestick one day ahead using the fuzzy candlestick representation. Naranjo et al. [38] fuzzified the gap between candles and added it as an extended element in candlesticks patterns. However, Japanese candlestick has contradictory information due to the market's noise [38]. Recently, the Heikin–Ashi technique modifies the traditional candlestick chart and makes it easier to reduce the noise, eliminate the gaps between candles, and smoothen the movement of the market and let the traders focus on the main trend. The Heikin–Ashi graph is not only more readable than traditional candles, but is also a real trading system [10].

In general, most existing fuzzy time series forecasting models follow fuzzy rules according to the relationships between neighboring states without considering the inconsistency of fluctuations for a related period [38–40]. This paper proposes a new perspective to study the problem of prediction, in which inconsistency is quantified and regarded as a key characteristic of prediction rules by utilizing a combination of the cloud model, Heikin–Ashi candlesticks, and fuzzy time series (FTS) in a unified model that can represent both fluctuation trend and fluctuation consistency information.

#### **3. Proposed Model**

The purpose of the study is to predict and confirm accurate stock future trends due to a lack of insufficient levels of accuracy and certainty. However, there are many problems in previous studies. The main problems in data are uncertainty, noise, non-linearity, non-stationary, and dynamic process of stock prices in time series. In the prediction model, many models are used. The statistical method like the ARMA family is achieved with the trial and error basis iterations. Traders also have problems that include predicting the stock price every day, finding the reversal patterns of the stock price, the difficulty in model parameter tuning, and finally, the gap exists between prediction results and investment decision. Additionally, traditional candlestick patterns have problems such as the definition of the patterns itself being ambiguous and the largest number of patterns.

In order to deal with the above problems, the suggested prediction model uses both cloud model and Heikin–Ashi (HA) candlestick patterns. Figure 5 illustrates the main steps of the suggested model that include preparing historical data, HA candlestick processing, representing the HA candlestick using the cloud model, forecasting the next day price (open, high, low, close) using cloud–based time series prediction, formalizing the next day HA candlestick features, and finally, forecasting the trend and its strong patterns. The following subsection discusses each step in detail [9].

#### *3.1. Step 1: Preparing the Historical Data*

The publicly available stock market datasets contain historical data on the four price time series for several companies were collected from Yahoo (http://finance.yahoo.com). The dataset specifies the "opening price, lowest price, closing price, highest price, adjusted closing price, and volume" against each date. The data were divided into two parts: the training part and the testing part. The training part from the time series data was used for the formulation of the model while the testing part was used for the validation of the proposed model.

**Figure 5.** The procedure of the proposed forecasting model.

#### *3.2. Step 2: Candlestick Data*

The first stage in stock market forecasting is the selection of input variables. The two most common types of features that are widely used for predicting the stock market are fundamental indicators and technical indicators. The suggested model used technical indicators that are determined by employing candlestick patterns such as open price, close price, low price, and high price to try to find future stock prices [5,6]. A standard candlestick pattern is composed of one or more candlestick lines. However, the extended candlestick (Heikin–Ashi) patterns have one candlestick line. The HA candlestick uses the modified OHLC values as candlesticks that are calculated using [5]:

$$\begin{aligned} Ha\_{\text{Closc}} &= \frac{(\text{Open} + \text{High} + \text{Low} + \text{Closc})}{4} \\ Ha\_{\text{Open}} &= \frac{\left(\text{Ha}\_{\text{Open}}(\text{Previous Bar}) + \text{Ha}\_{\text{Closc}}(\text{Previous Bar})\right)}{2} \\ Ha\_{\text{Hijgh}} &= \text{Max}\{\text{High}, \text{Ha}\_{\text{Open}}, \text{Ha}\_{\text{Closc}}\} \\ Ha\_{\text{Low}} &= \text{Min}\{\text{Low}, \text{Ha}\_{\text{Open}}, \text{Ha}\_{\text{Closc}}\} \end{aligned} \tag{1}$$

Herein, each candlestick line has the following parameters: length of the upper shadow, length of the lower shadow, length of the body, color, open style, and close style. The open style and close style are formed by the relationship between a candlestick line and its previous candlestick line. The crisp value of the length of the upper shadow, length of the lower shadow, length of the body, and color play an important role in identifying a candlestick pattern and determining the efficiency of the candlestick pattern. The candlestick parameters are directly calculated using [9,10].

$$\begin{aligned} \text{HaL\_{Body}} &= \frac{\text{Max}\left(\text{Ha}\_{\text{open}}, \text{Ha}\_{\text{Clow}}\right) - \text{Min}\left(\text{Ha}\_{\text{open}}, \text{Ha}\_{\text{Clow}}\right)}{\text{Ha}\_{\text{open}}} \times 100\\ \text{HaL\_{LpporShadow}} &= \frac{\text{Ha}\_{\text{Hlig}} - \text{Max}\left(\text{Ha}\_{\text{open}}, \text{Ha}\_{\text{Clow}}\right)}{\text{Opc}} \times 100\\ \text{HaL\_{LownerShadow}} &= \frac{\text{Min}\left(\text{Ha}\_{\text{open}}, \text{Ha}\_{\text{Clow}}\right) - \text{Ha}\_{\text{Lrow}}}{\text{Ha}\_{\text{open}}} \times 100\\ \text{HaL\_{Color}} &= \text{Ha}\_{\text{Clow}} - \text{Ha}\_{\text{open}} \end{aligned} \tag{2}$$

where *HaL* indicates the length of the body, upper shadow, or lower shadow of the HA candlestick. The *HaCOLOR* parameter represents the mean body color of the HA candlestick. Heikin–Ashi candlesticks are similar to conventional ones, but rather than using opens, closes, highs, and lows, they use average values for these four price metrics.

In stock market prediction, the quality of data is the main factor because the accuracy and the reliability of the prediction model depends upon the quality of data. Any unwanted anomalies in the dataset are known as noise. Outliers are the set of observations that do not obey the general behavior of the dataset. The presence of noise and outliers may result in poor prediction accuracy of forecasting models. The data must be prepared so that it covers the range of inputs for which the network is going to be used. Data pre-processing techniques attempt to reduce errors and remove outliers, hence improving the accuracy of prediction models. The purpose of HA charts is to filter noise and provide a clearer visual representation of the trend. Heikin–Ashi has a smoother look, as it is essentially taking an average of the movement [9,10].

#### *3.3. Step 3: Cloud Model-Based Candlestick Representation*

There is no crisp value to define the length of body and shadow in the HA candlestick; these variables are usually described as imprecise and vague. Herrin, to transform crisp candlestick parameters (HA quantitative values) to linguistic variables to define the candlestick (qualitative value), the cloud model was used. To achieve this goal, fuzzy HA candlestick pattern ontology was built that contains [4,8]:


$$Equal(\mathbf{x}:a,b) = \begin{cases} 0 & \mathbf{x} < a\\ \exp\left(-\frac{1}{2}\left(\frac{\mathbf{x}-\mathbf{E}\mathbf{x}}{\mathrm{E}n}\right)^{2}\right) & a \le \mathbf{x} \le b\\ 0 & \mathbf{x} > b \end{cases} \tag{3}$$

$$\text{Short}(\text{Middle}(\mathbf{x}:a,b,c,d)) = \begin{cases} 0 & \mathbf{x} < a \\ \exp\left(-\frac{1}{2}\left(\frac{\mathbf{x}-\mathbf{E}\mathbf{x}\_1}{\mathbf{E}\mathbf{x}\_1}\right)^2\right) & a \le \mathbf{x} \le b \\ 1 & b < \mathbf{x} < c \\ \exp\left(-\frac{1}{2}\left(\frac{\mathbf{x}-\mathbf{E}\mathbf{x}\_2}{\mathbf{E}\mathbf{x}\_2}\right)^2\right) & c \le \mathbf{x} \le d \\ 0 & \mathbf{x} > d \end{cases} \tag{4}$$

$$\text{Large}(\mathbf{x}:a,b) = \begin{cases} 0 & \mathbf{x} < a \\ \exp\left(-\frac{1}{2}\left(\frac{\mathbf{x}-\mathbf{E}\mathbf{x}}{\mathbb{E}n}\right)^2\right) & a \le \mathbf{x} \le b \\ 1 & \mathbf{x} > b \end{cases} \tag{5}$$

The body color *BodyColor* is also an import feature of a candlestick line. It is defined by three terms Black, White, and Doji. A Doji term is defined to describe the situation where the open price equals the close price. In this case, the height of the body is 0, and the shape is represented by a horizontal bar. The definition of body color is defined as [10]:

$$\begin{array}{l} \text{If} (\text{Open} - \text{Close}) > 0 \text{ Then } \text{Body}\_{\text{Color}} = \text{Black} \\ \text{If} (\text{Open} - \text{Close}) < 0 \text{ Then } \text{Body}\_{\text{Color}} = \text{White} \\ \text{If} (\text{Open} - \text{Close}) = 0 \text{ Then } \text{Body}\_{\text{Color}} = \text{Doji} \end{array} \tag{6}$$


$$X\_-Stylc(\mathbf{x}:a,b) = \begin{cases} 0 & \mathbf{x} < a\\ \exp\left(-\frac{1}{2}\left(\frac{\mathbf{x}-\mathbf{E}\mathbf{x}}{E\mathbf{t}}\right)^2\right) & a \le \mathbf{x} \le b\\ 0 & \mathbf{x} > b \end{cases} \tag{7}$$

**Figure 6.** The membership function of the body and shadow length based on the cloud model.

**Figure 7.** The membership function of the open and close styles based on the cloud model.

In our case, membership cloud function (forward normal cloud generator) converts the statistic results to fuzzy numbers, and constructs the one–to–many mapping model. The input of the forward normal cloud generator is three numerical characteristics of a linguistic term, (*Ex*, *En*, *He*), and the number of cloud drops to be generated, *N*, while the output is the quantitative positions of *N* cloud drops in the data space and the certain degree that each cloud drop can represent the linguistic term. The algorithm in detail is:


$$- \quad \text{Calculate } \text{Y} = \exp\left(-\frac{1}{2}\left(\frac{x - E\chi}{En}\right)^2\right)$$


Expectation value (*Ex*) at the center-of-gravity positions of cloud drops is the central value of distribution. Entropy (*En*) is the fuzzy measure of qualitative concept that describes the uncertainty and the randomness. The larger the entropy, the larger the acceptable interval of this qualitative concept, which represents that this conception is more fuzzy. Hyper entropy (He) is the uncertain measure of qualitative concept that describes the dispersion. The larger the hyper entropy, the thicker the shape of the cloud, which shows that this conception is more discrete [20,21].

– Forecast the next day price (open, high, low, close)

In the fuzzy candlestick pattern approach, the measured values are the open, close, high, and low price of trading targets in a specific time period. The features of the trading target price fluctuation are represented by the fuzzy candlestick pattern. The classification rules of fuzzy candlestick patterns can be determined by the investors or the computer system. In general, using a candlestick pattern approach for financial time series prediction consists of the following steps [21]:

	- ✔ Rule 1: If there is only one PLR in the PLRG, ( *A*<sup>1</sup> → *Ai* ) then,

$$P(t) = \frac{Ex\_i + S(t-1)}{2} \tag{8}$$

✔ Rule 2: If there is *r* PLR in the PLRG, ( *A*<sup>1</sup> → *A*1, *A*2, ... ., *Ap* ) then,

$$P(t) = \frac{1}{2} \left( \left( \frac{(n\_1 \times \text{Ex}\_1) + (n\_2 \times \text{Ex}\_2) + \dots + (n\_p \times \text{Ex}\_p)}{n\_1 + n\_1 + \dots + n\_p} \right) + S(t-1) \right) \tag{9}$$

	- ✔ Rule 1: **If** *BodyColor* is White and *HaLBody* is Long **Then**, UP Trend.
	- ✔ Rule 2: **If** *BodyColor* is Black and *HaLBody* is Long **Then**, Down Trend.
	- ✔ Rule 3: **If** *BodyColor* is White and *HaLBody* is Long and *HaLLowerShadow* is Equal **Then**, Strong UP Trend.
	- ✔ Rule 4: **If** *BodyColor* is Black and *HaLBody* is Long and *HaLUpperShadow* is Equal **Then**, Strong Down Trend.
	- ✔ Rule 5: **If** (*HaLBody* is Equal) and (*HaLUpperShadow* & *HaLLowerShadow*) is Long **Then** Change of Trend.
	- ✔ Rule 6: **If** (*HaLBody* is Short) and (*HaLUpperShadow* & *HaLLowerShadow*) is not Equal **Then**, Consolidation Trend.
	- ✔ Rule 7: **If** (*HaLBody* is Short or Equal) and (*HaOpen*\_Style and *HaClose*\_Style) is (Low\_Style or EqualLow\_Style) and *HaLUpperShadow* is Equal **Then** Weaker Trend.

**Figure 8.** The clouds of the linguistic terms.

**Table 1.** The digital characteristics of cloud member function for each linguistic term.


#### **4. Experimental Results**

In order to test the efficiency and validity of the proposed model, the model was implemented in MATLAB language. The prototype verification technique was built in a modular fashion and has been implemented and tested in a Dell™ Inspiron™ N5110 Laptop machine, Dell computer Corporation, Texas, which had the following features: Intel(R) Core(TM) i5–2410M CPU@ 2.30GHz, and 4.00 GB of RAM, 64–bit Windows 7. A dataset composed of real-time stocks series of the NYSE (New York Stock Exchange) was used in the experimentation. The dataset had 13 time series of NYSE companies, each one with the four prices (open, high, low, and close). Time series were downloaded from the Yahoo finance website (http://finance.yahoo.com), Table 2 shows the companies' names, symbol, and starting date and ending date for the selected dataset. The dataset was divided into 2/3 for training and the other 1/3 for testing.


**Table 2.** Selected time series datasets.

In the proposed forecasting model, the parameters were set as follows: the ranges of body (p) and shadow length were set to (0, 14) to represent the percentage of the fluctuation of stock price because the varying percentages of the stock prices are limited to 14 percent in the Taiwanese stock market, for example. It should be noted that although we limited the fluctuation of body and shadow length to 14 percent, in other applications, the designer can change the range of the fluctuation length to any number [4]. The four parameters (a–d) of the function to describe the linguistic variables SHORT and MIDDLE were (0, 0.5, 1.5, 2.5) and (1.5, 2.5, 3.5, 5). The parameters (a, b) that were used to model the EQUAL fuzzy set were equal to (0, 0.5). Regarding the two parameters *D*<sup>1</sup> and *D*2, which are used to determine the UOD, we can set *D*<sup>1</sup> = 0:17 and *D*<sup>2</sup> = 0:34, so the UoD can be represented as [6,8]. Finally, the number of drops in the cloud model used to build the membership function is usually equal to the number of samples in the dataset to describe the data efficiently. The mean squared error (MSE) and mean absolute percentage error (MAPE) that are used by academicians and practitioners [4,21] were used to evaluate the accuracy of the proposed method. Tables 3–6 show the output of applying each model step for the Yahoo dataset.

$$MSE = \frac{\sum\_{i=1}^{n} \left(Forward\ Value - Actual\ Value\right)^2}{n} \tag{10}$$

$$MAPE = \frac{1}{n} \sum\_{i=1}^{n} \left| \frac{(Actual\ Value)\_i - (Forrated\ Value)\_i}{(Actual\ Value)\_i} \right| \tag{11}$$


11/12/2014

 49.54

 50.58

 49.43

 49.94

−1.57

 A4

−0.22

 A4

 0.49

 A5

 1.48

 A6


**Table 5.** The PLR results.


The suggested model was verified with respect to the RMS on both the training and testing data. The predicted prices of the model were found to be correct and close to the actual prices. There was a clear difference between the MSE values for the training and testing data, showing that the model was overfitting the training data as the error on the training dataset was minimized. The reason for this is that the model was not as generalized and was specialized to the structure in the training dataset. Using cross validation represents one possible way to handle overfitting, and using multiple runs of cross validation is better again. The model RMS is summarized in Table 7.

**Table 7.** Average MSE of the suggested model for all dataset.


Table 8 shows the comparison results between our two versions of the suggested model: the first one uses open, high, low, and close price as the initial price in the cloud FTS model (Cloud FTS) and the second method uses HaOpen, HaHigh, HaLow, and HaClose prices as the initial price in the cloud FTS model (HA Cloud FTS), and other two standard Song fuzzy time series (FTS) [13,14] and Yu weighted fuzzy time series (WFTS) models [23]. In Song's studies, the fuzzy relationships were treated as if they were equally important, which might not have properly reflected the importance of each individual fuzzy relationship in forecasting. In Yu's study, it is recommended that different weights be assigned to various fuzzy relationships. From Table 8, the MSE of the forecasting results of the proposed model was smaller than that of the other methods for all datasets. That is, the proposed model could obtain a higher forecasting accuracy rate for forecasting stock prices than the Song FTS and Yu WFTS models. In general, the MSE values changed according to the nature of each dataset. It can be noted from the

table that the Wells Fargo dataset yielded the best results in terms of RMS for both the training and testing data. In general, the Wells Fargo dataset is a small dataset (2,313 row and 12 column) that is probably linearly separable, so it produced high accuracy. This is a bit difficult to accomplish with larger data, so the algorithm produced lower accuracy.

**Table 8.** MSE Comparison for CLOSE price prediction between HA Cloud FTS, Cloud FTS, Yu WFTS and Song.


One possible explanation of these results is that, compared with standard models that use FTS only, utilizing FTS with the cloud model helps to automatically produces random membership grades of a concept through a cloud generator. In this way, the membership functions are built based on the characteristics of the data instead of traditional fuzzy–based forecasting methods that depend on the expert. From the point of view of the importance of using HA candlesticks with the cloud model for forecasting, utilizing the HA candlesticks showed significant features that could identify market turning points and also the direction of the trend that helps improve prediction accuracy.

The last set of experiments was fulfilled to validate the efficiency of the suggested model compared to state-of-the-art models listed in Figure 9 using the Taiwan Capitalization Weighted Stock Index (TAIEX). The data used for comparison were obtained from a website https://www.twse.com.tw/ that provided the stock prices prevailing at the NASDAQ stock quotes. As shown in Figure 9, the proposed model can perform effective prediction where the predicted stock price closely resembles the actual price in the stock market. The MSE of the suggested model was 665.40 compared with 1254.90, 4530.45, and 4698.78 for the other methods, respectively. Clearly, the suggested model had a smaller MSE than the previous methods. One of the reasons for this result is due to the merging between the cloud model and HA candlesticks, which makes it possible to account for the vagueness and uncertainty of the pattern features based on data characteristics.

**Figure 9.** Comparison of the forecasting values of different methods.

#### **5. Conclusions**

In recent years, mathematical and computational models from artificial intelligence have been used for forecasting. Knowing about future values and the stock market trend has attracted a lot of attention by researchers, investors, financial experts, and brokers. This work analyzed stock trading due to its high non-linear, uncertain, and dynamic data over time. Therefore, this paper presented a Japanese candlestick-based cloud model for stock price prediction that minimizes the investor risk while investing money in the stock market. The proposed work presented an enhanced fuzzy time series forecasting model based on the cloud model and Heikin–Ashi Japanese candlestick to predict and confirm the accurate stock trends. The objective of this model was to handle qualitative forecasting and not quantitative only. The experimental result showed that using HA Cloud FTS and Cloud FTS had a lower average than the other methods used in the literature. This low average proves the high accuracy of the proposed model. HA Cloud FTS provided a MSE = 0.779 for the training data and 0.176 for the test data and Cloud FTS gave a MSE of 0.939 for the training data and 0.240 for the test data; these results mean that the HA Cloud FTS method, which uses HaOpen, HaHigh, HaLow, HaClose prices as the initial price, has a significant improvement in stock market trend prediction. Future work includes embedding Neutrosophic logic to enhance qualitative forecasting.

**Author Contributions:** Conceptualization, S.M.D. and O.A.H.; Methodology, S.M.D., O.A.H. and N.A.A.; Software, O.A.H., N.A.A. and Z.Z.A.; Validation, S.M.D., N.A.A. and O.A.H.; Formal analysis, S.M.D., N.A.A. and O.A.H.; Investigation, S.M.D., N.A.A. and O.A.H.; Resources, O.A.H., N.A.A. and Z.Z.A.; Data curation, O.A.H., N.A.A. and Z.Z.A.; Writing—original draft preparation, S.M.D., N.A.A. and O.A.H.; Writing—review and editing, S.M.D., O.A.H. and N.A.A.; Visualization, O.A.H., N.A.A. and Z.Z.A.; Supervision, S.M.D. and N.A.A. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

#### *Article*

## **A Novel Comprehensive Evaluation Method for Estimating the Bank Profile Shape and Dimensions of Stable Channels Using the Maximum Entropy Principle**

**Hossein Bonakdari 1, Azadeh Gholami 2, Amir Mosavi 3,4,5,\*, Amin Kazemian-Kale-Kale 2, Isa Ebtehaj <sup>1</sup> and Amir Hossein Azimi <sup>6</sup>**


Received: 31 July 2020; Accepted: 23 October 2020; Published: 26 October 2020

**Abstract:** This paper presents an extensive and practical study of the estimation of stable channel bank shape and dimensions using the maximum entropy principle. The transverse slope (*St*) distribution of threshold channel bank cross-sections satisfies the properties of the probability space. The entropy of *St* is subject to two constraint conditions, and the principle of maximum entropy must be applied to find the least biased probability distribution. Accordingly, the Lagrange multiplier (λ) as a critical parameter in the entropy equation is calculated numerically based on the maximum entropy principle. The main goal of the present paper is the investigation of the hydraulic parameters influence governing the mean transverse slope (*St*) value comprehensively using a Gene Expression Programming (GEP) by knowing the initial information (discharge (*Q*) and mean sediment size (*d*50)) related to the intended problem. An explicit and simple equation of the *St* of banks and the geometric and hydraulic parameters of flow is introduced based on the GEP in combination with the previous shape profile equation related to previous researchers. Therefore, a reliable numerical hybrid model is designed, namely Entropy-based Design Model of Threshold Channels (EDMTC) based on entropy theory combined with the evolutionary algorithm of the GEP model, for estimating the bank profile shape and also dimensions of threshold channels. A wide range of laboratory and field data are utilized to verify the proposed EDMTC. The results demonstrate that the used Shannon entropy model is accurate with a lower average value of Mean Absolute Relative Error (MARE) equal to 0.317 than a previous model proposed by Cao and Knight (1997) (MARE = 0.98) in estimating the bank profile shape of threshold channels based on entropy for the first time. Furthermore, the EDMTC proposed in this paper has acceptable accuracy in predicting the shape profile and consequently, the dimensions of threshold channel banks with a wide range of laboratory and field data when only the channel hydraulic characteristics (e.g., *Q* and *d*50) are known. Thus, EDMTC can be used in threshold channel design and implementation applications in cases when the channel characteristics are unknown. Furthermore, the uncertainty analysis of the EDMTC supports the model's high reliability with a Width of Uncertainty Bound (*WUB*) of ±0.03 and standard deviation (*Sd*) of 0.24.

**Keywords:** water resources; channel; mathematical entropy model; bank profile shape; gene expression programming (GEP); entropy; genetic programming; artificial intelligence; data science; big data

#### **1. Introduction**

The sections and dimensions of rivers and alluvial channels change due to the constant interactions between water and sediments. River and channel plans and cross-sections undergo dimensional changes until equilibrium or stable state is attained. After equilibrium, the average dimensions of a stable cross-section do not change over time; in fact, the rate of sedimentation and erosion in a channel cross-section is theoretically in equilibrium [1–3]. In this case, the particles on the bed and at the channel banks are in dynamic balance. In channels with coarse particles, the movement of sediments at any location in the channel contradicts the term "channel stability" [4,5]. In this type of channel, it is not possible for sediments to move without changing the channel dimensions and width [6]. Moreover, the channel dimensions and width of water surface are only preserved (channel stability) in a state when the sediment particles on the channel bed move slightly and at the banks are in the threshold of motion [7]. In such case, one problem related to river morphology is with predicting the erosion process of river banks and profile shape formation until stable sections are achieved [8,9].

The *St* is distributed between zero value on the channel bed and the maximum *St* value (*St* <sup>+</sup>) at the free water surface at the water margin. *St* distribution is related to the lateral distance (*x*) from the channel bed (*x* = 0) to the water margin. At the water margin, *x* is named *L* which is equal to the half-width of the free water surface (*B*/2) (*L* = *B*/2). Therefore, it is worth using the entropy concept in the study of the *St* of bank profiles because the entropy concept is based on the probability principle and its relation to a channel's geometric parameters. Furthermore, since the *St* <sup>+</sup> value at the free water surface is equal to μ (submerged static coefficient of Coulomb friction), the *St* of the banks is affected by the hydraulic parameters of the channel cross-sections too (including flow and sediment characteristics). The *St* value in channels is due to the homogeneity of *St* <sup>+</sup> values as a result of these conditions. Because the *St* value is not specified for channels (and also there is no specified relation for computing it), a uniform distribution of the transverse bank slope is assumed to obtain the *St* value from the ratio of the maximum flow depth at the channel centerline (*hc*) to the corresponding lateral distance of this depth from the central channel axis (*L*). Therefore, if the channel dimension values are not specified, the *St* value cannot be obtained. Therefore, a novel relationship would have existed to estimated *St* values based on available datasets (not only channel dimensions).

Furthermore, with the obtained entropy equation it is possible to accurately predict the *St* of the banks depending on the correct values of the Lagrange multipliers contained in the equation. Therefore, if the entropy equation can predict the transverse bank slope correctly, multiplier λ should be closely related to the hydraulic and geometric parameters of the banks, which has not been investigated so far except the recent study of authors. Gholami et al. [10] analyzed the sensitivity of λ multiplier to different hydraulic and geometric parameters. They referred to considerable impact of the maximum slope of the bank profile and the dimensionless lateral distance of the river banks on λ variations. Therefore, by investigating the relationship between the entropy parameters and the hydraulic and geometric parameters of a channel, it is possible to achieve a simpler equation for the transverse bank slope distribution and thus, the bank profile shape. Based on Gholami et al.'s [10] study results, a simple relation is presented based on the maximum entropy principle to compute entropy parameter using maximum and mean values of *St*. In the Consequently, the fraction obtained with the *St* to *St* + ratio (δ) is evaluated and a relationship between the δ ratio and the entropy parameter (*K* = λμ) is presented. Moreover, a regression model based on GEP is used to create a relationship between the *St* of the banks and the geometric and hydraulic parameters of the flow (when the channel dimensions are unknown and only the hydraulic characteristics (e.g., *Q* and *d*50) are available). This relationship is combined with Vigilar and Diplas' [11] polynomial equation to present an equation for estimating the stable free surface width based on the relationship between δ and *K*. The EDMTC proposed in this paper is used together with the bank profile shape equation to obtain the channel bank dimensions.

#### **2. Literature Review**

So far, many studies have been carried out to examine channel dimensions in dynamic equilibrium state [12–19]. However, few studies have examined the bank profile shape of threshold channels or the static equilibrium of channels. Parker [6] did extensive research in this field and justified the stable channel paradox with the nonuniform shear stress distribution on the channel bed and banks due to the longitudinal transformation of the lateral flow momentum. Parker's model estimated the bank profile shape as a cosine curve. Later, Ikeda [20] conducted extensive laboratory studies to investigate the shape of stable channel banks. Ikeda then employed a mathematical model based on Parker's idea and presented an exponential equation for bank profile shapes. Ikeda [20] pointed out that the most influential parameters in determining the shape of stable channels are the *Q* and *d*50. Diplas [21] used an analytical model with their experimental data and proposed a special case of Ikeda's [20] equation as an exponential function for a bank profile shape. Pizzuto [22] examined the stability criterion using an analytical solution of the widening process at the free water surface. Pizzuto [22] considered the shear stress redistribution due to lateral diffusion and reported an exponential function for a bank profile after channel widening stops. Diplas and Vigilar [23] presented a numerical model to assess the difference between the shape of threshold channels and a previous conventional shape (cosine) for banks. They stated that with particles that do not move along the banks, the transverse slope of the banks should be milder, in which case a wider and deeper channel would form. Hence, they introduced a fifth-degree polynomial profile shape of stable channel banks. Vigilar and Diplas [11,24,25] provided graphs for use to predict the dimensions and profile shapes of stable channel banks with a third-degree polynomial equation. This equation can accurately predict the bank profile shape, because it is in accordance with the results obtained with the equations of several other researchers who have used various other methods [26,27]. Babaeyan [7] did an extensive laboratory study and according to their observational data introduced a hyperbolic bank profile shape. Cao and Knight [28] were the first to examine the shape of bank profiles using the entropy concept. By applying the shape equation obtained with the maximum entropy principle, they reported a parabolic equation. In solving their entropy equation, the Lagrange multiplier (λ) contained within were tested numerically. The equation was validated according to Chow's [29] definition of natural rivers considering a value of zero for λ. Cao and Knight [28] emphasized the need to further consider the physical concept of multiplier λ. Following Cao and Knight's [28] brief study, no other study has been based on the entropy concept to predict the *St* and hence the bank profile shape of stable channels. Gholami et al. [30–34] assessed the ability of different artificial intelligence (AI) methods in the estimation of bank profile shapes of threshold channels. They referred to high efficiency in these methods in estimation and the necessity of further researches about on forming stable shape of bank profiles.

Due to the significance of the entropy concept, many studies have addressed entropy in examining different variables [35–38]. In hydraulic science, Chiu [39] was the first to examine the flow velocity distribution using entropy. Later, other considerations were applied to evaluate the mean and maximum velocity ratio, shear stress and sediment concentration distributions in the cross sections of channels [40–53]. In the field of application of entropy concepts in determining *St* of stable channels, recently, Gholami et al. [54,55] assessed the ability of Tsallis and Shannon entropy concepts in estimation of *St* of stable channels banks. They extensively assessed the variation of different entropy parameters and their signs in obtained entropy-based equations. However, they presented no reports about the significant effects of relations of maximum and mean values of *St* with entropy parameters and the other hydraulic and geometric conditions.

#### **3. Materials and Methods**

#### *3.1. Maximum Entropy Principle in Estimating the Transverse Slope of Stable Banks*

Cao and Knight [28] evaluated the *St* of banks in threshold state using the principle of maximum entropy for the first time. In the following, Gholami et al. [54,56] modified the application of maximum entropy principle used by Cao and Knight [28]. Cao and Knight [28] employed the Shannon entropy [56] in the form of Equation (1) and presented Equation (2) considering the *St* of stable banks as a random variable and the principle of maximum entropy [57,58] associated with the two constraint conditions of continuity and momentum in Equations (3) and (4) [59].

$$H(\mathcal{S}\_t) = -\int p(\mathcal{S}\_t) \ln p(\mathcal{S}\_t) d\mathcal{S}\_{t\prime} \tag{1}$$

where *p*(*St*) is the Probability Density Function (PDF) of the *St* of the banks, and *H* is the amount of entropy.

$$S\_t = \frac{1}{\lambda} \ln \left[ 1 + (e^{\lambda \mu} - 1)\frac{\chi}{L} \right] \tag{2}$$

$$\int\_0^\mu p(S\_t)dS\_t = 1,\tag{3}$$

$$\int\_{0}^{\mu} S\_t p(S\_t) dS\_t = \overline{S\_t} \,\tag{4}$$

where *x* is the lateral distance of points on the banks from the channel centerline and λ is the Lagrange multiplier. Figure 1 represents a symmetrical bank cross section of stable alluvial channels. In stable channels, *St* of the banks changes monotonically from the centerline of the channel bed (*x* = 0 and *y* = 0) that is zero (*St* = 0) to the *St* <sup>+</sup> value at the free water surface at the water margin (*x* = *L* = *B*/2 and *y* = *hc*), which is equal to μ (the submerged static coefficient of Coulomb friction).

**Figure 1.** Symmetrical cross section of alluvial threshold channels and its characteristics.

Cao and Knight [28] carried out numerical testing and considered a specified range for λ (1, 5, 10, 50, 100). They stated that when λ tends toward zero, the cross-sectional bank shape is a parabolic curve. Consequently, this multiplier was deleted from their equation. The following equation was presented with numerical justification for bank profile shape estimation:

$$y\* = \left(\frac{\mu^2}{4}\right)x\*^2,\tag{5}$$

where *x\** = *x*/*hc* is the dimensionless lateral distance from the channel centerline and *y\** = *y*/*hc* is the dimensionless vertical boundary level. The Lagrange multiplier is a key component of the maximum entropy principle. In the following, Gholami et al. [54] presented an equation based on the maximum entropy principle to caculate λ numerically [54] which is explained in summary in the following. Accordingly, by using the Lagrange Multiplier Method (LMM) and variable calculation technique [39,60,61], the equation below is obtained for *p*(*St*):

$$p(S\_t) = \exp\left(\lambda\_1 + \lambda S\_t - 1\right). \tag{6}$$

Equation (6) is used with the first constraint (Equation (3)) to obtain the following equation:

$$
\sigma^{\lambda\_1 - 1} = \left(\boldsymbol{\epsilon}^{\lambda \mu} - 1\right)^{-1},
\tag{7}
$$

where <sup>λ</sup><sup>1</sup> is Lagrange multipliuer and equal to: <sup>λ</sup><sup>1</sup> <sup>=</sup> ln[λ/(*e*λμ <sup>−</sup> 1)] <sup>+</sup> 1.

Furthermore, by replacing Equations (6) and (7) in the second constraint condition (Equation (4)), the following equation is obtained to calculate λ:

$$\overline{S\_t} = \frac{\mu e^{\lambda \mu}}{(e^{\lambda \mu} - 1)} - \frac{1}{\lambda}. \tag{8}$$

On the other hand, by dividing the sides of Equation (8) by μ, the following equation is obtained:

$$\frac{\overline{S\_t}}{\mu} = \delta = \frac{\varepsilon^K}{(\varepsilon^K - 1)} - \frac{1}{K} \tag{9}$$

where *K* is a dimensionless parameter known as the entropy parameter used to measure the uniformity of the probability and distribution of the *St*, which is equal to *K* = λμ, and δ is the ratio of *St* to *St* + (=μ). In the present study, when the values of *hc*, *L*, and *St* <sup>+</sup> (=μ) are known, the *St* value along the banks is obtained by assuming the uniform distribution of *St* as equal to the *hc*/*L* ratio. Therefore, λ is obtained by numerically solving Equation (8). Then, the *St* distribution of stable banks can be computed according to Equation (2). Moreover, physical justifications of λ multiplier and the effect of different hydraulic and geometric parameters on it is investigated in Gholami et al. [10]. On the other hand, the *St* at each point on the channel banks is formulated as *St* = *dy*/*dx*, where *y* is the vertical boundary level of the points. By integrating this, the bank profile shape equation for threshold channels becomes Equation (10), where the integral constant (*C*) is obtained by applying the boundary condition at the channel centerline (*x* and *y* = 0).

$$y = \frac{1}{\lambda} \Big| \left( \mathbf{x} + \frac{L}{e^{\lambda \mu} - 1} \right) \ln \left( 1 + (e^{\lambda \mu} - 1)\frac{\mathbf{x}}{L} \right) - \mathbf{x} \Big|. \tag{10}$$

This is introduced as the bank profile shape equation based on developed entropy model which is extended in Gholami et al. [54] in details. If the channel dimensions (*B* and *hc*) are not specified, it is not possible to estimate λ and hence, the *St* and *y* values. Therefore, in this paper, the next section presents a numerical model for when the channel dimensions are not specified and only *Q* and *d*<sup>50</sup> are known from the problem condition.

#### *3.2. Calculating* μ

The μ value can be calculated as μ = tan ϕ, where ϕ is the angle of sediment reposition. Furthermore, since the value of μ changes with the sand size and roughness [5,62], the following relationship between the ϕ and sediment size (*d*50) can be utilized in the current study to compute ϕ in uniform sediments [10,27,54]:

$$\varphi = \begin{bmatrix} 0.302(\log d\_{50})^5 + 0.126(\log d\_{50})^4 - 1.811(\log d\_{50})^3 \\ -0.57(\log d\_{50})^2 + 5.952(\log d\_{50}) + 37.52 \end{bmatrix} \tag{11}$$

where ϕ is in degree and *d*<sup>50</sup> should be inserted in centimeters.

#### *3.3. Entropy-Based Design Model of Threshold Channels (EDMTC)*

As stated in the previous section, by assuming a uniform distribution for *St* value, the *St* value can be obtained by the *hc*/*L* when the values of *hc*, and *L* (=*B*/2) are known. Accordingly, if the *hc* and *B* values are not known, it is not possible to calculate *St*. In this section, an explicit relationship will be provided to calculate the *St* value for the cases that the channel dimensions (*hc*, *B*) are not available.

In this way, using several series of available observational data with different hydraulic conditions, the *Q*, *d*<sup>50</sup> and μ values are determined and a relationship for the *St* value based on these parameters is applied to calculate the *St* value for any other data where the channel dimensions are not specified. Accordingly, considering *Q*, *d*<sup>50</sup> and μ parameters as input parameters and *St* as output parameter based on a numerical GEP model (Figure 2) [32,63,64] provide a relationship for predicting *St* in the form of Equation (12):

$$\begin{aligned} \mathbf{S}\_{l} &= \mathbf{G}1 + \mathbf{G}2 + \mathbf{G}3, \\ \mathbf{G}1 &= \mathcal{E} \Big\{- \Big[ \left[ \mu^{2} - 2\mu + \ln(\mu + 4.433) \right] + \left[ \exp \Big(- (\mathbf{Q} + \mu)^{2} \Big) \right] + \exp \Big[ - \left( (0.936 + d\_{\mathbb{S}0})^{2} \right) \Big] \Big\}, \\ \mathbf{G}2 &= \mathcal{E} \Big[ - \Big[ \left( (17.693 - 1.565Q) + (1/d\_{\mathbb{S}0}) \right) + \left[ \mu + 1.565 - \mu Q \right] \Big]^{2} \Big] \\ \mathbf{G}3 &= \mathcal{E} \Big[ - \Big[ \left( (1.112Qd50 - \ln(6.5Q))/\mu \right) + \mu \Big]^{2} \Big]. \end{aligned} \tag{12}$$

**Figure 2.** Flowchart of the proposed Entropy-based Design Model of Threshold Channels (EDMTC) computational procedure for designing the dimensions and shape of threshold channels in the present study.

In fact, with input parameters *Q*, *d*<sup>50</sup> and μ (=*St* <sup>+</sup>) the value of *St* is calculated using Equation (12). Now by knowing the *St* value for any channel whose stability dimensions are not specified, in addition to bank profile shape, the width and depth of the channel after stability can be determined. To do this,

*St* can be calculated by using the equations presented by former researchers who have applied analytical and theoretical frameworks to derive the relationships. As stated, the polynomial shape proposed by some researchers is an acceptable shape than the previous classic cosine, parabolic, and exponential forms [23]. Therefore, in the present study, the polynomial function provided by Vigilar and Diplas [11] is used to estimate the bank profile shape of stable channels as follows [11]:

$$y\* = \ 1 - a\_3 \ge \ast^3 - a\_2 \ge \ast^2 - a\_1 \ge \ast - a\_0. \tag{13}$$

Coefficients *a*0, *a*1, *a*<sup>2</sup> and *a*<sup>3</sup> depend on the values of δ*\*cr* and μ, which are obtained from Table 1 for each given dataset [11]. δ*\*cr* is the dimensionless critical stress depth (δ*\*cr* = δ*cr*/*hc*) in critical condition of sediments in the bank profile. In this case, the shear stress depth (δ ) is δ = τ/ρ*gS*, where τ is the shear stress along the channel and *S* is the longitudinal slope of the water surface. The value of δ*\*cr* can be obtained according to the (μ − δ*\*cr*) figures related to Vigilar and Diplas [11].


**Table 1.** Coefficients in the bank profile shape equation related to Vigilar and Diplas [11] (Equation (13)) for different values of μ and δ*\*cr* [11].

Now, the derivative of the above function (Equation (13)) versus *dx\** yields the transverse slope function at different points in the channel as follows:

$$S\_t = \frac{dy^\*}{d\mathbf{x}^\*} = -3a\_3 \mathbf{x} \cdot \mathbf{\*}^2 - 2a\_2 \mathbf{x} \cdot \mathbf{\*} - a\_1. \tag{14}$$

Now, according to the mean value theorem in integral, the mean slope value of the bank profiles (*St*) is computed based on the mean value theorem for definite integrals for *y\** distribution (Equation (13)) along the transverse interval in range of (0 ≤ *x*∗ ≤ 0.5*B*∗) according to the following Equations (15a–c):

$$\overline{S\_t} = \frac{1}{0.5B\*} \int\_0^{0.5B\*} y\*(x)dx,\tag{15a}$$

$$\overline{S\_l} = \frac{2}{B\*} \left[ 1 - a\_3 \frac{B\*^3}{8} - a\_2 \frac{B\*^2}{4} - a\_1 \frac{B\*}{2} - a\_0 \right] \tag{15b}$$

$$\overline{S\_l} = -a\_3 \frac{B \ast^2}{4} - a\_2 \frac{B \ast}{2} - a\_1 - \frac{2}{B \ast} (a\_0 - 1). \tag{15c}$$

Therefore, by obtaining *St* value using Equation (12), *B\** value of the free water surface of bank profile is obtained with Equation (15b). In fact, with input parameters *Q*, *d*<sup>50</sup> and μ (=*St* <sup>+</sup>) the value of *St* is calculated using Equation (12). Then, Equation (15c) is used to obtain the value of *B\** based on obtained *St* values according to Equation (12). Accordingly, in this study, the EDMTC (Figure 2) is presented to predict the dimensions and shape of bank profiles using the entropy principle. The value of *x\** (lateral distance from the channel axis) is selected for a specific range of arbitrary *x\** values at a distance of 0 ≤ *x*∗*<sup>i</sup>* ≤ 0.5*B* ∗ (= *L*). The values of *y\** obtained by the entropy facilitate plotting the bank shape profiles against different *xi*. Figure 2 shows the flowchart of the GEP model and model developed in the present study (EDMTC) to predict the shape and dimensions of threshold channels.

#### *3.4. Experimental Data*

The observational data series used in the present study were collected in previous investigations by Mikhailova et al. [65], Ikeda [20], Diplas [19], Babaeyan [7], Macky [66], Hassanzadeh et al. [67], and Khodashenas [68]. The hydraulic and geometric conditions of the data vary, with different ranges of *Q* and *d*<sup>50</sup> values in the channel as well as geometric conditions of the laboratory flumes used with each data series. Furthermore, several tests were carried out for different discharge rates with each data series, and the channels had different conditions until reaching equilibrium state. In each observational data series, in addition to the channel dimensions (*B* and *hc*) the coordinate data of the points in stable bank profiles (*x, y*) were extracted for some discharge values as well. Moreover, all experiments were done in laboratory flumes with different aspect ratios (*B*/*hc* = α) in the range (4–30). In each test, the sediment sizes selected were somewhat course, so the corresponding proportional discharge in the channels would cause no movement of sediment particles in the channels. Hence, the stresses on the walls and channel bed were respectively less and more than the critical stress until threshold channel conditions would govern. Table 2 summarizes the hydraulic and geometric conditions for the data used.


**Table 2.** Summary of experimental characteristics for the data used in the present study.

#### *3.5. Used Data in Modeling*

As stated in the previous section, in this paper, 12 numbers of observed runs (S1–S12) (according to Table 1) with different hydraulic and geometry characteristics are selected for training and testing the EDMTC model. The hydraulic and geometric conditions of the data series are varied, so that the range of *Q* and *d*<sup>50</sup> values in the channel, as well as the geometric conditions of the laboratory flumes used in each data series, are different. Furthermore, in each seven available observational data series (Mikhailova et al. 1980; Ikeda 1981; Diplas 1990; Babaeyan 1996; Macky 1999; Hassanzadeh et al. 2014; and Khodashenas 2016), there are several runs related to them according below with different discharges, therefore, the stable channel shape formed on banks in each observed run is different.


In fact, in this paper, external-validation is performed. External validation means that among 12 numbers of data series (totally 367 sample numbers), some data series are used for training and some data series are selected for testing the models. Accordingly, in this paper, 10 data series of S1, S2, S3, S7, S8, S9, S10, S11, and S12 (65% of all samples: 233 samples) are used for training the EDMTC model and three data series of S4, S5, and S6 (35% of all samples: 134 samples) related to Diplas' (1990), Babaeyan's (1996) and Macky's (1999) data series are selected for testing the EDMTC model. This kind of validation is acceptable, because the proposed EDMTC model is trained and tested based on data series with different hydraulic and geometry characteristics.

#### *3.6. Evaluation of Model E*ffi*ciency*

In order to evaluate the methods presented in this study, several statistical indices are used: The determination coefficient (*R*2), Root Mean Squared Error (RMSE), Mean Absolute Relative Error (MARE), Mean Absolute Error (MAE), and Bias. These evaluation criteria are defined by Equations (16)–(20):

$$R^2 = 1 - \frac{\sum\_{i=1}^n \left(y\_i - x\_i\right)^2}{\sum\_{i=1}^n \left(y\_i - \overline{y}\right)^2},\tag{16}$$

$$RMSE = \sqrt{\frac{1}{n} \sum\_{i=1}^{n} (x\_i - y\_i)^2} \tag{17}$$

$$MAEE = \frac{1}{n} \sum\_{i=1}^{n} \left( \frac{|\mathbf{x}\_i - y\_i|}{\mathbf{x}\_i} \right) \tag{18}$$

$$MAE = \frac{1}{n} \sum\_{i=1}^{n} |\mathbf{x}\_i - y\_i|\_{\prime} \tag{19}$$

$$Bias = \frac{1}{n} \sum\_{i=1}^{n} (x\_i - y\_i)\_\prime \tag{20}$$

where *yi* and *xi* denote the estimated and observed values, *y* represents the mean modeled values and *n* is the sample size. The closer the *R*<sup>2</sup> coefficient is to the unit value (1), the higher the agreement there is between the observed and predicted values. The closer the results of MARE, RMSE, Bias, and MAE indices are to zero, the higher the estimation accuracy is as well. Positive and negative Bias values imply model over and underestimation, respectively [69–71]. Therefore, computing several evaluation criteria can better reveal the model performance [72,73].

#### **4. Results**

In the first section, the ability of entropy model is evaluated to predict bank profile shapes. In the second section, the EDMTC proposed in this study is examined in detail. At the end, the uncertainty of the proposed EDMTC is examined using different uncertainty indexes.

#### *4.1. Entropy Model in Predicting Bank Profile Shapes*

In Figure 3, the vertical boundary level of stable channel banks is estimated by the developed entropy model based on the maximum entropy principle which is proposed in Gholami et al. [54] for the first time. The λ value is obtained by numerical solution of Equation (8). Accordingly, for each data series (each bank profile shape), one λ value is obtained by numerically solving Equation (8). In Equation (8), *St* value is calculated by assuming uniform distribution of *St*, according to ratio of *hc*/*L*. Using obtained λ value, the *y* value is computed based on entropy method by solving Equation (10). The *y\** distribution obtained by Equation (10) corresponding each *x\** value is drawn for each data series in Figure 3. Moreover, the results of Cao and Knight's [28] model (CKM) (according to Equation (5)) are extracted and their proposed bank profile shape is drawn in Figure 3 to evaluate the entropy model performance. Table 3 contains the different error indices for entropy model and CKM. Figure 3 indicates that entropy model exhibits acceptable conformity with the corresponding observational data series in predicting the vertical boundary level and hence, estimates the bank profile shape with low error values. According to all data series, entropy model is able to estimate the governing bank profile shape trend with lower MARE and RMSE values equal to 0.317 and 0.08 better than CKM with 0.981 and 0.363 values respectively. Figure 3 also shows that for two data series, i.e., S1 and S2 (Mikhailova et al.'s [65] data), CKM has high error values in *y\** estimation and high accuracy in the area near the free water surface, where high MARE values in the 2–4 range are observed for these data series. However, the proposed entropy model is able to detect the bank profile shape trend with lower error values (MARE = 0.2 and 0.8 for S1 and S2 datasets respectively) than CKM with 1.95 and 3.95 MARE values, which represents the significant superiority of entropy model. This process is repeated

for the S2 and S3 data series. Although CKM exhibits acceptable performance, entropy model is more accurate with lower error values and coincides closely with the observed values (especially in the area near the surface). For the S6 field data series, although both models do not perform well (with close Bias values of −0.31 and 0.45 for entropy model and CKM respectively), entropy model again performs with lower error (MARE = 0.58) than CKM (MARE = 1.03). Furthermore, the high MARE index value for CKM is representative of its inability to estimate low *y\** values (in the vicinity of channel bed), a problem that is solved by entropy model significantly. Furthermore, the RMSE values of CKM and entropy model which is equal to 0.5 and 0.38 respectively approved the inefficiency of CKM in estimating low *y\** levels. With data series S7, the improvement of entropy model over CKM by about 60% and 85% in the MARE and RMSE values respectively is observed clearly in Figure 3, as entropy model highly conforms to the observational data with *R*<sup>2</sup> values of 0.98. With Khodashenas' [68] data (S9–S12), the higher efficiency of entropy model over CKM is evident with lower MARE and RMSE values in entropy model than CKM. Furthermore, entropy model is able to estimate the water surface widening with high *y\** values well with low values of RMSE and Bias values close to 0. The negative and positive Bias value represents the underestimation and overestimation of the models respectively. As it can be seen in the Bias values, the CKM in most of the datasets have positive Bias values and overestimates the *y\** values in comparison with the corresponding observed values. It can thus be said that the entropy model proposed in the present study based on the maximum entropy principle is more accurate in the estimating the bank profile shape of stable channels than CKM, which suggests a parabolic curve (Equation (5)) for channel banks. A notable point in this paper is the significant physical effect of λ values on the accurate estimation of the intended variables, which is negligible with CKM. The λ values obtained by entropy model in this study are gathered in Table 3, where it can be seen that this multiplier is in a specified range of −2 to 2 with almost all data series (except with 1–2 data series). Furthermore, the λ values are the same for different runs of one experiment.

**Figure 3.** *Cont*.

**Figure 3.** Bank profile shape predicted by developed entropy model and Cao and Knight's [28] model (CKM) for different observational data series (S1–S12).

**Table 3.** Assessment of the efficiency of developed entropy model (DEM) and CKM compared with different observational data series according to different error indices and λ values related to DEM in this paper.


*4.2. Presenting the Entropy-Based Design Model of Threshold Channels (EDMTC)*

In previous sections, the entropy model was evaluated for its prediction ability of bank profile shapes in case the depth and width of the free water surface in the channel are determined. In this study, EDMTC based on the relationship between the entropy parameter and the *St* of channel banks to predict the channel dimensions as well as the bank profile shape is presented and explained in detail in Section 3.2 and Figure 2. The proposed EDMTC is evaluated in the first subsequent section and the model's uncertainty is examined in the second part.

#### Evaluation of EDMTC Performance

Figure 4 displays scatter plots of the EDMTC proposed in this study for several observational data series. The left side of the figure contains the regression plots of the *y\** values predicted by EDMTC compared to the corresponding observational values. The right side of the figure shows the cross-sectional profile shapes predicted by EDMTC compared with the profile shapes obtained with observational values. Table 4 lists the error indices of EDMTC compared to the corresponding observational values. The scatter plots indicate that EDMTC can very accurately predict the vertical elevation of stable channel banks, as most data is compressed around the trend line and slight scattering is observed for some of the datasets. In Figure 4, the trend line is mapped to the data and the resulting equation is *y* = *ax* + *b*. Closer *a* and *b* values to 1 and 0, respectively, represent acceptable model prediction performance. According to the trend line, for all datasets the predicted values are concentrated around this line and the values of *a*, *b* are close to 1, 0, respectively. This indicates the high efficiency of the proposed EDMTC in predicting the vertical elevation of channel banks. Moreover, the *R*<sup>2</sup> index value in this figure is higher than 0.95 for all observational data series, indicating the high EDMTC prediction accuracy. The value of this index is very close to 1 for some of the observational data [20,21,68], signifying very high model conformity to the corresponding observational values. Furthermore, according to the diagrams on the right side of Figure 4, the EDMTC is able to accurately estimate the bank profile shape trend for all data series. Although some differences between the values *y\** predicted by the model and the observational values are seen, it is notable that EDMTC is able to model the vertical bank elevation (from the channel center on the bed to the free water surface margins) and the water surface widening near the water surface levels similar to the corresponding observational values. The error index values in Table 4 are also validated accordingly. This table shows that the MARE values for all datasets are 0.3–0.5, which is close to 0. This index indicates the accuracy of the proposed EDMTC in predicting the vertical elevation of banks as well as the free water surface width in stable channels. An important point is that the proposed EDMTC predicts the profile shape trend successfully and can therefore be used to design the width and depth (dimensions) of stable channels when only flow inputs such as *Q*, *d*<sup>50</sup> and μ are known. The high accuracy of this model is confirmed, and achieving such a model with the least parameters to predict the dimensions and cross-sectional bank shapes formed in stable channels is of considerable importance. Also, EDMTC not only considers the geometric conditions of the channel cross sections but also involves the hydraulic conditions of the problem (by using Vigilar and Diplas' [11] equation), which is one of the notable features of this model. Based on most observational data series, the estimated channel width is very similar to the observational values (in some cases it is slightly less). For example, for the EDMTC profile predictions based on the observational data from Diplas [21], Babaeyan [7], and Hassanzadeh et al. [67], the water surface width is estimated very close to the observed values. Furthermore, for most observational datasets, the proposed model estimates greater values for the vertical elevation of the water surface, although the estimated profile trend fits the observational values perfectly. The partial error values of EDMTC that are mostly seen in the areas near the channel bed and the free water surface with some of the datasets can be considered measurement errors of the observational data [74]. For some data, e.g., Hassanzadeh et al. [67] and Khodashenas [68] this error is seen at the channel bed. Additionally, Figure 4 shows that EDMTC based on Khodashenas' [68] data estimates lower *y\** than the actual values, which results in a negative Bias and an absolute error increase of 14% in MAE value according Table 4 (MAE represents the absolute magnitude of the difference between observational values and the model). It is worth noting that the EDMTC can estimate a more logical shape than the profile derived from the corresponding observational values, which has a uniform distribution from the bed to the water surface. With the rest of the data series, EDMTC estimates roughly higher partial values equal to the observational values for *y\**, as the RMSE

error value is about 0.9–0.13, which is acceptable. Therefore, EDMTC with low average error values (MARE = 0.55 and MAE = 0.19) is generally highly accurate in predicting bank profiles and stable channel dimensions.

**Table 4.** Evaluation of the EDMTC proposed in the present study in estimating the dimensions of stable channels in comparison with several available observational data series.


**Figure 4.** *Cont*.

**Figure 4.** Comparison of values predicted for the vertical boundary level of stable channels by the EDMTC proposed in the present study using scatter plots (left side) and cross-sectional profile shapes (right side) for different observational data: (**a**) Ikeda [20]-S3, (**b**) Babaeyan [7]-S5, (**c**) Diplas [21]-S4, (**d**) Hassanzadeh et al. [67]-S7, (**e**) Khodashenas [68]-S9, and (**f**) Khodashenas [68]-S12.

#### *4.3. Uncertainty Analysis of the Proposed EDMTC and GEP Model*

In this section, the uncertainty of EDMTC in predicting the bank profile shape based on entropy model ans also GEP model in predicting *St* of bank (according Equation (12)) is examined and the uncertainty indices are shown in Table 5. With the Uncertainty Wilson Score Method (UWSM) [10], ref. [19,75–79], the error of the *St* predicted by the GEP model and the *y\** values predicted by EDMTC is calculated and compared with the corresponding observation values. The error between estimated and observed values (*ei*) and the corresponding the Mean Prediction Error (MPE or *e*) and standard deviation (*Sd*) for error values calculated for data is obtained as Equations (21)–(23):

$$e\_i = x\_i - y\_i.\tag{21}$$

$$MPE = \overline{\varepsilon} = \frac{1}{n} \sum\_{i=1}^{n} \varepsilon\_{i\prime} \tag{22}$$

$$S\_d = \sqrt{\sum\_{i=1}^{n} \left(\frac{\left(c\_i - \overline{c}\right)^2}{n-1}\right)}.\tag{23}$$

where *n* is the sample size. With these indices, the *WUB* are calculated as Equation (24):

$$WIB = \frac{1}{n^{0.5}} (I\_{\text{It}} S\_d)\_{\text{\textquotedblleft}} \tag{24}$$

where *Ilt* is the left-tailed inverse of the error distribution that represent the probability of error distrubution associated with the numebr of degree of freedom with which to characterize the distribution [76,80]. In the present paper, the probability of 0.05 error (95% Confidence Bound (CB)) with degree of freedom equals to *n* − 1 is considered in *Ilt*-value calculation [80]. Moreover, CB is the 95% quantile of the *Ilt* distribution with 1 degree of freedom. In the following, CB can be defined. In this range, the *WUB* represents the upper and lower uncertainty bounds of CB respectively as Upper Bound (*UB*) and Lower Bound (*LB*). *UB* and *LB* can be calculated by *e* ± *WUB*. Moreover, the CB represents the mean value of error. Furthermore, *dx* represents the average width of CB which is calculated as Equation (25). The lower average width of the CB associated with the lower values of *Sd* and *WUB* provides the high certainty of model.

$$\overline{d}\_x = \frac{1}{n} \sum\_{i=1}^n \left( lIB - LB \right) = \frac{1}{n} \sum\_{i=1}^n \overline{\varepsilon} \pm \mathcal{W}lIB,\tag{25}$$


**Table 5.** Uncertainty analysis for the Gene Expression Programming (GEP) model in *St* prediction according to Equation (12) and EDMTC.

The ideal certainty analysis is achieved when most of the estimated values are bracketed within the CB and also the narrowest width is achieved.

Table 5 shows the MPE, CB, *dx*, and *WUB* for predicting the *St* − using the GEP model as well as the values of these indices for the EDMTC. Figure 5 displays the CB calculated using MPE for several observational data series (S3, S4, S5, S9, S10, and S12). In EDMTC, according Table 5, for all datasets, the low values of *dx* (0.14), *WUB* (±0.04) and the low value of MPE (−0.14) represent the low uncertainty and high precision of proposed EDMTC in predicting *y\** values. It is clear that for almost each observational data series, 95% of predicted and observed values are within the CB range beside the narrow *WUB*. This represents the acceptable accuracy of the proposed models in predicting the vertical boundary elevation of stable channel profiles. According Table 5, in S3 [20] and S5 [7] data, almost all of the *y\** values predicted by EDMTC model are located within the one side of CB. Because, in these series of data, the more underestimation and overestimation performance of the EDMTC causes the almost high values of *WUB*. Morover, CB is calculated based on mean error values, therefore, the higher and lower predicted *y\** values than observed values are located in one side of CB. For the rest of the data, as more than 95% of the data are within this bound. According Table 5, the *WUB* in all test is low for EDMTC and for GEP model the *WUB* is 0.01. The low *WUB* and associate with the low *dx* values provides a high certainty and precision of EDMTC for S3 [20] and S4 [21], and S5 [7]. While in S12 [68] the low values of *WUB* is associated with high *Sd* values. The low values of *Sd* and *WUB* in GEP model represents the high precision (low MPE value) and certainty of model simultaneously. Therefore, according to the explanations and results presented, it can be said that the proposed EDMTC and GEP has great certainty and their ability to predict the dimensions and stable bank profiles with high accuracy is assured. Therefore, the models proposed in this study can be used to predict channel dimensions in cases when there is little channel information given. Besides, the proposed model is capable of predicting the profile shape of stable channel banks when observational data for the bank profile shape is not available.

**Figure 5.** *Cont*.

**Figure 5.** CB (95%) ranges for the observational values and values predicted by EDMTC for the vertical boundary elevation of stable channels based on different datasets of (**a**) Ikeda [20] (S3), (**b**) Diplas [21] (S4), (**c**) Babaeyan [7]-S5, (**d**) Khodashenas [68]-S9, (**e**) Khodashenas [68]-S10, and (**f**) Khodashenas [68]-S11.

Finally, the proposed EDMTC can be used to determine the maximum value of *y\** as the maximum dimensionless depth at the channel center and the predicted free surface width. In this case, the channel dimensions can be obtained using the proposed model.

#### **5. Conclusions**

In the present study, the maximum entropy principle was employed to provide an equation to calculate the Lagrange multipliers. Accordingly, an equation was developed to predict the bank profile shape of threshold channels. The relation between (δ) ratio with the entropy parameter (*K*) and the hydraulic and geometric characteristics of channels was evaluated. Next, the EDMTC computational model for estimating the shape of banks profiles and the channel dimensions (*B* and *hc*) was designed based on the maximum entropy principle in combination with the GEP regression model for cases when only the *Q* and *d*<sup>50</sup> are known as problem conditions. The results indicate that the entropy model is capable of predicting the bank profile shape trend with acceptable error values (MARE = 0.317, RMSE = 0.09) according to the experimental data in comparison with the Cao and Knight's [28] model (MARE = 0.317, RMSE = 0.09). Therefore, the λ multiplier has a significant role in determining the transverse slope and consequently the vertical elevation of banks, and the physical meaning of λ is associated with the hydraulic parameters governing the problem. The EDMTC proposed in this study with *R*<sup>2</sup> greater than 0.95 and MAE in the 0.076–0.436 range for different observational data series is able to predict the bank profile shape trend as well as the free water surface level in threshold channels. In addition, the uncertainty analysis of EDMTC demonstrated that more than 95% of predicted and observed data are within the CB with low *WUB*, and the model reliability is largely assured. The EDMTC computational model presented in this paper can be used widely to predict stable channel profiles when the given problem information only includes the *Q* and *d*50. This study was developed on Shannon entropy concept, it is suggested to improve the obtained results with other generalized entropies. It is further recommended that other equations provided by different researchers be used to estimate the free surface width of channels. Regression and AI models based on more field data also ought to be used to estimate the mean transverse slope of banks as well as other entropy model types to examine the accuracy of the model presented in this study.

**Author Contributions:** Conceptualization, H.B. and A.G.; methodology, H.B., A.G., and I.E.; software, A.G. and I.E.; validation, H.B., A.G., and A.M.; formal analysis, H.B., A.G., and A.K.-K.-K.; investigation, H.B. and A.G.; resources, H.B., A.G., and A.H.A.; data curation, H.B., A.G., and I.E.; writing—original draft preparation, H.B. and A.G.; writing—review and editing, H.B., A.G., A.K.-K.-K., A.H.A., and A.M.; visualization, H.B. and A.G.; supervision, H.B.; project administration, H.B.; funding acquisition, A.M. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no funding.

**Acknowledgments:** This research in part is supported by the Hungarian State and the European Union under the EFOP-3.6.2-16-2017-00016 project. Support of European Union, the new Szechenyi plan, European Social Fund and the Alexander von Humboldt Foundation are also acknowledged.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

MDPI St. Alban-Anlage 66 4052 Basel Switzerland Tel. +41 61 683 77 34 Fax +41 61 302 89 18 www.mdpi.com

*Entropy* Editorial Office E-mail: entropy@mdpi.com www.mdpi.com/journal/entropy

MDPI St. Alban-Anlage 66 4052 Basel Switzerland

Tel: +41 61 683 77 34 Fax: +41 61 302 89 18