1. Introduction
Entropy serves as a pivotal metric for quantifying the inherent uncertainty within a system, finding ubiquitous application across a multitude of domains including dimension reduction [
1,
2], parameter estimation [
3,
4], identification [
5,
6], feedback control [
7,
8], and abnormal detection [
9,
10,
11]. The minimum error entropy (MEE) criterion stands out as a notable framework aimed at minimizing the entropy associated with estimation errors, thereby mitigating uncertainty within the estimation model. Particularly pertinent to linear regression estimation quandaries, MEE diverges from conventional least square methodologies by not only considering the variance of prediction errors, but also incorporating higher-order cumulants, rendering it adept at handling non-Gaussian noise distributions [
12,
13]. Given the prevalence of non-Gaussian noise in real-world scenarios, the efficacy of MEE has been duly validated across various applications, encompassing adaptive filtering [
14,
15,
16], face recognition [
17,
18], sparse system identification [
19,
20,
21], stochastic control systems [
22,
23], and visible light communication [
24].
Furthermore, the convergence properties of MEE have been meticulously examined in the prior literature [
25,
26,
27]. In-depth analyses delved into the convergence dynamics of fixed-point MEE methodologies [
28], while investigations into the interplay between MEE and Minimum Mean-Squared Error (MMSE) [
29], as well as its relationship with maximum correntropy [
30], have been extensively explored. Noteworthy extensions to the conventional MEE framework include the proposal of kernel MEE variants [
31], elucidations on regularization strategies for MEE implementations [
32], and the development of semi-supervised and distributed adaptations [
33]. Moreover, within the errors-in-variables (EIV) modeling realm, robustness analyses of MEE methodologies have been expounded upon [
34], alongside the proposition of novel methodologies such as the minimum total error entropy approach [
35].
The computational demands associated with MEE criterion stem from its time complexity, which scales quadratically as
. Central to MEE’s computational load is the necessity for a double summation operation during the calculation of the gradient pertaining to the error probability density function (PDF), particularly when employing Parzen windowing [
36] and certain types of Kernel functions. Moreover, the selection of an inappropriate Kernel function or its parameters can detrimentally impact the accuracy of the resulting PDF. To solve the problem, efforts have been made to optimize the gradient descent MEE approach, as evidenced by the normalization of the step size in [
37], leveraging the power of the input entropy. Additionally, innovative methodologies such as the utilization of a quantization operator, as proposed in [
38], have been instrumental in mitigating computational complexity. By mapping the error onto a series of real-valued code words, this approach effectively transforms the double summation operation into a singular form, consequently reducing the algorithm’s computational overhead.
This paper presents a novel approach, the fast minimum error entropy estimation (FMEE) algorithm, tailored specifically for linear regression tasks, leveraging the polynomial expansion technique applied to the error Probability Density Function (PDF). In contrast to the traditional minimum error entropy (MEE) methodology employing Parzen windowing for error PDF derivation, FMEE adopts the Gram–Charlier expansion method to approximate the error PDF, resulting in a notable reduction in time complexity from to . Notably, FMEE obviates the need for Kernel functions or their associated parameters. The proposed algorithm entails several key steps: firstly, the derivation of the PDF approximation for a random variable via the Gram–Charlier expansion, followed by the simplification of the entropy of the random variable facilitated by the orthogonality of the Hermite polynomials. Subsequently, the error of the linear regression model is ascertained, with the error entropy expressed as a function of the regression coefficient vector. Ultimately, the gradient descent technique is employed to deduce the regression coefficient vector corresponding to the minimum error entropy. Experimental validation underscores the efficacy of FMEE, revealing a remarkable reduction in time consumption, with a disparity of less than 1‰ compared to that of the MEE approach for identical problem settings. Importantly, minimal discrepancies are observed between the accuracy levels of FMEE and MEE methodologies.
The rest of the paper is organized as follows:
Section 2 introduces the related algorithms, FMEE is proposed in detail in
Section 3,
Section 4 contains the experiments, and the conclusions are in
Section 5.
3. Methodology
This section derives the relation between the differential entropy and the Gram–Charlier expansion of the PDF of the error for linear regression. Moreover, the assumption that the PDF of the regression error is near the Gaussian distribution that has the same mean and variance as the error is implemented.
The differential entropy [
42] of a random variable
is shown in Equation (
10):
Substitute Equation (
9) into Equation (
10); the result can be derived in Equation (
11):
This integral is rather difficult to compute. However, notice that if the PDF of
x is near the normal density as assumed, it can be inferred that
and
are very small. Then, the approximation in Equation (
12) can be used:
where
denotes the infinitesimal.
Then, Equation (
11) can be transformed into Equation (
13):
As
forms an orthogonal system, Equation (
13) can be simplified as Equation (
14),
The details of the simplification process of Equation (
14) are illustrated as follows. This part is to derive Equation (
14). First, it is needed to show that
and
For Equation (
15), suppose that
,
As
is an even function and
is an odd function, then
and Equation (
15) holds.
For Equation (
16), suppose that
,
The first term in Equation (
19) can be calculated as Equation (
20),
The second term in Equation (
19) can be calculated as Equation (
21),
The last term in Equation (
19) can be calculated as Equation (
22),
According to Equations (
20)–(
22),
and Equation (
16) holds.
Notice that
, then the second term of Equation (
24) is 0. As it is assumed that
x is close to a normal random variable,
and
are very small. Then, the last term of Equation (
24) can be omitted as 0, as it involves third-order monomials, which are infinitely smaller than the terms only containing second-order monomials. Then, for Equation (
24), take Equations (
15) and (
16) into consideration, and notice again that
,
For the linear regression,
, the standard deviation of the error
is in Equation (
26):
where
is the mean of
y and
donates the mean of
.
To simplify the calculation of
, let
become a random variable with 0 mean:
Notice that the transformation in Equation (
27) does not change the standard deviation and the entropy of
as
and
when
c is a constant number. In the rest of the paper, to keep clarity,
is replaced with
.
Then, for linear regression, its entropy of the error can be expressed as Equation (
28):
where
To minimize
, the derivative of
with respect to
w is calculated in Equation (
31):
where
Then, the optimal
for the minimum of
can be obtained by the iteration scheme in Equation (
35) with gradient descent:
where
is the step size determined by Armijo condition [
43].
To compute
for FMEE in every iteration,
,
,
,
,
, and
are needed. From Equations (
26) and (
29)–(
34), the time complexity of
,
,
,
,
, and
is
, respectively. Then, the computational complexity of FMEE is
. As the time complexity of MEE is
due to the double summation operation, FMEE runs faster than MEE does.
4. Experiments
In this section, a comprehensive array of experiments is undertaken, encompassing both numerical simulations and real-world scenarios. The numerical simulations serve to validate the efficacy of the FMEE approach and to scrutinize the time efficiency of both FMEE and MEE methodologies. Concurrently, practical experiments are conducted aimed at forecasting power outages within a city located in northwest China.
Regarding the numerical simulations, we consider a scenario where
, with the model defined as
, where
and
follows a normal distribution
. Two distinct types of noise are examined in our experiments. Firstly, Gaussian noise
is employed, while the second type comprises generalized Gaussian noise characterized by a probability density function
. This heavy-tailed distribution, utilized herein, aligns with previous works, such as references [
27,
35]. Although FMEE is derived under the assumption of near-Gaussian noise, it is anticipated that FMEE remains effective in the presence of non-Gaussian, heavy-tailed noises. Here,
denotes a constant adjusted to maintain a variance of 1 for
e. For MEE, the step size is set to
, and the scale parameter for the Gaussian kernel is 10, mirroring the settings adopted in Reference [
27]. Across the experiments, sample sizes range from 100 to 500, with each combination of sample size and noise type repeated 100 times. The ensuing experimental results are detailed below.
Table 1 provides a comparative analysis of the time consumption between MEE and FMEE in the presence of Gaussian noise. FMEE demonstrates a notable acceleration in computational efficiency compared to MEE. Specifically, while the computational time of MEE exhibits quadratic growth relative to the sample size, FMEE’s computational overhead scales linearly with the number of samples. Notably, FMEE achieves remarkable speed, processing 500 objects in approximately 0.1 s, thereby positioning it as a highly promising candidate for time-sensitive applications.
Table 2 compares the mean squared error between MEE and FMEE in the context of Gaussian noise. Overall, MEE exhibits superior performance compared to FMEE in terms of MSE. Particularly noteworthy is MEE’s substantial advantage over FMEE, especially evident when the sample size is 100. However, as the sample size increases, the discrepancy in MSE between the two algorithms diminishes. By the time the sample size reaches 500, the difference dwindles to a mere
. This observation suggests that while FMEE may lag slightly behind MEE, particularly in larger-scale datasets, such disparity remains acceptable in practical applications, given FMEE’s significantly reduced computational time compared to MEE.
Table 3 analyses the iteration counts required for convergence between MEE and FMEE in the context of Gaussian noise. Notably, FMEE achieves convergence with significantly fewer iterations compared to MEE. Furthermore, it is observed that the iteration count of MEE gradually decreases as the sample size increases. Conversely, the iteration count of FMEE remains relatively constant, irrespective of variations in sample size.
Table 4 provides a comparative assessment of the time consumption between MEE and FMEE in the context of non-Gaussian noise. Across all five experimental groups, FMEE consistently demonstrates a remarkable reduction in time consumption, consistently amounting to less than 1‰ of MEE’s time expenditure. Moreover, in accordance with theoretical predictions, the computational cost of MEE exhibits quadratic growth relative to the sample size, whereas FMEE’s time overhead scales linearly with the sample size.
Table 5 conducts a comparative evaluation of the mean squared error (MSE) between MEE and FMEE in the context of non-Gaussian noise. Generally, the disparity between the MSE values of FMEE and MEE is relatively minor. Notably, FMEE outperforms MEE when the sample size is 100 or 400, while MEE exhibits superior performance for sample sizes of 200, 300, and 500. Additionally, FMEE attains the optimal MSE across all sample sizes, except for
. Furthermore, it is observed that the MSE values for non-Gaussian noise, as depicted in
Table 5, span a broader range compared to those for Gaussian noise, indicating greater variability in results for both MEE and FMEE.
Table 6 compares the number of iterations required for convergence between MEE and FMEE in the context of non-Gaussian noise. FMEE consistently demonstrates faster convergence compared to MEE. Furthermore, it is observed that the iteration count of MEE decreases as the sample size increases. In contrast, the iteration count of FMEE remains relatively stable despite variations in sample size.
Based on the findings presented in
Table 1,
Table 2,
Table 3,
Table 4,
Table 5 and
Table 6, it is evident that FMEE consistently yields comparable outcomes to MEE, albeit with significantly reduced time consumption and iteration counts. Notably, FMEE stands out for its remarkable efficiency, being approximately 1000 times faster than MEE and achieving convergence within approximately 0.1 s for 500 instances. Given its linear time complexity of
, FMEE holds considerable promise for deployment in real-time scenarios characterized by non-Gaussian noise.
Subsequently, we embark on practical experimentation utilizing transformer characteristic data to forecast distribution network failures. The transformers are categorized into two groups based on their geographical location within distinct levels of urban activity within a specific city in northwest China. Initially, our attention is directed towards a subset of 1265 transformers situated within a smaller, less densely populated area of the aforementioned city. We undertake a random partitioning of this dataset into two segments: a training set comprising of the data, totaling 1012 instances, and a verification set encompassing the remaining , comprising 253 instances. To be more specific, the transformer dataset comprises various characteristic variables, involving the standardized heavy overload duration, maximum active load ratio, average active load ratio, mean three-phase unbalance, standardized heavy three-phase unbalance duration and so forth.
We conduct a comparative analysis, pitting our proposed FMEE against three established baseline methodologies: logistic regression, neural network, and support vector machine. The dataset comprises 1265 transformer characteristic variables associated with substantial load levels, undergoing 30 rounds of random partitioning. Subsequently, each randomly divided dataset is subjected to four distinct algorithms. The predictive efficacy of these algorithms on the verification set is assessed based on F-measure and error rate metrics.
Figure 1 depicts the F-measure evaluations stemming from 30 prediction instances conducted by four distinct algorithms. Notably, each evaluation involves consistent partitioning of the dataset across the four algorithms. A discernible trend emerges wherein all algorithms yield F-measure values surpassing 0.8. Notably, our proposed FMEE exhibits the most promising performance, boasting an average F-measure of 0.904, thereby outshining the other three baseline methodologies by a significant margin.
To facilitate a comprehensive comparison among the four algorithms,
Figure 2 depicts the error rate and F-measure of fault outage prediction results in the form of line and box charts. Notably, FMEE emerges as a frontrunner, showcasing a substantial advantage over the sophisticated support vector machine in terms of error rate. Specifically, our FMEE achieves an average error rate of
, surpassing the neural network, logistic regression, and support vector machine by margins of
,
, and
, respectively. These findings underscore the clear superiority of our proposed FMEE methodology.
Furthermore, we extend our experimentation to a more densely populated urban region within the northwest Chinese city.
Figure 3 illustrates the F-measure evaluations derived from 30 independent tests conducted across the four comparative algorithms. On a comprehensive scale, our proposed FMEE attains an F-measure of 0.891, surpassing baseline methodologies such as neural networks, logistic regression, and support vector machines by margins of 0.013, 0.023, and 0.028, respectively.
Finally, we report the error rate of F-measure concerning fault outage prediction results derived from four comparative algorithms, represented through line and box charts. As shown in
Figure 4, our proposed FMEE achieves an average error rate of
, showcasing a notable superiority over three baseline methodologies. These findings robustly underscore the efficacy of our algorithm in addressing regression problems.