*3.2. Complexity Features Based on Huffman-Multi-Scale Entropy (HMSE)*

Sample entropy (SampEn) is a new time series complexity characterization parameter proposed by Richman et al. in 2004 [31]. The sample entropy is improved on the basis of approximate entropy, both of which measure the complexity of the time series and the probability of a new pattern generated by the sequence when the dimensionality changes. The greater the probability of generating a new pattern, the more complex the sequence and the higher the entropy value. Compared with other nonlinear dynamic methods such as Lyapunov exponent, information entropy, and correlation dimension, sample entropy has the advantages of short data, strong anti-noise and anti-interference ability, and good consistency within a large range of parameters. Therefore, it has attracted the attention of many scholars and has been frequently used in the field of mechanical signal analysis and fault diagnosis in recent years.

#### 3.2.1. Traditional Multi-Scale Sample Entropy (MSE)

Suppose a time series of length *N* is *X* = {*x*1, *x*2, ··· , *xN*−1, *xN*}, and the calculation method of sample entropy is as follows:

Step 1: Construct the time series X into an *m*-dimensional vector:

$$X(i) = \{x\_i, x\_{i+1} \cdot \dots \cdot, x\_{i+m-1}\}, \ i = 1, 2, \dots, N - m + 1\tag{9}$$

Step 2: Define the distance between *X*(*i*) and *X*(*j*) as *d*[*X*(*i*), *X*(*j*)], (*i* = *j*), which is the largest difference between the two corresponding elements:

$$d[X(i), X(j)] = \max\_{k \in (0, m-1)} \left| \mathbf{x}(i+k) - \mathbf{x}(j+k) \right|\_{\prime} \text{ ( $i \neq j$ )}\tag{10}$$

Step 3: Given a threshold *r* > 0, count the number of *d*[*X*(*i*), *X*(*j*)] < *r* and calculate the ratio to the total number of vectors *N* − *m*:

$$B\_i^m(r) = \frac{1}{N - m} num\{d[X(i), X(j)] < r\} \tag{11}$$

Step 4: Average all the results obtained by Equation (12):

$$B^{m}(r) = \frac{1}{N - m + 1} \sum\_{i=1}^{N-m+1} B\_i^{m}(r) \tag{12}$$

Step 5: Then m = m + 1, repeat Step1–Step4.

Step 6: Then theoretically the sample entropy of this sequence is:

$$SampEn(m,r) = \lim\_{N \to \infty} \left\{ -\ln(\frac{B^{m+1}(r)}{B^m(r)}) \right\} \tag{13}$$

However, *N* cannot be infinite in fact, but a finite value. The estimated value of sample entropy is:

$$Samp\mathbb{E}n(m,r,N) = -\ln(\frac{B^{m+1}(r)}{B^m(r)})\tag{14}$$

The sample entropy does not include the comparison of its own data segments, which not only improves the calculation accuracy and saves the calculation time, but also makes the calculation of the sample entropy independent of the data length. In addition, the sample entropy has better consistency. In other words, if one sequence has a higher SampEn than another sequence, then when the parameters *m* and *r* are changed, the sequence still has a relatively high SampEn value. However, the disadvantage of sample entropy is that it does not consider the different time scales that may exist in the time series.

To calculate the complexity of the signal at different time scales, Costa et al. proposed multi-scale entropy [32], which aims to extend the sample entropy to multiple time scales to provide additional observation perspectives when the time scale is uncertain. Like other entropy measurement methods, the goal of multi-scale entropy is to evaluate the complexity of time series. One of the main reasons for using multi-scale entropy is that the relevant time scale in the time series is not known. For example, when analyzing a speech signal, it is more effective to count the complexity of the signal under the word time scale than the complexity of the entire speech segment. However, the actual situation is that we often cannot know how many words a certain speech segment contains, or know what time scale should be used to obtain more useful information from the original signal. Therefore, analyzing the problem through multiple time scales will obtain more effective information.

The basic principle of multi-scale entropy (MSE) includes coarse-graining or downsampling the time series, so that the time series can be analyzed at increasingly coarse time resolutions. Given a time series *X* = {*x*1, *x*2, ··· , *xN*−1, *xN*} of length *N*, set the coarsegrained scale to *s*, then the original time series can be split into *i* consecutive segments without overlap, where *i* = *floor*(*N*/*s*), *floor*(∗) means taking the largest integer smaller than \*. The original sequence can be transformed into a new sequence by calculating the average value of each fragment by Equation (15). Then the MSE of the original sequence can be obtained by solving the sample entropy of the new sequence *Y* = {*y*1, *y*2, ··· , *yi*} obtained under different *s*. The process of coarse-graining the time series is shown in Figure 3.

$$y\_i = \frac{\sum\_{k=1}^{s} \mathcal{X}\_{(i-1)s+k}}{s} \tag{15}$$

**Figure 3.** The process of coarse-graining the time series.

#### 3.2.2. The Huffman-Multi-Scale Entropy (HMSE)

According to the process of the calculation of multi-scale sample entropy, the core of this method is to coarse-grain the original time series on different time scales by averaging. Figure 4a shows a satellite momentum wheel voltage telemetry signal with the length of 10,000. This signal is coarse-granulated and averaged on time scales *s* = 10, *s* = 50, and *s* = 100 respectively. The results are shown in Figure 4b–d. As can be seen from Figure 4, the waveform of the new signal obtained by averaging the original signal at different time scales is almost the same.

It can be seen from Figure 4 that the state of the signal changes at about the 4000th sample point in the original signal. However, the use of different coarse-grained scales cannot reflect the difference in signal changes. Therefore, this paper proposes a new improved multi-scale entropy calculation method based on the Huffman mean model. The main innovation of this method is that when the original data is coarse-grained on different time scales, the average value is not taken, but the Huffman average value is taken. This section will introduce the Huffman mean model and the improved multi-scale entropy calculation method based on the Huffman mean model in detail.

(1) Huffman Coding.

In 1952, Huffman proposed an optimum method of coding an ensemble of messages consisting of a finite number of members [33]. A minimum-redundancy code is one constructed in such a way that the average number of coding digits per message is minimized. The process of Huffman coding is as follows.

Step 1: Given a sequence containing *n* kinds of symbols. Suppose the set of symbol types is *S*<sup>0</sup> = *s*0 1,*s*<sup>0</sup> <sup>2</sup>, ··· ,*s*<sup>0</sup> *<sup>i</sup>* , ···*s*<sup>0</sup> *n* , *i* = 1, 2, ··· , *n*. The probability of each symbol appearing is *P*<sup>0</sup> = *p*0 <sup>1</sup>, *<sup>p</sup>*<sup>0</sup> <sup>2</sup>, ··· , *<sup>p</sup>*<sup>0</sup> *<sup>i</sup>* , ··· *<sup>p</sup>*<sup>0</sup> *n* , *<sup>i</sup>* <sup>=</sup> 1, 2, ··· , *<sup>n</sup>*, and *<sup>n</sup>* ∑ *i*=1 *p*0 *<sup>i</sup>* = 1.

Step 2: Set the iteration parameter to *t*, the maximum value of t is *n* − 1 and the initial value of *t* is 0. The symbol sequence and the corresponding probability at the beginning of the *k*-th iteration are *Sk*−<sup>1</sup> and *Pk*−1. The symbol sequence and the corresponding probability at the end of the *k*-th iteration are *Sk* and *Pk*.

**Figure 4.** The average value of voltage telemetry under different scales: (**a**) original signal, (**b**) scale = 10, (**c**) scale = 50, (**d**) scale = 100.

Step 3: When *<sup>t</sup>* = *<sup>k</sup>*, arrange the symbol *Sk*−<sup>1</sup> in ascending order of probability *Pk*−<sup>1</sup> as *Sk*−<sup>1</sup> = *<sup>s</sup>k*−<sup>1</sup> <sup>1</sup> ,*sk*−<sup>1</sup> <sup>2</sup> , ··· ,*sk*−<sup>1</sup> *<sup>i</sup>* , ···*sk*−<sup>1</sup> *n*−*k*+1 . Then the probability *Pk*−<sup>1</sup> is also rearranged accordingly as *Pk*−<sup>1</sup> = *<sup>p</sup>k*−<sup>1</sup> <sup>1</sup> , *<sup>p</sup>k*−<sup>1</sup> <sup>2</sup> , ··· , *<sup>p</sup>k*−<sup>1</sup> *<sup>i</sup>* , ··· *<sup>p</sup>k*−<sup>1</sup> *n*−*k*+1 .

Step 4: Take the two symbols *<sup>s</sup>k*−<sup>1</sup> <sup>1</sup> and *<sup>s</sup>k*−<sup>1</sup> <sup>2</sup> with the least probability in the symbols sequence. Encode the symbol *<sup>s</sup>k*−<sup>1</sup> <sup>2</sup> with higher probability into "1" and the symbol *<sup>s</sup>k*−<sup>1</sup> <sup>1</sup> with lower probability as "0". Add the probabilities *<sup>p</sup>k*−<sup>1</sup> <sup>1</sup> and *<sup>p</sup>k*−<sup>1</sup> <sup>2</sup> of the *<sup>s</sup>k*−<sup>1</sup> <sup>1</sup> and *<sup>s</sup>k*−<sup>1</sup> <sup>2</sup> as the probability *p*∗ of the new symbol *s*∗.

Step 5: Delete *<sup>s</sup>k*−<sup>1</sup> <sup>1</sup> and *<sup>s</sup>k*−<sup>1</sup> <sup>2</sup> from *Sk*−1, and add *<sup>s</sup>*<sup>∗</sup> into *Sk*−1. Then the *Sk*−<sup>1</sup> turns into *Sk*, and the size of *Sk* is *<sup>n</sup>* <sup>−</sup> *<sup>k</sup>*. Delete *<sup>p</sup>k*−<sup>1</sup> <sup>1</sup> and *<sup>p</sup>k*−<sup>1</sup> <sup>2</sup> from *Pk*−1, and add *<sup>p</sup>*<sup>∗</sup> into *Pk*−1. Then the *Pk*−<sup>1</sup> turns into *Pk*, and the size of *Pk* is also *n* − *k*.

Step 6: Repeat the Step 3 to Step 5 until *t* = *n* − 1. Then the symbols sequence will be *Sn*−<sup>1</sup> = *<sup>s</sup>n*−<sup>1</sup> <sup>1</sup> and the probability will be *Pn*−<sup>1</sup> = *<sup>p</sup>n*−<sup>1</sup> <sup>1</sup> , *<sup>p</sup>n*−<sup>1</sup> <sup>1</sup> <sup>=</sup> 1.

It can be seen from the above coding process that the symbol with the lower probability in the original signal has the longer Huffman code length. Conversely, the symbol with the higher probability has the shorter Huffman code length. The complexity of the probability distribution of the signal can be described by solving the Huffman average code length of the original signal. Based on the above-mentioned Huffman coding process, the method to further calculate the average Huffman coding length is as follows.

Backtrack from the symbol *<sup>s</sup>n*−<sup>1</sup> <sup>1</sup> with the probability of *<sup>p</sup>n*−<sup>1</sup> <sup>1</sup> <sup>=</sup> 1 to each source symbol and record 0/1 in the backtracking path. The Huffman code of *s*<sup>0</sup> *<sup>i</sup>* is *ci*. The average Huffman coding length *L*<sup>∗</sup> can be calculated according to the length of *ci* and the corresponding probability *p*<sup>0</sup> *<sup>i</sup>* as Equation (16). *L*(*ci*) is the length of *ci*.

$$L^\* = \sum\_{i=1}^n p\_i^0 \ast L(c\_i) \tag{16}$$

For a set of source symbols *S*<sup>0</sup> = {*s*1,*s*2,*s*3,*s*4,*s*5,*s*6} with probability *P*<sup>0</sup> = {0.35,0.28, 0.14,0.13,0.07,0.03} , the process of Huffman coding is shown in Table 2. The average Huffman coding length of *S*<sup>0</sup> can be calculated as 2.33 as follows.

$$L^\*(S\_0) = (0.35 + 0.28 + 0.14) \ast 2 + 0.13 \ast 3 + (0.07 + 0.03) \ast 4 = 2.33$$


**Table 2.** An example of the Huffman coding process.

(2) Huffman Mean Model.

The basic principle of Huffman coding and the calculation method for solving the average Huffman coding length were introduced above. In this paper, a new Huffman mean model based on the Huffman coding is proposed for the problem of satellite anomaly detection. For a sequence *T* = {*t*1, *t*2, ··· , *ti*, ··· , *tn*}, the expression of the Huffman mean model is shown in Equation (17).

$$HM(T) = \begin{cases} \begin{array}{l} T' = T/sum(T) \\ \text{C} = Huffman\\_coding(T')\\ \ell = L(\text{C}) \\\ Huffman\\_mean = sum(T\*(\ell/sum(\ell))) \end{array} \end{cases} \tag{17}$$

where *HM*(*T*) is the Huffman mean value of *T*, *T* = *T*/*sum*(*T*) means to convert the original time series into a probability series, *C* = *Huf f man*\_*coding*(*T* ) represents the Huffman coding result of the probability sequence *T* , *C* = {*c*1, *c*2, ··· , *ci*, ··· , *cn*}, *i* = 1, 2, ··· , *n*, *ci* is the Huffman code corresponding to *ti* in the original sequence, - = *L*(*C*) means to calculate the length of each *ci*, *Huf f man*\_*mean* = *sum*(*T* ∗ (-/*sum*(-))) represents the Huffman mean value of the sequence *T* considering the length weight of the Huffman code.

(3) The Improved Method of Huffman-multi-scale Entropy.

The Figure 5 shows the calculation process of Huffman-multi-scale entropy. The inputs of both two methods are original signal *X* = {*x*1, *x*2, ··· , *xN*−1, *xN*}, the scale sequence *Scale* = *s*1,*s*2, ··· ,*sp* and the parameter set *<sup>θ</sup>* <sup>=</sup> {*m*,*r*}, usually 0.1*std*(*X*) <sup>&</sup>lt; *<sup>r</sup>* <sup>&</sup>lt; 0.2*std*(*X*). Compared with the classic MSE, the Huffman-multi-scale entropy method proposed in this paper adopts the coarse-grained method based on the Huffman mean model.

**Figure 5.** The process of the improved method of Huffman-multi-scale Entropy.

Figure 6 shows the same satellite momentum wheel voltage telemetry signal with the length of 10,000. This signal is coarse-granulated and calculated by Huffman mean model on time scales *s* = 10, *s* = 50, and *s* = 100, respectively. Obviously, the coarse-grained method based on the Huffman mean model can enhance the difference of signal changes at different time scales.

**Figure 6.** The Huffman average code length of voltage telemetry under different scales: (**a**) original signal, (**b**) scale = 10, (**c**) scale = 50, (**d**) scale = 100.

#### **4. Anomaly Detection Method Based on Multi-Class SVM**

#### *4.1. Multi-Class SVM*

A support vector machine (SVM) is widely used in classification problems. The basic idea is to find a hyperplane so that all sample points in the positive and negative categories are farthest from the plane, and points that are far enough from the plane can basically be correctly classified. Therefore, if the points closer to the hyperplane are as far away as possible from the hyperplane, a better classification effect can be achieved.

This article uses the most interval classifier to achieve two-class SVM, and then uses the directed acyclic graph (DAG) method to achieve the multi-class SVM based on twoclass SVM.

Set the dataset as {(*xi*, *yi*)|*i* = 1, 2, ··· , *N*}, *xi* ∈ *Rn*, *y* ∈ {−1, +1}, the hyperplane is *wTx* + *b* = 0, Then the distance from the support vector to the hyperplane is *wTx* + *b* = *y*, which can be written as 

$$\frac{|y(w^T\mathbf{x}+b)|}{||w||\_2} = \frac{1}{||w||\_2} \tag{18}$$

The SVM model keeps all the points on both sides of the support vector of their respective categories, while keeping away from this hyperplane. It can be seen from Equation (18) that when ||*w*||<sup>2</sup> is the smallest, the interval is the largest. Introduce the penalty parameter *λ* for misclassification and the relaxation factor *ξ* that allows misclassification, and the objective function can be

$$\begin{array}{c} \min\_{\mathbb{Z}} \frac{1}{2}||w||\_2^2 + \lambda \sum\_{i=1}^N \mathbb{Z}\_i \\ \text{s.t.} \quad y^i(w^T x + b) \ge 1 - \mathbb{Z}\_{i\prime} i = 1, 2, \cdots, N \\ \mathbb{Z}\_i^x \ge 0, i = 1, 2, \cdots, N \end{array} \tag{19}$$

According to Lagrange's duality, the optimization objective can be converted into an equivalent dual problem. The Equation (19) can be transformed into:

$$\begin{aligned} \min\_{\boldsymbol{a}} & \frac{1}{2} \sum\_{i,j=1}^{N} y\_i y\_j \boldsymbol{a}\_i \boldsymbol{a}\_j \boldsymbol{\mathcal{K}} \langle \boldsymbol{x}\_i, \boldsymbol{x} \rangle - \sum\_{j=1}^{N} \boldsymbol{a}\_j \\ \text{s.t.} & \quad y^i (\boldsymbol{w}^T \boldsymbol{x} + \boldsymbol{b}) \ge 1 - \xi\_i, i = 1, 2, \dots, N \\ & \quad \xi\_i^z \ge 0, i = 1, 2, \dots, N \end{aligned} \tag{20}$$

where *K xi*, *x* is the kernel function. The radial basis function is used as the kernel function in this paper, *<sup>K</sup> xi*, *<sup>x</sup>* <sup>=</sup> exp −||*xi*−*x*||<sup>2</sup> 2*σ*<sup>2</sup> , *σ* is the kernel function parameter. Then the decision function is

$$f(\mathbf{x}) = w^T \mathbf{x} + b = \text{sgn}[\sum\_{i=1}^{N} y\_i \mathbb{1}\_i \mathbb{K} \langle \mathbf{x}\_i, \mathbf{x} \rangle + b], 0 < \alpha\_i < \lambda \tag{21}$$

Among the multi-class SVM methods, one is the direct solution method, but this method has high time complexity and is difficult to implement. It is not suitable for a large amount of data. The other one is to combine multiple two-class SVM models into a multi-class SVM model. In this paper, a directed acyclic graph method is used to construct a multi-class SVM.

The DAG method uses the "competition" rule. For *n* types, the height of the decision tree is *n* − 1. Put the classes that are easy to distinguish on the upper layer, and the classes that are difficult to distinguish on the lower layer. The schematic diagram of DAG method for five-class SVM is shown in Figure 7.

**Figure 7.** The schematic diagram of DAG method for five-class SVM model.
