3.1.1. Remove The Stop Words

In computing, stop words are words which are filtered out before or after processing of natural language data (text). Any group of words can be chosen as the stop words for a given purpose. For some search engines, these are some of the most common, short function words, such as 'the', 'is', 'at', 'which', and so on.

#### 3.1.2. Segmentation

Chinese language is a unique language that is completely different from English. In English, word and word are separated by space and each word stands for independent meaning. On the contrary, Chinese words do not have a space to separate them. Furthermore, although each word has its independent meaning, their meaning is changed when the words are put together. Therefore, it is important and difficult to separate the words based on the context. Wrong Segmentation will totally change the sentence's meaning and increase the difficulty of classification.

After comparing the most commonly used tools for Chinese word segmentation, we finally choose "Jieba", which is built to be the best Python Chinese word segmentation module. The mainly algorithms for it are as follows:


For the Chinese words, it uses four states (BEMS) to distinguish them: B(begin), E(end), M(middle), and S(single). In addition, after training on large quantities of corpora, it gets three probability tables: TransProbMatrix, Emission Probability Matrix, and Initial State Matrix. Then, for a sentence that needs to be segmented, the HMM model uses a Viterbi algorithm to obtain the best 'BEMS' sequence that begins with a 'B'-word and ends with an 'E'-word.

Assume that, given the HMM state space S, there are k states, the probability of the initial state *i* is *πi*, and the transition probability from the state *i* to the state *j* is *aij*. Let the observed output be *y*1, ..., *yT*. The most likely state sequence *x*1, ..., *xT* that produced the observation is given by the recurrence relation:

$$V\_{i,j} = P(y\_l|k) \cdot \pi\_{k\prime} \tag{1}$$

$$V\_{t,k} = \max\_{x \in S} \left( P(y\_t|k) \cdot a\_{x,k} \cdot V\_{t-1,x} \right) . \tag{2}$$

Here, *Vt*,*<sup>k</sup>* is the probability of the most likely state sequence to correspond to the first *t* observations with a final state of *k*. The Viterbi path can be obtained by saving the backward pointer to remember the state x used in the second equation and declaring a function *Prt*(*k*, *t*) that returns the *x* value to be used if *t* > 1 or returns a *k* value if *t* = 1:

$$\mathbf{x}\_T = \arg\max\_{\mathbf{x}\in S} (V\_{T,\mathbf{x}})\_\prime \tag{3}$$

$$\mathbf{x}\_{t-1} = \text{Pr}(\mathbf{x}\_{t\prime}\mathbf{t}).\tag{4}$$
