A Mandarin Tone Recognition Algorithm Based on Random Forest and Feature Fusion †

Yan, Jiameng; Meng, Qiang; Tian, Lan; Wang, Xiaoyu; Liu, Junhui; Li, Meng; Zeng, Ming; Xu, Huifang

doi:10.3390/math11081879

Open AccessArticle

A Mandarin Tone Recognition Algorithm Based on Random Forest and Feature Fusion ^†^†

by

Jiameng Yan

^1,‡

,

Qiang Meng

^1,‡,

Lan Tian

^1,*,

Xiaoyu Wang

¹,

Junhui Liu

¹,

Meng Li

²,

Ming Zeng

¹

and

Huifang Xu

¹

School of Microelectronics, Shandong University, Jinan 250100, China

²

China Telecom Shandong Branch, Jinan 250098, China

^*

Author to whom correspondence should be addressed.

^†

This paper is an extended version of our paper published in Proceedings—7th International Conference on Control Engineering and Artificial Intelligence, CCEAI 2023, Sanya, China, 28–30 January 2023; pp. 168–172.

^‡

These authors contributed equally to this work and are co-first authors.

Mathematics 2023, 11(8), 1879; https://doi.org/10.3390/math11081879

Submission received: 1 March 2023 / Revised: 24 March 2023 / Accepted: 13 April 2023 / Published: 15 April 2023

Download

Browse Figures

Versions Notes

Abstract

In human–computer interaction (HCI) systems for Mandarin learning, tone recognition is of great importance. A brand-new tone recognition method based on random forest (RF) and feature fusion is proposed in this study. Firstly, three fusion feature sets (FFSs) were created by using different fusion methods on sound source features linked to Mandarin syllable tone. Following the construction of the CART decision trees using the three FFSs, modeling and optimization of the corresponding RF tone classifiers were performed. The method was tested and evaluated on the Syllable Corpus of Standard Chinese (SCSC), which is a speaker-independent Mandarin monosyllable corpus. Additionally, the effects were also assessed on small sample sets. The results show that the tone recognition algorithm can achieve high tone recognition accuracy and has good generalization capability and classification ability with unbalanced data. This indicates that the proposed approach is highly efficient and robust and is appropriate for mobile HCI learning systems.

Keywords:

tone recognition; random forest; feature fusion; Mandarin

MSC:

68T10

1. Introduction

Different from English or other Western non-tonal languages, monosyllables with the same pinyin in Mandarin have four tones, such as mother (ma1), hemp (ma2), horse (ma3), and abuse (ma4). Therefore, for a learner whose mother tongue is non-tonal, mastering the tonal pronunciation of monosyllables is both a difficult and key point in learning Mandarin [1]; even a Chinese child with severe and profound prelinguistic deafness, after the implantation of a cochlear implant (CI) device, would also have a similar problem [2]. Since Mandarin sentences are composed of many continuous monosyllabic words and due to the vocal cord vibration inertia and mutual influence during the pronunciation of adjacent syllables, some syllables’ tones will change based on the tone-changed rules. A portable Mandarin tone training system with human–computer interaction (HCI) function is of great significance for learners. However, the tone recognition algorithm should be low complexity, efficient, robust, and speaker independent.

The key to tone recognition lies in the extraction of feature parameters and the design of the tone classifier. There are already a number of tone recognition approaches which are either based on machine learning or on deep learning. The method proposed by Fu et al. [3] extracted features related to fundamental frequency and energy in voiced segments and used support vector machines (SVMs) to automatically recognize four tones in Mandarin, achieving 93.52% accuracy. In Mizo tone recognition with four tones as in Mandarin and with six F0 features, the SVM-based classifier achieved 73.39% accuracy and the deep neural network (DNN)-based classifier achieved 74.11% accuracy [4]. However, the extracted sound source features in the above two methods were not fully optimized and fused. Zheng et al. [5] proposed a tone recognition algorithm for Mandarin three-syllable words based on fundamental frequency and an improved back propagation neural network (BPNN) algorithm, and the accuracy of the first word, middle word, and last word was 87.50%, 70.83%, and 79.17%, respectively, though the BPNN model was obtained by training involving 800,000 epochs. Shen et al. [6] put forward a Chinese four-tone recognition method based on the fusion of prosodic and cepstral features, and the tone classification accuracies of the Gaussian mixture model, BPNN, SVM, and convolutional neural network (CNN) were 84.55%, 86.28%, 85.50%, and 87.60%, respectively. Liu et al. [7] proposed a one-step continuous Mandarin tone recognition method, and the final tone accuracy rate was 88.80% based on pitch features and spectral features by using an MSD-HMM (multi-space distribution hidden Markov model). The performance of these two algorithms is still not ideal and the construction of tone feature parameters is not optimized for tone recognition tasks. In [8], a speaker-independent four-tone recognition system for Mandarin digits was realized that was only based on the pitch contour of each syllable, which achieved 90.20% recognition accuracy and a response time of about 1.64 s. Although the system is very simple, the accuracy and response time still need further improvement. A CNN architecture Mandarin tone classifier [9] was based on the features from pre-training on Mel frequency cepstrum coefficient (MFCC) vectors through the use of a denoising autoencoder and achieved 95.53% accuracy; however, the use of 3600 syllables was required for training. Based on a CNN and multi-layer perceptron, the ToneNet model [10] was used to classify Mandarin monosyllabic tones and achieved 99.16% accuracy with the Syllable Corpus of Standard Chinese (SCSC). However, the method used the Mel spectrogram’s image as the input of the model and required a large amount of data for training and large computation.

Although some of the above methods have achieved good results in Mandarin tone recognition, there is still no method that combines high accuracy and low complexity. While the random forest (RF) is an efficient ensemble modeling approach, in the process of model construction, the bootstrap aggregating (bagging) algorithm and random feature selection strategy can be used to avoid falling into local optimization [11,12]. There is also no Mandarin tone recognition method based on RF and feature fusion, and how to construct appropriate fusion features for a RF classifier and its recognition performance remains to be explored. Thus, in this paper, we introduce RF into tone recognition to establish a highly efficient Mandarin tone recognition method with a high level of performance, which should be suitable for mobile training systems, especially for small sample sets.

In the study, the sound source features related to Mandarin tones are first comprehensively selected and fused, and the fusion feature sets (FFSs) are produced by using three fusion methods, which are respectively used to form corresponding RF models, with the classification and comparison experiments then conducted. Furthermore, the tone recognition performances of different FFSs and RF with small sample sets are evaluated. The results show that the RF classifier has high effectiveness and robustness, which is suitable for Mandarin tone recognition in small sample sets.

Summarizing the contributions of this paper is the following:

RF is first applied to identify Mandarin tone in a speaker-independent manner.
Through a large number of experiments, with three FFSs from only sound source features we find that RF for tone recognition is a high-stability, low-complexity approach.
It is proven that the method proposed has good recognition for small sample sets and has strong generalization ability.

2. Materials and Methods

2.1. Data Description

Fundamental frequency, which varies greatly with speaker and gender, is an important feature parameter of tone recognition (the fundamental frequency range of males is 70~200 Hz and the fundamental frequency range of females voice is 140~400 Hz [13]). An effective speaker-independent Mandarin tone training system can be designed to first select gender and then start speech pronunciation practice. In this way, we need to study the relevant performance of the proposed algorithm in a certain gender, and the method for another gender can then be inferred.

In this work, the Syllable Corpus of Standard Chinese (SCSC) [14] was used to evaluate the effectiveness and robustness of our presented method. The corpus contains syllables used in daily Mandarin from fifteen male speakers named as m01, m02, …, m15, with the ages not noted. There are 1275 identical syllables per speaker and sound files are stored in high-quality 16 KHz sampling 16-bit data mono WAV format. In order to form the experimental speech dataset and balance the four tones, 40 monosyllables were selected from each speaker to obtain 600 syllables, including 10-tone 1, 10-tone 2, 10-tone 3, and 10-tone 4 (see Appendix A for details).

2.2. Preprocessing

In the short time frame processing, the frame length was set to 30 ms and frame shift was set to 10 ms. The double threshold method based on the short-time zero crossing rate and energy was then used to detect the voiced segments [15]. A Chebyshev low-pass filter with a cut-off frequency of 900 Hz was used to remove high frequency features from vocal tracts, and the auto-correlation method was used to derive fundamental frequency parameters [16].

2.3. Feature Fusion

After the preliminary experiment (see Section 3.2.1 for details), we found that cepstrum features have little contribution to improving the accuracy of tone recognition, so only sound source features should be used for feature fusion in tone recognition.

In this paper, seven original feature sets were used after reviewing the literatures, which are shown in Table 1.

Efficient RF models rely on high-quality FFSs. At present, it is not clear which features of the above feature parameters are important in the tone recognition task, so we used the three feature fusion methods shown in Figure 1 to explore this.

The specific details of these sets are as follows:

FFS SI was directly composed of all 94 feature parameters from S1 to S7.
The second method involved a BPNN, which has fine performance and wide application and was selected as a fixed classifier model to optimize the features of a tone recognition task. The number of nodes in the BPNN’s hidden layer was set to 32. The process was as follows: For each feature in SI, the ReliefF algorithm [22] was used to calculate the weight of each feature, and the weight was ranked from large to small. We then inputted the features into the BPNN in order for the purposes of tone recognition and stopped this process once recognition accuracy no longer rose. FFS SII was thus formed, which contained fifteen features.
FFS SIII, which included twelve features, was obtained using the third method. Firstly, the top three feature sets of S1 to S7 were selected by the BPNN. Each feature from the top three sets was then ranked by ReliefF. Lastly, the twelve features could be optimized according to a process similar to that of the second method.

2.4. Classifier

The proposed method aims to handle small sample sets well, achieve fast modeling, and rely on simple calculations so that it may be deployed on small mobile terminals for Mandarin learners. Therefore, the tone recognition classifier should not be too complicated, suggesting low-complexity machine learning models to be more suitable for constructing the classifier. Back propagation neural networks, support vector machines, the Naive Bayes model (NBM), AdaBoost, and random forest are commonly used machine learning classifiers [6,23,24]. Therefore, the five classifiers mentioned above were used for the preliminary experiment in Section 3.2.2. We found that the RF classifier has better learning ability; however, no study has found that RF is suitable for Mandarin tone recognition.

2.5. Tone Recognition Classifier Based on Random Forest

2.5.1. CART Decision Tree and Random Forest Modeling

Random forest is composed of T decision trees, where T is the hyperparameter. The ID3, C4.5, and CART algorithms are commonly used decision tree algorithms [25]. It has been studied that the classification accuracy of the CART algorithm is better than that of other algorithms [26]. Thus, in this paper, the CART algorithm was used to construct the decision tree. The samples of each decision tree in the random forest are chosen randomly (reflected in the use of the bootstrap sampling algorithm to construct the sample set in the training process), which can effectively avoid overfitting and improve robustness.

In the training process, the shape of the training set is M₁*N, where M₁ denotes the number of training samples, which contains L types of tones, N denotes the number of features of each sample, and T is the number of decision trees.

The specific process is as follows:

First, during the construction process of each decision tree, the training set is randomly selected and put back M₁ times to obtain a sample set with sample size M₁, where some data in the training set are selected multiple times and some are not. Thus, the M₁*N features matrix F = {f_i,j, i = 1,…, M₁, j = 1,…, N} is formed.
Second, at the root node of each decision tree, one optimal feature with the smallest Gini index is selected from M₁*N features, and its feature value is the decision at the root node.

In the CART algorithm, the Gini index is used to select the node feature and represents the impurity of the dataset; the smaller the Gini index, the lower the impurity. This is expressed by the following equation:

G i n i (D, F) = \frac{|D_{1}|}{|D|} G i n i (D_{1}) + \frac{|D_{2}|}{|D|} G i n i (D_{2})

(1)

G i n i (D_{1}) = 1 - \sum_{t = 1}^{S} {(\frac{|S_{t}|}{|D_{1}|})}^{2}

(2)

G i n i (D_{2}) = 1 - \sum_{t = 1}^{S} {(\frac{|S_{t}|}{|D_{2}|})}^{2}

(3)

where D denotes the sample set in a certain node containing S tone types, whose number is |D|. The number of the tth tone type is |S_t|. D is divided into two subsets (D₁ and D₂) according to the value of the current node feature f_i,j, with D₁ including samples whose f_i,j feature value is lower than the value of the current node feature.

3.: In the next step, the matrix M1*N is divided into two parts: M₁₁*N and M₁₂*N. One optimal feature with the smallest Gini index is selected from M₁₁*N features, and its feature parameter is the decision at the branch node. A similar step is performed in M₁₂. M₁₁ is also divided into two parts, and the above process is repeated until all features are used or the tone type is provided as output, which is the leaf node.
4.: Repeat 1, 2, and 3 T times to construct T decision trees, which thus form a random forest. Figure 2 shows one decision tree.

2.5.2. Tone Recognition Based on the Random Forest Classifier

In the test process, the matrix of the test set is M₂*N, where M₂ denotes the number of test samples. The test set is input into the modeled random forest classifier, and the steps included are as follows:

The test set is fed into the pre-trained random forest classifier.
Starting from the root node of the current decision tree, the random forest classifier compares the feature parameters based on the value of the current node on each decision tree until the decision reaches the leaf node, which outputs the corresponding tone type.
Since each decision tree is independent in the recognition process of each test sample, the final recognition result of the test sample is obtained via a voting process involving the results of multiple decision trees.

The entire training and test process is shown in Figure 3.

3. Experiment and Result

All experiments in this paper were implemented and tested on MATLAB R2020a using a 64-bit computer (Intel Core i7-12700 CPU, 2.10 GHz; 16.0 GB RAM).

3.1. Optimization Experiment of RF Classifier’s Hyperparameter T

Decision tree number T is an important hyperparameter in the RF classifier, and three RF classifiers were constructed based on SI, SII, and SIII. The number of T was set to 100, 200, 300, 400, and 500 in order to determine the optimal T value, and the variation in T values was reduced to 50 around the best value. The evaluation metrics were recognition accuracy rate (ACC), the area under the receiver operating characteristic (AUROC), and the area under the precision recall curve (AUPRC), which are commonly used for evaluating the performance of classifiers [27].

As shown in Figure 4 and Table 2, Table 3 and Table 4 the RF classifiers based on SI, SII, and SIII achieve the best performance at T = 400 (ACC, AUROC, and AUPRC are 98.33%, 98.88%, and 98.32%, respectively), T = 350 (97.50% ACC, 98.35% AUROC, and 97.50% AUPRC), and T = 350 (98.00% ACC, 98.65% AUROC, and 97.97% AUPRC), respectively.

3.2. Preliminary Experiment

3.2.1. Analysis of the Role of Vocal Tract Features in Tone Recognition

The feature parameters of the original speech signal can be divided into vocal tract features and sound source features. The former are mainly spectrum envelope parameters such as MFCC, and the latter are mainly time domain features, such as duration, energy, and fundamental frequency. MFCC, as a typical vocal tract feature, is commonly used in acoustic analysis such as automatic speech recognition [28] and dialect and language recognition [29]. However, the role of this feature in tone recognition needs to be evaluated through analysis experiments.

In the extraction of MFCC, preprocessing without 900 Hz low-pass filtering, fast Fourier transform (FFT) calculations, spectral line energy, Mel filtering energy, and a discrete cosine transform (DCT) cepstrum are needed. For each frame speech signal, twelve MFCCs and twelve ΔMFCCs were extracted, with the 24 feature parameters forming a one-dimensional feature vector. The ten frames of the central part of the syllable were selected for calculation, and these 240 feature parameters were named as the cepstrum feature set which was used for the tone recognition pre-experiment. When using five-fold cross-validation and a three-layer BPNN with 64 hidden layer nodes, tone recognition accuracy was 50.67%. It can be seen that the experimental result for tone recognition using the cepstrum feature set is not ideal. Next, we used both sound source features and cepstrum features for the experiment. We selected the fundamental frequency statistical features introduced in reference [6] as sound source features. Firstly, the fundamental frequency statistical features and cepstrum features were respectively used to carry out tone recognition experiments on the BPNN, and the two BPNN tone classifiers were then mixed and given weight α and 1-α, respectively, so as to explore the change in tone recognition accuracy with weight α. The specific realization formulas are as follows:

T_{p i t c h}^{*} = a v e r a g e \{T_{p i t c h} (T_{n}| X_{i})\}

(4)

T_{M F C C}^{*} = {a v e r a g e {T}_{M F C C} (T_{n}| X_{i})}

(5)

where

T_{p i t c h}^{*}

and

T_{M F C C}^{*}

are the recognized accuracy using the fundamental frequency statistical features and cepstrum features, respectively. n is the tone label, with values of 1, 2, 3, and 4.

T_{p i t c h} (T_{n}| X_{i})

means the tone (T_n) accuracy of test sample set X_i (1 ≤ i ≤ N, N is the total number of samples) when using fundamental frequency statistical features and

T_{M F C C} (T_{n}| X_{i})

means the T_n accuracy when using cepstrum features. The combined recognition accuracy with the two tone classifiers is defined as follows:

T^{*} = α \cdot T_{p i t c h}^{*} + (1 - α) \cdot T_{M F C C}^{*}

(6)

We carried out five-fold cross-validation and set the value of α as 0:0.1:1, and the results are shown in Figure 5.

It can be seen that the accuracy of tone recognition is not high when the cepstrum features are used alone. When the fundamental frequency statistical features and cepstrum features are used jointly for tone recognition, the fundamental frequency statistical features play a major role in greatly improving classification accuracy. Cepstrum features have little effect on improving the accuracy of tone recognition but greatly increase the operational complexity of parameter extraction. Therefore, cepstrum features are not used for tone recognition in this paper.

3.2.2. Analysis of Classifiers in Tone Recognition

The five classifiers under the typical structure were used for pre-experiments on S1 to S7, and the experimental results are shown in Figure 6. In S2, S5, S6, and S7, RF achieved the highest recognition accuracy, and in the remaining feature sets (S1, S3, and S4), the BPNN achieved the highest recognition accuracy, which indicates that the random forest has better learning ability. Further, the recognition accuracy of the five classifiers was averaged over seven feature sets. From the average accuracies, it is also evident that the RF classifier was slightly higher than that of others. The best result for the NBM could reach 96.67% and the worst was only 50.67%, which indicates that the NBM’s classification is either unstable or not robust; the accuracy of RF and the BPNN always remained above 95%. The classification results of the above five classifiers were obtained based on the same seven feature sets, and since different classifiers are based on different mathematical models, it shows the roles and effects of mathematical models in classifier modeling.

3.3. Comparative Experiment

In order to study the performance of the proposed method, comparative experiments were performed.

3.3.1. Comparative Experiments of Different Fusion Feature Sets

The four performance results were obtained by applying five-fold cross-validation on the three FFSs and RF classifiers. The results are shown in Table 5.

The Average Processing Time Per Sample (APTPS) index is the average time from extracting features to identifying a single syllable tone. From this index, it can be seen that the RF methods can achieve real-time identification and high recognition accuracy with three FFSs. It shows that the FFSs and RF tone recognition methods are highly efficient.

There are 94 features contained in SI, which is the most data contained in the three FFSs. This can describe the characteristics of tone more comprehensively; thus, the ACC, AUROC, and AUPRC values obtained are the highest and the classification effect is the best. Experimental results show that the classification effect of SII (15 features) is slightly worse than SIII (12 features), although the number of features for SII is more than SIII. This proves that simply increasing the number of features does not necessarily improve the recognition accuracy. This suggests that feature fusion is more important.

In addition, feature optimization is also needed to decrease algorithm complexity. By comparing SII and SIII, it can be seen that the classification effect is better and the running speed is faster with SIII. In general, the feature fusion method can not only reduce the running time, but also maintain high recognition accuracy.

3.3.2. Comparative Experiments of Small Sample Sets

In order to analyze the performance of the RF classifier in tone recognition, we carried out comparative experiments on small sample training sets. The experimental speech database (i.e., 600 syllables) was divided into 10 parts, with each part containing the same number of samples for the four tones. The proportion of training samples was reduced from 90% to 10% by step 10%. Consequently, the percentage of test samples was increased from 10% to 90% by the same step 10%.

Figure 7 shows the recognition results of the corresponding nine groups based on the three FFSs and RF classifiers. It can be seen that as the proportion of training samples decreases, tone recognition accuracy only decreases by about four percent. Nevertheless, even when training samples only accounted for ten percent of the database (i.e., 60 samples), there was still good accuracy of above 93.57%, which demonstrates the powerful learning ability of random forest.

4. Discussion

Using state-of-the-art methods (ToneNet [10], CNN [9]) as a benchmark, we compared accuracy, and the results are shown in Table 6. Although [10] used ToneNet to perform tone recognition on the SCSC database with an accuracy of 99.16%, it depends on a large amount of data and brings complex calculation, which is contrary to our intention to deploy a model on small mobile terminals. In [10], the authors used the CNN method of [9] on the SCSC, with results that were not as good as those obtained in this work. Therefore, considering operation time and computational complexity comprehensively, our method is the most cost effective.

In addition, we also carried out cross-dataset (i.e., beyond the original 600 syllables) testing experiments with balanced samples and extremely unbalanced samples. The balanced samples were 100 new syllables from the 15 original people, and were randomly selected from the SCSC, i.e., the samples of tone 1, tone 2, tone 3, and tone 4, with 25 of each. The extremely unbalanced samples were another 400 syllables from the same 15 people, and the ratio of tone 1, tone 2, tone 3, and tone 4 was 7:1:1:1, 1:7:1:1, 1:1:7:1, and 1:1:1:7, respectively. The balanced samples were respectively sent into three modeled FFSs and RF classifiers for tone recognition experiments, and the results are shown in Figure 8a. The extremely unbalanced samples were sent into the modeled RF classifier corresponding to SIII (which was the most cost effective), and the results are shown in Figure 8b. We find that tone 2 and tone 3 are more difficult to recognize than tone 1 and tone 4, both for balanced samples and for extremely unbalanced samples; this conclusion is consistent with that of previous research [30]. The test results of the cross-dataset experiments show that the tone classification algorithms based on RF have strong generalization ability.

5. Conclusions

This study introduces a novel Mandarin tone recognition approach which is based on random forest and feature fusion. Feature fusion and optimization can obviously reduce the complexity of an algorithm and are more suitable for portable Mandarin tone recognition. The random forest model is a robust classifier with low complexity in tone recognition. Through comparative experiments, the performance of RF on three different FFSs is validated, which shows that RF modeling has the advantages of high recognition accuracy, simplicity, and strong learning capability. This method can achieve good recognition effects based on the simplified FFS SIII using a small number of training samples. The proposed algorithm has achieved the expected results, but only simulation verification has been completed at present, and we will verify this on more databases and deploy the proposed method to the practical learning terminal in the future.

Author Contributions

Conceptualization, J.Y., M.L. and L.T.; methodology, M.L. and X.W.; software, J.Y. and J.L.; validation, J.Y., X.W. and M.L.; investigation, J.Y.; writing—original draft preparation, J.Y. and M.L.; writing—review and editing, L.T., J.Y., Q.M., M.Z. and H.X.; visualization, J.L. and X.W.; supervision, L.T.; funding acquisition, L.T. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Natural Science Foundation of Shandong Province, grant number ZR2021ZD40 and ZR2021MF065, and in part by the Research Project for Graduate Education and Teaching Reform, Shandong University, China, grant number XYJG2020108.

Data Availability Statement

The SCSC data can be obtained from the Laboratory of Phonetics and Speech Science, Institute of Linguistics, CASS at http://paslab.phonetics.org.cn/?p=1741 (accessed on 20 March 2023).

Acknowledgments

We gratefully acknowledge the support from the above funds.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. List of the 600 Syllables from Fifteen Speakers in the SCSC Database

Table A1. Speaker: m01.

	Syllable 1	Syllable 2	Syllable 3	Syllable 4	Syllable 5	Syllable 6	Syllable 7	Syllable 8	Syllable 9	Syllable 10
Tone 1	a1	ai1	ao1	cheng1	e1	fang1	feng1	hou1	ji1	lao1
Tone 2	a2	ai2	ao2	cheng2	e2	fang2	feng2	hou2	ji2	lao2
Tone 3	a3	ai3	ao3	cheng3	e3	fang3	feng3	hou3	ji3	lao3
Tone 4	a4	ai4	ao4	cheng4	e4	fang4	feng4	hou4	ji4	lao4

Table A2. Speaker: m02.

	Syllable 1	Syllable 2	Syllable 3	Syllable 4	Syllable 5	Syllable 6	Syllable 7	Syllable 8	Syllable 9	Syllable 10
Tone 1	ang1	en1	eng1	wei1	wo1	wu1	yan1	yang1	yao1	yi1
Tone 2	a2	ai2	ang2	ao2	er2	wang2	wei2	wen2	tu2	e2
Tone 3	fa3	lou3	yi3	pai3	yuan3	wen3	wo3	xiang3	yan3	yang3
Tone 4	a4	ang4	er4	gun4	lian4	lie4	lun4	ou4	si4	weng4

Table A3. Speaker: m03.

	Syllable 1	Syllable 2	Syllable 3	Syllable 4	Syllable 5	Syllable 6	Syllable 7	Syllable 8	Syllable 9	Syllable 10
Tone 1	fan1	bie1	chai1	cun1	en1	mo1	tuo1	wu1	xiong1	yi1
Tone 2	a2	ai2	fo2	cun2	fang2	cu2	ju2	she2	wang2	wen2
Tone 3	wu3	yan3	wei3	fa3	yao3	ha3	lou3	wang3	weng3	xiang3
Tone 4	lie4	na4	lun4	gun4	mie4	mi4	ou4	si4	tie4	wen4

Table A4. Speaker: m04.

	Syllable 1	Syllable 2	Syllable 3	Syllable 4	Syllable 5	Syllable 6	Syllable 7	Syllable 8	Syllable 9	Syllable 10
Tone 1	a1	bie1	tuo1	en1	yue1	wa1	tou1	you1	yuan1	mo1
Tone 2	ai2	wen2	fo2	fang2	ju2	lou2	wang2	wu2	she2	zhuo2
Tone 3	shu3	a3	yang3	yan3	ti3	yong3	you3	wo3	wei3	yu3
Tone 4	ci4	gun4	lie4	lian4	lun4	xian4	hu4	pao4	ou4	zuan4

Table A5. Speaker: m05.

	Syllable 1	Syllable 2	Syllable 3	Syllable 4	Syllable 5	Syllable 6	Syllable 7	Syllable 8	Syllable 9	Syllable 10
Tone 1	weng1	en1	eng1	wo1	wu1	yao1	yin1	you1	yuan1	yue1
Tone 2	a2	ai2	er2	ban2	wang2	wen2	wu2	cu2	yong2	yuan2
Tone 3	fa3	er3	wen3	weng3	xiang3	yao3	yong3	you3	yuan3	yun3
Tone 4	ou4	wen4	wu4	yang4	yao4	ye4	ying4	you4	yuan4	yun4

Table A6. Speaker: m06.

	Syllable 1	Syllable 2	Syllable 3	Syllable 4	Syllable 5	Syllable 6	Syllable 7	Syllable 8	Syllable 9	Syllable 10
Tone 1	en1	eng1	weng1	wo1	wu1	yao1	yin1	you1	yuan1	yue1
Tone 2	a2	ai2	er2	wan2	wang2	wen2	wu2	ang2	yong2	yuan2
Tone 3	ai3	er3	wen3	weng3	wu3	yao3	yong3	you3	yuan3	yun3
Tone 4	ou4	wen4	wu4	yang4	yao4	ye4	ying4	you4	yuan4	yun4

Table A7. Speaker: m07.

	Syllable 1	Syllable 2	Syllable 3	Syllable 4	Syllable 5	Syllable 6	Syllable 7	Syllable 8	Syllable 9	Syllable 10
Tone 1	an1	ang1	e1	ou1	wai1	wan1	wei1	yan1	yang1	gou1
Tone 2	ao2	e2	fang2	zhuo2	qu2	ye2	ying2	yu2	yun2	chao2
Tone 3	a3	an3	yao3	wei3	zhai3	yang3	yan3	ye3	yin3	fa3
Tone 4	a4	an4	ang4	en4	er4	wa4	wang4	weng4	ying4	yong4

Table A8. Speaker: m08.

	Syllable 1	Syllable 2	Syllable 3	Syllable 4	Syllable 5	Syllable 6	Syllable 7	Syllable 8	Syllable 9	Syllable 10
Tone 1	gou1	hao1	ji1	nie1	luo1	mang1	nang1	ao1	yong1	wen1
Tone 2	chao2	gen2	tong2	pu2	nu2	ping2	yan2	yi2	cheng2	zhi2
Tone 3	ao3	gan3	gou3	gen3	wa3	wai3	wan3	gu3	yin3	yu3
Tone 4	bai4	dong4	he4	hou4	ze4	nan4	miu4	wan4	wei4	zun4

Table A9. Speaker: m09.

	Syllable 1	Syllable 2	Syllable 3	Syllable 4	Syllable 5	Syllable 6	Syllable 7	Syllable 8	Syllable 9	Syllable 10
Tone 1	dun1	fei1	hai1	hui1	jia1	qiu1	sui1	tan1	xian1	zha1
Tone 2	chou2	da2	ji2	jia2	li2	nang2	niang2	peng2	ruan2	gu2
Tone 3	da3	dang3	dian3	jia3	lin3	qian3	rao3	shi3	ta3	zen3
Tone 4	cheng4	cun4	guan4	jing4	niu4	qing4	sui4	xia4	zhe4	zuan4

Table A10. Speaker: m10.

	Syllable 1	Syllable 2	Syllable 3	Syllable 4	Syllable 5	Syllable 6	Syllable 7	Syllable 8	Syllable 9	Syllable 10
Tone 1	cui1	fu1	gai1	gu1	hei1	jiao1	liao1	man1	dun1	zi1
Tone 2	fen2	hai2	hao2	tuan2	kang2	ruo2	qin2	nong2	shao2	gu2
Tone 3	chang3	dai3	ga3	nian3	shui3	tu3	xing3	xue3	zhuang3	lao3
Tone 4	ba4	bi4	dao4	duan4	ju4	nie4	nong4	shui4	qu4	kao4

Table A11. Speaker: m11.

	Syllable 1	Syllable 2	Syllable 3	Syllable 4	Syllable 5	Syllable 6	Syllable 7	Syllable 8	Syllable 9	Syllable 10
Tone 1	li1	lao1	tei1	meng1	shang1	tong1	pian1	za1	zhe1	peng1
Tone 2	kuang2	chu2	chuang2	gang2	hen2	po2	jie2	lun2	zha2	zhuo2
Tone 3	hen3	mu3	jiang3	jue3	liang3	meng3	niao3	kou3	tui3	zhang3
Tone 4	bao4	chao4	gang4	gui4	lue4	nuo4	qi4	rui4	shang4	shuo4

Table A12. Speaker: m12.

	Syllable 1	Syllable 2	Syllable 3	Syllable 4	Syllable 5	Syllable 6	Syllable 7	Syllable 8	Syllable 9	Syllable 10
Tone 1	sheng1	pan1	dao1	de1	kou1	sen1	shai1	shou1	che1	deng1
Tone 2	cen2	luo2	mu2	qie2	lin2	shou2	hong2	ta2	tai2	xu2
Tone 3	shun3	guang3	kua3	li3	qiang3	zun3	gei3	nang3	zhe3	sun3
Tone 4	kuai4	den4	di4	guang4	kong4	mu4	sa4	shen4	shou4	cuan4

Table A13. Speaker: m13.

	Syllable 1	Syllable 2	Syllable 3	Syllable 4	Syllable 5	Syllable 6	Syllable 7	Syllable 8	Syllable 9	Syllable 10
Tone 1	gan1	gao1	hong1	hou1	lu1	pa1	po1	qi1	zhen1	zhi1
Tone 2	biao2	cong2	hang2	wa2	luo2	na2	nan2	nian2	pi2	qia2
Tone 3	bang3	dia3	dou3	gai3	jie3	ka3	mai3	qia3	zhen3	bie3
Tone 4	cuan4	duo4	gai4	jie4	juan4	mai4	qia4	she4	shua4	zhan4

Table A14. Speaker: m14.

	Syllable 1	Syllable 2	Syllable 3	Syllable 4	Syllable 5	Syllable 6	Syllable 7	Syllable 8	Syllable 9	Syllable 10
Tone 1	bu1	chang1	pin1	dui1	huang1	kai1	kua1	lin1	ba1	zhao1
Tone 2	bu2	chui2	eng2	ge2	hu2	kui2	min2	pei2	ting2	tou2
Tone 3	bie3	chai3	huang3	na3	re3	bei3	sheng3	za3	zhao3	zu3
Tone 4	biao4	bie4	cha4	kun4	pei4	pie4	pu4	rao4	tie4	zong4

Table A15. Speaker: m15.

	Syllable 1	Syllable 2	Syllable 3	Syllable 4	Syllable 5	Syllable 6	Syllable 7	Syllable 8	Syllable 9	Syllable 10
Tone 1	beng1	dai1	he1	jie1	jing1	long1	bin1	han1	qian1	jiu1
Tone 2	die2	huan2	ke2	liao2	nao2	neng2	nuo2	pian2	qiu2	qu2
Tone 3	beng3	duan3	pi3	fou3	mian3	fu3	niu3	mou3	tao3	yao3
Tone 4	chou4	guai4	heng4	huang4	jiang4	mao4	ba4	ren4	xiong4	xiu4

References

Pelzl, E. What makes second language perception of Mandarin tones hard? A non-technical review of evidence from psycholinguistic research. Chin. Second Lang. 2019, 54, 51–78. [Google Scholar]
Peng, S.C.; Tomblin, J.B.; Cheung, H.; Lin, Y.S.; Wang, L.S. Perception and production of mandarin tones in prelingually deaf children with cochlear implants. Ear Hear. 2004, 25, 251–264. [Google Scholar] [CrossRef] [PubMed]
Fu, D.; Li, S.; Wang, S. Tone recognition based on support vector machine in continuous Mandarin Chinese. Comput. Sci. 2010, 37, 228–230. [Google Scholar]
Gogoi, P.; Dey, A.; Lalhminghlui, W.; Sarmah, P.; Prasanna, S.R.M. Lexical Tone Recognition in Mizo using Acoustic-Prosodic Features. In Proceedings of the 12th Language Resources and Evaluation Conference, Marseille, France, 11–16 May 2020. [Google Scholar]
Zheng, Y. Phonetic Pitch Detection and Tone Recognition of the Continuous Chinese Three-Syllabic Words. Master’s Thesis, Jilin University, Jilin, China, 2004. [Google Scholar]
Shen, L.J.; Wang, W. Fusion Feature Based Automatic Mandarin Chinese Short Tone Classification. Technol. Acoust. 2018, 37, 167–174. [Google Scholar]
Liu, C.; Ge, F.; Pan, F.; Dong, B.; Yan, Y. A One-Step Tone Recognition Approach Using MSD-HMM for Continuous Speech. In Proceedings of the Interspeech 2009, Brighton, UK, 6–10 September 2009. [Google Scholar]
Chang, K.; Yang, C. A real-time pitch extraction and four-tone recognition system for Mandarin speech. J. Chin. Inst. Eng. 1986, 9, 37–49. [Google Scholar] [CrossRef]
Chen, C.; Bunescu, R.; Xu, L.; Liu, C. Tone Classification in Mandarin Chinese using Convolutional Neural Networks. In Proceedings of the Interspeech 2016, San Francisco, CA, USA, 8–12 September 2016. [Google Scholar]
Gao, Q.; Sun, S.; Yang, Y. ToneNet: A CNN Model of Tone Classification of Mandarin Chinese. In Proceedings of the Interspeech 2019, Graz, Austria, 15–19 September 2019. [Google Scholar]
Breimanl, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Quinlan, J.R. Induction of decision trees. Mach. Learn. 1986, 1, 81–106. [Google Scholar] [CrossRef]
Biemans, M. Gender Variation in Voice Quality. Ph.D. Thesis, Catholic University of Nijmegen, Nijmegen, The Netherlands, 2000. [Google Scholar]
SCSC-Syllable Corpus of Standard Chinese|Laboratory of Phonetics and Speech Science, Institute of Linguistics, CASS. Available online: http://paslab.phonetics.org.cn/?p=1741 (accessed on 20 March 2023).
He, R. Endpoint Detection Algorithm for Speech Signal in Low SNR Environment. Master’s Thesis, Shandong University, Jinan, China, 2018. [Google Scholar]
Li, M. Study on Multi-Feature Fusion Chinese Tone Recognition Algorithm Based on Machine Learning. Master’s Thesis, Shandong University, Jinan, China, 2021. [Google Scholar]
Zhang, W. Study on Acoustic Features and Tone Recognition of Speech Recognition. Master’ Thesis, Shanghai Jiaotong University, Shanghai, China, 2003. [Google Scholar]
Nie, K. Study on Speech Processing Strategy for Chinese-Spoken Cochlear Implants on the Basis of Characteristics of Chinese Language. Ph.D. Thesis, Tsinghua University, Beijing, China, 1999. [Google Scholar]
Taylor, P. Analysis and synthesis of intonation using the Tilt model. J. Acoust. Soc. Am. 2000, 107, 1697–1714. [Google Scholar] [CrossRef] [PubMed]
Quang, V.M.; Besacier, L.; Castelli, E. Automatic question detection: Prosodic-lexical features and crosslingual experiments. In Proceedings of the Interspeech 2007, Antwerp, Belgium, 27–31 August 2007. [Google Scholar]
Ma, M.; Evanini, K.; Loukina, A.; Wang, X.; Zechner, K. Using F0 Contours to Assess Nativeness in a Sentence Repeat Task. In Proceedings of the Interspeech 2015, Dresden, Germany, 6–10 September 2015. [Google Scholar]
Robnik-Sikonja, M.; Kononenko, I. Theoretical and Empirical Analysis of ReliefF and RReliefF. Mach. Learn. 2003, 53, 23–69. [Google Scholar] [CrossRef]
Onan, A.; Korukoglu, S.; Bulut, H. Ensemble of keyword extraction methods and classifiers in text classification. Expert Syst. Appl. 2016, 57, 232–247. [Google Scholar] [CrossRef]
Yan, J.; Tian, L.; Wang, X.; Liu, J.; Li, M. A Mandarin Tone Recognition Algorithm Based on Random Forest and Features Fusion. In Proceedings of the 7th International Conference on Control Engineering and Artificial Intelligence, CCEAI 2023, Sanya, China, 28–30 January 2023. [Google Scholar]
Bittencourt, H.R.; Clarke, R.T. Use of classification, and regression trees (CART) to classify remotely-sensed digital images. In Proceedings of the IGARSS 2003, Toulouse, France, 21–25 July 2003. [Google Scholar]
Javed Mehedi Shamrat, F.M.; Ranjan, R.; Hasib, K.M.; Yadav, A.; Siddique, A.H. Performance Evaluation Among ID3, C4.5, and CART Decision Tree Algorithm. In Proceedings of the ICPCSN 2021, Salem, India, 19–20 March 2021. [Google Scholar]
Xie, X.; Liu, H.; Chen, D.; Shu, M.; Wang, Y. Multilabel 12-Lead ECG Classification Based on Leadwise Grouping Multibranch Network. IEEE Trans. Instrum. Meas. 2022, 71, 1–11. [Google Scholar] [CrossRef]
Paul, B.; Bera, S.; Paul, R.; Phadikar, S. Bengali Spoken Numerals Recognition by MFCC and GMM Technique. In Proceedings of the Advances in Electronics, Communication and Computing, Odisha, India, 5–6 March 2020. [Google Scholar]
Koolagudi, S.G.; Rastogi, D.; Rao, K.S. Identification of Language using Mel-Frequency Cepstral Coefficients (MFCC). In Proceedings of the International Conference on Modelling Optimization and Computing, Kumarakoil, India, 10–11 April 2012. [Google Scholar]
Hao, Y. Second language acquisition of Mandarin Chinese tones by tonal and non-tonal language speakers. J. Phon. 2012, 40, 269–279. [Google Scholar] [CrossRef]

Figure 1. Optimization process of feature sets in tone recognition. S1 to S7 are the original feature sets selected. SI, SII, and SIII are fusion feature sets obtained using three different optimization methods.

Figure 2. Schematic diagram of one decision tree in the RF tone recognition process based on set SIII. xj indicates the jth feature of set SIII. The blue triangular box indicates the branch node, the blue line is the branch, the solid blue dot is the leaf node, and the related number is the tone prediction result.

Figure 3. Flow block of the RF training and test process. The blue section is the training process and the green section is the test process.

Figure 4. Results of optimizing hyperparameter T of the random forest classifier (i.e., the number of decision trees) with three FFSs. (a–c) show that the optimal value of T on SI, SII, and SIII is 400, 350, and 350, respectively.

Figure 5. Bar graph of tone recognition ACC analysis based on vocal tract features and fundamental frequency statistical features. α is the proportion of tone recognition results based on fundamental frequency statistical features. The bar’s value when α = 0 indicates the ACC result only based on vocal tract features, and the bar’s value when α = 1 indicates the ACC result only based on fundamental frequency statistical features.

Figure 6. Tone recognition results of five different classifiers for set S1 to S7. On the right end, “Mean” represents the average accuracy of each classifier under seven feature sets. The value of the red box is the highest recognition accuracy under each feature set.

Figure 7. Recognition results of comparative experiments with small sample sets. The 600 samples were divided into 10 parts. The x-coordinate at ninety percent means 9 parts were taken for training and 1 part was taken for the tone recognition test. For the smallest sample set, only 1 part was taken for training and the remaining 9 parts were taken for the test.

Figure 8. Test results of cross-dataset experiments with balanced samples and extremely unbalanced samples using the modeled RF classifiers. (a) Results of balanced samples based on SI, SII, and SIII. Different colors represent different tones. (b) Results of extremely unbalanced samples based on SIII. The ratio of 7:1:1:1 indicates that tone 1 accounts for seventy percent of the samples, with tone 2, tone 3, and tone 4 accounting for ten percent. The ratio of 1:7:1:1 indicates that tone 2 accounts for seventy percent of samples, with tone 1, tone 3, and tone 4 accounting for ten percent. The other ratios follow this same pattern.

Table 1. The seven original feature sets.

Feature Set Name	Source	Features Number
S1	Reference [6]	22
S2	Reference [17]	13
S3	Reference [18]	6
S4	Reference [3]	16
S5	Reference [5]	18
S6	Reference [19]	7
S7	References [20,21]	12

Table 2. Results of SI and RF.

Number of Decision Trees (T)	100	200	300	350	400	450	500
ACC (%)	98.17	98.00	98.00	98.00	98.33	98.00	98.00
AUROC (%)	98.79	98.69	98.68	98.68	98.88	98.68	98.68
AUPRC (%)	98.15	97.99	97.98	97.98	98.32	97.98	97.98

Table 3. Results of SII and RF.

Number of Decision Trees (T)	100	200	250	300	350	400	500
ACC (%)	97.00	97.17	97.33	97.33	97.50	97.17	97.33
AUROC (%)	98.02	98.12	98.23	98.23	98.35	98.12	98.23
AUPRC (%)	97.02	97.17	97.32	97.32	97.50	97.17	97.32

Table 4. Results of SIII and RF.

Number of Decision Trees (T)	100	200	300	350	400	450	500
ACC (%)	97.50	97.50	97.67	98.00	97.83	97.83	97.67
AUROC (%)	98.33	98.31	98.42	98.65	98.52	98.55	98.42
AUPRC (%)	97.47	97.48	97.63	97.97	97.80	97.81	97.63

Table 5. Recognition results from comparative experiments of different FFSs.

Set	SI	SII	SIII
ACC (%)	98.33	97.50	98.00
AUROC (%)	98.88	98.35	98.65
AUPRC (%)	98.32	97.50	97.97
APTPS (s)	0.0022	0.0011	0.0007

Table 6. Recognition performance of different methods on the SCSC database.

Method	ACC	Database	Suitable for Small Learning Terminal
ToneNet [10]	99.16%	SCSC	NO
CNN [9]	94.45%	SCSC	NO
The proposed	98.33%	SCSC	YES

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yan, J.; Meng, Q.; Tian, L.; Wang, X.; Liu, J.; Li, M.; Zeng, M.; Xu, H. A Mandarin Tone Recognition Algorithm Based on Random Forest and Feature Fusion ^†. Mathematics 2023, 11, 1879. https://doi.org/10.3390/math11081879

AMA Style

Yan J, Meng Q, Tian L, Wang X, Liu J, Li M, Zeng M, Xu H. A Mandarin Tone Recognition Algorithm Based on Random Forest and Feature Fusion ^†. Mathematics. 2023; 11(8):1879. https://doi.org/10.3390/math11081879

Chicago/Turabian Style

Yan, Jiameng, Qiang Meng, Lan Tian, Xiaoyu Wang, Junhui Liu, Meng Li, Ming Zeng, and Huifang Xu. 2023. "A Mandarin Tone Recognition Algorithm Based on Random Forest and Feature Fusion ^†" Mathematics 11, no. 8: 1879. https://doi.org/10.3390/math11081879

APA Style

Yan, J., Meng, Q., Tian, L., Wang, X., Liu, J., Li, M., Zeng, M., & Xu, H. (2023). A Mandarin Tone Recognition Algorithm Based on Random Forest and Feature Fusion ^†. Mathematics, 11(8), 1879. https://doi.org/10.3390/math11081879

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Mandarin Tone Recognition Algorithm Based on Random Forest and Feature Fusion ^†^†

Abstract

1. Introduction