*Article* **Lightweight Multi-Scale Dilated U-Net for Crop Disease Leaf Image Segmentation**

**Cong Xu, Changqing Yu and Shanwen Zhang \***

School of Electronic Information, Xijing University, Xi'an 710123, China

**\*** Correspondence: zhangshanwen@xijing.edu.cn

**Abstract:** Crop disease leaf image segmentation (CDLIS) is the premise of disease detection, disease type recognition and disease degree evaluation. Various convolutional neural networks (CNN) and their modified models have been provided for CDLIS, but their training time is very long. Aiming at the low segmentation accuracy of various diseased leaf images caused by different sizes, colors, shapes, blurred speckle edges and complex backgrounds of traditional U-Net, a lightweight multiscale extended U-Net (LWMSDU-Net) is constructed for CDLIS. It is composed of encoding and decoding sub-networks. Encoding the sub-network adopts multi-scale extended convolution, the decoding sub-network adopts a deconvolution model, and the residual connection between the encoding module and the corresponding decoding module is employed to fuse the shallow features and deep features of the input image. Compared with the classical U-Net and multi-scale U-Net, the number of layers of LWMSDU-Net is decreased by 1 with a small number of the trainable parameters and less computational complexity, and the skip connection of U-Net is replaced by the residual path (Respath) to connect the encoder and decoder before concatenating. Experimental results on a crop disease leaf image dataset demonstrate that the proposed method can effectively segment crop disease leaf images with an accuracy of 92.17%.

**Keywords:** crop disease leaf image segmentation (CDLIS); U-Net; dilated convolution; lightweight multi-scale dilated U-Net (LWMSDU-Net)

**1. Introduction**

Plant diseases severely affect the quality and yields of crops. Early detection of crop diseases reduces economic losses and has a positive impact on crop quality [1,2]. Crop disease leaf image segmentation (CDLIS) is a key prerequisite for the automatic detection, early warning, diagnosis and recognition of leaf diseases [3,4]. However, CDLIS is an important and challenging topic due to the various colors, shapes, textures, sizes and backgrounds of crop disease leaf images, as shown in Figure 1 [5,6].

**Figure 1.** Disease leaf image examples.

Many image segmentation algorithms, such as fixed threshold, Otsu, K-means clustering, C-means clustering, fuzzy clustering, maximum entropy, 7 invariant moments and Local Binary Patterns (LBP), can be applied to CDLIS [7]. Wang et al. [8] proposed an

**Citation:** Xu, C.; Yu, C.; Zhang, S. Lightweight Multi-Scale Dilated U-Net for Crop Disease Leaf Image Segmentation. *Electronics* **2022**, *11*, 3947. https://doi.org/10.3390/ electronics11233947

Academic Editor: Juan M. Corchado

Received: 8 October 2022 Accepted: 24 November 2022 Published: 29 November 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

adaptive CDLIS method based on K-means clustering by three stages. Fernandez et al. [9] applied principal component analysis (PCA) to the spectrum to evaluate the spectral separability between healthy and infected leaves, used the spectral ratio between infected and healthy leaves to determine the optimal wavelength for disease detection, and applied the linear support vector machine (SVM) classifier to some spectral features.

The accuracy of the above traditional algorithms mainly depends on experience, and due to the complexity of diseased leaf images, they lack generalization ability. With the improvement of computing power, storage, Internet of Things, big data and artificial intelligence, deep learning methods, such as convolutional neural network (CNN), full convolutional neural network (FCN) and U-Net, have been widely applied to the detection, segmentation and classification of crop disease leaf images, and achieved a significant accuracy rate [10–13]. Ashwinkumar [14] proposed an optimal mobile network-based CNN (OMNCNN) for detecting and classifying plant leaf diseases. It involves bilateral filtering-based image preprocessing and Kapur's thresholding-based image segmentation to detect the affected portions of the leaf image. U-Net is a relatively simple and widely used image semantic segmentation model and has achieved remarkable performance in medical image segmentation. However, its segmentation performance for very multi-scale small targets may be poor. U-Net can be improved from many aspects, such as encoder number, convolution operation, up-sampling and down-sampling operation, residual operation, attention mechanism, multi-scale convolution, model optimization strategy and connection type between encoding and decoding layers [15,16]. Tarasiewicz et al. [17] proposed a lightweight U-Net (LWU-Net) and applied it to multi-mode magnetic resonance brain tumor image segmentation, obtaining accurate brain tumor contour. Xiong et al. [18] proposed a multi-scale feature fusion attention U-Net (AU-Net) to improve the defect detection accuracy caused by large background noise, unpredictable environments, and different defect shapes and sizes in defect images of industrial parts. This model combines attention U-Net with a multi-scale feature fusion module to detect the defects in low-noise images effectively. Yuan et al. [19] presented an improved AU-Net, which can integrate deep and rich semantic information and shallow detail information to perform adaptive and accurate segmentation of aneurysm images with large size differences in MRI angiography. Multi-scale U-Net (MSU-Net) can concatenate the fixed and moving images with multiscale input or image pyramid and concatenate them with corresponding layers of the same size in U-Net [20]. Tian et al. [21] proposed a modified MSU-Net with dilated convolution structure, squeeze excitation block and spatial transformer layers. Experiment results indicated that it is competitive for normal and abnormal images. Wang et al. [22] proposed an improved U-Net namely HDA-ResUNet with residual connections, adding a plug-andplay, portable channel attention block and a hybrid dilated attention convolutional layer. It makes full use of the advantages of U-Net, attention mechanism and extended convolution, and performs accurate and effective medical image segmentation for different tasks. In U-Net, some related discriminant features may be lost in image segmentation.

Inspired by LWU-Net, AU-Net and MSU-Net, a multi-scale dilated U-Net (LWMSDU-Net) is constructed to improve the performance of CDLIS. It is lightweight, and the dilated convolutional coding operation is used to fuse features from different sizes of receptive fields. The main contributions of this paper are as follows:


The rest of this paper is arranged as follows. Section 2 introduces the related works. LWMSDU-Net is described in detail in Section 3. A lot of experiments are conducted on a

crop disease leaf image dataset in Section 4. Finally, the paper is concluded and the future work is given in Section 5.

#### **2. Related Works**

#### *2.1. Residual Block*

The difference between the residual convolution block and the standard convolution block is that there is a skip connection [23]. Skip connection can effectively reduce the problems of gradient vanishing and network model degradation. Residual is the difference between the predicted value and the observed value. Suppose the first layer of the network is described as *Y* = *H*(*x*), and a residual block of the residual network is noted as *H*(*x*) = *F*(*x*) + *x*, then *F*(*x*) = *H*(*x*) − *x*, and *y* = *x* is the observed value and *H*(*x*) is the predicted value, *H*(*x*) − *x* or *F*(*x*) is the residual, so it is also called the residual network.

#### *2.2. Dilated Convolution*

The basic principle of dilated convolution is to fill 0 in the middle of the convolution kernel to expand the receptive field as a principle, which is shown in Figure 2. By setting different expansion rates for each layer, multi-scale convolution domains can be obtained, thus obtaining multi-scale features. Its advantage is that the receptive field is enlarged without loss of features by pooling, so that each convolution output contains a wide range of features. Figure 2a corresponds to a 1-dilated convolution of 3 × 3, which is the same as an ordinary convolution operation without filling 0. Figure 2b corresponds to a 2-dilated convolution of 3 × 3. The actual convolution kernel size is still 3 × 3, but the void is 1, that is, for a 7 × 7 image patch, only 9 red points have convolution operation with a 3 × 3 kernel, and the rest points are skipped. Figure 2c is a 4-dilated convolution operation reaching the receptive field of 15 × 15. Compared with the traditional convolution operation, when the convolution of 3 layers and 3 × 3 is added together, the stride is 1, and the receptive field can only reach (kernel-1) × layer + 1 = 7, that is, the receptive field of dilated convolution increases exponentially. The corresponding convolutional images are shown in Figure 3.

**Figure 2.** Dilated convolution kernel: (**a**) rate = 1; (**b**) rate = 2; (**c**) rate = 4.

**Figure 3.** Dilated convolution images: (**a**) original image; (**b**) rate = 1; (**c**) rate = 2; (**d**) rate = 3.

#### *2.3. U-Net*

U-Net consists of a mutually symmetrical encoding subnetwork, decoding subnetwork and the skip connection. Its basic architecture is shown in Figure 4.

**Figure 4.** U-Net architecture.

Encoding subnetwork consists of four down-sampling operations and middle-layer operations, and each down-sampling operation includes Conv 3 × 3, BN, ReLU, MaxPool 2 × 2, DEA, where Conv 3 × 3 is 3 × 3 convolution for feature extraction, BN is a batch normalization layer to alleviate the problem of gradient disappearance, ReLU is the activation layer used to introduce nonlinear factors and accelerate network convergence, and MaxPool 2 × 2 is the maximum pooling layer of 2 × 2 to extract semantic information. Decoding subnetwork takes the output of the coding subnetwork as the input and carries out three upsampling operations, which are described as upconv 2 × 2 + Copy&crop + Conv 3 × 3 + BN + ReLU + DE module, where upconv 2 × 2 is a 2 × 2 upsampling convolution operation used to restore the size and size of the feature maps, and copy&crop, namely skip connection, refers to integrate the rough features of the encoding subnetwork with the refined features of the decoding to better retain the spatial information and detail information of the original image and then improve the image accuracy.

#### *2.4. Summarization*

The characteristics of the Residual block, dilated convolution and U-Net are summarized as follows.

Residual blocks can increase the depth of the network, help solve the problems of gradient disappearance and gradient explosion, and ensure good performance while training deeper networks.

When the network layer requires a large receptive field, but the computing resources are limited and cannot increase the number or size of convolution kernels, dilated convolution can be considered. Its advantages are that the receptive field can be increased without pooling information, so that each convolution output contains a large range of information. However, the dilated convolution may have a grid effect, that is, the convolution kernels are discontinuous; if only multiple 3 × 3 convolution kernels with dilation rate = 2 are stacked multiple times, not all input pixels are calculated. The key to designing a good dilated convolution layer is how to deal with the relationship between objects of different sizes at the same time.

U-net can provide context semantic information of segmentation target in the whole image, train end-to-end from a few images, and is superior to the previous sliding window convolution network. It uses features spliced together in the channel dimension to form thicker features, which can provide finer features for image segmentation. The addition of corresponding points used in FCN fusion does not form thicker features.

#### **3. Lightweight Multi-Scale Dilated U-Net (LWMSDU-Net)**

#### *3.1. LWMSDU-Net Architecture*

Although many improved U-Net models have been constructed and achieved remarkable results, they do not take into account the number of trainable parameters, the calculation of the model, and the characteristics of the disease leaf image shown in Figure 1, and are not suitable for deployment on devices with limited computing power and storage space. To improve the accuracy and effectiveness of CDLIS, a lightweight multi-scale dilated U-Net (LWMSDU-Net) is constructed for CDLIS by making use of the advantages of lightweight, multi-scale, residual convolution, and dilated U-Net. Its architecture is shown in Figure 5a.

**Figure 5.** The architecture of LWMSDU-Net: (**a**) LWMSDU-Net structure; (**b**) dilated structure, where the size of the feature map is *a* × *a*, and M is the channel number; (**c**) Respath structure.

The multi-scale dilated convolution is employed instead of the convolution of U-Net in the convolutional layer. According to the characteristics of the diseased leaf image, the expansion rates r is set as 1, 2 and 3, respectively, so that the irregular disease regions of different scales could be effectively segmented and the overall segmentation performance could be improved. The structure of multi-scale dilated convolution is shown in Figure 5b. Suppose *fh* represents the height of the original convolution kernel and *fw* is the width of the original convolution kernel, the height of the effective convolution kernel thus obtained is *fh* + (*fh* − 1)(*r* − 1), and the width is *fw* + (*fw* − 1)(*r* − 1).

In U-Net, the skip connection is used to connect the encoder and decoder. It is simple to implement, but there is often a big semantic gap in semantic between the encoder and decoder due to the complex disease leaf images. To improve segmentation results and relieve this semantic gap, a residual path (Respath) instead of the skip connection is constructed to connect the encoder and decoder before concatenating, so that the encoder features perform some additional convolution operations before being spliced with the corresponding features in the decoder. Respath structure is shown in Figure 5c, consisting of four residual convolution blocks [22].

#### *3.2. Process of CDLIS*

The steps of LWMSDU-Net based CDLIS method include training stage and test stage. The original parameters of LWMSDU-Net are set by transfer learning, then the training dataset is used to optimize its parameters iteratively, and the test set is used to verify the model recognition effect. Model training is the most crucial step in the experiment because the trained appropriate model can improve the classification accuracy, and the experiment mode and hyper-parameter configuration of this paper are standardized to ensure the validity of the experiment. In the model training stage, to enhance the model image feature extraction ability and training speed, the PlantVillage dataset is used as the input of LWMSDU-NET, and the parameters of pre-training are retained. Then, the network model after pre-training is trained by the constructed augmented dataset of maize corn cucumber diseases. Pre-training can accelerate the model training speed, effectively enhance the fitting ability of the network, and improve the accuracy of CDLIS on the limited dataset.

The training stage includes the following steps:

Step 1: Convert the disease leaf images from R\*G\*B color space to L\*a\*b, and using the simple linear iterative cluster (SLIC) method to preprocess the transformed disease leaf images;

Step 2: Disease leaf images are converted to TensorFlow2 format, divided into different batches and then input into LWMSDU-Net for feature extraction (https://github.com/ tzutalin/labeling/releases, accessed on 7 October 2022);

Step 3: Use transfer learning to reduce the number of training iterations and speed up training the network;

Step 4: Fuse the extracted features from LWMSDU-Net, and input the fused features into the classifier for training the classifier;

Step 5: If the error between the authentic labeled training images and the predictive labeled training images is more than the given threshold, go back to Step 2 and further train LWMSDU-Net. Otherwise, the training stage is stopped.

The test stage includes the following steps:

Step 1: Normalize the scale of the test images;

Step 2: Put the normalized images into the trained LWMSDU-Net and extract features;

Step 3: Fuse the extracted features and then put them into the SoftMax classifier;

Step 4: Output the recognition result of the input image.

#### **4. Experiments and Analysis**

In this section, a lot of experiments of CDLIS are conducted to validate the proposed method. Comparative experiments and results are then analyzed and discussed. All experiments are carried out: Windows 7 64-bit operating system, Intel Xeon E5-2643v3 @3.40 GHz CPU, 64 GB RAM, NVidia Quadro M4000 GPU, 8 GB of video memory, by CUDA Toolkit 9.0, CUDNN V7.0, Python 3.5.2, Tensorflow-GPU 1.8.0 with Keras open source deep learning framework. In LWMSDU-Net, the initial weight parameters are set randomly, the number of iterations is set as 500, the initial learning rate is specified as 0.001 and then gradually reduced to 0.1 times in training stages, the momentum is set as 0.99 to reduce the overfitting problem, the weight decay is set as 0.005, and the training images are divided into 10 batches and sent to the network model in turn. To improve the segmentation effect of the model, LWMSDU-Net is trained 1200 rounds with each round of iteration of 3000 times, and the widely used stochastic gradient descent (SGD) is used as a training mechanism. Since the last layer of the network is the Softmax classifier, Softmax-loss is used as a loss function, which is more stable in computing. Other parameters are set as the default parameters of the U-Net framework. The trained model is evaluated by the verification images. In LWMSDU-Net, all RGB images of disease leaf are preprocessed through median filtering and then standardized by cropping to reduce calculation and training time. Each image is normalized and cropped to a size of 512 × 512 pixels.

#### *4.1. Dataset*

PlantVillage (https://tensorflow.google.cn/datasets/catalog/plant\_village, accessed on 7 October 2022) is an open source dataset. It was collected at experimental research stations associated with Land Grant Universities in the USA (Penn State, Florida State, Cornell and others). It is an open source dataset for diagnosing and recognizing crop diseases. It consists of 54,303 healthy and unhealthy leaf images of 26 diseases of 14 crops taken in the natural environment of farmland. In this paper, it is utilized for pre-training to make up for the shortage of the training samples. The pre-trained model is then trained and tested using the real crop dataset.

In this paper, five types of maize and cucumber disease were taken with digital cameras, smart phones and other devices in the Yangling Agricultural Demonstration Field, Shanxi Province, including two corn leaf images of blight and brown spot, and three cucumber leaf images of target spot, brown spot and anthracnose, 20 leaf images for each disease. As the disease leaf images vary with crop growth environment, background, sunshine and photographic equipment, to reflect the real scene and improve the generalization ability of the model, all images were taken in the morning, noon, afternoon, sunny and cloudy days from April to June 2021. Five disease leaf image samples are shown in Figure 6.

**Figure 6.** Five typical disease leaf images: (**a**) Original images for leaf blight of maize; (**b**) original images for brown blotch of maize; (**c**) original images for target spot of cucumber; (**d**) original images for brown spot of cucumber; (**e**) original images for anthrax disease of cucumber; (**f**) 10 augmented images of a maize disease leaf image; (**g**) equalized images of the above images in the above (**a**–**e**).

The number of the collected disease leaf images is limited, which easily leads to the overfitting. Augmenting algorithms, such as randomly enhanced lighting, randomly cropping, rotation, shifting, adding random noise and mirroring, are often used to augment the number of training samples. Augmenting operation can enlarge the diversity of the training samples and avoid overfitting. In the following experiments, each image is augmented to 10 images, as shown in Figure 6f. An augmented dataset containing 1100 images is constructed, including 100 original and 1000 augmented images. The details of the original dataset and its augmented dataset are shown in Table 1.


**Table 1.** The details of the original dataset and its augmented dataset.

To reduce environmental noise and computational complexity, smooth the image, remove salt and pepper noise and retain image edge information, the median filtering algorithm is carried out on the crop disease leaf image, as follows:

$$\mathbf{y}(n) = \text{med}[\mathbf{x}(i-N), \dots, \mathbf{x}(i+N)] \tag{1}$$

where *x*(*i*) is the value of the pixel point in the center of the sliding window, *med* is the value of the pixel's neighborhood, and *y*(*n*) is the median filtering output value.

From Figure 6g, it is observed that median filtering can enhance the contrast of the disease leaf images and the filtered images can significantly characterize the disease leaf image features. The image recognition accuracy of CDLIS can be improved after median filtering.

The effective disease leaf image blocks are cropped from the collected images to reduce the influence of complex background on CDLIS, and the leaf images are uniformly processed into 512 × 512 resolution images. Secondly, Labelme is used to label the image set of crop disease leaves in the demonstration base. Each image contains two data labels: 1 represents the area of crop leaf disease spots, and 0 represents the background. Annotation data are stored in JSON format, and the command of labelme json to the dataset is used to convert data labels into binarized PNG graphs. The color annotated image can be obtained by multiplying the original and binarized images. The cropping and annotating process is shown in Figure 7.

**Figure 7.** The cropping and annotating process.

In order to reduce the influence of geometric transformation and accelerate the speed of gradient descent to find the optimal solution, each image is normalized, which is implemented by mapping the pixel value of the image to (0,1) by linear function transformation,

$$y = \text{(x-MinValue)} / \text{(MaxValue-MinValue)} \tag{2}$$

where *x* and *y* are the values before and after conversion, respectively, and MaxValue and MinValue are the maximum and minimum values of the sample, respectively.

There are some methods to form the statistical tests [24]. In the paper, a five-fold cross verification scheme is employed to validate LWMSDU-Net, that is, all 1100 leaf images per disease are randomly divided into five subsets with the same number of images, each is used as a test set for testing the model, and the remaining images are used as training samples for training the model. Each subset is taken as a test set once, and a total of five tests are conducted. The average segmentation result of five times experiments is the final result.

#### *4.2. Results*

Average precision, average recall and average *F*1-score of five-fold cross verification experiments are often adopted to test network performance, and are calculated as follows:

$$Recall = \frac{B\_{\text{seg}}}{B\_{\text{seg}} + I\_{\text{umscg}}} \tag{3}$$

$$Precision = \frac{B\_{\text{scg}}}{B\_{\text{scg}} + I\_{\text{vascg}}} \tag{4}$$

$$F\_1\text{-score} = 2 \times \frac{precision \times recall}{precision + recall} \tag{5}$$

where *Bseg* is the pixel number correctly segmented into spot pixels, *Iunseg* is the pixel number not segmented into spot pixels but being spot pixels in the image, and *Iwseg* is the pixel number that segments the background pixels into spot pixels.

Pixel accuracy *AccPixel* is often used to evaluate the performance of the model. It is the total number of pixels whose real pixel category is predicted as a category, which is calculated as follows,

$$Acc\_{\text{Pixel}} = \frac{1}{m} \sum\_{i=1}^{m} f\_{i\prime} f\_i = \begin{cases} 1, \left| y\_i - y\_i^{\prime} \right| < T\\ 1, \left| y\_i - y\_i^{\prime} \right| \ge T \end{cases} \tag{6}$$

where *yi* is the *i*th real pixel category, and *y <sup>i</sup>* is the *i*th predicted category, *T* is a threshold.

In fact, the final output of the image segmentation models is a grayscale image and the values of all pixels vary from 0 to 1, *T* is often set 0.5.

LWMSDU-Net is trained on the augmented dataset. The training accuracy and loss are recorded after each iteration, as shown in Figure 8. It can be seen from Figure 8 that with the increasing number of training iterations, the accuracy of the model keeps rising while the loss value keeps decreasing. When the number of iterations reaches 2500, the accuracy is stable at 0.91, the fluctuation is stable within 1 percentage point, the loss value is stable at 0.043, and the fluctuation is within 0.01. The model has high accuracy and good robustness. It can be observed from the analysis that the LWMSDU-Net in this paper is effective and feasible for CDLIS.

The pre-trained model on the PlantVillage dataset is trained in the constructed dataset. In order to test the training performance of LWMSDU-Net, it is compared with U-Net, LWU-Net [17], AU-Net [18] and MSU-Net [21] on the augmented dataset. Each of the three improved models has its advantages, where LWU-Net is a lightweight U-Net, AU-Net takes advantage of attention, and MSU-Net is a multi-scale U-Net. Figure 9 shows their segmentation accuracies versus the number of iterations in the convergence process, where all models are pre-trained on PlantVillage dataset. From Figure 9, it is observed that all loss values of five network models drop rapidly before the 1000th iteration, and are nearly stable after the 1500th iteration. From Figure 9, it is also found that LWMSDU-Net outperforms other four models and achieves the best convergence performance after the 2700th iteration. The reason may be that dilated convolution and Respath are used to speed up its training and improve its segmenting performance. Comparing Figures 9 and 10, it can be found that the performance of LWMSDU-Net after pre-training is very good.

**Figure 8.** Accuracy and loss value versus iteration.

**Figure 9.** Segmentation accuracy versus the number of iterations of four networks.

To be fair, 5 trained models are chosen after the 3000th iteration. The typical segmented disease leaf images of five models are shown in Figure 10.

From Figures 9 and 10, it is observed that all four modified U-Net models are much better than the traditional U-Net. In five-fold cross verification experiments, the trained U-Net, LWU-Net, AU-Net, MSU-Net and LWMSDU-Net are used to segment the disease leaf images of the augmented dataset, and their segmentation results are shown in Table 2.


**Table 2.** Segmentation results of U-Net, LWU-Net, AU-Net, MSU-Net and LWMSDU-Net.

**Figure 10.** Typical segmented disease leaf images by 5 models.

#### *4.3. Ablation Experiments and Results*

The proposed model LWMSDU-Net is based on U-Net, and makes use of the characteristics of the Respath connection, dilated convolution and multi-scale Inception module. To verify the effectiveness of their combination, some ablation experiments are carried out. The experimental results are shown in Table 3 by combining different convolution structures and connection structures, where U-Net employs 3 × 3 convolution and skip connection, Res-U-Net is combined by U-Net and residual block for image segmentation [25], and Inception U-Net consists of a normalization layer, convolution layers, and Inception layers (concatenated 1 × 1, 3 × 3, and 5 × 5 convolution [26].

**Table 3.** Segmentation results by different combinations of convolution and connection.


From Table 3, it is found that the proposed LWMSDU-Net exhibits quite significant results as compared to the original U-Net, Inception U-Net and different combination architecture, and the results validate the effectiveness of dilated Inception module, Respath connection and their combination.

#### **5. Analysis and Discussion**

From Figures 9 and 10 and Tables 2 and 3, it is observed that LWMSDU-Net and other modified U-Net networks can obtain more detailed spot images even if the spots are small and not clearly contrasted with the healthy leaf areas and background, and specially, LWMSDU-Net is superior to the other models in accuracy and computing complexity. LWU-Net and LWMSDU-Net have shorter training times because they are lightweight and have fewer trainable parameters, while LWMSDU-Net has the shortest training time because it utilizes dilated convolution and Respath connection. U-Net splices the features together in the channel dimension to form richer segmentation features. U-Net can completely segment the lesion area including the small lesion area, but it cannot effectively divide the adhesion lesion, resulting in more missing lesion pixels. CDLIS by U-Net has some false positive areas, which could not distinguish the lesion area from the background. CDLIS by its modified models is better than that of U-Net. MSUN-Net is slightly better than LWU-Net and AU-Net due to the multi-scale convolution. AU-Net is slightly superior to LWU-Net because of the attention mechanism. LWMSDU-Net can accurately segment the disease leaf images including the lesion area and the edge details of the lesion, due to it utilizing Respath instead of the skip connection of U-Net, and dilated convolution instead of convolution. It is indicated that Respath and dilated convolution can improve the performance of CDLIS.

Compared with other networks, the experimental results demonstrate that LWMSDU-Net achieves a significant segmentation effect. However, it is validated only on a single enhanced dataset. The super-parameters of the training network need to be adjusted according to the dataset being processed, so it cannot completely guarantee that the model weight parameters can be transmitted to other data sets.

In terms of the memory occupied by the model, VGG16 occupies the largest memory, 552.0 MB, the memory occupied by AlexNet is 227.6 MB, because the number of parameters of the fully connected layers is the largest in the entire model. In deep CNN, in order to increase receptive fields and reduce the amount of computation, it is always necessary to conduct downsampling (pooling or s2/conv). In this way, although the receptive fields can be increased, the spatial resolution is reduced. In order not to lose resolution and still expand the receptive field, dilated convolution can be used. By adding zeros to expand the receptive field, the original 3 × 3 convolution kernel can have a 5 × 5 (dilated rate = 2) or a larger receptive field under the same parameter amount and calculation amount, so that no down sampling is required. Dilated convolution introduces only a parameter called dilated rate to the convolution layer, which defines the distance between values when the convolution kernel processes data. In other words, compared with the original standard convolution, the extended convolution has an additional division rate parameter. The division rate of a normal revolution is 1. It can be observed that the number of parameters of the dilated convolution is greatly reduced. Based on this, dilated convolution is added to U-Net, which effectively reduces the number of model parameters. The number of parameters of U-Net is 7.76 M, while the number of parameters of this model after training is 5.8 MB.

#### **6. Conclusions**

Aiming at the problem of crop disease leaf image segmentation (CDLIS), the traditional U-Net model is improved by making use of dilated convolution and Respath. Multi-scale dilated convolution instead of traditional convolution is used to increase the receptive field and improve the feature learning ability of U-Net. Respath instead of skip connection between Encoder and Decode is utilized to concatenate the lesion information of disease leaf image. PlantVillage is employed for pre-training to make up for the shortage of the training samples, overcome the overfitting problem and improve the network performance. The proposed CDLIS method based on LWMDU-Net can be applied to the actual agricultural environment, help farmers quickly and accurately detect crop diseases, and provide effective technical means for scientific disease control. For future work, it is necessary to further verify and optimize the model and construct a more lightweight version for deployment to personal computers and smartphones.

**Author Contributions:** Conceptualization, C.X. and S.Z.; methodology, C.X.; software, C.X. and C.Y.; formal analysis, C.X. and C.Y.; writing—original draft preparation, C.X.; writing—review and editing, S.Z. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work is supported by the National Natural Science Foundation of China (Nos. 62172338 and 62072378).

**Data Availability Statement:** Not applicable.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


## *Article* **Human Perception Intelligent Analysis Based on EEG Signals**

**Bingrui Geng 1,\*, Ke Liu <sup>2</sup> and Yiping Duan <sup>3</sup>**

	- **\*** Correspondence: gengbr@cuc.edu.cn

**Abstract:** The research on brain cognition provides theoretical support for intelligence and cognition in computational intelligence, and it is further applied in various fields of scientific and technological innovation, production and life. Use of the 5G network and intelligent terminals has also brought diversified experiences to users. This paper studies human perception and cognition in the quality of experience (QoE) through audio noise. It proposes a novel method to study the relationship between human perception and audio noise intensity using electroencephalogram (EEG) signals. This kind of physiological signal can be used to analyze the user's cognitive process through transformation and feature calculation, so as to overcome the deficiency of traditional subjective evaluation. Experimental and analytical results show that the EEG signals in frequency domain can be used for feature learning and calculation to measure changes in user-perceived audio noise intensity. In the experiment, the user's noise tolerance limit for different audio scenarios varies greatly. The noise power spectral density of soothing audio is 0.001–0.005, and the noise spectral density of urgent audio is 0.03. The intensity of information flow in the corresponding brain regions increases by more than 10%. The proposed method explores the possibility of using EEG signals and computational intelligence to measure audio perception quality. In addition, the analysis of the intensity of information flow in different brain regions invoked by different tasks can also be used to study the theoretical basis of computational intelligence.

**Keywords:** computational intelligence; quality of experience; human perception; electroencephalogram

#### **1. Introduction**

With the continuous development of computer technology, how to deal with and analyze the potentially insightful information in big data has become an extremely urgent problem that must be overcome. The emergence of computational intelligence and artificial intelligence technology has become an effective way to solve the above problems in various scientific fields. Many outstanding works have further promoted the application of computational intelligence. In the field of image analysis, machine learning (ML) and deep neural networks are used for feature extraction and image segmentation [1,2].

In the field of multimedia communication, with the development of multimedia and communication technology, new services and applications emerge in an endless stream. There are more and more ways for people to obtain information through various terminals, and the audio–visual forms are becoming increasingly abundant; traditional audio, video and emerging virtual reality, augmented reality and other forms are becoming more and more convenient. Ubiquitous multimedia and converged media services are changing people's lives, which also leads to great changes in business content and data volume. Whether a product can provide users with satisfactory services has become a decisive factor for success in the rapidly changing market environment, which is crucial for communication service providers and business service providers. Under the new market demand, the communication changes from data communication to multimedia communication. User

**Citation:** Geng, B.; Liu, K.; Duan, Y. Human Perception Intelligent Analysis Based on EEG Signals. *Electronics* **2022**, *11*, 3774. https://doi.org/10.3390/ electronics11223774

Academic Editor: Akshya Swain

Received: 7 October 2022 Accepted: 15 November 2022 Published: 17 November 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

satisfaction is also affected by a variety of factors, and the mechanism of action is much more complex [3,4]. At this time, ML is often used for resource allocation, quality management and quality prediction [5].

Traditionally, the most recognized method is a technology parameter-centric quality metric named quality of service (QoS) [6], which mainly considers objective technical parameters such as jitter, packet loss, delay, etc. It has been widely used in technology and industry. Additional research has found that the key performance QoS of traditional networks measures the objective quality [6]. The QoS does not consider the actual experience of users. Therefore, a good QoS may not satisfy users, which leads to the bottleneck of improving user satisfaction [7].

International standardization organizations ITU-T [8] defined QoE as "the overall acceptability of an application or service, as perceived subjectively by the end-user" [9]. According to such a definition, the factors influencing QoE are more diverse, including not only audio quality, video quality and network quality, but also service content, multimedia devices and users' personal feelings [3]. For service providers and network operators, the shift from the traditional quality evaluation method focusing on QoS service performance to the QoE evaluation aiming at users' perception and demand seems to better reflect the original intention of providing users with better-quality services. Therefore, QoE research has become an interdisciplinary field involving a lot of knowledge, such as social psychology, cognitive science, intelligent computing and engineering science [10].

At present, the evaluation methods of QoE are mainly divided into two categories: objective parameter-based evaluation and intelligent cognitive-based subjective evaluation [4,7,11], as shown in Figure 1. The objective parameter-based evaluation method first measures or calculates the objective parameters, or establishes a mathematical estimation model from objective parameters to subjective experience, which is based on the statistical knowledge derived from a large number of data, then the estimation model is further used to transform the objective parameters into the estimated value of experience quality [11]. Both the advantages and disadvantages of this kind of method are very prominent. One advantage is that if a suitable mathematical model has been embedded in the QoE evaluation system, the evaluation of QoE will be efficient. Therefore, it is still the best choice for the actual multimedia business scenario [12]. The disadvantage is that it is impossible to truly experience the multi-level satisfaction of users without their participation. Intelligent cognitive-based subjective evaluation refers to evaluation that requires users' participation. Either the specific indicators or the information of experience quality needs to be obtained directly from users. It can be reported by users straight away or be measured by users' relevant physiological variables. These physiological data need to further adopt feature extraction and learning to calculate and analyze the real feelings of the user [7,10,13]. Based on the correlation between perceptual processes and neurophysiology, using advanced calculation and analysis of user neurophysiological indicators to quantify users' subjective experience is an important way to overcome the bias caused by users' upper cognitive behavior in the process of subjective feedback [14]. In addition, due to the amount of data and analytical requirements, computational intelligence techniques also provide more feasible methods for subjective QoE prediction and quality analysis [15,16].

In multimedia communication, the sound is the sensory channel with the highest priority, which is the basis of audiovisual perception. Nonetheless, to our knowledge, the influence of auditory perception on QoE is much less studied than that of visual perception on QoE. This paper proposed a new method to explore the possibility of measuring the user's auditory subjective feelings by collecting the physiological sensory signals from the user's central nervous system. The main contributions of our work are summarized as follows. First, a complete experiment was designed to collect perceptual data of users under different audio quality conditions, including EEG data, subjective judgment data and perceptual semantic data. Second, a new method of studying the relationship between human perception ability and audio noise intensity using EEG signals was proposed, and the perceptual tolerance of audio noise in different semantic scenarios was obtained. In addition, the relationship between audio signal to noise ratio (SNR), audio scenarios, user emotions, and noise perceptual tolerance was explored. Finally, the location of the brain area for audio processing was explored, and the connectivity of related brain regions was quantitatively analyzed.

**Figure 1.** The evaluation methods of QoE.

The rest of this paper is organized as follows. Section 2 reviews related work for QoE evaluation. Section 3 briefly describes the experiment design and data recording. In Section 4, we describe the signal processing and analysis methods in detail, and Section 5 expands on the experimental results and discussion. In Section 6, we conclude the current work and give the direction for future work.

#### **2. Related Work**

Since the concept of QoE was proposed, there has been a lot of excellent work published continuously on QoE prediction and evaluation. In the paper [17], the authors used subjective mean opinion score (MOS) data and evolutionary algorithms to optimize QoE on a global scale. In the paper [18], deep learning (DL) was used to extract generalized features and representation learning from text data, video and audio data and classification parameters, and finally achieved QoE prediction through the classifier. The data in the above works came from communication networks and multimedia devices. Psychological and physiological data were retrieved directly from the user. The psychology aspect mainly involves the user questionnaire, the ratings, and so on. The physiology aspect mainly involves the collection and processing of users' physiological signals. Currently, physiological measures used to assess the quality of multimedia experience fall into three categories: central nervous system measurement, peripheral autonomic nervous system measurement, and eye measurements [19]. Human primary perception and thinking activities belong to the central nervous system function. The neural connections between attention, decision making, and memory in animals and humans have been described in a wide range of experimental studies [20]. Because the physiological indicators measured by the central nervous system can directly reflect human perception and other thinking activities, this method is more conducive to the calculation and analysis of users' perception and cognitive process of multimedia stimulation [14]. The most common devices available are electroencephalography (EEG) [19], near-infrared spectroscopy (NIRS) [21], functional magnetic resonance imaging (fMRI) [22] and magnetoencephalography (MEG) [23]. The activity of the peripheral autonomic nervous system is not controlled by the upper cognition of the brain. The peripheral autonomic nervous system regulates physiological functions such as respiration, heart rate and skin conductance, so electrocardiography (ECG) [24] and electrodermal activity (EDA) [25] can be used to measure the fatigue degree and emotional changes of users. There is also an eye measurements method that evaluates QoE by measuring eye gaze tracking, blinking, or pupillometry [26].

EEG is one of the basic theoretical research methods for brain science. Human mental and physical activities are dependent on bioelectricity. The brain produces and transmits different but regular electrical signals all the time. Therefore, the physiological signals of brain activity can overcome the influence of user fatigue, preference, educational background and external environment when analyzing the user's real feelings [27]. When neurons in the brain fire, they penetrate the brain's dura and skull, creating a weak wave of electrical potential on the brain's skin. This allows non-invasive EEG measurements to infer the firing of intracranial neurons, which can be observed and collected by attaching special electrodes to the surface of the scalp [27]. The location of these electrodes is usually specified in the 10–20 standard system, and the appropriate reference electrode is selected. A standard system facilitates the spatial localization and signal tracking of electrodes in EEG signal analysis.

Induced event-related potentials (ERPs) [28], time-frequency domain analysis [29] and spatial brain connectivity [30] are important methods for EEG experiments and signal processing. ERPs is a special brain potential evoked by sensory stimulation and cognitive process in the brain. The relative strength of the component is significantly improved during the superposition averaging process. After the occurrence of sensory stimulation events, the waveforms of specific channel signals show distinct multiple fluctuations in sequence, and these peaks and troughs represent different patterns of ERPs. The middlelatency response generally refers to the potential induced by 50–200 ms, mainly including N100, P100, N200 and P200. In the paper [31], the authors pointed out that N100 was widely present in a variety of cognitive processing functions, including auditory, visual, behavioral and cognitive tasks, and it can reflect early simple sensory processing and can be used as a biomarker of neuroplasticity. P300 is the neural activity triggered by task-related target stimulus, which is an important aspect of ERPs research. It is a widely existing component that can be recorded and observed in the scalp, with a large amplitude and a wide span [32]. The P3a subcomponent reflects the top-down frontal attentional mechanism during task processing. Another subcomponent, P3b, reflects top-down temporoparietal activity related to memory mechanisms [33]. N400 can be used as a neurophysiological index for semantic priming, with the absolute value of N400 amplitude being smaller when a word is a good match with the previous word/context, and larger when the two do not match [34]. The time-frequency decomposition of non-stationary time signals, such as continuous wavelet transform (CWT) [35], discrete wavelet transform (DWT) [29] and empirical mode decomposition (EMD) [36], are effective EEG signal analysis methods, which can accurately capture and locate transient features in the time domain and the frequency domain to better understand the dynamic characteristics of the human brain. Assessing information exchange between brain regions is also a common method for analyzing EEG signals. This method can be combined with graph theory to analyze and quantify the structure, function and causality of the brain. The directed transfer function of the autoregressive model framework was proposed and used to determine the direction and frequency content of brain activity, and the validity of the DTF algorithm was verified by real neurobiological data [37,38]. In the paper [39], the authors validated a connectionbased EEG feature detection method using ML based on tone-mapped high dynamic range videos and confirmed that DTF outperformed undirected functions.

It is clear from a large amount of research that visual stimuli have been studied far more than auditory stimuli. In the paper [40], the authors pointed out that there were not as many physiological studies on hearing as vision, so early auditory perception activation could be explored by means of physiological measurement and computational intelligence. In our previous article, we carried out some preliminary research, including recruiting volunteers, collecting EEG signal samples, selecting appropriate threshold of DTF to construct edge sets and using weighted degree for clustering [41]. The work of this paper was based on the previous work, so part of the previous experimental results are presented in Section 5.3.

#### **3. Design of Experiments**

#### *3.1. Procedure*

The experiments were performed in the Wireless Multimedia Communication Lab (WMC) at Tsinghua University. The subjects were required to complete all the experimental contents in the professional EEG shielding room, as shown in Figure 2. The process of signal acquisition required the subject to complete all experiments in a professional EEG shielding chamber. This shielding room can strictly control external noise, indoor temperature, light, and electromagnetic interference. Mobile phones and other devices were banned during the experimental phase. Before the experiment, every participant was asked to read and sign an informed consent form. The researchers explained the experimental procedure and operation to the subjects in detail. The subjects did not know the specific principles and methods of the experiment. During the experiment, the subjects had to complete their tasks alone in the shielding room. Researchers could watch the indoor situation through a monitor in the control room and the brain waves of the subjects through a computer screen in real time. In special cases, researchers could communicate with the subject through the internal microphone and sound system as necessary.

We recruited 12 students and young teachers as volunteers, consisting of 6 females and 6 males, aged between 18 and 28. None of them had major illnesses. They all had normal hearing and had never had any neurological problems. Participants were tested in a soundproof, standardized EEG lab and asked to minimize blinking, make body movement, and swallow during the experiment. Two of the subjects' data were discarded due to the too many behavioral interference signals. We finally admitted EEG data from a total of 10 subjects [41].

**Figure 2.** EEG experiment environment

#### *3.2. Stimuli and Experimental Procedure*

In the experiment, four kinds of specially processed audio materials with very different semantic content were played through the headset, and each audio clip was played for 15 s. The four semantic contents were classical piano music, ocean waves, fire alarms and mosquitoes, all with periodic rhythms. Six levels of white Gaussian noise were added to each audio clip. The six Gaussian noise levels were defined according to the power spectral density of noise, which was 0, 0.001, 0.005, 0.01, 0.03 and 0.1. Depending on the level, the noise was added to the audio from 2s to 6s and lasted for 5 s. The noise of level 1 started from the second second; the noise of level 2 started from the third second, and so on. In the end, 24 different audio clips were obtained.

In each section of the experiment, the audio clips (24 in total) were randomly played twice. So, in the whole experiment, all the audio clips (24 in total) were played six times. After each audio clip was played, the subjects were asked whether they could tolerate the noise in the audio. A response of Y meant yes, and N meant no. At the end of each section, subjects rested for 3 min. At the end of the experiment, the subjects were asked to complete a subjective audio semantic questionnaire. We used the semantic difference method to make the subjects perform multiple perceptual evaluations on four different kinds of audio. The subjects were asked to evaluate three contrasting pairs of attributes. They were pleasant–unpleasant pair, relaxed–tense pair, and calm–upset pair. Matlab was

used for audio material synthesis and signal processing, and Presentation, a program used for stimulation presentation and experimental control in physiological experiments, was used for stimulus materials. The whole experimental procedure is shown in Figure 3.

**Figure 3.** The experimental procedure consisted of three sections and two rests. In each session, 48 stimuli clips were played randomly.

**濨**

#### **4. Signal Processing**

#### *4.1. Directed Transfer Function*

In brain network research, directional functional brain connections can also be called causal brain connections. The information between the connected nodes is statistically causal. Methods for constructing causal connections mainly include directional transfer function (DTF) and partial directed coherence (PDC), and network connection thresholds need to be further selected for quantification. In this paper, we used the DTF method to construct the brain network and carried out degree feature extraction.

DTF is an autoregressive (AR) model [37], which can be described as

$$\sum\_{d=0}^{D} A\_d x\_{t-d} = e\_t$$

where *D* is the model order determined by Akaike information criterion, *Ad* is the delay matrix in AR model, and when *d* = 0, it is an identity matrix. *xt* = (*x*1,*t*, *x*2,*t*, ... , *xk*,*t*) is the the EEG data based on time series and *et* = (*e*1,*t*, ... ,*ek*,*t*) is the vector of uncorrelated zero-means Gaussian white noise processes. If *xk*,*t*is a stationary stochastic process, *Ad* can be obtained according to the Yule–Walker equation. Then, the Z transformation gives the following result.

$$X(f) = H(f)E(f)$$

where *H*(*f*) is the transfer function, *X*(*f*) and *E*(*f*) represent the transformed EEG data and noise data at frequency *f* . The DTF value (denoted by *DTFi*,*j*(*f*)) is obtained by performing column square sum normalization by *H* and indicates the information flow intensity between the *i*-th and *j*-th electrode.

There is a large amount of redundancy in the DTF coefficients. In the simulated signal test, only dimensions of 3, 4, 5 or 7 are used frequently [37,42], while in actual multichannel EEG signal processing, the dimensions are generally much greater than those in the simulated test. Therefore, we first simulated and tested the same dimensional vector time series system of the DTF algorithm so as to determine an appropriate threshold to construct the brain-connected network. We controlled the spectral radius *ρ*(*Ad*) to solve the problem of randomly generating a large number of *Ad* matrices while maintaining system stability in high-dimensional vector time series system simulation [37,41]. The formula is as follows.

$$r(A\_d) \le \rho(A\_d) \le R(A\_d)$$

where *r*(*Ad*) and *R*(*Ad*) are the minimum and maximum row summation of *Ad*, respectively. In the process of simulation, we let each row summation of *Ad* be a random variable obeying uniform distribution with extreme values of 0.30 and 0.95; thus, we had for all *i*.

$$\sum\_{j=1}^{31} A\_d(i,j) \sim \mathcal{U}(0.30, 0.95)^2$$

Specifically, we gave the row summation and then randomly divided it into 5–16 parts as the elements of the corresponding line, indicating that *Ad* was non-negative and *R*(*Ad*) < 1.

In our previous work [41], we found a strong correlation between the information flow accuracy of the DTF algorithm and the *Ad* of the actual AR model through large-scale testing of random analog signals. Previous experimental results have shown that when 10% was chosen as the threshold for constructing the brain connectivity network, the accuracy of effective connectivity could be guaranteed at most densities of *Ad*.

#### *4.2. Network Structure and Comprehensive Weighted Degree*

In order to characterize the intensity of information flow in the cerebral cortex, we constructed a brain connectivity graph by *DTF*(*f*) denoted by *G<sup>q</sup> <sup>f</sup>* = (*V*, *A*, *W*), where *V* = {1, 2, ... , 31} is the vertex of the network, corresponding to 31 electrodes. *A* = {(*i*, *j*)|*i*, *j* ∈ 1, ... , 31 *and i* = *j*} is the directed edge set of the graph and *W* : *A* → [0, 1] represents the weight of each directed edge. Figure 4 shows different brain connection networks constructed by a subject when listening to piano music of different quality levels. Different colors represent different connection strengths. As can be seen from the figure, the strength of noise in audio affected brain connectivity.

**Figure 4.** The brain connection networks of a subject when listening to piano music with 0 (**a**), 1 (**b**), 4 (**c**) and 5 (**d**) noise level.

To further quantify the information strength feature, for each vertex *v* ∈ *V*(*G*), we calculated the following parameters.

$$\deg(v) = \sum\_{\substack{w \in ON(v) \mid IN(v)}} \mathcal{W}(v, w) + \sum\_{\substack{w \in IN(v) \mid ON(v)}} \mathcal{W}(w, v) + \sum\_{\substack{w \in ON(v) \cap IN(v)}} \max\left(\mathcal{W}(v, w), \mathcal{W}(w, v)\right)$$

where *IN*(*v*) and *ON*(*v*) are the input and output neighbor of vertex *v*, respectively, and *deg*(*v*) is the comprehensive weighted degree of *v* , we also let *degGq f* (*V*) denote the comprehensive weighted degree sequence of graph *G<sup>q</sup> <sup>f</sup>* , and *<sup>λ</sup><sup>q</sup>* denote that of full-frequency band [41].

$$
\lambda^q = \frac{\sum\_f \deg\_{G\_f^q}(V)}{f\_{\max} - f\_{\min} + 1}
$$

Figure 5 shows the brain topography of comprehensive weighted degree of a user in four different audio scenarios under two extreme conditions (the audio with no noise and the audio with noise intensity of 0.1). It can be seen that the user's EEG response varies greatly under different conditions.

**Figure 5.** The brain topography of comprehensive weighted degree in four different audio scenarios under two extreme conditions.

#### *4.3. Clustering*

For each given audio semantic scenario, we performed the clustering algorithm separately on *λ*0, ... *λ*5. Clustering optimization was carried out according to the error sum of squares criterion function.

$$\mathcal{J} = \sum\_{i=1}^{K} \sum\_{j=1}^{N} w\_{ji} \|\lambda^q - \mathcal{C}\_i\|^2$$

where *w* is the membership coefficient, which is either zero or one. *λ<sup>q</sup>* is the feature data of K-means clustering. This comprehensive weighted degree was 31 dimensions. The clustering category was defined as the acceptable level space and the unacceptable level space, and the user's tolerance level in different audio semantics was determined by the clustering sample subordination, which was defined as the proportion of EEG signal samples classified into the unacceptable level category at different noise levels.

#### **5. Result and Discussion**

#### *5.1. Results of Subjective Data Analysis on Noise Level*

Figure 6 shows the statistical subjective evaluation results of the number of times the user experiences noise that affects audio quality. It can be seen from the results that the pure subjective evaluation of users was not completely consistent with the objective facts. In many cases, the subjective evaluation results were intuitive but not reliable. For example, in the case of the sound of ocean waves, when the noise level was low, there was no negative evaluation. Users did not make a lot of negative quality evaluations, even when the noise level reached level 4, which was unexpected. In addition, although subjects were required to evaluate only the impact of noise on audio quality, in the last two audio scenarios of the experiment, when the noise level was zero, a lot of negative evaluations on audio quality had been received. In fact, the objective audio quality was very good at that point and did not include noise. This is the disadvantage of subjective evaluation, which is the uncontrollable subjective arbitrariness of users.

**Figure 6.** Number of times the user experiences noise that affects audio quality.

#### *5.2. Results of Semantic Questionnaire Analysis*

The attributes and semantic questionnaire analysis results are shown in Figure 7. It can be clearly seen from the figure that the perceptive semantic radar map of the four audio scenarios expressed two completely different audio emotions. This data and result can also be seen in our previous work [41]. The details were discussed in Section 5.3, together with the physiological data results.

**Figure 7.** The subjective audio semantic questionnaire: the result of multiple perceptual evaluations on four different kinds of audio. (**a**) Piano music (**b**) Ocean wave (**c**) Fire alarm (**d**) Mosquito.

#### *5.3. Perceptual Tolerance*

An important goal of our analysis of EEG signals is to find the level of noise perceptual tolerance, when the noise level is higher than the perceptual tolerance, almost all subjects would show an intolerable trend. According to general experience, the perceptual tolerance of humans to audio noise should be determined by the value of the SNR. Figure 8 shows the SNR results of all audio stimulus materials with noise in our experiment. As can be seen from the Figure 8, the value of SNR decreases significantly with the increase in noise level. In addition, due to the different semantics of the audio scene, the value of SNR with the same noise level fluctuates in a small range. However, the physiological signal analysis results given in Figure 9 show that humans have different perceptual tolerance for the same noise level. In this work, the brain map of the comprehensive weighting degree was very different from that of high-intensity audio when users listened to raw audio and low-intensity-noise audio. Therefore, the comprehensive weighting degree of the full frequency can be used as EEG features for the clustering algorithm. Figure 9 shows the clustering visualization results as block diagrams of all subjects.

It can be clearly seen from Figure 9 that the user's noise tolerance level for a particular audio scenario was determined. Specifically, the user's limits of audio 1, 2, 3 and 4 were noise levels 1, 2, 4 and 4, respectively. We suspected the above results were related to the audio scenario and the difference between the original audio signal and the noise. So, we focused on the analysis of the semantic environment of the audio and the absolute integral value of the deviation between the four semantic audios with different levels of white noise.

**Figure 8.** The SNR of audio stimulus materials.

Combined with the results of the perception semantic questionnaire results in Figure 7, it can be seen that based on the choices of all subjects, the smooth piano music and ocean waves make people feel pleasant, relaxed and calm. Under this situation, even the lowintensity white Gaussian noise on such audio will have a great influence on the subject's quality of experience; the user will be very sensitive to the noise, and their brainwave signal will significantly change. A different situation appears in audio 3 and 4. The fire alarm makes subjects feel tense and unpleasant, and mosquito audio makes subjects more upset. In this semantic audio environment, the subject's sensitivity to noise is reduced, and the perceptual tolerance of noise intensity is increased. Different audio scenes bring different perceptual emotions to people, which perfectly explains that humans' perceptual tolerance does not exactly correspond to the objective SNR of the audio. Figure 10 gives more details about the signal absolute difference integral proportion difference between audio with five levels of noise and raw audio under four different audio scenarios. This is a strong explanation for the results that the perceptual tolerance of audio 3 and 4 are higher than that of audio 1 and 2. The clustering results can also be seen in our previous work [41].

In conclusion, the perceptual tolerance of human perception of noise was related to the audio semantic environment perceived by users, and it was inversely proportional to the signal absolute difference integral proportion difference between audio with noise and raw audio under different audio scenarios. Moreover, both EEG signals analysis and subjective evaluations indicated that users were more sensitive to noise-induced quality changes in the calming and soothing audio scenario.

**Figure 10.** The signal absolute difference integral proportion difference between audio with five levels of noise and raw audio under four different audio scenarios.

#### *5.4. Connectivity Analysis of Related Brain Regions*

To better illustrate the experimental result, we presented the comprehensive weighted degree of key channel signal of ten users with qualified experimental data. We defined the key channel as degree >1 in the audio condition (high-quality audio or audio with noise), and compared with level 0, the amplitude of level 5 increased by more than 10%. The specific values are shown in the table below.

The brain is divided into frontal, parietal, temporal, and occipital regions. The naming of the channel electrodes on the EEG cap is refined according to the location of the four brain regions. The channel F represents the frontal region, P represents the c region, T represents the temporal region, O represents the occipital region, C represents the central region, FC represents the frontal central region, CP represents the central parietal region, FP represents the frontal pole region, the singular represents the left brain, the even represents the right brain, and Z represents the middle region.

As can be seen from Table 1, when users heard the audio, the degree of nodes of CP related channels degree was higher than that of other nodes (8/10 users), and the degree of nodes of FC related channels degree was higher than that of other nodes (8/10 users), too, indicating that certain brain regions were activated after users heard the audio stimulation. We found that no matter the audio scenario, the value of node degree would increase significantly when there was noise, indicating that the activation degree of the electrical nerve signal in the brain area increased. For example, under four audio scenarios, the channel degree of the original audio and the audio with noise level 5 increased by 39.59%, 35.08%, 16.07%, and 41.66%, respectively. For another example, the CP2 channel degree of user 1 increased by 28.2%,65.92%, and 32.3% under the audio scenarios 1, 2 and 4. In audio scenario 3, the degree of channel CP5 increased by 13.99% in the same brain area. Similarly, the increase in FC-related channels was also obvious. Under audio scenarios 1, 2 and 3, the degree of FC2 of user 4 increased by 105.88%, 37.97%, and 28.78%, respectively. The degree of FC6 in the same area increased by 10.16% under audio scenario 2 and 18.75% under audio scenario 4. These statements suggested that noise had a greater effect on the brain regions where the channels mentioned above were located. In particular, the central parietal region where CP channels were located and the frontal central region where FC channels were located were cognitive-integration-related brain regions and preferencedecision-related brain regions. These were consistent with previous research on brain perception [43]. All of these were consistent conclusions, regardless of the individual or the audio scenario. However, activation of the brain regions did not rule out individual differences. For example, when user 7 was under audio scenario 1 and scenario 4, the degree value and the increase in F3 and Fz channels in the frontal regions were both great.


**Table 1.** The values and ranges of degree.



#### **6. Conclusions**

This paper discussed the evaluation methods of human subjective perception from two aspects. They were the analysis of the physiological signals from the central nervous system and the users' subjective behavioral data. The EEG was used to record real brain wave data, and brain connectivity maps were constructed to obtain the perceptual tolerance degree of audio noise in different scenarios. The relationship between audio signal-tonoise ratio, audio scenarios, user emotions and noise perception tolerance was analyzed comprehensively. Meanwhile, a change in brain activity intensity was also demonstrated.

**Author Contributions:** Conceptualization, B.G.; Methodology, B.G. and K.L.; Software, K.L.; Supervision, Y.D.; Writing—original draft, B.G. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work was supported by the Fundamental Research Funds for the Central Universities(CUC220C007, CUC22GZ007).

**Institutional Review Board Statement:** The data collection part of the study was conducted at Tsinghua University, and this study was approved by the Medical Ethics Committee of Tsinghua University (1100000118937).

**Informed Consent Statement:** Informed consent was obtained from all subjects involved in the study.

**Data Availability Statement:** Not applicable.

**Acknowledgments:** This work was also supported by Communication University of China and Tsinghua University-China Mobile Communications Group Co., Ltd. Joint Institute. We wish to thank Tsinghua WMC EEG Lab for providing experimental conditions. A small part of this work was presented at the conference IWCMC, and we have officially obtained IEEE permission to reuse the materials.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Abbreviations**


#### **References**

