*5.1. Questionnaire*

We first analyzed the questionnaire results, summarized in Figure 4, to be used for labeling the datasets. We regrouped the questionnaire answers into two groups as follows: "None" and "Little" were merged into the weak boredom group, and the remaining answers were merged into the strong boredom group. The total number of questionnaire answers was 56, which comprised 28 participants' answers for two video stimuli. The number of answers in the weak boredom group was 30, while 26 were assigned to the strong boredom group. These regrouped questionnaire results were then used for labeling the collected samples. Based on Figure 4, the game trailer and the comedy clip did not induce boredom among most participants; however, the circle video was successful in evoking boredom among the participants.

**Figure 4.** Questionnaire results (Question: How much boredom did you feel from "stimulus name"?).

#### *5.2. Initial Test for Model Selection*

We performed initial testing to compare the 19 candidate algorithms and to select the best models for further analysis. Table 4 presents the top ten models for each dataset. Based on the results, we selected RF, MLP, and NB for further investigation. In particular, RF was ranked as the best algorithm for the EEG-GSR combined and EEG datasets. MLP was ranked in the top ten more than other algorithms in all datasets. We also chose NB because it has a relatively low time complexity of *O* (*np*), where *n* is the number of training samples and *p* is the number of features, whereas other algorithms with similar performance, such as IBk or J48, have time complexity of *O n*<sup>2</sup>.


**Table 4.** Initial testing results for model selection.

#### *5.3. Hyperparameter Tuning*

#### 5.3.1. Random Forest

Weka's RF API has three major hyperparameters that are related to performance: the number of features to randomly investigate, the number of trees, and the maximum depth of trees. In order to find the best hyperparameters, we trained models with all possible combinations of the parameters within predefined ranges. As a result, we trained 107,100 models and measured these performances with 10-fold cross validation. The predefined ranges of the tuning parameters were as follows:


Table 5 presents the top three hyperparameter combinations for the Random Forest algorithm in each dataset. To select the best hyperparameters, we established the following prioritization: (1) Accuracy, (2) Area Under the receiver operating characteristics Curve (AUC), and (3) expected classification cost. Consequently, we selected 7, 14, and 7 as the number of features, trees, and the maximum depth value, respectively, for the EEG-GSR combined dataset. In the case of EEG and GSR datasets, the default values for the number of features (6, and 3, respectively) were optimal, and 18 and 11 were selected as the number of trees and depth, respectively. The "no limit" in depth means unlimited search on each tree of random forest. Considering the expected classification cost, 11 appears to be a suitable value.


**Table 5.** RF hyperparameter tuning results.

#### 5.3.2. Multilayer Perceptron

We tuned the neural network design, learning rate, and epoch parameters for MLP in the Weka API. We started by evaluating the neural network design parameters by testing all possible cases, whilst keeping the other parameters at default values. For flexible neural network design, we followed the Weka API's MLP neural network design parameter rule (see Table 3). The general design concept of a neural network was to decrease the number of each layer's nodes gradually, from the first layer to the last layer.

To decide the optimal learning rate and epoch values for MLP, we tested all possible combinations of these within predefined ranges. As a result, 3,600,000 models were trained and their performances were measured using 10-fold cross validation. The predefined ranges of each tuning parameter were as follows (the default value of MLP's momentum parameter is 0.2; however, we fixed it as 0.1):


Table 6 presents the hyperparameter tuning results. To select the best hyperparameters, we established the following prioritization: (1) Accuracy, (2) AUC, and (3) low epoch. Considering the characteristics of MLP and to avoid overfitting, the learning needs to stop when the accuracy does not increase. Regarding the network design, models with three hidden layers could not be tested because WSE produced a model that contained only the label data. According to the results of Table 6, we selected "t", 0.76, and 73 as the values of network design, learning rate, and epoch, respectively, for the EEG-GSR combined dataset. For the EEG dataset, "i", 0.19, and 489 were found to the best values of network design, learning rate, and epoch, respectively. Finally, "t", 0.95, and 321 were designated as the values of the three hyperparameters for the GSR dataset.


**Table 6.** MLP hyperparameter tuning results.

## 5.3.3. Naïve Bayes

Table 7 presents the tuning results of the NB hyperparameters for each dataset. Weka's NB API has two major options, which are mutually exclusive: whether to use a kernel density estimator rather than the normal distribution for numeric attributes ("Kernel" in Table 7), and whether to use a supervised discretization to process numeric attributes ("Discretization" in Table 7).

As in the parameter tuning processes for RF and MLP, we also tested all possible combinations of the NB hyperparameter options. Thus, we trained and tested nine models in total, and found that the parameters should be disabled for the EEG-GSR combined, EEG, and GSR datasets.


**Table 7.** NB hyperparameters' tuning results.

#### *5.4. Final Performance Analysis*

#### 5.4.1. Performance Measurement

A common method of evaluating a classification model's performance is to use k-fold cross validation. However, k-fold cross validation is based on a random split of data; thus, when the validation is executed, the produced performance results can be different each time. Therefore, to obtain more reliable performance results, we measured the final models' performance by repeating 10-fold cross validation 1000 times with different seed values. Table 8 presents the mean, maximum and minimum accuracies and AUCs produced by 1000 iterations of the 10-fold cross validation runs for each parameter combination and dataset. Considering the mean accuracy, the MLP algorithm produced the best performance in all datasets. However, when considering the mean AUC values, the RF model outperformed the MLP model on the EEG-GSR combined dataset. The last column of Table 8 reports the average computation time for each run. We measured the average time per cross validation using the last 100 iterations of 10-fold cross validation. MLP took the longest time overall, for example, it took 71 ms on the EEG-GSR combined dataset, which was about 2.7 times longer than RF and 7.5 times longer than NB. The performance on the EEG-GSR combined dataset was better than each individual data performance in all cases.


**Table 8.** Final performance comparison—1000 runs of 10-fold cross validation.

Figure 5 shows the models' final performances using box plots. We observe that the MLP and NB models' performances were more stable than that of RF in 1000 runs of 10-fold cross validation. The mean accuracies of MLP were higher than those of the other models in general. The RF model on the EEG-GSR combined dataset showed a high mean AUC but had a large variance. Muller et al. [42] defined a model's discriminatory ability with AUC as follows: (1) excellent discrimination (AUC >= 0.90), (2) good discrimination (0.80 <=

AUC < 0.90), (3) fair discrimination (0.70 <= AUC < 0.80), and (4) poor discrimination (0.60 <= AUC < 0.70). For many of our final models, the AUC was over 0.7, and thus these models can be classified as fair discrimination models.

Considering all aspects of our model performance validation, the MLP model classified boredom most reliably on all the datasets. Furthermore, the model using the EEG-GSR combined dataset showed the highest performance, while MLP on the EEG dataset and MLP on the GSR dataset ranked second and third, respectively.

#### 5.4.2. Analysis of the Selected Features

Table 9 presents the features that the WSE algorithm selected for each model. These results indicate that the standard deviation of MV was selected for all classifiers that used the GSR datasets. For a more detailed analysis of the selected features, we illustrate the distribution of the selected EEG features by frequency bands and electrodes in Figure 6. As the left pie chart indicates, the features related to the Alpha and Beta bands were selected more frequently than those of the Delta and Gamma bands, and the Theta band features were not selected by WSE at all. The right pie chart in Figure 6 illustrates the distribution of features by electrodes, where the distribution among FP1 and FP2 is nearly balanced; however, FP2 features were picked slightly more frequently.


**Table 9.** Selected features by WSE in each model.

**Figure 6.** Distributions of EEG features by frequency bands and electrodes.

Figure 7 illustrates the distribution of EEG features by frequency bands for each electrode separately. We find that all selected Alpha band features are concentrated on FP1. In contrast, all selected Gamma band features occur on FP2 as the right pie chart shows.

**Figure 7.** Distributions of EEG features by frequency bands for each electrode.

Based on our analysis of Table 9, Figures 6 and 7, we conclude that EEG and GSR show some indicators for the classification of boredom. First, the standard deviation of MV strongly correlated with boredom because this feature was selected from both GSR datasets. Second, the Gamma, Alpha and Beta bands have a strong correlation with boredom because these were selected more frequently than the Delta and Theta bands. Third, each frequency band has some correlation with a specific electrode location in boredom classification; for example, all selected Gamma band features belonged to FP2 and all selected Alpha band features belonged to FP1.

## **6. Discussion**

In our experiment, the MLP models' performances were more stable than those of the other models. Considering Table 8, the RF model for the EEG-GSR combined dataset produced a maximum accuracy of 89.29%, which is the highest of all accuracies; however, its minimum accuracy was 66.07%, which was the lowest accuracy among all the models on the same dataset. An important property of a good classification model is the robustness of performance over multiple executions. To evaluate this for the selected models, we executed 1000 iterations of 10-fold cross validation with different random seeds. Thus, training data and testing data splits were changed randomly. A good model should be able to produce good classification results for different training and testing datasets. From this aspect, the MLP models classified boredom more robustly than the other models. Furthermore, the EEG-GSR dataset's MLP model (mean accuracy of 79.98%) outperformed the other MLP models in the aspect of mean accuracy. Therefore, our analysis suggests that MLP is generally recommended for classifying boredom from EEG and GSR.

Comparing our main results with the previous research presented in Table 2, our model's performance is better than those of Jaques et al. [13] and lower than Jang et al. [10]. However, as Table 1 shows, previous studies did not utilize EEG and GSR for classifying boredom. Thus, a direct comparison between our model and previous studies' performance is not reasonable. Furthermore, we collected EEG and GSR data from 28 participants whereas many previous studies, with the exception of Jang et al. [10], Jaques et al. [13], and Seo et al. [7], collected data from less than 28 participants (see Table 1); thus, our model is based on data acquired from a sufficient number of participants.

Moreover, this study executed 1000 iterations of 10-fold cross-validation on each model to reduce the effect of randomness on the results and to produce more reliable performance scores. Among previous studies, Jaques et al. [13], Jang et al. [10], and Seo et al. [7] also validated their models with 10-fold cross validation; however, they did not mention the number of repetitions of validation so we assume that cross validation was only executed once. Shen et al. [2] separated their data into training and testing sets but did not consider the random effect. Giakoumis et al. [9] validated their model with the use of leave-one-out cross validation that validates a model without random effect; however, as we mentioned above, a direct comparison of this study to our study is not reasonable because we used EEG and GSR datasets, while Giakoumis et al. [9] used ECG and GSR datasets.

We note that the proposed MLP model's performance (79.98%) is lower than that of the model proposed in our previous study (86.73%) [7]. One of the reasons contributing to this difference is that, as we explained in the paragraph above as well as in Sections 4.1 and 4.2.1, the experimental setting and the way we evaluated the models' performance was modified from the previous study. These changes were made to improve the robustness and generalizability of the results. For example, in our previous study [7], we acquired multiple samples from one participant's data by splitting them into one-second windows and used these for training; in the current study, we acquired only one sample from each participants' dataset by increasing the window size, thus aiming to increase the independence between the samples. This can help to reduce overfitting and achieve more generalizable results. Moreover, the previous study conducted only one iteration of cross validation, whereas, in the current study, mean accuracies were recorded after

1000 iterations of 10-fold cross-validation to increase the robustness of the results. This approach provides more reliable performance scores especially in the applications where the number of available samples is relatively small as in this study. Although the aforementioned steps taken decreased the accuracy of the final model, the generalizability and reliability of the result were increased.

Another novelty of this study is the identification of correlation between EEG, GSR, and boredom through the interpretation of features. As we explained in Section 2, this correlation was not revealed by previous studies. Moreover, our findings are aligned with Bench and Lench [34]'s suggestion that boredom should increase the autonomic nervous system activity, which directly relates to EEG and GSR as data sources. In our feature refinement results, the WSE algorithm recommended the EEG and GSR features on the best performing model for increasing performance. This suggests that the combination of these data correlates with boredom. In particular, Gamma band features were selected for the combined EEG-GSR and the EEG datasets. This indicates that the Gamma band may correlate with boredom, whereas the Alpha, Beta and Delta bands have weaker correlations, and the Theta band has no correlation at all with boredom. Furthermore, the WSE algorithm selected the standard deviation of MV among the GSR features from the combined EEG-GSR and GSR datasets. Consequently, the standard deviation of MV can also be an indicator of boredom.
