*3.7. Classification Results*

Once all classifiers were trained and tested with the default parameters, cross-validation and grid search methods were implemented to select the set of parameters that output the maximum accuracies. On the basis of the list of hyperparameters presented in Table 6, the selected ones for the most accurate classifiers are shown in Tables 6–8 for the K-nearest neighbors, random forest, and multilayer perceptron, respectively. These classifiers were used for comparing accuracy using the testing set in later stages. Note that these optimal parameters varied when choosing a different working fluid, as the results for ethanol could not be extrapolated to those for FC-72. Nevertheless, slight differences could be seen, especially with those classifiers that do not depend on a large number of hyperparameters, such as the K-nearest neighbors classifier. For this classifier, a training set was split even though it was not strictly needed. This was for the sake of consistency when comparing all three classifiers.

**Table 6.** Optimal hyperparameters for K-nearest neighbors classifier.


**Table 7.** Optimal hyperparameters for random forest classifier.


To assess the performance of individual classifiers after identifying the best set of hyperparameters, normalized confusion matrices were used to visualize the distribution of correct and incorrect classifications on the testing set. A confusion matrix depicts the fraction of correct and incorrect labeled points over the total number of true labels. Thus, a matrix entry of 1, for a specific label, indicates that all points in such a dataset are

categorized with the expected label. Normalized values were used for more interpretable visual results.

**Table 8.** Optimal hyperparameters for multilayer perceptron classifier.


The confusion matrices for the three algorithms in the case of ethanol as the working fluid are shown in Table 9. In general, all algorithms performed similarly, where the classification of slug/plug flow was significantly higher than that of semi-annular flow. This could be due to the increase in observation errors while classifying semi-annular flow or to the innate nature of this flow pattern, which could have led to more erroneous observations.

**Table 9.** Confusion matrix results for ethanol.


In the case of ethanol, the multilayer perceptron exhibited a slightly higher number of correct classifications, considering both slug/plug and semi-annular flow. These low differences among the three classifiers suggest that, given the available data and selected input features (limited by design and, thus, subject to potential improvements), a fixed order of accuracy could be reached by all algorithms, with the highest provided by MLP.

In the case of FC-72, the highest fraction of correct classifications was also found using the MLP classifier. The major difference across classifiers was seen in the slug/plug category. The confusion matrices for FC-72 are illustrated in Table 10. As in the case of ethanol, the MLP classifier tended to present the highest accuracy among the algorithms with the testing set, after selecting the most suitable set of hyperparameters.

**Table 10.** Confusion matrix results for FC-72.


The overall classification performance is reflected in the value of the accuracy score. These values are reported in Tables 11 and 12, respectively. These values represent the accuracy of those classifiers that presented the highest cross-validated score when selecting the most suitable set of hyperparameters. In accordance with the confusion matrices, the results suggest that the use of MLP provided the highest performance. The lowest performance was shown by the random forest algorithm. In this work, this accuracy score was chosen as a selection criterion for the most suitable classification method; however, it is acknowledged that additional criteria such as computational time or performance when dealing with larger datasets could be included.

**Table 11.** Accuracy score for each algorithm: ethanol.


**Table 12. Accuracy** score for each algorithm: FC-72.


An alternative and complementary method for evaluating the performance of each classification algorithm is the analysis of learning curves. A learning curve shows the sensitivity of a particular performance metric (i.e., accuracy score, mean squared error, etc.) with respect to the size of the training set. This allows the user to identify (i) whether it is necessary to include more data samples in the training set, and (ii) whether the classifier under study presents a bias error or a variance one. Generally speaking, a bias error indicates that the classifier could be overly simple/complex with respect to the training set, leading to either overfitting or underfitting cases. Similarly, a variance error indicates that the classifier could vary drastically or remain unaffected when moving from training (seen data) to testing (unseen data). This also leads to overly simplistic/complex models depending on the case [49].

Both these concepts are related, as machine learning models with high/low bias present low/high variance, exhibiting a tradeoff that should always be taken into account [49]. To evaluate this tradeoff in both training and validation stages, the learning curves for each classifier and working fluid were created. Once the hyperparameters of each algorithm were selected via grid search, different sizes of training data were chosen (as a proportion of the initial training set size; see Table 4), and the training accuracy score was estimated for each training set size. For the validation curve, cross-validation was used once again (via *k*-fold cross-validation, as described in Section 3.5), and, for each training set size, accuracy scores were also calculated. The learning curves for all three algorithms and both working fluids are depicted in Figure 5.

Figure 5 reveals that the training error for the KNN classifier in both working fluids (a and d, respectively) was equal to 1, denoting perfect accuracy. Although this might seem like a successful result, there was a relatively large gap between the training score trend and that for the validation set. This gap is indicative of variance error, as the classifier performed well only during training. The accuracy score in both datasets was within the range of 83% to 100% for both working fluids, suggesting a small bias error in both stages. A similar case was seen in the random forest classifier. However, in this case, the variability of the accuracy score in the training set behaved differently as the size of the training set increased for both working fluids. A minimum accuracy score was reached with ethanol as a working fluid, whereas an oscillating trend was seen when working with FC-72. The accuracy score in the validation set was seemingly improved with the increase in data samples. In general, the random forest algorithm presented a lower bias error than KNN. On the other hand, the MLP classifier presented a smaller gap between training

and validation stages, indicating a good balance between variance and bias errors, as the accuracy score for both stages was within the range of 84% to 85%. This suggests that the tradeoff for this method was the most balanced.

**Figure 5.** Training and cross-validation curves: (**a**) KNN with ethanol; (**b**) MLP with ethanol; (**c**) random forest with ethanol; (**d**) KNN with FC-72; (**e**) MLP with FC-72; (**f**) random forest with FC-72.
