**4. Results and Discussion**

Considering only the right answers in measuring the effectiveness of the model may lead to an erroneous perception of the machine behavior. For this reason, a confusion matrix was used to assess the performance of the supervised learning algorithms. In contrast to other metrics, a confusion matrix is capable of distinguishing between different types of errors. It is a square matrix where each row represents the actual classes, while each column refers to predicted classes. As a result, its main diagonal reflects the correct predictions, and the rest of its cells represent misclassifications.

In addition to the accuracy of the classifiers, two other measures are provided by a matrix confusion, namely precision and recall. Precision is used to obtain the percentage of correct predictions in every class, meaning the degree of reliability, while recall is used to represent the fraction of samples which were correctly recognized, that is, the model's detection capability. Both measures were calculated using Equations (4) and (5). In these equations, TP (True Positive) refers the number of predictions where the classifier correctly predicts the positive class as positive, TN (True negative) indicates the number of predictions where the classifier correctly predicts the negative class as negative, FP (False Positive) depicts the samples incorrectly associated with a class, and FN (False Negative) represents the experiments belonging to a specific class that were wrongly labelled.

$$\text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}} \tag{4}$$

$$\text{Recall} = \frac{\text{TP}}{\text{TP} + \text{FN}} \tag{5}$$

Combining both previous indicators, the F-score value can be obtained according to the Equation (6), which evaluates the harmonic mean of true positives and true negative cases.

$$\text{F} = \frac{\text{2} \cdot \text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}} \tag{6}$$

As discussed, the approach consisted of developing an effective model to predict the average surface roughness and to extrapolate the results to complementary variables. To this end, 153 simulations were carried out based on the procedure indicated in Figure 3. This led us to optimize the process parameters and maximize the accuracy of the predictive model. All the results are set out below, in a detailed description of the optimized model.

The model was trained with images obtained in JPG format with dimensions of 1148 × 1076 pixels, which were converted into RGB. These pictures were not split, since it has been proven that increasing the number of training images does not justify the deviation between the measured roughness and the value of the fragmented photograph. Additionally, features were extracted in each image through a SURF object recognition. At this point, the appropriate number of features was analyzed to optimize the accuracy of the model. Hence, models were trained using 2000, 1500, 1000, and 500 attributes, among other values.

In Table 3, some training tests are selected from the whole probes carried out. Particularly, those classified using Medium Gaussian SVM algorithm with three classes and using a photograph division from 1 into 4 (2 × 2) are shown. The number of features is correlated with the difficulty of classifying the photograph correctly, that is, the higher number of features, the more demanding requirements must be fulfilled. Thus, as it can be appreciated, the accuracy obtained is lower as the number of features is higher.

**Table 3.** Comparison between the number of features extracted and the model accuracy with a medium Gaussian SVM classifier.


Finally, an initial value of 500 features was selected. In such a way, a thorough analysis could be performed to assess the impact of each characteristic on the prediction capacity, and thus counterproductive features could be removed.

Under those conditions, all classifiers were trained. The supervised learning algorithms used to train the different models are shown in Figure 4.


**Figure 4.** Compilation of classifiers used.

Regarding the outcome evaluation, given the limited number of training data available, a fivefold cross validation was used. Consequently, the best performance was obtained in

Model 141 with a support vector machine with a quadratic kernel, the confusion matrix of which is shown in Figure 5. This model achieved an accuracy of 69.4%.

**Figure 5.** Confusion matrices of Predictive Model 141 for average surface roughness, Ra.

According to the recall matrix, the confusion is greater when distinguishing images belong to Classes B and C. Based on percentages, 23% of Class B images were incorrectly associated with Class C, whereas 19% of Class C samples were wrongly linked to Class B. Therefore, the predictive model displayed a better detection capability for roughness values in the first interval (Class A). Additionally, the precision matrix reflects a similar performance to that of the recall matrix, and so Class A presents the greatest reliability.

As can be seen from Figure 5, the intermediate class presented less reliability and recall with regard to the other ones, given it was more difficult to extract specific features of this class. This was a recurring issue that, among other problems, may be associated with a narrow roughness range and an insufficient database.

On the one hand, non-uniform distribution of roughness values resulted in a reduction in inequalities between categories. As a result, there are fewer specific characteristics that lead to further confusion in the classification. On the other hand, the inaccuracy of measuring instruments gains importance, since a small error, around 10 µm, might derive in an incorrect classification.

Once the average surface roughness had been studied, new predictive models were built to assess the methodology on complementary parameters, such as the maximum peak to valley height and the arithmetic mean waviness. Of these, we should highlight the following model trained to predict the average primary surface.

Model 169 was trained and validated using the same settings as the previous one. In this case, an accuracy of 80% was reached with a support vector machine with a cubic kernel. Figure 6 depicts the model confusion matrix.

**Figure 6.** Confusion matrices of Predictive Model 169 for the average primary surface measurement.

As can be seen, recall and precision are approximately 80% in all classes. This improvement can be explained by the greater coherence between photographs and parameters, given that the images used represent an overlapping between roughness and waviness. In accordance with the better model performance, not only could the effectiveness of machine learning algorithms in this field be proved, but also, the application of filters on samples is considered for further work.

After running 169 simulations, changing processing parameters to build predictive models for surface quality measuring, the best results were those reported in Table 4.

**Table 4.** Results of the models developed. Intervals for the arithmetic roughness, Ra, are: 3 classes class A [0.495, 0.799] µm; class B (0.799, 1.11] µm; class C (1.11, 2.81] µm; 4 classes—class A [0.407, 0.7] µm; class B (0.7, 0.95] µm; class C (0.95, 1.15] µm; class D: (1.15, 2.81] µm.


The active participation of the support vector machine as main classifier supports the findings obtained in the literature [22,24,26]. Additionally, it is clear that this methodology is more appropriate for average parameter measuring.

The development of this work was subject to limitations that explain the results obtained, such as the small number of training images, considering the difficulty of the feature extraction in these types of images, the narrowness of the working range and the non-uniform distribution of the roughness values. Some of these are now discussed with a view to achieving better predictive models in future works. Firstly, the working range must be taken into account. Because of the narrowness of the average surface roughness interval, specific feature extraction for each class becomes a more complex task. Moreover, a strong adhesive mechanism existed in the tribology system modelled. This involved

a non-uniform surface in all the cases considered herein. Thus, the surfaces presented scattered areas with removal of material and ploughing with material accumulations, besides the typical erosive abrasive with no clear tendency across the different SPIF process conditions (Figure 7). Figure 7 depicts the surface of a part measured in a 45◦ direction with respect to the rolling one, in different areas. In Figure 7a, the plastic deformation caused by the overlap between passes of the tool is marked as a regular macro-ploughing effect, indicated by arrows in the image. Figure 7b shows some hints of peeling pits (some of which are marked inside circles), while in Figure 7d, some re-adhesion of previous removed material can be observed. All these phenomena contrast with an almost uniform worn surface (Figure 7c) with only a micro-ploughing phenomenon with some isolated areas of peeling pits.

**Figure 7.** Surfaces of a part obtained by SPIF process at 400 ◦C with a WC tool of 12 mm diameter; surfaces correspond to 45◦ with respect to the rolling direction and from different zones of the part; (**a**–**d**) photographs correspond to zones 1 to 4 according to the reference of Figure 2. Zone 4 is the deepest.

Only one major adhesive mechanism was confirmed with the temperature, but its influence on the part does not always occur in the same area (Figure 7a–d). Some evidence of this can be observed in Figure 8, where the re-adhesion process is clearly noticeable in different areas of the images (marked areas). Moreover, the peeling–adhesion phenomena can dominate the topography of a study area, as can be seen in Figure 8a.

**Figure 8.** (**a**) Surface of a part obtained by SPIF process at 300 ◦C with WC tool of a diameter of 12 mm; inner area 2 in the rolling direction. (**b**) Surface of a part obtained at 300 ◦C with steel of 12 mm diameter; inner area 2 in the rolling direction. (**c**) Surface of a part obtained at 400 ◦C with WC tool of 12 mm diameter; inner area 5 in a 45◦ rolling direction. Inner areas according to the reference of Figure 2.

> Increasing the number of observations, and thus the database size, would allow us not only to develop a margin of separation between classes, but also to define each category more accurately. While our work sets no margin, studies such as that conducted by Abu-Mahfouz et al. [22] established a distance of up to 0.18 µm between adjacent classes. As a result, classifiers based on machine vector supports using polynomial kernels achieved an 81.25% success rate.

> In short, non-uniform distribution of training data demands a greater level of accuracy in both the measurement and the photography. The rigor required could be reached through the following steps:


Secondly, another limitation is the dataset size. The small amount of training data is a common problem in studies using machine learning algorithms. Following the literature, the number of samples required varies according to the purpose of the application. Whereas experiments using processing parameters utilize considerably fewer than a hundred observations, those using images to predict variables demand hundreds of photographs. This is evidenced by the work by Koblar et al. [23], in which 300 pictures were used to determine whether parts were suitable for commercialization in accordance with the difference between the highest peak and the lowest valley.

It is also necessary to emphasize the importance of a uniform distribution of images that allows us to use a suitable classification strategy without class imbalance. It only and exclusively depends on the manufacturing system as, in many cases, these conditions affect the surface finish of the parts in range and typology.

Finally, together with these limitations, the nature of the images must be considered. Roughness and waviness are overlapped in each photograph, so it is important to develop a filtering methodology in order to ensure that every picture faithfully represents the measured value.
