**3. Data Preparation and Model Training**

Building a database is highly time consuming, not only to process the collected images, but also to train predictive models with them. For this purpose, it is necessary to previously know the functionality of the model.

Two types of problems can generally be distinguished when a supervised learning approach is used. First, classification questions are intended to categorize unknown data into known discrete ranges named classes, while the aim of regression algorithms is to predict a continuous value.

Initially, this work seeks to categorize images into several classes according to their surface roughness. Figure 3 depicts a block diagram where data processing and the training procedure are shown in detail. At this second stage, the classification learner from MATLAB software (2019b, MathWorks, Natick, MA, USA) was used.

**Figure 3.** Data processing and model training.

Firstly, classes must be defined. Thus, the intervals for the arithmetic roughness, Ra, were class A [0.495, 0.799] µm, class B (0.799, 1.11] µm, and class C (1.11, 2.81] µm for three different class definitions. In this endeavor, it is necessary to collect the surface quality values from the name with which each image has been identified. In this way, it will be possible to set optimum roughness ranges to avoid imbalance in classes. As the number of groups increases, the accuracy required will be greater. Therefore, work begins on three classes. In accordance with the database used, trained models will only be able to solve classification tasks, whereas if we wanted to address regression tasks, a much larger number of images would be required.

Secondly, the photographs were processed and distributed in the specified classes. The aim of image processing is to maximize the picture features to facilitate their classification. Coloring or dimensions are some of the characteristics that were analyzed. According to the literature, machine learning algorithms based on computer vision demand a large number of samples. For this reason, the possibility of splitting images into the prediction of average values was studied in order to increase the database.

The image characteristics were then extracted by using a speeded up robust feature (SURF) detector (2019b, MathWorks, MA, USA) in order to develop a visual vocabulary. The number of attributes was modified to optimize the accuracy of the predictive model. In addition, following the supervised learning method, the name of the classes was considered as another feature.

Next, the classification learner, included in Matlab tool boxes (2019b, MathWorks, MA, USA) was invoked, a software tool capable of training supervised learning algorithms from the table of characteristics and assessing the results obtained. A portion of the feature table created for training is shown in Table 2, where the class column can be seen.


**Table 2.** Part of the feature table used by the classification learner.

The support vector machine (SVM) is a supervised learning algorithm applied in countless fields to solve classification and regression problems [29]. Furthermore, following the literature [22,24,26], it has proven to be an effective technique to classify the surface roughness using higher dimensional data.

The operating principle of this classifier is based on finding an optimal hyperplane ensuring the best separation between classes, that is to say, to maximize the margin between the boundary function and the closest samples. The separation hyperplane can be defined by Equation (1), depending on its orthogonal vector w and intersection coefficient b and where x refers to the vector coordinates. Symbol T represents the transpose matrix-vector.

$$w^T \cdot \mathfrak{x}\_i + b = \mathbf{0} \tag{1}$$

In mathematical terms, this question becomes a quadratic programming optimization with the objective of minimizing the named margin inverse function (Equation (2)). The second term of Equation (2) represents the action of adjustment variables, *ξi*, where the *C* is a balancing factor.

$$\Phi(x) = \frac{1}{2}||w||^2 + \mathbb{C} \cdot \sum\_{i=0}^{n} \xi\_i \tag{2}$$

In our view, the distribution of the samples as well as the number of classes demands this issue be addressed as a higher dimensional classification with non-linear separable data. To tackle this challenge, the original area must be mapped into a higher dimensional space in order to enable developing an appropriate separation function in the new conditions. This transformation is achieved with the help of kernel functions and its effect on the problem is reflected in the SVM dual formulation (Equation (3)), where the Lagrangian is

utilized to solve the quadratic programming problem and the coefficients α reference the Lagrange multipliers.

$$\Theta(a) = -\frac{1}{2} \cdot \sum\_{i} \sum\_{j} a\_i a\_j y\_i y\_j \cdot \mathcal{K}(\mathbf{x}\_i \mathbf{x}\_j) - \sum\_{i} a\_i \tag{3}$$

Kernel functions are chosen according to the nature of the problem. The suitability of different types in relation to the surface roughness prediction was discussed in the work by Abu-Mahfouz et al. [22], where a linear kernel was shown to be the most promising function. Nevertheless, all Kernel functions implemented in MATLAB® [30] were tested in order to find that one leading the best results, i.e., the highest precision.

Finally, the validation procedure depends on the database size. In this work, two techniques were used to avoid overfitting: holdout validation and cross validation. In holdout validation, a portion of data is selected to train the predictive model, whereas the remaining samples are used as a test set. For this reason, holdout validation is suitable for large data sets. On the other hand, cross validation is characterized by the splitting of the initial samples into folds. Each block is trained with the observations not belonging to the fold and is validated with the remaining images. Test error is calculated as an average of the mistakes in each fold. Therefore, according to the number of classes and images, different validation methods were used.
