*3.2. Evaluation Metrics*

In each learning configuration, the performance of the model was evaluated using the following metrics:

$$\text{Accuracy} = \frac{\text{TP} + \text{TN}}{\text{TP} + \text{TN} + \text{FP} + \text{FN}} \tag{5}$$

$$\text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}} \tag{6}$$

$$\text{Recall} = \frac{\text{TP}}{\text{TP} + \text{FN}} \tag{7}$$

$$\text{F1}\_{\text{Score}} = 2\left(\frac{1}{\text{Recall}} + \frac{1}{\text{Precision}}\right)^{-1} \tag{8}$$

*TP* (True Positives) refer to the number of correctly classified images as defects.

*TN* (True Negatives) refer to the number of background images that are correctly classified as background.

*FP* (False Positives) refer to the number of background images that are incorrectly identified with defects.

*FN* (False Negatives) refer to the number of images incorrectly identified as background images.

The Root Mean Squared Error (RMSE) was also used to assess the model's performance in the three different training schemes. It is defined by Equation (9):

$$RMSE = \sqrt{\sum\_{i} (1 - y\_i)^2 / n} \tag{9}$$

where *yi* is the calculated probability of the image *i* (from the testing subset) belonging to the ground truth class.

### *3.3. Weakly Supervised Semantic Segmentation*

Based on the best learning scheme, feature maps of the last convolutional layer of the trained model were used to provide visual explanations of classification results using Grad-CAM and Grad-CAM++. The implementation of these interpretation techniques was based on the publicly available repository in [55]. Pixel-level heatmaps were generated for test images, and a threshold of 0.5 was applied to each image to localize the regions with a target class probability above 50%.
