*3.2. Loss Function*

The loss function was used to evaluate the difference between the prediction result of the model and the actual situation. Different loss functions are used in different models. In general, the better the loss function, the more accurate the model predictions. The main microstructure of high-carbon steel is sorbite, the content of which is generally more than 50% and is more than 70% in most samples. Therefore, the proportion of sorbite in the metallographic image is extremely unbalanced with the proportion of the background. In the sample shown in Figure 1, the sorbite content (dark part) was about 97%. For the data set of this paper, the statistics of sample proportions with different sorbite contents are shown in Table 1. Samples with a sorbite content higher than 80% accounted for 56.5% of the total. In general, unbalanced samples can cause training models to focus on predicting pixels as dominant types, while "disregarding" the minority types, which negatively affects the model's ability to generalize on test data. Therefore, it is necessary to use the appropriate loss function or its combination to deal with the imbalance of the sample. Table 1 shows the proportion of samples with different sorbite contents in the data set.

**Table 1.** Proportions of samples with different sorbite contents in the data set.


The detection of sorbite content is essentially a problem of a significant imbalance between positive and negative samples in a binary classification, and a large number of background pixels affect the segmentation accuracy of the model. Therefore, focal loss was selected as the semantic segmentation loss function in this paper. This function was originally proposed by He [18] to solve the model performance problems caused by the imbalance of data classes and the difference in classification difficulty in the image domain. Focal loss adds a parameter γ to the cross-entropy loss function and constructs an adjustment factor (**<sup>1</sup>** <sup>−</sup> **<sup>p</sup>**(**x**))<sup>γ</sup> to solve the problem of the sample imbalance. The calculation formula of the loss function is as follows:

$$\text{FL}(\mathbf{p}(\mathbf{x})) = -(\mathbf{1} - \mathbf{p}(\mathbf{x}))^\mathcal{Y} \text{log}(\mathbf{p}(\mathbf{x})) \tag{4}$$

where the sample **p**(**x**) with accurate classification tends to 1, the regulation factor tends to 0; the sample **1** − **p**(**x**) with inaccurate classification tends to 1, the regulation factor tends to 1. Compared with the cross-entropy loss function, focal loss does not change for inaccurately classified samples and decreases for accurately classified samples. Overall, it is equal to adding the weight of the sample with inaccurate classification to the loss function, **p**(**x**). It also reflects the difficulty of classification. The greater the **p**(**x**), the higher the confidence of classification, the more easily the representative sample is classified; the smaller the **p**(**x**), the lower the confidence of classification, the lower the confidence of

classification, and the more difficult it is to classify the representative sample. Therefore, focal loss is equivalent to increasing the weight of difficult samples in the loss function, so that the loss function tends to be difficult samples, which is helpful to improve the accuracy of difficult samples.

In addition, the problem of region size imbalance between the sample foreground and the background of the sorbite image can be handled by the Dice loss [19] function. Dice loss is a region-dependent loss, that is, the loss of the current pixel is not only related to the predicted value of the current pixel, but also related to the value of other pixel points. The specific loss function formula is:

$$\text{Dice Loss} = \mathbf{1} - \frac{\mathbf{2}|\mathbf{X} \cap \mathbf{Y}|}{|\mathbf{X}| + |\mathbf{Y}|} \tag{5}$$

where **X** represents the target segmented image, **Y** represents the predicted segmented image, and the intersection form of **Dice Loss** can be understood as a mask operation. Therefore, regardless of the size of the image, the calculated loss of the fixed-size positive sample area is the same, and the supervision contribution to the network does not change with the size of the image. **Dice Loss** training tends to tap into foreground areas and thus adapts to the smaller foreground situation in this paper. Training loss, however, is prone to instability, especially against small targets. In addition, gradient saturation occurs in extreme cases. Therefore, considering the sample situation of sorbite content, this paper combines **Dice Loss** with focal loss.
