*3.1. Network Architecture*

In this paper, the residual neural network ResNet [9] was selected as the backbone network for feature extraction, and DeepLabv3+ [10] and U-Net ++[11] were used as semantic segmentation models for the segmentation and statistics of sorbite content.

The network structure of ResNet is based on VGG-19, which is improved, and residual elements are added by a short circuit mechanism. In ResNet-34, it adopts 3 × 3 filters, and its design principle is that: first, in the same output feature map, the number of filters in each layer is the same; second, when the size of the feature map is reduced by half, the number of filters will double to ensure the time complexity of each layer. The network ends with a global average pooling layer and a 1000-dimensional fully connected layer with softmax. The rest of the residual neural network structure is deformed on this basis.

The U-Net++ network structure is shown in Figure 2, which consists of several parts: the convolution unit, down-sampling and up-sampling modules, and skip connection between convolution units. In the U-Net model structure, nodes X0.4 only form a skip connection with nodes X0.0, while in the U-Net++ model structure, nodes X0.4 connect the outputs of four convolution units X0.0, X0.1, X0.2, and X0.3 at the same layer, where each node Xi.j represents one convolution down-sampling or deconvolution up-sampling. The U-Net++ network has a nested structure and dense skip paths, which is conducive to aggregating features with different semantic scales on the decoder subnetwork and has achieved

excellent performance levels in other fields [12,13]. In this paper, after grayscale processing, the sorbite was quite different from other tissues, which was intuitively applicable to this problem.

**Figure 2.** Schematic diagram of the U-Net++ network structure [11], with permission from Springer, 2023.

Enter X0.0 from X0.0 on the first layer of the model, and calculate according to the following formula in turn:

$$\mathbf{x}^{\mathbf{i},\mathbf{j}} = \begin{cases} \mu(\mathbf{x}^{\mathbf{i}-\mathbf{1},\mathbf{j}}), \mathbf{ j} = \mathbf{0} \\ \mu\left(\left[\left[\mathbf{x}^{\mathbf{i},\mathbf{k}}\right]\_{\mathbf{k}=\mathbf{0}'}^{\mathbf{i}-\mathbf{1},\mathbf{j}}\mathbf{T}\left(\mathbf{x}^{\mathbf{i}+\mathbf{1},\mathbf{j}-\mathbf{1}}\right)\right]\right), \mathbf{ j} > \mathbf{0} \end{cases} \tag{1}$$

where μ denotes the convolution, **[]** denotes the feature cascade, **T** denotes the deconvolution up-sampling, and **xi**,**<sup>j</sup>** denotes the output of the node **Xi**,**<sup>j</sup>** , where **i** denotes the down-sampling layer along the encoder index and **j** denotes the convolution layer along the skip index dense block.

The skip connection can introduce high-resolution information in the image into the result of the up-sampling, thereby ensuring high segmentation accuracy. Taking the first layer of the model as an example, the skip connection results are shown in Figure 3.

**Figure 3.** Schematic diagram of the U-Net++ skip connection results [11], with permission from Springer, 2023.

The deep supervision process was introduced into the U-Net++ network model. The shallow feature perception of the image can be increased by deconvolutional up-sampling of the results **xi**,**<sup>0</sup>** obtained by down-sampling at each level, and then adding the final up-sampling segmentation results **x0**,**<sup>j</sup>** corresponding to each level of the training loss calculation process [14]. The split loss function designed accordingly is shown in the following formula:

$$\text{Loss} = \frac{1}{\mathbf{J}} \sum\_{\mathbf{i}=\mathbf{1}}^{\mathbf{J}} \mathbf{L}\_{\mathbf{j}}(\mathbf{x}^{0,\mathbf{j}}) \tag{2}$$

where **Loss** is the total segmentation loss, **Lj** is the loss function used to calculate the segmentation loss of **x0**,**<sup>j</sup>** , and **J** is the number of all nodes except the down-sampling nodes in the first layer of the model.

Compared with the U-Net++ network, the DeepLabv3+ network greatly reduces the actual running graphic memory occupancy under the same setting and slightly improves the performance, which is also suitable for this problem. The structure of the DeepLabv3+ network is shown in Figure 4. The network with a codec structure was generated based on the DeepLabv3 network structure, in which DeepLabv3 was used as the encoder part to extract and fuse multi-scale features. A simple structure was added as the decoder to further merge the underlying features with the higher-layer features, improve the accuracy of the segmentation boundary, and obtain more details, forming a new network that merges atrous spatial pyramid pooling (ASPP) [15] and codec structures.

**Figure 4.** Schematic diagram of the DeepLabv3+ network structure [10], with permission from Springer, 2023.

Atrous convolution is the core of the DeepLab series model [10,16,17], which is a convolution method to increase the receptive field and is conducive to extracting multiscale information. Atrous convolution adds voids based on ordinary convolution, in which a parameter dilation rate is added to control the size of the receptive field. As shown in the Figure 5, taking **3** *×* **3** convolution as an example, the grey lattice represents the **3** *×* **3** convolution kernel, and the receptive field after ordinary convolution is 3. When **rate** = **2**, the receptive field after convolution by the atrous convolution module is 5, which is increased by 2 compared with the ordinary convolution receptive field. When **rate** = 3, the receptive field after convolution by the atrous convolution module is 7, which is increased by 4 compared with the ordinary convolution receptive field. Due to the mesh effect of atrous convolution, some information will be lost in the image after the atrous convolution operation.

For two-dimensional signals, in particular, for each position **i** on the output feature map **y** and the convolution filter **w**, an atrous convolution was applied on the input feature map **x**, with the following formula:

$$\mathbf{y}[\mathbf{i}] = \sum\_{\mathbf{k}} \mathbf{x}[\mathbf{i} + \mathbf{r}\mathbf{a}\mathbf{e} \cdot \mathbf{k}] \mathbf{w}[\mathbf{k}] \tag{3}$$

Atrous spatial pyramid pooling (ASPP) uses atrous convolution with different expansion rates to make up for the defects of atrous convolution, captures the context of multiple layers, fuses the obtained results, reduces the probability of information loss, and helps to improve the accuracy of convolutional neural networks.

**Figure 5.** Schematic diagram of the atrous convolution receptive field.
