*2.3. Image Dataset*

The apple images used in this research were captured by the machine vision system, as shown in Figure 1. Before capturing the apple images, the apples were put on the separate fruit tray, and the separate fruit tray moved with the conveyor belt. When the apples passed through the lighting chamber, the camera on the top of the lighting chamber would automatically capture the apple image directly under the control of hardware trigger signal. Then, grading software read the apple image from the camera buffer and saved the image. Three thousand apple images were finally obtained as the dataset of this research. The

size of all apple images was 400 pixels × 336 pixels. Before training, the resize function of openCV was used to resize the input images into 512 × 512 pixels.

An open-source annotation tool-LabelMe-was used to semantically label the captured apple defect images and establish a standard semantic label dataset. Meanwhile, LabelImg was used to mark the stem, calyx and defect regions in the apple images. In total, 2400 images were selected as the training set of the network and the remaining 600 as the validation set.

## *2.4. Apple Surface Defect Detection Based on BiSeNet V2*

In order to obtain an optimal lightweight network model to reduce the network parameters, many researchers were looking for a balance among the amount of computation, parameters and accuracy, hoping to use as few computations and parameters as possible to obtain high accuracy of the detection model [19]. In the field of semantic segmentation, reducing the image size or reducing the complexity of the model could decrease the computation cost caused by semantic segmentation.

Reducing the image size could directly reduce the amount of computation, but the image would lose many details, which would affect the image accuracy. In addition, reducing the complexity of the model would weaken the feature extraction ability of the model, which would affect the segmentation accuracy. Therefore, it was quite challenging to apply lightweight model in semantic segmentation task while taking into account accuracy and real-time performance.

The BiSeNet network could basically balance the relationship between real-time performance and accuracy [20]. So, it was used in this research, and the architecture of it is shown in Figure 2.

**Figure 2.** The architecture of BiSeNet V2 network.

The BiSeNetV2 network is divided into three main components: the two-pathway backbone (green dashed box) with a detail branch (the purple cubes), a semantics branch (the pink cubes), the booster component (blue dashed box) and the aggregation layer (red dashed box). C1, C2 and C3 indicate the channels of the detail branch, respectively. The context embedding block as the output of the semantics branch is in the last stage. Down and up represent the down-sampling and the up-sampling operation, respectively. The sigmoid function and the elementwise product were represented by ϕ and ⊗, respectively.

Shallow layers and wide channel dimensions are the characteristics of the detail branch, which have a small receptive field of spatial detail used to generate high-resolution feature representation and capture low-level detail. The semantic branch with deep layers and narrow channel dimensions has a large receptive field for the categorical semantics to capture high-level semantics. The gaps between the semantic and resolution were compensated by the aggregation layer. The initialization parameters of BiSeNet V2 network are shown in Table 1.

**Table 1.** The initialization parameters of BiSeNet V2 network.


Because defects were considered as the region of interest in apple images and in order to ensure the real-time detection, apple images were only segmented into defect region and background region. The segmentation result based on BiSeNetV2 used binary image IB to present. The gray value of defect region was set as BV, where BV was not equal to 0. The gray value of background region was set as 0. In practical application, there might be multiple defect regions in apple images. So, RB (RB∈{Rb1, Rb2,... Rbn}) was used to store the position values of different defect regions, where *n* was the total number of defects obtained using BiSeNetV2 model in apple image.

The overall goal of this study was to quickly and accurately realize the online grading of defective apples. Therefore, it was necessary to further calculate the area and the number of defects of defective apples. Finally, the grade of apple could be determined according to the comparison between the defect information and the grading standard.
