2.2.3. Contour-Resampling-Based Fully Connected Network

Because the information density of the binarized contour image samples is extremely low, using CNN to solve the classification problem of such images seems to be a waste of performance, so another solution was tried.

The contour image samples in the dataset can be represented as a coordinate set of contour pixel points that only contains elements twice the number of contour pixel points (horizontal and vertical coordinate values). The number of contour pixels in some 480 × 640 size contour maps is counted, and the number of contour pixels is within 800, while the total number of pixels in the overall contour map is up to 307,200. Therefore, the set of contour point coordinates of each image sample is used as the model input, and building a fully connected neural network can also solve the orientation recognition problem of garlic clove contour images. Although this method needs to increase the steps of extracting contour points from the collected image, the increased amount of calculation is very small. Along with the help of OpenCV, the extraction process of contour points is also very easy to implement.

#### Uniform Input Size

It is very difficult to realize the variable length input of a neural network. Because the number of contour pixels contained in each contour image sample is different, it is necessary to unify the number of contour pixels of the sample first. Hence, the equidistant sampling method is used to sample a fixed number of point coordinates from the contour of each image, and then draw a polygon with these sampled points and observe its ability to reconstruct the original sample through artificial vision (Figure 12). It was found that when 50 contour points were sampled, the polygon formed was very close to the shape of the original sample. Through the above sampling method, 50, 100, and 200 contour point sets of all contour image samples were collected and combined with the orientation classification of the samples as the training dataset of the fully connected model. A fully connected model with three Hidden layers and 512, 256, and 128 neurons was used for testing. It was found that there was no significant difference in the recognition rate of the model when 200, 100, or 50 sampling points were used, but using fewer sampling points could effectively reduce the number of parameters and calculations of the model, so 50 points is preferable.

**Figure 12.** The ability of contour points with different sampling rates to restore the original contour.

The matrix shape of the contour point set obtained by the findcontours function of OpenCV is [*n*, 2], where *n* is the number of contour points, and the matrix shape of the contour point set after sampling is [*m*, 2], where *m* is the number of sampling points, and the dimension with length of 2 contains the horizontal and vertical coordinates of each contour point. For the fully connected model discussed in this section, the input of each layer of the model should be a one-dimensional vector, so it is necessary to flatten the contour point set. There are two ways to flatten: the first from the point dimension, the other from the coordinate dimension. The first flattening method was chosen (the value of the data\_format parameter corresponding to the Keras Flatten layer is "channels\_last").

Structure of Fully Connected Model

Through the testing of several fully connected models defined by Keras API, it was found that when the number of Hidden layers of the model was less than three, increasing the number of fully connected layers was effective. When the number of layers exceeded three, increasing the number of fully connected layers could not significantly improve the recognition rate of the model. Using more neurons can improve the performance of the model, but increasing the number of neurons will greatly increase the number parameters of the model, resulting in the model becoming bloated. The preliminary test results of typical models are shown in Table 3. The accuracy of the model with 4096, 2048, and 1024 neurons in the Hidden layer reached 0.97893. The accuracy of the model with 1024, 512, and 256 neurons in the Hidden layer was 0.97465. The accuracy of the model with 512, 256, and 128 neurons in the Hidden layer was 0.97241.

**Table 3.** Overview of fully connected models.


The structure of the fully connected model is shown in Figure 13. Each fully connected layer includes a batch normalization [23] layer. It is particularly noteworthy that adding batch normalization layers after the flat layer can greatly improve the convergence speed of the model.

**Figure 13.** Fully connected model. Note: both the Flatten and Hidden layers are connected to the Batch Normalization layer, but only the Batch Normalization layer of the Hidden layer applies the activation function. N1, N2 and N3 are the number of undetermined Hidden layer neurons.

#### 2.2.4. Model Optimization

In order to obtain faster computing speed and higher accuracy, the three deep learning models have been greatly optimized, and the optimization directions include model lightweighting and model training tuning.

#### Implementation of Lightweight Convolutional Model

It was found that when the size of the input image of MobileNetV3 was reduced to 60 × 80, the recognition rate of the model decreased significantly, while in the relevant test of the naive CNN model, the input of 60 × 80 did not greatly reduce performance of the model.

The stride of the first standard convolution layer of the MobileNetV3 model is 2. When it was modified to 1, the performance of small-sized input of 60 × 80 was improved. Removing the 1 × 1 standard convolutional layer before the Global Average Pooling [24] layer did not reduce the recognition rate of the model, but it could reduce the number of parameters and computation of the model and improve the convergence speed of the model.

For the naive CNN model, the actual receptive field of the 3 × 3 convolutional kernel when using the input size of 60 × 80 was larger than the actual receptive field of the 5 × 5 convolution kernel when using the input size of 120 × 160. Since the edge of the garlic contour image is a background does not contain anything, the convolutional layers in the first two groups of the convolutional-pooling modules could be modified to valid padding. Due to the above adjustments and halved input size, two convolutionalpooling modules can be removed to achieve the same feature map size, which can greatly reduce the number of parameters and computations in the model. Since the input of the model is only a single-channel image, the use of standard convolution at the input of the model only increases the number of calculations and parameters by very little compared with the depthwise separable convolution. This change has been tested to slightly improve the performance of the model. The naive CNN model structure obtained after the abovementioned optimization procedure is shown in Figure 14.

**Figure 14.** Naive CNN model after lightweighting.

Model Training and Tuning

Three optimizers: Adam [25], Nadam [26], and SGD were tested in model training. The final convergence results of Adam's optimizer in multiple training tests of the same model were unstable. Nadam and SGD were more stable than Adam, but Nadam had the greatest computational complexity of the three and the slowest performance. SGD is theoretically less efficient than Adam and Nadam, but the fully connected model proposed in Section 2.2.3 could converge stably when the learning rate of SGD was set to 1.0 or even higher. In this way, both the convergence speed of the model and convergence stability could be guaranteed. In addition, the computational complexity of SGD was the lowest of the three, and the computational speed was the fastest.

Five activation functions comprising Tanh, Relu6 [27], Gelu [28], Swish [29], and hswish [17] were tested, and the convergence curve for the fully connected model is shown in Figure 15. When using the SGD optimizer, the effect of swish and h-swish was better (1000-epochs validation set accuracy is 0.97611 and 0.97586, respectively), and because the computational complexity of Hard-Swish was lower than that of Swish, and the model using h-swish was more reliable in weight value quantization, the fully connected model uses the h-swish activation function.

**Figure 15.** Convergence curves of different activation functions. Note: the above convergence curves were all measured on the fully connected model proposed in Section 2.2.3, and the range shown in the figure is 100 to 1000 epochs.

For the garlic seed contour dataset, the training loss values of the three types of models tested are close to 0 in the later stage of the training process, the training set accuracy can be close to 100%, and the validation set accuracy is different. This is an overfitting phenomenon, and the loss flooding method [30] has a significant effect on it. The idea of this method is to keep the training loss value always above a certain threshold delta, so that the model can continue to learn and possibly converge to a better performing state. In the optimization process of the model, there may be a large number of local optimum points. The random walk strategy of the loss flooding method requires the optimizer to have a large enough optimization stride to ensure that the model escapes the local optimum point. When the model is optimized to a good state range, the weight value needs to be saved in time to prevent missing the state. In the later stages of the finite number of training iterations, the probability of the random walk method obtaining a better state becomes very low, but continuing to train the model with a smaller learning rate can often make the model's performance improve again in the short term, so the learning rate decay method combined with the loss flooding method is very effective. In order to ensure that the model is in an ideal state when the learning rate decay is triggered, a program is written to dynamically load the weights saved during the last state boost each time the learning rate decays.

The LSR [31] method was also used. When the LSR method was applied alone, the model was trained with a label\_smoothing parameter of 0.2, and the obtained validation set accuracy was comparable to the loss flooding method with a delta of 0.1. When the loss flooding method was combined with the LSR method, the delta and label\_smoothing parameters were set to 0.7 and 0.2, and the accuracy of the validation set obtained was slightly improved, but the accidental components were not excluded.

Based on the above methods, L1/L2 regularization and Dropout [32] regularization were further tried. L1/L2 regularization is effective for convolutional models, but not for fully connected models. Dropout regularization looks simple and crude, but it significantly improves the performance of the fully connected model.

#### *2.3. Application Method in Embedded System*

The application of deep learning models in seeders requires some additional support programs and control programs. First, the deep-learning model will give a direction judgment for any input image, including the image when no garlic seeds pass by. Therefore, in order to avoid meaningless direction judgment and device linkage, for each frame of a collected image, a judgment should be made on whether it contains garlic seeds. Secondly, since the deep learning models constructed in this paper are all based on the contour image of garlic seeds or their sampling point sets as the basis for classification, an additional program is required to extract binarized contour image or resampling the contour points. A flow chart of the complete orientation judgment process is shown in Figure 16.

**Figure 16.** Flow chart of orientation judgment procedure.

Under backlight illumination, it can be judged whether a garlic seed is passing by monitoring the change of the average pixel value of the central area of the camera's field of view. Figure 17 shows the relationship between the average pixel value and the movement of a garlic seed in the camera's field of view. When a garlic seed passes through the camera's field of view, it is captured with multiple frames of images, then compared to the image frames of single garlic seed, and an image with the lowest average pixel value in the central area is obtained, which is the optimal image frame. This process is shown in Figure 18.

**Figure 17.** Relationship between the pixel mean value of the central area of the field of view and the position of garlic seeds.

**Figure 18.** Frame retrieval flow chart.

After obtaining the optimal image frame, brightness compensation is performed, binarization of the image is completed, and the contours from the binarized image are extracted. The output of the contour extraction algorithm is a set of contour points. For

the CNN model that takes the contour image as input, the set of contour points needs to be drawn as a contour image. For the fully connected model that takes the set of contour points as input, it is necessary to reduce the number of contour point coordinates to a number suitable for the input of the model through sampling.

#### **3. Results and Discussion**

#### *3.1. Model Test and Result*

After a series of optimization operations, some typical models were retrained, and their performances are shown in Table 4. Of these models, the transfer learning model based on MobileNetV3-Large has the highest recognition rate of 98.71% on the validation set. However, compared to other models in Table 4, MobileNetV3-Large is too bloated. The recognition rate of the standard-width MobileNetV3-Small model is second only to the MobileNetV3-Large model, but its parameters and computation are still too large. The naive CNN model in Figure 14 performs better than the MobileNetV3-Small with reduced width factor, and its performance is close to that of the standard-width MobileNetV3-Small, but it has the lowest number of parameters among all the models in the table. The fully connected model with 512, 256, and 128 neurons in the Hidden layer achieves almost the same accuracy as the naive CNN model with extremely low computational cost and parameter cost. It has the fastest speed and the most cost-effective application.

**Table 4.** Performance of the optimized model.


The \* in Table 4 means that there is no test.

The last column of Table 4 shows the macro F1 score of the models. The F1 score of the four models are almost equal to the accuracy rate, which indicates that the recognition rate of the models for the four orientation categories are very balanced. The ROC curve and AUC value of the models also support this point. The ROC curves of the four orientations are almost identical with only a small gap, and they cover each other in the graph and are difficult to distinguish, as shown in Figure 19. Meanwhile, the macro average AUC and the AUC of each classification are close to 1, which indicates that the recognition effect of the models for each orientation classification are very good.

Based on the program flow introduced in Section 2.3, representative experimental models were selected and converted to TFlite format for speed testing on OrangePi 3 LTS. The test results are shown in Table 5. For the three CNN models in the table, due to their own calculation being more complicated, adding the complete process has little impact on its speed. The fully connected model itself has simple calculation and fast inference speed, but the ability of the support program to provide input data for the model is limited, and it finally reached a speed of about 151.40, which is still more than 50% faster than the fastest CNN model.

**Figure 19.** ROC and AUC for models with recognition rates over 98%.


**Table 5.** Inference speed of the model on OrangePi 3 LTS.
