2.2.2. Naive CNN Model

Compared with classic CNN models such as AlexNet [18] and VGG NETS [19], MobileNetV3 has a relatively complex model structure, and includes some advanced designs, such as depthwise separable convolution [20], inverse residual block structure [21], squeezeand-excitation block [22] and h-swish [17] activation function. Along with their support, MobileNetV3 shows excellent classification performance for some large-scale natural image datasets. The garlic contour image is very different from natural images. As a binary image, its content density and information density are very low. In order to explore which designs of MobileNetV3 are most helpful to the classification task of garlic contour images, some experiments were done. By training a modified model that separately applies the squeeze-and-excitation module, h-swish activation function, and 5 × 5 convolution kernel, this study found that the 5 × 5 convolution kernel had the greatest impact on model performance among the three, while the squeeze-and-excitation and the h-swish activation function had little effect on model performance. After determining the importance of convolutional kernel size, a series of naive CNN models with a structure similar to VGG NETS were constructed, which were compared with MobileNetV3 to analyze the impact of inverse residual block structure on the performance of the model and further verify the importance of convolutional kernel size.

In order to make the training results of the models more comparable, the same training conditions as MobileNetV3 transfer learning were used to train these models. The performance achieved after full convergence is shown in Table 2. Because these models have a simple structure, compared with the MobileNetV3 model, the parameters and calculation of the naive CNN model with similar performance are greatly reduced. This seems to indicate that the model with a simple structure is more suitable for solving the direction judgment problem of garlic clove contour images, but the naive CNN model does not match the performance achieved by MobileNetV3-Large transfer learning using the same training strategy.


**Table 2.** List of Naive CNN models.

Note: \* Indicates that after Global Average Pooling, the size of the feature map will already be 1. The calculation of Out Stride no longer makes sense.

The performance of the naive CNN model provides some guidance for the optimization of the model. Comparing model 1 and model 2 in Table 2, there is a large gap in the performance of the model when the number of channels of each convolutional layer is doubled. It can be seen that ensuring the width of the model is one of the key factors to improve the performance of the model, but the cost of doubling the width is high, and the number of parameters and calculations is doubled. Comparing model 2 and model 3, it is further verified that the convolutional kernel of 5 × 5 is more efficient than the convolutional kernel of 3 × 3, and because of the depthwise separable convolution, the increase in the number of parameters and calculation is not large. Comparing model 4 and model 5, the max-pooling operation is more reliable than the down-sampling method using a convolutional layer with a step size of two as a characteristic graph. Comparing model 3 and model 6, the position of the lower sampling layer in the model will also affect the performance of the model. In general, the low layer of the model (close to the input layer) does not need to stack too many convolutional layers, while the high layer of the model (close to the output layer) needs to stack more convolutional layers.
