1. Introduction
The development of agriculture affects the world’s economy and social progress. Among them, the identification of seed varieties has great significance in the promotion of agricultural development. In recent years, the automatic detection of various seeds has been widely performed for crop seed variety classification with the development of deep learning technology.
The kidney bean, also known as the string bean, is native to certain countries, such as America and Mexico. It is rich in nutrients, including protein, vitamins, and many minerals. Due to the development of technology, the breeding technology of kidney beans has improved. To date, there are increasingly more varieties of kidney bean seeds that exist. Many varieties have similar colors and sizes. This situation may lead to mixing in the production work of seeds [
1]. It directly affects the sowing, harvesting, transportation, and storage of seeds. In these situations, there is a need to identify the varieties of seeds. In the study of seeds, there are fewer studies concerning the identification of kidney bean seeds, which can be referred to the classification studies of other crops.
Traditional detection methods, such as the molecular identification of DNA markers, can be used to achieve seed classification. However, molecular identification methods with DNA markers may damage the test object [
2]. This method is not suitable for seeds to complete online inspection on a conveyor belt. A large amount of information can be obtained from an image, without touching the object, by computer vision techniques. This method can accomplish nondestructive detection [
3,
4,
5]. Traditional computer vision techniques can detect seeds by manually extracting features from images [
6,
7]. To date, deep learning methods in the field of computer vision are used to detect seeds. Deep learning can bypass the manual extraction step and complete the detection of seeds by automatically extracting the features in the image using a computer [
7,
8].
Deep learning is a process that mimics the learning of the human brain, and it originates from artificial neural networks [
9]. Deep learning can extract the features of an image through multiple convolutional layers to classify the image. Currently, deep learning is widely used in agriculture to classify crops. In [
10,
11,
12], various popular and improved classification networks for deep learning are applied to the task of classifying maize seeds, wheat seeds, and other seeds. Although all of these methods can achieve high accuracy rates, most of the research on seed classification uses image classification networks via deep learning. The image classification network classifies the entire image as a single label. The output information of the network also involves only the class information of the seeds. This method can complete the classification when there is only one class of seeds in one image. To adapt dynamic inspection on the conveyor and improve efficiency, multiple classes of seeds can be classified simultaneously in one image [
13,
14]. In comparison to classification networks, detection networks in deep learning practices separate the selected targets from the background. The target-detection network assigns class and location information to the selected targets. In this way, each seed can be selected, and each seed can present class and location information. By the target-detection method, multiple classes of objects in an image can be successfully classified [
15,
16,
17,
18]. Among the target-detection algorithms, Yolov3 belongs to the single-stage target detection algorithm [
15]. Considering the requirement for speed and accuracy, the present study used the Yolov3 algorithm for the detection of kidney bean seeds.
Target-detection networks have many parameters. When the model needs to meet the speed requirement, the accuracy of detection may be degraded. In turn, the false and missed detection of seeds can occur when the model does not meet the speed requirements. To solve this problem, the compression of the model is an effective method [
19,
20,
21,
22]. The core idea of this solution is the reduction in the model’s parameters and the computational complexity [
23,
24,
25]. Model compression can be divided into the categories of shallow and deep compression. Shallow compression includes pruning algorithms [
26] and knowledge distillation [
27]. Deep compression includes the quantization of the model [
28] and the design of the lightweight network structure [
29]. By comparison, pruning algorithms can increase the speed with a smaller reduction in accuracy; therefore, it is used in the present study [
30]. Inspired by [
31], the accuracy of the pruned neural network is improved further by the distillation of knowledge into fine-tuning methods. Therefore, in the present study, the pruned Yolov3 is fine-tuned by knowledge distillation.
Fast and accurate networks are more suitable to accomplish the detection of seeds in practice. In the present study, the Yolov3 network is trained and improved to meet the requirement of rapid detection. The improved Yolov3 network is compared to other networks. The contributions of this study are the following: (1) a dataset of kidney bean seeds is obtained by using a self-built image acquisition-system. The Yolov3 model is implemented for the accurate detection of kidney bean seeds. (2) The Yolov3 network is compressed and the unimportant components of the network are pruned. The detection speed of the network is improved. (3) During compression, the knowledge distillation algorithm is used to fine-tune the assistance. The pruned network is made to recover the mAP further.
2. Materials and Methods
2.1. Production of the Dataset
For the present study, the seeds were purchased from Heilongjiang Junyi Liability Co., Ltd. The producers of the seeds are Heilongjiang Quanfu Seedling Co. and Heilongjiang Junyan Agricultural Development Co. The images in the dataset included ‘bayuelv’, ‘dazihua’, ‘juguan’, ‘taikongdou’, ‘cuiguan’, ‘hongguan’, ‘yuguan’, ‘qingguan’, ‘fengguan’ and ‘jinguan’. The classes of seeds mentioned above are all regular species and present different colors and sizes. One bag of seeds was selected for each category, and the net content of each bag was 200 g. To meet the requirements for conducting the experiments, the seeds were all over 95% variety purity, with no obvious defects or breakage. These features were extracted in the deep learning network.
The image acquisition system is presented in
Figure 1, which includes a mobile phone, a ring-light source with an adjustable light intensity, and seeds. The phone used was a Xiaomi MIX2. To mimic the light conditions on the assembly line, an LED ring-light source was placed at the center of the set-up. This way, the seeds could be imaged clearly. Additionally, it was ensured that the light intensity was constant via the use of a light-source controller. A cardboard box was placed around the equipment to prevent the interference of natural light. The whiteboard placed underneath the seeds was used to mimic the quality of the conveyor-belt device presenting a monotonous color. The phone was placed at a distance of about 10.5 cm from the whiteboard. Subsequently, 1292 images were captured, and the ratio of the training set to validation and test sets was 8:1:1. The environment was kept consistent while shooting was performed. The photographed seeds were not in contact with each other, as can be observed in
Figure 2. In the dataset, each seed was captured once or twice. The dataset included images of one type of seed as shown in
Figure 2a–j as well as the combinations of multiple types of seeds as shown in
Figure 2k–m. The dataset had a resolution of 600 × 800 pixels and was saved in a “JPG” format. When the dataset was fed into the network, all the images were scaled to 416 × 416 pixels, which reduced the training time and memory usage. After the long side was scaled to 416 pixels, the short side was also scaled to the same multiple. The purpose of this was to prevent the distortion of the image and to accomplish equal scaling.
Different kinds of kidney bean seeds were labeled using the labeling tool LabelImg; the weights were obtained by training and testing according to the test set. The images of this dataset were processed into a JPG format, and the seeds of each species were labeled with the image processing tool LabelImg. Then, TXT files were generated by the LabelImg software to output the seed ID and coordinate location information, as presented in
Figure 3. In
Figure 3d, each row represents a seed that was labeled, and the labeled box was generated. From 0 to 3, the four categories of seeds were represented. The other four data in each row represent the center coordinates of the labeled box relative to the image, and the width and height of the labeled box relative to the image.
2.2. Yolov3 Network Structure
In the present study, a deep learning target detection algorithm was used to improve the efficiency of detection. It can detect multiple types of seeds simultaneously.
Yolo (You Only Look Once) is an end-to-end neural network, meaning that the image needs to be observed only once to obtain information about the target. Thus, it has been commonly used for target-detection practices [
32]. The algorithm treats target detection as a regression problem in relation to target area and category predictions. The information about the position, size, and category of the target object in the image can be obtained. The Yolov3 model incorporates the advantages and disadvantages of other target-detection algorithms, such as SSD and Faster R-CNN. It is the same as the SSD algorithm in terms of detection accuracy, in comparison to other target-detection algorithms. However, the speed of detection is more advantageous than SSD and Faster R-CNN [
33].
The network structure of Yolov3 is presented in
Figure 4. The Yolov3 model is an improvement of the Yolov2 model, which mainly includes the optimization of the network structure. The Darknet19 network is improved in the Darknet53 network as the backbone network of the feature extractor, and multi-scale features are used for object detection [
33]. The 52 convolutional layers were used to extract the features of the images in the network framework. To prevent the degradation of the model during the training of the deep network, Yolov3 created a link between the layers, which is borrowed from ResNet. The last layer is fully connected. Yolov3 could achieve localization and classification of targets in the images. The Darknet-53 network downsamples the image five times and also works with a residual net. This method allows the network to continue to rapidly converge in the deeper layers. Then, the features are extracted from the input image through a Yolo layer to obtain a feature map with a resolution of 13 × 13. The output feature map is upsampled by DBL operation on the 13 × 13 feature map, to obtain a 26 × 26 feature map. This is the same resolution as the feature map obtained by the five-times-downsampling strategy, and the results are summarized. Finally, the 26 × 26 feature map is upsampled and added to the feature map of the penultimate downsampling phase to obtain a 52 × 52 feature map [
34]. In this way, the multi-scale prediction is achieved. The targets with different sizes can be predicted based on the different resolution feature maps. As presented in
Figure 4, the Yolov3 network structure consists of three basic components: the CBL, Res unit, and ResX.
CBL: This component consists of the Convolutional layer, BN layer, and Leaky Relu function.
Res unit: This component borrowed the residual structure from the Resnet network. This residual structure allows the network structure to converge even at very deep levels, and it allows the model to continue training.
ResX: This component consists of CBL and X residual components, which are the large components in Yolov3. The CBL component is used as a downsampling stage before each Res module. Each ResX contains 1 + 2 × X convolutional layers, so the entire backbone network contains a total of 52 convolutional layers.
As a lite version of the Yolov3 model, the Yolov3-tiny model has also been widely used with a quicker run speed. However, it may decrease the accuracy of detection. The backbone network consists of seven 3 × 3 convolutional layers and six pooling layers. The first 5 are pooling layers with a step size of 2. The last one is a pooling layer with a step size of 1. It also uses multiscale projections with resolutions of 26 × 26 and 13 × 13. For the same image, the detection result of Yolov3-tiny is weaker than Yolov3 in terms of position and confidence [
35].
2.3. Yolov3 Model Compression
The actual online crop detection work often faces some hardware limitations [
36]. Only simple and efficient algorithms can be more widely used in practical detection environments. Although the Yolov3 network can accurately detect kidney bean seeds, the Yolo network has a high number of parameters. Thus, it is necessary to compress the model. This is beneficial in a real-world-detection environment. Model pruning is a commonly used model compression method. It introduces sparsity to the dense links of deep neural networks. The number of non-zero weights was reduced by directly assigning zero values to the ‘unimportant’ weights.
The model pruning method can be divided into four steps [
37]: (1) analyzing the importance of each neuron in the trained model; (2) removing neurons with a low activation in the model for model inference; (3) fine-tuning the model to improve the accuracy of the pruned model; and (4) testing the fine-tuned model to determine whether the pruned model meets the requirements. The flow chart is presented in
Figure 5.
2.3.1. Sparsity Training
Prior to pruning the Yolov3 model, it is necessary to detect the parts that are of least importance for the model. Therefore, sparsity training is required. Each convolutional layer in the Yolov3 network comes with a batch normalization layer, which enhances the convergence and generalization of the network. To obtain the importance scores, the scale factor in the batch-normalization layer is used as a measure of the importance of channels and layers. It simplifies the task of classifying neural networks [
38].
The principle of the batch normalization layer is as follows: firstly, the mean and standard deviation of the output data in the previous layer is solved.
In Equations (1) and (2),
m is the number of samples of images in one training process, i.e., the batch value. Following the normalization process,
xi can be obtained. The batch normalization principle can adjust the training efficiency to prevent gradient explosion, etc.
In Equations (3) and (4),
ε is an extremely small value added to prevent the denominator from being 0.
γ is the scaling factor and
β is the translation factor. In the current paper,
γ is used as an important measure of the channel. The L1 parametric number is calculated by the
γ of the channel, as shown in Equation (5):
The sparsity training loss function adds the regularization term to the Yolov3 loss function with the following Equation (6):
In summary, Lossyolo denotes the Yolov3 loss function, and s denotes the sparsity factor.
2.3.2. Model Pruning
After the sparsity training, many scaling factors of the model were close to zero. Then, the sparsity scaling factor
γ is ranked from smallest to largest. The closer the scaling factor is to zero, the less important the channel. Pruning the unimportant channels could simplify the model and reduce the model volume. The channel pruning process is presented in
Figure 6. Therefore, the channels with scale factors close to zero, and the corresponding weights are pruned [
39]. After sorting the scaling factors, ranging from small to large, the pruning rate needs to be set. The higher the pruning rate, the smaller the model size.
To ensure the integrity of the Yolov3 network structure, only the shortcut layers in the backbone network were pruned in the present study. The γ of each layer was ranked and the smallest value was used as the layer pruning.
2.3.3. Knowledge-Distillation Assisted Fine-Tuning
Large pruning leads to a considerable loss of model accuracy, so the pruned model needs to be fine-tuned. In the present study, knowledge distillation was used to fine-tune the model to improve the accuracy as much as possible. Knowledge distillation is based on the idea of transfer learning. More specifically, the student model was trained with the knowledge that was obtained from the relatively high-precision teacher model. This approach could accelerate the convergence of the student model [
31]. Since the models have structural similarities both prior to and after pruning, they are well suited for fine-tuning with the help of knowledge distillation.
In traditional target-detection networks, the data are usually labeled with hard labels of “0” and “1”. However, hard labels contain limited information and do not include the relationship between classes. In contrast, knowledge distillation uses soft labels from the teacher network to label the data. The soft labels of the teacher network provide a considerable amount of information and can offer more information to the student network. This enables the more effective training of the student network as a way to improve the generalization ability of the model. Soft labels are present as probabilities in the category vector. The knowledge distillation algorithm introduces a temperature coefficient,
T, in the softmax function. Appropriately increasing the value of
T can result in a more even target category probability distribution; the information provided by the soft labels is amplified. In this way, the student model can focus more on the soft labels and further improve the training effect of the student model. The target category probability formula of the teacher network with the introduction of
T is presented in Equation (7):
where
vi is the output layer of the teacher model in class
i,
N is the total number of labels, and
is the value of the soft label of the softmax function output of the teacher model in class
i at a temperature of
T.
Similarly, the target class probability formula for the student model network with the introduction of
T is presented in Equation (8):
where
zi is the output layer of the student model in class
i,
N is the total number of labels, and
is the value of the soft label of the softmax function output of the student model in class
i at a temperature of
T. Similarly,
in Equation (7) and
in Equation (8) are used in Equations (9) to (11) to calculate the loss.
The flow of knowledge distillation is presented in
Figure 7. The loss function is used to express the degree of difference between the prediction and actual data. It can measure how well the model performs predictions. It is obtained during the process of model training. Knowledge distillation is essentially also the process of training a model. Therefore, the loss of knowledge distillation needs to be calculated. Similarly,
Figure 7 also illustrates the process of calculating the loss function for knowledge distillation. The loss function of knowledge distillation consists of two parts and the formula is as follows:
The knowledge distillation loss is
LossKD, which is obtained by weighting
Losssoft and
Losshard. The two weighting factors
α and
β have the sum of 1. In
Figure 7,
LossKD is the total loss,
Losssoft is the distillation loss, and
Losshard is the student loss.
For the input of the same dataset, the teacher and student models are subjected to the softmax layer to produce soft labels. At the same temperature, the cross-entropy loss function of the two soft labels is
Losssoft. This part of the loss is the loss of information provided by the teacher model for the student model. The formula for
Losssoft is as follows:
where
and
are the values of the soft labels presented in Equations (7) and (8),
N is the total number of labels, and
Losssoft is a fraction of the
LossKD presented in Equation (9).
The cross-entropy loss between the output of softmax and ground truth is
Losshard for the student model, under the condition that
T is 1. The teacher model also has information about misclassification, and the error information received by the student model can be effectively reduced by introducing the ground truth value.
where
is presented in Equation (8),
ci is the ground truth value of class
i, and
Losshard is a fraction of
LossKD presented in Equation (9).
4. Discussion
The mAP achieved 99.19% during the basic training phase. In this way, the model can focus more on the feature information of the seeds during the training process. In the process of sparsity training, different sparsity factors were chosen, and the results obtained for the sparsity training were different. By using sparsity factors, s, of 0.01 and 0.005, the scaling factor γ of the BN layer was rapidly compressed. By using a sparsity factor s, of 0.001, the scaling factor γ of the BN layer was compressed more slowly. Sparsity training using sparsity factors, s, of 0.01 and 0.001 after 300 epochs meant that the scale factor γ could be compressed to a value close to 0. Therefore, an insignificant part of the network was revealed. However, the sparsity factor s of 0.001 ensured a lower loss value, which allowed the model to be trained in a more efficient manner. In the process of pruning, a pruning rate of 0.93 was selected in the study to reduce the unimportant parameters of the model as much as possible. If the pruning rate was increased further, the mAP of the model would be completely lost. This is because too much scaling factor was pruned out, which was close but not equal to 0; it is difficult to make the model mAP recover by fine-tuning methods when using pruning rates that are too high. Finally, the knowledge distillation algorithm was introduced in the fine-tuning process. The pruned model was made to obtain the help of the teacher model. The pruned model obtained more information from the soft labels of the teacher model. Thus, using knowledge distillation to assist in the fine-tuning process can cause the model mAP to increase further. The Yolov3 network was completed using compression. Finally, the improved network used in the present study was compared to the Yolov3, Yolov4 and Yolov3-tiny networks. By comparing it to other networks, it was determined that the improved network had advantages in both speed and accuracy.
In future short-term plans, the research will focus on the following factors. (1) Adding different combinations of seeds, and using different light-intensity adjustments. These methods will be used to enrich the dataset as a way to improve the generalization ability of the model. (2) Deploying the trained models in embedded devices to form a highly portable, low-power, low-cost detection system. (3) Replace the lightweight backbone network by employing model quantization to obtain a microscopic model. Ensuring accuracy while reducing the model’s inference time is also an aspect on which the present study will focus in the future.