**1. Introduction**

In the field of intelligent agriculture, for instance, plant protection and precision farming, there are incremental progresses in agricultural image processing, e.g., classification of crop pests, and harvest yield forecast. Step advances are catalyzed by the developed various computerized models, which have covered a wide range of technologies, such as machine learning, deep learning, transfer learning, few-shot learning, and so on. For instance, several machine learning methods were adopted in crop pest classification [1,2]. The convolutional neural networks were used to diagnose and identify the plant diseases from leaf images [3,4]. The deep learning neural networks showed a powerful and excellent performance on several agricultural applications, such as plant identification [5], crop classification [6,7], fruit classification [8], weed classification [9], animal classification [10], quality evaluation [11], and field pest classification [12,13]. The transfer learning technology helped fine-tune the pre-trained models to reduce the difficulty of model training [14,15]. The few-shot learning method reduced the requirements for the scale of the training dataset [16]. There were also some related agricultural research surveys [17–19], providing more comprehensive views.

Although the abovementioned methods achieved good performance on some special tasks, they are still far away from the true intelligence in this area. Specifically, one deep neural network is designed for a special task with a static evaluation protocol. The entire dataset will be split in two parts: A training set used for learning and a testing set used for accuracy evaluation. Once the training period completes, the structure and parameters of this model are fixed, and any new knowledge cannot be learned again. This is quite different from how humans learn.

Biological learning is to continually learn new skills (tasks) and accumulate knowledge throughout the lifetime [20]. We can also incorporate new information to expand our cognitive abilities without seriously breaking past memories, which results from the good balance between synaptic plasticity and stability [21,22]. As known, the basic principle of deep learning networks is the back-propagation error and gradient descent. However, from the perspective of biological cognition, our learning process is more likely through similarity matching, rather than back-propagation error or gradient descent in the brain [23]. So, the bio-inspired work in this article will be around metric learning and continual learning. The metric learning aims to learn the internal similarity between input paired data [24], which is suitable for classification and pattern recognition in agriculture. Continual learning requires the designed model to continuously learn new tasks without forgetting old ones, that is, keeping good performance on both new and old tasks.

Continual learning is an approach inspired from the biological factors of the mammalian brain. In this topic, the most important thing is the stability–plasticity dilemma. In detail, plasticity means to integrate novel information to incrementally refine and transfer knowledge, and stability aims not to catastrophically interfere with consolidated knowledge. For a stable continual learning process, two types of plasticity are required: Hebbian plasticity for positive feedback instability and compensatory homeostatic plasticity, which stabilizes neural activity [25]. So far, the main methods to realize continual learning can be divided into three categories: (1) Store and replay, including previous data or memory. The limitation is that the storage of old information will lead to large working memory requirements; (2) regularization approaches, such as learning without forgetting (LWF), elastic weight consolidation (EWC), etc. They alleviate catastrophic forgetting by imposing constraints on the update of the neural weights. However, the additional loss terms for protecting consolidated knowledge will lead to a trade-off on the performance of old and novel tasks; and (3) dynamic architectures, which change the structure of networks in response to new information, e.g., re-training with an increased number of neurons or network layers. Obviously, the limitation is the continuously growing complexity of the architecture with the increasing number of learnt tasks.

In this study, in order to imitate human learning and memory patterns to maintain a good performance on both new and old tasks, we propose an artificial neural network (ANN)-based continual classification method via memory storage and retrieval, including the convolutional neural network (CNN) and generative adversarial network (GAN). Looking at ourselves, how do we remember past events? We only keep the most important information in our brain, throwing out the details and abstracting the inner relationships. These life experiences inspire us to find a way to abstract and preserve prior knowledge in memory. The memory only records the most important information from prior events, automatically ignoring the details. Inspired by this, we used the GAN to extract central information from old tasks and generate abstracted images as memory. For the similarity matching tasks in agriculture, it has a good effect on both new and old tasks, alleviating the forgetting problem. The main contributions of this work are as follows:


Clearly, there are so many possible applications of this proposed approach in the field of agriculture, for instance, intelligent fruit picking robots, which can recognize and pick different kinds of fruits, and plant protection through automatic identification of diseases and pests, which can continuously improve the detection range to show the ability to upgrade the developed model. Clearly, there are so many possible applications of this proposed approach in the field of agriculture, for instance, intelligent fruit picking robots, which can recognize and pick different kinds of fruits, and plant protection through automatic identification of diseases and pests, which can continuously improve the detection range to show the ability to upgrade the developed model.

*Agriculture* **2020**, *10*, x FOR PEER REVIEW 3 of 15

### **2. Materials and Methods 2. Materials and Methods**

### *2.1. Crop Pest and Plant Leaf Datasets 2.1. Crop Pest and Plant Leaf Datasets*

The typical deep neural networks require amounts of data to train the model, while the metric learning method only needs few raw data. In this research work, we collected two common cross-domain agricultural datasets: Crop pests and plant leaves. The crop pest dataset was collected from the open dataset [26], which provides images of important insects in agriculture with natural scenes and complex backgrounds, close to the real world. The plant leaf dataset was collected from the famous open dataset (PlantVillage). Generally, the image preprocessing for deep neural networks includes the target cropping, background operation, gray transform, etc. Here, we used the raw images to make this study closer to the practical application. The typical deep neural networks require amounts of data to train the model, while the metric learning method only needs few raw data. In this research work, we collected two common crossdomain agricultural datasets: Crop pests and plant leaves. The crop pest dataset was collected from the open dataset [26], which provides images of important insects in agriculture with natural scenes and complex backgrounds, close to the real world. The plant leaf dataset was collected from the famous open dataset (PlantVillage). Generally, the image preprocessing for deep neural networks includes the target cropping, background operation, gray transform, etc. Here, we used the raw images to make this study closer to the practical application.

In the crop pest dataset, there are 10 categories and the number of samples in each class is 20. The total number of samples is 200, which is a small dataset compared to that required for the traditional deep learning models based on back-propagation error. Some samples of the crop pest dataset are shown in Figure 1. In the crop pest dataset, there are 10 categories and the number of samples in each class is 20. The total number of samples is 200, which is a small dataset compared to that required for the traditional deep learning models based on back-propagation error. Some samples of the crop pest dataset are shown in Figure 1.

**Figure 1.** Samples of the crop pest dataset (from [26]).

The plant leaf dataset also includes 10 classes, and the number of samples in each class is 20. The parameter sizes of these two databases are the same. Some samples of the plant leaf dataset are shown in Figure 2. The plant leaf dataset also includes 10 classes, and the number of samples in each class is 20. The parameter sizes of these two databases are the same. Some samples of the plant leaf dataset are shown in Figure 2.

**Figure 1.** Samples of the crop pest dataset (from [26]).

*Agriculture* **2020**, *10*, x FOR PEER REVIEW 4 of 15

*Agriculture* **2020**, *10*, x FOR PEER REVIEW 4 of 15

**Figure 2.** Samples of the plant leaf dataset (from PlantVillage). **Figure 2.** Samples of the plant leaf dataset (from PlantVillage). **Figure 2.** Samples of the plant leaf dataset (from PlantVillage).

### *2.2. Classification with Metric Learning Based on CNN 2.2. Classification with Metric Learning Based on CNN*

Siamese network is shown as Figure 3.

*2.2. Classification with Metric Learning Based on CNN*  Metric learning learns the inner similarity between input paired data using a distance metric, which is aimed at distinguishing and classifying. The typical metric learning model is the Siamese network [27]. The Siamese network basically consists of two symmetrical neural networks sharing the same weights and architecture, which are joined together at the end using some energy function. During the training period of the Siamese network, the inputs are a pair of images, and the objective is to distinguish whether the input paired images are similar or dissimilar. The workflow of the Metric learning learns the inner similarity between input paired data using a distance metric, which is aimed at distinguishing and classifying. The typical metric learning model is the Siamese network [27]. The Siamese network basically consists of two symmetrical neural networks sharing the same weights and architecture, which are joined together at the end using some energy function. During the training period of the Siamese network, the inputs are a pair of images, and the objective is to distinguish whether the input paired images are similar or dissimilar. The workflow of the Siamese network is shown as Figure 3. Metric learning learns the inner similarity between input paired data using a distance metric, which is aimed at distinguishing and classifying. The typical metric learning model is the Siamese network [27]. The Siamese network basically consists of two symmetrical neural networks sharing the same weights and architecture, which are joined together at the end using some energy function. During the training period of the Siamese network, the inputs are a pair of images, and the objective is to distinguish whether the input paired images are similar or dissimilar. The workflow of the Siamese network is shown as Figure 3.

**Figure 3.** The workflow of the Siamese network. **Figure 3.** The workflow of the Siamese network. **Figure 3.** The workflow of the Siamese network.

As shown in Figure 3, there are four blocks. Now, we considered them one by one. For block 1, it means the input paired images, including the images X1 and X2, fed to network A and network B, respectively. They may come from the same category or not. As shown in Figure 3, there are four blocks. Now, we considered them one by one. For block 1, it means the input paired images, including the images X1 and X2, fed to network A and network B, respectively. They may come from the same category or not. As shown in Figure 3, there are four blocks. Now, we considered them one by one. For block 1, it means the input paired images, including the images X1 and X2, fed to network A and network B, respectively. They may come from the same category or not.

For block 2, there are two convolutional neural networks (CNNs), named network A and network B. The role of network A and network B is to generate the embeddings (feature vectors) for the input paired images. Since the inputs of the model are images, we used a CNN to generate the embeddings. Remember that the role of the CNNs here is only to extract features but not to classify. For block 2, there are two convolutional neural networks (CNNs), named network A and network B. The role of network A and network B is to generate the embeddings (feature vectors) for the input paired images. Since the inputs of the model are images, we used a CNN to generate the embeddings. Remember that the role of the CNNs here is only to extract features but not to classify. For block 2, there are two convolutional neural networks (CNNs), named network A and network B. The role of network A and network B is to generate the embeddings (feature vectors) for the input paired images. Since the inputs of the model are images, we used a CNN to generate the embeddings. Remember that the role of the CNNs here is only to extract features but not to classify. This differs with

This differs with the traditional deep learning classification models. It is required that the two CNNs

the traditional deep learning classification models. It is required that the two CNNs in the Siamese network have shared weights and structure, which means the two CNNs, in fact, have the same topology, as shown in Figure 4. *Agriculture* **2020**, *10*, x FOR PEER REVIEW 5 of 15 in the Siamese network have shared weights and structure, which means the two CNNs, in fact, have the same topology, as shown in Figure 4.

**Figure 4.** The topology of used convolutional neural network (CNN). **Figure 4.** The topology of used convolutional neural network (CNN).

Here, the shared structure and parameters of CNN are shown in Table 1. Specifically, the output shape of the layers in CNN, and the size and number of kernels used in the convolutional layers and max-pooling layers are included. The programming tool used was 'Jupyter Notebook', which is a popular web-based interactive computing environment. We realized the functions with Python language and the environmental backend was TensorFlow. Our programming files and used image dataset were uploaded to the ZENODO.org, which is free and open for other researchers [28]. Here, the shared structure and parameters of CNN are shown in Table 1. Specifically, the output shape of the layers in CNN, and the size and number of kernels used in the convolutional layers and max-pooling layers are included. The programming tool used was 'Jupyter Notebook', which is a popular web-based interactive computing environment. We realized the functions with Python language and the environmental backend was TensorFlow. Our programming files and used image dataset were uploaded to the ZENODO.org, which is free and open for other researchers [28].


**Table 1.** The CNN structure and parameters in the Siamese network. **Table 1.** The CNN structure and parameters in the Siamese network.

Then, for block 3, the embedding is referred to the output of the last dense layer of CNN, as shown in Figure 4. Network A and network B generate the embeddings for the input images X1 and X2, respectively. These embeddings are fed to block 4, the energy function, which gives the similarity between the paired inputs. The Euclidean distance is adopted as the energy function, which is the most common way to measure the distance between the two embeddings in the high-dimensional Then, for block 3, the embedding is referred to the output of the last dense layer of CNN, as shown in Figure 4. Network A and network B generate the embeddings for the input images X1 and X2,respectively. These embeddings are fed to block 4, the energy function, which gives the similarity between the paired inputs. The Euclidean distance is adopted as the energy function, which is the most common way to measure the distance between the two embeddings in the high-dimensional space. The expression of block 4, the energy function, can be written as Equation (1):

space. The expression of block 4, the energy function, can be written as Equation (1):

Dense (2) 102

$$E(\mathbf{X}\_1, \mathbf{X}\_2) = \|f\_{\mathbf{N}\_A}(\mathbf{X}\_1) - f\_{\mathbf{N}\_B}(\mathbf{X}\_2)\|\_2 \tag{1}$$
 
$$\dots \quad \dots \quad \dots \quad \dots \quad \dots \tag{1}$$

(1)

2 The value of E represents the similarity between the outputs of the two networks: If X1 and X2 are similar (from the same category), the value of E will be less. Otherwise, the value of E will be large The value of *E* represents the similarity between the outputs of the two networks: If *X*<sup>1</sup> and *X*<sup>2</sup> are similar (from the same category), the value of *E* will be less. Otherwise, the value of *E* will be large if the inputs are dissimilar (from different categories).

if the inputs are dissimilar (from different categories). To train the Siamese network well, the loss function is very important. The loss function guides the iteration of parameters of CNNs in the Siamese network. Since the goal of the Siamese network is to understand the similarity between the paired input images, we used the contrastive loss function, expressed as Equation (2): To train the Siamese network well, the loss function is very important. The loss function guides the iteration of parameters of CNNs in the Siamese network. Since the goal of the Siamese network is to understand the similarity between the paired input images, we used the contrastive loss function, expressed as Equation (2):

$$\begin{aligned} \text{on (2):}\\ \text{Contrastive Loss} &= Y \times E^2 + (1 - Y) \times [\max(\text{margin} - E, 0)]^2 \end{aligned} \tag{2}$$

where *E* is the energy function and *Y* is the true label, which is 0 if the two input images are from the same category and 1 if the two input images are from different categories. Some examples of the input pairs are shown in Figure 5. where E is the energy function and Y is the true label, which is 0 if the two input images are from the same category and 1 if the two input images are from different categories. Some examples of the input pairs are shown in Figure 5.

*Agriculture* **2020**, *10*, x FOR PEER REVIEW 6 of 15

**Figure 5.** Examples of the input pairs. **Figure 5.** Examples of the input pairs.

In Equation (2), the term margin is used to set the threshold, that is, when input pairs are dissimilar, the Siamese network needs to hold their distance greater than the margin; otherwise, there will be a loss during the training period. Here, the margin was set as 1. When the training period is done, the distribution of embeddings will have a group effect, where different groups represent different categories. In Equation (2), the term margin is used to set the threshold, that is, when input pairs are dissimilar, the Siamese network needs to hold their distance greater than the margin; otherwise, there will be a loss during the training period. Here, the margin was set as 1. When the training period is done, the distribution of embeddings will have a group effect, where different groups represent different categories.

### *2.3. Continual Classification with Metric Learning Based on CNN and GAN 2.3. Continual Classification with Metric Learning Based on CNN and GAN*

From the bio-inspired perspective, we aimed for the model to be more flexible and able to handle continual tasks. Continual learning, also called lifelong learning, differs from transfer learning or other traditional networks. As known, a typical deep neural network is designed for some specific task, e.g., crop pest classification. After the training period, the weights and structure of the designed model are fixed, with an excellent performance on the specific task. However, if we want the model to perform another new task directly, e.g., plant leaf classification, it will have a very bad performance unless it is trained again from scratch or uses transfer learning. However, if we train the model by the new dataset, the distribution of weights will change to ensure a good performance on a new task. Since the weights of the network are modified, the network loses the ability to recognize the old task; in other words, it forgets the old knowledge. For transfer learning, the forgetting problem of old knowledges still exists. Obviously, traditional learning way has very poor flexibility. From the bio-inspired perspective, we aimed for the model to be more flexible and able to handle continual tasks. Continual learning, also called lifelong learning, differs from transfer learning or other traditional networks. As known, a typical deep neural network is designed for some specific task, e.g., crop pest classification. After the training period, the weights and structure of the designed model are fixed, with an excellent performance on the specific task. However, if we want the model to perform another new task directly, e.g., plant leaf classification, it will have a very bad performance unless it is trained again from scratch or uses transfer learning. However, if we train the model by the new dataset, the distribution of weights will change to ensure a good performance on a new task. Since the weights of the network are modified, the network loses the ability to recognize the old task; in other words, it forgets the old knowledge. For transfer learning, the forgetting problem of old knowledges still exists. Obviously, traditional learning way has very poor flexibility.

If we want a model that can continually learn new tasks without forgetting old knowledge, it should have some bio-inspired ability, such as memory. In this study, we proposed a continual classification method based on memory storage and retrieval to maintain a good performance on both new and old tasks. Look at ourselves, how do we remember past events? We only keep the most important information in our brain, throwing out the details and abstracting the inner relationships. These life experiences inspire us to find a way to abstract and preserve prior knowledge in memory. If we want a model that can continually learn new tasks without forgetting old knowledge, it should have some bio-inspired ability, such as memory. In this study, we proposed a continual classification method based on memory storage and retrieval to maintain a good performance on both new and old tasks. Look at ourselves, how do we remember past events? We only keep the most important information in our brain, throwing out the details and abstracting the inner relationships. These life experiences inspire us to find a way to abstract and preserve prior knowledge in memory.

Here, we used the GAN to perform information abstracting and memory storage, which is a technique to learn to generate new data with the same statistics as the raw dataset, which consisted of two parts: Generator and discriminator. The basic workflow of GAN is shown as Figure 6. Here, we used the GAN to perform information abstracting and memory storage, which is a technique to learn to generate new data with the same statistics as the raw dataset, which consisted of two parts: Generator and discriminator. The basic workflow of GAN is shown as Figure 6.

structures are shown in Table 2.

*Agriculture* **2020**, *10*, x FOR PEER REVIEW 7 of 15

**Figure 6.** The workflow of generative adversarial network (GAN). **Figure 6.** The workflow of generative adversarial network (GAN).

The generator and discriminator are both deep convolutional neural networks, and their The generator and discriminator are both deep convolutional neural networks, and their structures are shown in Table 2.


**Table 2.** The generator and discriminator in GAN.

Conv2D (64, 64, 3) Dense (1) The GAN chains the generator and discriminator together, expressed as Equation (3):

$$\text{GAN}(\mathbf{X}) = \text{discriminant}(\text{generator}(\mathbf{X})) \tag{3}$$

$$\text{The commutator and discriminant } \text{contract} \text{ with each other in a cone. We trained the decision } \mathbf{X} \text{ to a curve } \mathbf{X} \text{ to a curve } \mathbf{X} \text{ to a curve } \mathbf{X} \text{ to a curve } \mathbf{X} \text{ to a curve } \mathbf{X} \text{ to a curve } \mathbf{X} \text{ to a curve } \mathbf{X} \text{ to a curve } \mathbf{X} \text{ to a curve } \mathbf{X} \text{ to a curve } \mathbf{X} \text{ to a curve } \mathbf{X} \text{ to a curve } \mathbf{X} \text{ to a curve } \mathbf{X} \text{ to a curve } \mathbf{X} \text{ to a curve } \mathbf{X} \text{ to a curve } \mathbf{X} \text{ to a curve } \mathbf{X} \text{ to a curve } \mathbf{X} \text{ to a curve } \mathbf{X} \text{ to a curve } \mathbf{X} \text{ to a curve } \mathbf{X} \text{ to a curve } \mathbf{X} \text{ to a curve } \mathbf{X} \text{ to a curve } \mathbf{X} \text{ to a curve } \mathbf{X} \text{ to a curve } \mathbf{X} \text{ to a curve } \mathbf{X} \text{ to a curve } \mathbf{X} \text{ to a curve } \mathbf{X} \text{ to a curve } \mathbf{X} \text{ to a curve } \mathbf{X} \text{ to a curve } \mathbf{X} \text{ to a curve } \mathbf{X} \text{ to a curve } \mathbf{X} \text{ to a curve } \mathbf{X} \text{ to a curve } \mathbf{X} \text{ to a curve } \mathbf{X} \text{ to a curve } \mathbf{X} \text{ to a curve } \mathbf{X} \text{ to a curve } \mathbf{X} \text{ to a curve } \mathbf{X} \text{ to a curve } \mathbf{X} \text{ to a curve } \mathbf{X} \text{ to a curve } \mathbf{X} \text{ to a } \mathbf{X} \text{$$

(3)

The generator and discriminator contest with each other in a game. We trained the discriminator using samples of raw and generated images with the corresponding labels, such as any regular image classification model. To train the generator, we started with the random noise and used the gradients of the generator's weights, which means, at every step, moving the weights of the generator in a direction that will make the discriminator more likely to classify the images decoded by the generator The generator and discriminator contest with each other in a game. We trained the discriminator using samples of raw and generated images with the corresponding labels, such as any regular image classification model. To train the generator, we started with the random noise and used the gradients of the generator's weights, which means, at every step, moving the weights of the generator in a direction that will make the discriminator more likely to classify the images decoded by the generator as "real". In other words, we trained the generator to fool the discriminator.

as "real". In other words, we trained the generator to fool the discriminator. Since the GAN can carry out the memory storage for old tasks, the workflow of our proposed continual metric learning method can be shown as Figure 7, which is mainly based on memory Since the GAN can carry out the memory storage for old tasks, the workflow of our proposed continual metric learning method can be shown as Figure 7, which is mainly based on memory storage and retrieval.

storage and retrieval. When the first task comes, the task data will be organized as pairs and fed to the metric learning model (Siamese network). The output result is the similarity between input pairs, that is to say whether the input images are from the same category or not. Besides, the task data will also be fed to the GAN after data augmentation, due to the small scale of the raw database. Then, the GAN generates the abstracted images that represent the most important information of the old tasks, after the amount of iterations. We call this process memory storage. When the second task comes, the new task data and the data from memory will be mixed together, and fed to the metric learning model. We call this process memory retrieval.

*Agriculture* **2020**, *10*, x FOR PEER REVIEW 8 of 15

**Figure 7.** The workflow of continual metric learning.

*Agriculture* **2020**, *10*, x FOR PEER REVIEW 8 of 15

**Figure 7.** The workflow of continual metric learning. **Figure 7.** The workflow of continual metric learning. *3.1. Single Task Experiment with the Basic CNN Model* 

### When the first task comes, the task data will be organized as pairs and fed to the metric learning **3. Results** In order to testify the performance of the metric learning model on similarity matching for a

### model (Siamese network). The output result is the similarity between input pairs, that is to say *3.1. Single Task Experiment with the Basic CNN Model* single task, we carried out experiments on a crop pest dataset and plant leaf dataset, respectively. For

*3.1. Single Task Experiment with the Basic CNN Model* 

whether the input images are from the same category or not. Besides, the task data will also be fed to the GAN after data augmentation, due to the small scale of the raw database. Then, the GAN generates the abstracted images that represent the most important information of the old tasks, after the amount of iterations. We call this process memory storage. When the second task comes, the new task data and the data from memory will be mixed together, and fed to the metric learning model. We call this process memory retrieval. **3. Results**  In order to testify the performance of the metric learning model on similarity matching for a single task, we carried out experiments on a crop pest dataset and plant leaf dataset, respectively. For these two datasets, we prepared the input data as paired images. In detail, the total number of input pairs was 10,000, which may have contained a small number of duplicates because of the random combinations. We spilt the training set and testing set by the ratio of 8:2, that is, 2000 input pairs were used to test the accuracy. During training, 25% of the training data were taken out for the validation set. In summary, there were 6000 pairs for training, 2000 pairs for validation, and 2000 pairs for testing. these two datasets, we prepared the input data as paired images. In detail, the total number of input pairs was 10,000, which may have contained a small number of duplicates because of the random combinations. We spilt the training set and testing set by the ratio of 8:2, that is, 2000 input pairs were used to test the accuracy. During training, 25% of the training data were taken out for the validation set. In summary, there were 6000 pairs for training, 2000 pairs for validation, and 2000 pairs for testing.

For the crop pest dataset, the loss and accuracy of the CNN model is shown in Figure 8. For the crop pest dataset, the loss and accuracy of the CNN model is shown in Figure 8.

**Figure 8.** The loss and accuracy on the crop pest dataset. **Figure 8.** The loss and accuracy on the crop pest dataset.

It is shown that the variation trend of the training loss is consistent with that of the validation loss. The variation trend of the training accuracy is also consistent with that of the validation It is shown that the variation trend of the training loss is consistent with that of the validation loss. The variation trend of the training accuracy is also consistent with that of the validation accuracy. This indicates that there is no overfitting problem in the training. The testing accuracy is 100%, which means the model can distinguish the input paired images well. The distribution of embeddings from the crop pest dataset is shown in Figure 9.

(**a**) loss (**b**) accuracy

It is shown that the variation trend of the training loss is consistent with that of the validation

**Figure 8.** The loss and accuracy on the crop pest dataset.

images well.

accuracy. This indicates that there is no overfitting problem in the training. The testing accuracy is

100%, which means the model can distinguish the input paired images well. The distribution of

100%, which means the model can distinguish the input paired images well. The distribution of

*Agriculture* **2020**, *10*, x FOR PEER REVIEW 9 of 15

**Figure 9.** The distribution of embeddings from the crop pest dataset. **Figure 9.** The distribution of embeddings from the crop pest dataset. Through the distribution of the model's output embeddings, it can be seen that the metric

Through the distribution of the model's output embeddings, it can be seen that the metric Through the distribution of the model's output embeddings, it can be seen that the metric learning model has good ability for similarity matching on the single task, that is, the images from the same category gather while those from different categories are far away from each other. learning model has good ability for similarity matching on the single task, that is, the images from the same category gather while those from different categories are far away from each other. Similar experiments on the other dataset were also carried out. The loss and accuracy of the CNN

learning model has good ability for similarity matching on the single task, that is, the images from the same category gather while those from different categories are far away from each other. Similar experiments on the other dataset were also carried out. The loss and accuracy of the CNN model on the plant leaf dataset is shown in Figure 10. The variation trends of the training loss and Similar experiments on the other dataset were also carried out. The loss and accuracy of the CNN model on the plant leaf dataset is shown in Figure 10. The variation trends of the training loss and training accuracy are consistent with those of the validation loss and validation accuracy, which indicates that there is also no overfitting problem in the training period. The distribution of the model's output embeddings of images from the plant leaf dataset is shown as Figure 11, which also shows the good ability of the similarity matching on a single task to distinguish the input paired images well. model on the plant leaf dataset is shown in Figure 10. The variation trends of the training loss and training accuracy are consistent with those of the validation loss and validation accuracy, which indicates that there is also no overfitting problem in the training period. The distribution of the model's output embeddings of images from the plant leaf dataset is shown as Figure 11, which also shows the good ability of the similarity matching on a single task to distinguish the input paired images well.

**Figure 10.** The loss and accuracy on the plant leaf dataset. **Figure 10.** The loss and accuracy on the plant leaf dataset.

(**a**) loss (**b**) accuracy

**Figure 10.** The loss and accuracy on the plant leaf dataset.

Figure 12.

Figure 12.

acceptable.

acceptable.

*Agriculture* **2020**, *10*, x FOR PEER REVIEW 10 of 15

**Figure 11.** The distribution of embeddings from the plant leaf dataset. tasks, accumulating knowledge like humans to perform well on both old and new tasks. So, we

### **Figure 11.** The distribution of embeddings from the plant leaf dataset. *3.2. Continual Tasks Experiment with the Basic CNN Model* carried out the experiments on sequential tasks to testify the continual performance of the CNN

*3.2. Continual Tasks Experiment with the Basic CNN Model*  As mentioned earlier, we hope that the model can be more flexible and able to handle continuous As mentioned earlier, we hope that the model can be more flexible and able to handle continuous tasks, accumulating knowledge like humans to perform well on both old and new tasks. So, we carried out the experiments on sequential tasks to testify the continual performance of the CNN model, namely, the basic metric learning model. model, namely, the basic metric learning model. For these two datasets, two occurring orders exist, that is, from the crop pest task to the plant leaf task, and the opposite one. For the first case, the testing accuracy of the two tasks is shown in

tasks, accumulating knowledge like humans to perform well on both old and new tasks. So, we For these two datasets, two occurring orders exist, that is, from the crop pest task to the plant leaf task, and the opposite one. For the first case, the testing accuracy of the two tasks is shown in Figure 12.

**Figure 12.** The testing accuracy of the first case.

**Figure 12.** The testing accuracy of the first case. At the first stage, the model has a good performance on the crop pest dataset, which was verified in Section 3.1. However, it has a very bad performance on the other dataset. The reason is that the At the first stage, the model has a good performance on the crop pest dataset, which was verified in Section 3.1. However, it has a very bad performance on the other dataset. The reason is that the other dataset is an unknown task and has never been seen before; this result is understandable and acceptable.

**Figure 12.** The testing accuracy of the first case.

the crop pest task in the past. After the training period, the testing accuracy on the plant leaf task increases to 100% while that of the crop pest dataset decreases to nearly 50%, which is almost a blind guess. So, the extent of catastrophic forgetting for the crop pest task is nearly 50%. The new distribution of output embeddings from the old crop pest task is shown in Figure 13, which indicates that the basic metric learning model has lost the ability to distinguish the similarity between input paired images. The extracted features (embedding) of samples from different categories are mixed,

At the second stage, the model begins to learn the plant leaf task. Note that the model also learnt

in Section 3.1. However, it has a very bad performance on the other dataset. The reason is that the

other dataset is an unknown task and has never been seen before; this result is understandable and

the crop pest task in the past. After the training period, the testing accuracy on the plant leaf task

increases to 100% while that of the crop pest dataset decreases to nearly 50%, which is almost a blind

guess. So, the extent of catastrophic forgetting for the crop pest task is nearly 50%. The new

distribution of output embeddings from the old crop pest task is shown in Figure 13, which indicates

that the basic metric learning model has lost the ability to distinguish the similarity between input

paired images. The extracted features (embedding) of samples from different categories are mixed,

and cannot be separated. This is an undesired forgetting problem!

and cannot be separated. This is an undesired forgetting problem!

At the first stage, the model has a good performance on the crop pest dataset, which was verified

At the second stage, the model begins to learn the plant leaf task. Note that the model also learnt

images.

images.

At the second stage, the model begins to learn the plant leaf task. Note that the model also learnt the crop pest task in the past. After the training period, the testing accuracy on the plant leaf task increases to 100% while that of the crop pest dataset decreases to nearly 50%, which is almost a blind guess. So, the extent of catastrophic forgetting for the crop pest task is nearly 50%. The new distribution of output embeddings from the old crop pest task is shown in Figure 13, which indicates that the basic metric learning model has lost the ability to distinguish the similarity between input paired images. The extracted features (embedding) of samples from different categories are mixed, and cannot be separated. This is an undesired forgetting problem! *Agriculture* **2020**, *10*, x FOR PEER REVIEW 11 of 15 *Agriculture* **2020**, *10*, x FOR PEER REVIEW 11 of 15

**Figure 13.** The distribution of embeddings from the old pest task. **Figure 13.** The distribution of embeddings from the old pest task. For the second case, from the plant leaf task to the crop pest task, the experimental result of the

For the second case, from the plant leaf task to the crop pest task, the experimental result of the testing accuracy is shown in Figure 14. The testing accuracy on the plant leaf task decreases from 100% to 60%, which means the extent of catastrophic forgetting for the plant leaf task is 40%. We For the second case, from the plant leaf task to the crop pest task, the experimental result of the testing accuracy is shown in Figure 14. The testing accuracy on the plant leaf task decreases from 100% to 60%, which means the extent of catastrophic forgetting for the plant leaf task is 40%. We found that, regardless of the occurring order of sequential tasks, the basic metric learning model does have a serious forgetting problem, as shown in Figures 12 and 14. In other words, after new learning, the basic metric learning model can no longer do the previous task well, due to the forgetting. testing accuracy is shown in Figure 14. The testing accuracy on the plant leaf task decreases from 100% to 60%, which means the extent of catastrophic forgetting for the plant leaf task is 40%. We found that, regardless of the occurring order of sequential tasks, the basic metric learning model does have a serious forgetting problem, as shown in Figures 12 and 14. In other words, after new learning, the basic metric learning model can no longer do the previous task well, due to the forgetting.

**Figure 14.** The testing accuracy of the second case. **Figure 14.** The testing accuracy of the second case.

**Figure 14.** The testing accuracy of the second case.

mixed and chaotic, losing the ability to distinguish and classify the similarity between input paired

**Figure 15.** The distribution of embeddings from the old leaf task.

tasks. Taking tthe sequential tasks from the crop pest dataset to the plant leaf dataset as an example,

As known, due to the forgetting problem, the basic CNN model cannot balance new and old

**Figure 15.** The distribution of embeddings from the old leaf task.

tasks. Taking tthe sequential tasks from the crop pest dataset to the plant leaf dataset as an example,

As known, due to the forgetting problem, the basic CNN model cannot balance new and old

*3.3. Continual Tasks Experiment with Our Proposed Method* 

*3.3. Continual Tasks Experiment with Our Proposed Method* 

The distribution of embeddings from the old plant leaf task is shown in Figure 15, which is very

images.

The distribution of embeddings from the old plant leaf task is shown in Figure 15, which is very mixed and chaotic, losing the ability to distinguish and classify the similarity between input paired images. mixed and chaotic, losing the ability to distinguish and classify the similarity between input paired

**Figure 14.** The testing accuracy of the second case.

*Agriculture* **2020**, *10*, x FOR PEER REVIEW 11 of 15

**Figure 13.** The distribution of embeddings from the old pest task.

testing accuracy is shown in Figure 14. The testing accuracy on the plant leaf task decreases from

100% to 60%, which means the extent of catastrophic forgetting for the plant leaf task is 40%. We

found that, regardless of the occurring order of sequential tasks, the basic metric learning model does

have a serious forgetting problem, as shown in Figures 12 and 14. In other words, after new learning,

the basic metric learning model can no longer do the previous task well, due to the forgetting.

For the second case, from the plant leaf task to the crop pest task, the experimental result of the

**Figure 15.** The distribution of embeddings from the old leaf task.
