*Article* **Evaluation of Convolutional Neural Networks' Hyperparameters with Transfer Learning to Determine Sorting of Ripe Medjool Dates**

**Blanca Dalila Pérez-Pérez <sup>1</sup> , Juan Pablo García Vázquez 1,\* and Ricardo Salomón-Torres 2,\***

	- Tel.: +52-686-191-2431 (J.P.G.V.); +52-653-534-4255 (R.S.-T.)

**Abstract:** Convolutional neural networks (CNNs) have proven their efficiency in various applications in agriculture. In crops such as date, they have been mainly used in the identification and sorting of ripe fruits. The aim of this study was the performance evaluation of eight different CNNs, considering transfer learning for their training, as well as five hyperparameters. The CNN architectures evaluated were VGG-16, VGG-19, ResNet-50, ResNet-101, ResNet-152, AlexNet, Inception V3, and CNN from scratch. Likewise, the hyperparameters analyzed were the number of layers, the number of epochs, the batch size, optimizer, and learning rate. The accuracy and processing time were considered to determine the performance of CNN architectures, in the classification of mature dates' cultivar Medjool. The model obtained from VGG-19 architecture with a batch of 128 and Adam optimizer with a learning rate of 0.01 presented the best performance with an accuracy of 99.32%. We concluded that the VGG-19 model can be used to build computer vision systems that help producers improve their sorting process to detect the Tamar stage of a Medjool date.

**Keywords:** *Phoenix dactylifera* L.; Medjool dates; image classification; convolutional neural networks; deep learning; transfer learning

### **1. Introduction**

The date palm fruit (*Phoenix dactylifera* L.) is a berry composed of a fleshy mesocarp, covered by a thin epicarp and an endocarp covering all of its seed [1]. The name of this fruit is "date," which comes from the Greek word "Daktylos," which means "finger" [2]. This fruit has been the primary source of food in several countries in the Middle East, playing an essential role in the economy, society, and environment [3].

This fruit's growth presents a progressive maturity level in four stages known by their Arabic names: Kimri, Khalal, Rutab, and Tamar. At its first stage of growth (Kimri), the fruit is small, green, and with a hard texture. In its second stage (Khalal), the fruit reaches its maximum size and changes it is green color to yellow or red. In the third stage (Rutab), the fruit is losing weight and moisture, turning the fruit into a brown color. In the last stage (Tamar), the fruit is ripe and ready to be harvested [4].

According to Food and Agriculture Organization of the United Nations data, the world's largest date producers are Egypt, Saudi Arabia, Iran, Algeria, and Iraq, producing 66% of the world production in 2018 [5]. However, despite not being a native crop of the American continent, the date has also become a priority fruit for cultivation in southern California and Arizona in the United States and northwestern Mexico, where high-quality dates, such as Medjool cultivar, are grown [6].

The date palm producers face several challenges concerning harvesting, sorting, and packaging because they are mainly performed manually [7]. Therefore, many employers

**Citation:** Pérez-Pérez, B.D.; García Vázquez, J.P.; Salomón-Torres, R. Evaluation of Convolutional Neural Networks' Hyperparameters with Transfer Learning to Determine Sorting of Ripe Medjool Dates. *Agriculture* **2021**, *11*, 115. https:// doi.org/10.3390/agriculture11020115

Academic Editors: Sebastian Kujawa and Gniewko Niedbała Received: 31 December 2020 Accepted: 25 January 2021 Published: 1 February 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

are hiring for these activities that involve long working hours. People perform repetitive tasks, causing mistakes in the correct inspection of the fruit's quality attributes, such as color (maturity level), size, and texture.

Particularly in the Medjool date harvesting process, fruit pickers shake the palm bunch so that the ripe dates fall into containers. This can cause the ripe fruit to suffer damage to its texture or that fruits in other ripening stages are also harvested. The dates are placed in trays, where the immature ones will be extracted and grouped in other trays to dry them in the sun until they reach their full maturity. In contrast, the minute or damaged ones are commonly separated to develop date by-products or for animal consumption. Finally, the Medjool date sorting (which has the required degree of maturity) is packed.

To automate the processes related to harvesting, sorting, and packaging of dates, recently, there has been interest in exploring convolutional neural networks (CNNs) [8,9]. CNN has shown exceptional accuracy for classifying fruits and vegetables, considering several quality parameters, such as color (maturity level), shape, texture, and size [10–14].

Regarding dates, we identified that some studies use machine learning algorithms and image processing techniques to sort among date palm fruit or to detect among their different maturity stages [15–17]. Further, there are research works that propose using CNNs [8,9,18]. However, these studies do not present models to detect the maturity stage of the Medjool date.

The main contribution of this article was the identification of the hyperparameters that best influenced the training of a CNN architecture that transfers learning to Medjool's mature date sorting. To achieve it, we performed a comparison of the performance of eight CNN architectures. Two versions of the CNN architecture are called the Visual Geometric Group (VGG) from Oxford University, VGG-16 and VGG-19. Three versions of the CNN architecture are called Residual Network from Microsoft research, ResNet-50, ResNet-101, and ResNet-152. WE also looked at AlexNet, Inception Version 3, and a CNN from scratch. The hyperparameters analyzed were the number of layers, the number of epochs, the batch size, optimizer, and learning rate.

### **2. Materials and Methods**

### *2.1. Image Acquisition*

The images corresponding to ripe and unripe dates in trays were taken in September 2020, during the first round of harvest of Medjool dates in the plantation located in Colonia La Herradura (32◦3605600 N, 115◦1503600 W) in the Mexicali Valley, Mexico. The acquisition of images was made with three different cameras, using natural light between 8:00 a.m. and 2:00 p.m. We used a Canon EOS Rebel T6 of 18 megapixels and the cameras of the smartphones Samsung, SM-N950F and SM-N960F, which have a dual camera of 12 megapixels.

### *2.2. Image Data Set*

The image data set contained 1002 images in JPG format, which are of different sizes (5184 × 3456, 4449 × 3071, and 4376 × 3375 pixels). The network architectures were trained with JPG images because they are fed with low-quality images in real scenarios. We refer to low-quality as images with blur, noise, contrast, or compression. We considered that if you are trained in architecture with this type of image, the system will be able to classify the Medjool date in images with these features. Further, a study shows that convolutional neural networks are minimally affected in their performance by using JPG format [19].

The image data set was distributed as follows: 501 images of ripe dates and 501 images of unripe dates on trays (Figure 1). The dates in trays were previously classified as ripe or unripe by expert people.

**Figure 1.** Example of dataset images. (**A**–**C**) Trays with ripe dates. (**D**–**F**) Trays with unripe dates.

### *2.3. Convolutional Neural Networks*

The convolutional neural network (CNN) is a type of artificial neural network, where neurons correspond to receptive fields similar to the neurons in the primary visual cortex of a biological brain [20]. Also, CNN is identical to ordinary neural networks such as multilayer perceptron. They are composed of neurons that have weights and biases that can learn. Each neuron receives some input, performs a scalar product, and then applies an activation function [21]. The CNN as a multilayer perceptron has a loss or cost function on the last layer, which will be fully connected.

Figure 2 presents a CNN structure, which consists of three blocks. The first is the input, an image. Next, we can see the block of feature extraction, which consists of convolutional and pooling layers. Finally, the third block is of classification, which consists of fully connected layers and softmax. The structure of the convolutional network changes as the number of convolution and pooling layers increases.

**Figure 2.** Representation of a basic convolutional neural network (CNN).

The main difference between convolutional neural networks from ordinary neural networks is that they explicitly assume that the inputs are images [21], allowing specific encoding properties in the architecture, gaining efficiency and reducing the number of parameters in the network. In this way, CNNs can model complex variations and behaviors, giving quite accurate predictions. This study considered eight CNNs' architectures: VGG-16, VGG-19, Inception V3, ResNet-50, ResNet-101, ResNet-152, AlexNet, and one CNN from scratch.

### 2.3.1. VGG-16 and VGG-19 Architectures

VGG is the abbreviation for the Visual Geometric Group [22]. The VGG model was developed by Simonay and Zimmerman [23]. VGG uses 3 × 3 convolutional layers stacked on top of each other in increasing depth. The reduction of volume size is handled by max pooling. Two fully connected layers, each with 4096 nodes, are then followed by a softmax classifier [23]. The number 16 or 19 is the layer of networks considered deep.

### 2.3.2. Inception V3 Architecture (GoogLeNet V3)

This architecture was born with the name of GoogLeNet, but subsequent updates have been called Inception vN, where N refers to the version number put out by Google [22]. The basic module of Inception [24] consists of four branches concatenated in parallel: a 1 × 1 kernel convolution, followed by two 3 × 3 convolutions; a 1 × 1 convolution, followed by a 3 × 3 convolution; a pooling, followed by a 1 × 1 convolution; and finally a 1 × 1 convolution. Inception consists of 10 modules, although these modules are going slightly as the net gets deeper. Five of the modules are changed with the purpose of reducing the computational cost by replacing the n × n convolutions with two convolutions, a 1 × 7 followed by a 7 × 1. Two last modules replace the last two convolutions: 3 × 3 of the first branch with two convolutions each and one 1 × 3 followed by another 3 × 1, this time in parallel. In total, Inception V3 has 42 layers with parameters.

### 2.3.3. ResNet-50, ResNet-101, and ResNet-152 Architectures (Residual Neural Network)

ResNet [25] does not have a fixed depth and depends on the number of consecutive modules used. However, increasing the network's depth to obtain a greater precision makes the network more difficult to optimize. ResNet addresses this problem by adjusting a residual application in place of the original and adding several connections between layers. These new connections skip several layers and perform an identity or a 1 × 1 convolution. The base block of this network is called the residual block. When the network has 50 or more layers, it is composed of three sequential convolutions, a 1 × 1, a 3 × 3, and a 1 × 1, and a connection that links the input of the first convolution to the output of the third convolution. This study used three models with this architecture, ResNet-50, ResNet-101, and ResNet-152, which are composed of 50, 101, and 152 layers, respectively.

### 2.3.4. AlexNet

This architecture consists of five convolutional layers and three fully connected layers. Some convolution layers are followed by max-pooling layers (1, 2, and 5 layers). The Rectified Linear Unit (ReLU) nonlinearity is applied to the output of every convolutional and fully connected layer. The fully connected layers have 4096 neurons each [26]. To avoid data over-adjustment, a regularization method is used, known as a dropout, which consists of "turning off" neurons with a predetermined probability during training.

### 2.3.5. CNN from Scratch

The CNN that we built from scratch was composed of four alternate convolutional and max-pooling layers, followed by a dropout after every other convolutional and pooling pair. After the last grouping layer, we attached a fully connected layer with 256 neurons, another dropout layer, and, finally, a softmax classification layer for our classes. The loss function was the cross entropy since it is useful with convolutional neural networks, most significantly for purposes of image classification [27]. In order to compare the performance of a network that learns from scratch against other architectures that start from transfer learning, a convolutional network was trained from scratch.
