Strawberry Defect Identification Using Deep Learning Infrared–Visible Image Fusion

Lu, Yuze; Gong, Mali; Li, Jing; Ma, Jianshe

doi:10.3390/agronomy13092217

Open AccessArticle

Strawberry Defect Identification Using Deep Learning Infrared–Visible Image Fusion

¹

Key Laboratory Photonic Control Technology, Ministry of Education, Tsinghua University, Beijing 100083, China

²

International Joint Research Center for Smart Agriculture and Water Security of Yunnan Province, Yunnan Agricultural University, Kunming 650201, China

³

Division of Advanced Manufacturing, Tsinghua Shenzhen International Graduate School, Tsinghua University, Shenzhen 518055, China

^*

Authors to whom correspondence should be addressed.

Agronomy 2023, 13(9), 2217; https://doi.org/10.3390/agronomy13092217

Submission received: 29 June 2023 / Revised: 6 August 2023 / Accepted: 19 August 2023 / Published: 24 August 2023

(This article belongs to the Section Precision and Digital Agriculture)

Download

Browse Figures

Versions Notes

Abstract

:

Feature detection of strawberry multi-type defects and the ripeness stage faces huge challenges because of color diversity and visual similarity. Images from hyperspectral near-infrared (NIR) information sources are also limited by their low spatial resolution. In this study, an accurate RGB image (with a spatial resolution of

2048 \times 1536

pixels) and NIR image (ranging from 700–1100 nm in wavelength, covering 146 bands, and with a spatial resolution of

696 \times 700

pixels) fusion method was proposed to improve the detection of defects and features in strawberries. This fusion method was based on a pretrained VGG-19 model. The high-frequency parts of original RGB and NIR image pairs were filtered and fed into the pretrained VGG-19 simultaneously. The high-frequency features were extracted and output into ReLU layers; the

l_{1}

-norm was used to fuse multiple feature maps into one feature map, and area pixel averaging was introduced to avoid the effect of extreme pixels. The high- and low-frequency parts of RGB and NIR were summed into one image according to the information weights at the end. In the validation section, the detection dataset included expanded 4000 RGB images and 4000 NIR images (training and testing set ratio was 4:1) from 240 strawberry samples labeled as mud contaminated, bruised, both defects, defect-free, ripe, half-ripe, and unripe. The detection neural network YOLOv3-tiny operated on RGB-only, NIR-only, and fused image input modes, achieving the highest mean average precision of 87.18% for the proposed method. Finally, the effects of different RGB and NIR weights on the detection results were also studied. This research demonstrated that the proposed fusion method can greatly improve the defect and feature detection of strawberry samples.

Keywords:

fruit feature detection; image fusion; VGG-19; infrared image; RGB image

1. Introduction

Ripeness is a crucial factor for determining the best harvest time for fruits, which significantly influences the taste. During the ripening period, the skin of many fruits softens, and it is also easy to cause surface damage before and after picking, affecting the quality of sale. Judgment of ripeness and surface defects is complex and comprehensive. A reliable, rapid, and non-destructive measure will help to improve the work efficiency and reduce the artificial cost. With the development of computer technology, optical and imaging techniques have made these procedures tend towards simplicity.

Dhiman et al. [1] comprehensively summarized that RGB is the commonly used color space for some crop predictions and detection work currently. May et al. [2] used the RGB color model and fuzzy logic technique to assess the ripeness of oil palm fruit. Pardede et al. [3] studied the features extracted from RGB, HSV, HSL, and L*a*b* color space with the help of the support vector machine (SVM). Anraeni et al. [4] achieved strawberry ripeness detection by the RGB feature and k-Nearest Neighbor (k-NN) method, which performed well in the non-strawberry category. However, traditional fruit surface imaging methods only represent external information and are not sensitive to the internal information under the skin. Cubero et al. [5] and Costa et al. [6] summarized many commercial vision systems, but these measures only imitate human vision. Therefore, they only make judgments on the parts exposed to cameras. Moreover, because strawberries are herbaceous plants, the fruiting position is close to the land, which can easily cause mud contamination; the multiple colors of contaminants could make detection difficult. It has been proven that spectroscopy techniques can be used to evaluate the quality and ripeness of fruits or vegetables.

Hyperspectral imaging (HSI) is very useful for detecting bruises under the skin, and it can be used to evaluate changes in the fruit quality (e.g., soluble solids content (SSC), moisture content (MC), and firmness). Each pixel within a hyperspectral image contains abundant spectra, so a whole hyperspectral image is a three-dimensional cubic shape. And, more information can be involved in hyperspectral images than in visible images. Wei et al. [7] extracted hyperspectral features from wavelengths between 400 and 1000 nm to classify the ripeness of persimmon fruit. They made a different ripeness dataset with the help of a linear discriminant analysis (LDA), and reached a correct classification rate of 95.3%. Guo et al. [8] chose optimal wavelengths by principal component analysis (PCA) loading and fed features into the SVM to assess the ripeness situation of strawberries. Khodabakhshian et al. [9] and Benelli et al. [10] developed different regression models to detect fruit quality attributes and used classification algorithms to finally evaluate the grade of maturity.

Although HSI can provide richer spectral information, it still suffers from many challenges, including the need to select the appropriate spectral bands in advance, the limitation on results by the selected region of interest (ROI), the low spatial resolution, the slow imaging speed, the unintuitive imaging results, etc. In addition, similar spectral curves might sometimes be ambiguous. Elmasry et al. [11] found that the sharply changed spectral curve between 600 and 800 nm is optimal for evaluating the ripeness of strawberries, while Liu et al. [12] thought that the similaraties within the same spectral range could be used to evaluate fungal contamination of strawberries. Unfortunately, similar situations are often encountered in reality.

Furthermore, NIR imaging also has spectral information that is deficient in RGB images, and its imaging speed is faster than HSI as well. Luo et al. [13] proved that the best wavelengths for apple bruise detection are within the NIR range, and it is feasible to conduct detection by using the reflectance difference between the selected wavelengths. Wu et al. [14] used spectral preprocessing and the least squares support vector machine algorithm to detect tomato surface bruising within 600–1600 nm of the wavelengths. In a narrower spectral range, Wang et al. [15] reported that a 700–1000 nm spectral wavelength could be used to separate the chilling injured symptom severity of kiwifruits. Additionally, NIR wavelengths involved tasks including defect detection on chilling-injured nectarines [16], hailstorm damaged olives [17], and the sunburn [18] and bruise susceptibility of apples [19], which also achieved good results.

Unfortunately, current methods for NIR participation almost always acquire hyperspectra across the entire NIR range, from which some specific spectra in the ROI region are selected for analysis, and the the results are likewise limited by the selected area. This results in poor visibility of the final acquired image. Moreover, the presence of filters or spectroscopic prisms in NIR or hyperspectral cameras severely weakens the ability of CCDs to capture the light intensity, which is also a major factor causing spatial resolution degradation. Taken together, image fusion work is necessary to obtain a more visual and intuitive whole image.

Deep learning methods have gained popularity recently. Gao et al. [20] built a real-time HSI system with a pretrained AlexNet convolutional neural network (CNN) to evaluate the early ripe stage or ripe stage of strawberry. They used two wavelength ranges (528 and 715 nm) of the spectrum as the dataset in the laboratory and obtained an accuracy of 98.6% for the test dataset. Su et al. [21] developed one-dimensional and three-dimensional hyperspectral information [12] and fed it into 1D and 3D ResNet models to assess the ripeness and SSC of strawberries. They achieved an accuracy of over 84% but were limited by the insufficient capacity of the dataset. Gulzar [22] used a transfer learning technique to classify fruits, which not only improved the accuracy, but also effectively solved the challenge of having insufficient training data. But, the problem with spatial resolution degradation was still unresolved, which could negatively affect the performance of neural networks. More importantly, when faced with the co-existence of multiple conditions, such as bruises, diseases, maturity judgments, dirt masking, etc, it is particularly important to avoid the influence of extraneous conditions on the results in a reasonable way. Established experiments have rarely discussed the co-existence of multiple conditions; they are often designed with only one or two variables, which is unlikely in reality.

The main objective of this study is to fuse the RGB and NIR information into a single image with greater visualization and evaluate the strawberry ripeness and surface quality, focusing on spectral information and spatial information as well. In this experiment, we proposed a fusion method for RGB and NIR images. Firstly, the high-frequency information from the RGB image and NIR image was extracted using filters, and the pretrained VGG-19 network was utilized to process the high-frequency part. Next, the high-frequency information was extracted in advance from RuLU layers with different activation functions, and the image pixels with different features of high-frequency information were averaged and processed to fuse them into a single image. Finally, the high and low-frequency images were fused together. The resulting fused image obtained both spatially resolved information and spectral information, which improved the accuracy of subsequent detection processes.

2. Materials and Methods

2.1. Strawberry Samples

A total of 240 strawberry (Fragaria × ananassa Duch.) samples (‘Hongyan’) were harvested by professional farmers from a local fruit growing base in Shenzhen, Guangdong province, China. These strawberry samples were divided equally into three groups: ripe stage, half-ripe stage, and unripe stage. The ripeness of samples was judged according to the color of the skin [23,24,25]. After being washed, wiped, and numbered individually, the samples were stored in a thermostat where the relative humidity was set to 70%, and the temperature was set to 15–22 °C. Among each group, 35 to 50 strawberries were artificially contaminated with mud, bumped, and rubbed. The firmness of the samples differed depending on the ripeness stage. Therefore, the bruises were different after the same level of treatment. This also exactly reproduced the state of the real transport situation. These strawberries were stored in a refrigerator at 5 °C for 1–3 days, during which they were randomly selected and photographed. The strawberry samples showed different damage states, which enriched our dataset.

2.2. Infrared Image and RGB Image Acquisition

2.2.1. Infrared–RGB Imaging System

Images of the strawberry samples were acquired by the hyperspectral reflectance system shown in Figure 1. The core component of the system was a push-broom hyperspectral imager (GaiaField-V10E, Dualix Instruments, Chengdu, China), which contained a hyperspectral imaging part and an RGB imaging part. The former part covered the Vis/NIR wavelength ranging from 400 nm to 1100 nm with a resolution of 2.8 nm. The hyperspectral CCD contained

696 \times 700

pixels, and each spectral image contained up to 256 bands (wavelengths). The RGB imaging lens was right next to the hyperspectral imaging part; the compact design allowed the two images to be viewed without too much difference in the viewing angle. The RGB imaging CCD contained

2048 \times 1536

pixels. Above all, two 250 W tungsten halogen lamps (Projection Lamp Type 13162, Philips, Germany) were fixed at the top. The rotatable platform connected with a stepping motor allowed the imager to acquire information from multiple viewpoints.

2.2.2. Image Acquisition and Preprocessing

The imaging system could acquire RGB images and NIR (700–1100 nm) images at the same time. Due to the much longer wavelength of NIR, the resolution of the NIR images was dramatically reduced. The acquisition range of the hyperspectral camera was fixed at 700–1100 nm [26], and 146 bands (channels) were contained in each image cube. And, the raw RGB sample images are shown in Figure 2. These included different defects and ripeness stages. It should be noted that although the hyperspectral camera was used in this experiment to acquire images in the NIR range, in the following work, the spectral image cube was added up into a single channel gray scale map that was exactly equal to the spectral integration effect of the NIR camera. This was because a suitable NIR camera was not available in the laboratory, and secondly the small interval between the two lenses on the hyperspectral camera facilitated subsequent image registration.

The initial images should be corrected by the black reference image and white reference image under the unchanged dark room environment. The dark reference image was captured by covering the camera with a completely opaque lens cap and turning off the illumination; the white reference image was captured by shooting a standard whiteboard with the illumination on. The final corrected images were calculated by Equation (1):

I_{c} = \frac{I_{i n i t i a l} - I_{d a r k}}{I_{w h i t e} - I_{d a r k}}

(1)

where

I_{c}

is the image after correction,

I_{i n i t i a l}

is the raw image output by the imager,

I_{d a r k}

is the dark reference image, and

I_{w h i t e}

is the white reference image. The whole correction procedure was executed by the software attached to the imager.

Common NIR images have a single channel, and each image cube should be transformed into the same form. This procedure was based on Equation (2):

I_{f u s e d} (x, y) = \sum_{i = 1}^{N} \frac{I_{c}^{i} (x, y)}{N}

(2)

where

I_{f u s e d}

is the single-channel gray-scale image after spectral fusion,

(x, y)

is the spatial coordinate within each image, N is all channels within the NIR range, which was set as 146, and i represents the i-th channel in each image cube.

After obtaining the three-channel RGB image and single-channel NIR gray-scale image, due to the different shooting views, registration processing should be performed between the NIR and RGB images. Cross-correlation method [27] was consulted to align and calibrate the pixels of two source images. According to the similarity between the RGB image and NIR image, each pixel was matched to avoid blurring and ghosting.

2.2.3. Dataset Preparation

The data included RGB images and the corresponding NIR images, which were captured from the same scene and registered. With each exposure, the rotatable platform rotated at a random angle. Each strawberry was photographed three to five times. The total set of samples included 1000 RGB images and 1000 NIR images from 240 samples. Before being divided into different sets, all image pairs were fused by the proposed method.

All fused images were divided into three groups according to the ripeness stages. As mentioned above, a certain number of samples were contaminated, bumped, and rubbed randomly within each group, so that each fused image contained specific features of relative samples. The fused images were manually labeled with the tool LabelImg (URL: https://github.com/tzutalin/labelImg, accessed on 1 October 2022), and 80% of the fused images were randomly selected to build the training set, and the remaining images were employed as the testing set. To overcome the issue of having insufficient data, the expansion procedure was operated before training the deep learning detection network, as follows: (1) flipping each fused image horizontally along the vertical direction; (2) scaling each fused image 0.8–1.5 times randomly; (3) panning each fused image in four directions (up, down, right, or left). By using the above three approaches, the size of the dataset was expanded from 1000 raw fused images to 4000 fused images. The training set and testing set were divided according to the ratio of 80% and 20%. In other words, after removing the invalid images, the augmented training set contained 3150 fused images, and the augmented testing set contained 783 fused images. The detailed information about the training set and testing set is demonstrated in Table 1 and Table 2.

Labels, including the three ripeness stages, mud pollution, and mechanical damage, were manually placed on fused image samples, as shown in Figure 3. The background and ROI did not need to be predefined because the neural network only paid attention to the whole sample regions after training.

2.3. Proposed Framework

The convolutional neural network (CNN) is a very effective machine learning method. Because of its high adjustability, generality, and large amount of data, it learns through data, extracts data features, and thus identifies complex model patterns [28]. The creation of the CNN eliminated the need for researchers to solve difficult mathematical formulas and physical problems. VGG-19- [29] or VGG-based models have been proven to be commonly used and reliable network models for image analysis and recognition. They were developed by the Visual Geometry Group from the University of Oxford. The simple network structure can meet the needs of secondary development according to different tasks, so this is the main reason why VGG series networks are favored by researchers in the agriculture and biology domains [30]. Previous works validated VGG-19 for use in vegetable texture recognition [31] and the classification of fruit ripeness [32]. Moreover, in the domain of RBG research images, it is easier to acquire and annotate the image source. Mamat et al. [33] used YOLOv5 to annotate fruit varieties, which made data acquisition more convenient. But, the advantages of RGB-NIR fused images were not fully exploited. A pretrained VGG-19 based network is used to extract image features, fuse the RGB and NIR images, and classify the different groups of strawberry samples in the following sections.

2.3.1. Fusion of the Low-Frequency Part

The convolutional kernels in VGG-19 can be seen as different types of filters, and the pretrained model automatically adjusts the parameters of each convolutional kernel. In this experiment, the pretrained model of ImageNet [34] was implemented. The high-frequency (HF) part of the images contained much richer information than the low-frequency (LF) part; therefore, the method described in [35] was used to decompose each source image pair into LF and HF parts. The LF parts were fused directly by the weighted-averaging strategy; the fused LF part was calculated by Equation (3):

F_{L F} (x, y) = α_{r g b} I_{L F}^{r g b} (x, y) + α_{n i r} I_{L F}^{n i r} (x, y)

(3)

where

(x, y)

denote the corresponding positions of the image intensity of the LF RGB image

I_{L F}^{r g b}

and the LF NIR image

I_{L F}^{n i r}

.

α_{r g b}

and

α_{n i r}

represent the weights of

I_{L F}^{r g b}

and

I_{L F}^{n i r}

, respectively. In the following experiment, changes in the two weight ratios had little effect on the final results, so

α_{r g b}

and

α_{n i r}

were set as 0.5.

2.3.2. Texture Extraction of the High-Frequency Part

For the HF contents,

I_{H F}^{r g b}

and

I_{H F}^{n i r}

, the proposed VGG-19-based framework was used to extract texture features. Two forms of images were extracted in the same way from the network.

In the pretrained VGG-19 network, parameters were adjusted to the optimal situation and fixed, because the channels of the VGG-19 were

M = 64 \times 2^{i - 1}

,

i \in {1, 2, 3, 4}

with the changes in the i-th layer. The feature maps of the input images were extracted by convolutional kernels and were output at activation layers, denoted by

R e L U_{1}

,

R e L U_{2}

,

R e L U_{3}

,

R e L U_{4}

, and

R e L U_{5}

.

T_{r g b, n i r}^{i} = {ReLU}_{i} (I_{H F; r g b, n i r}^{i})

(4)

where

I_{H F; r g b, n i r}^{i}

is the high-frequency RGB image or the NIR image fed into the i-th activation layer,

R e L U_{i} (\cdot)

is the activation ReLU function processed within the i-th layer, and

T_{r g b, n i r}^{i}

is the image feature extracted and output by the i-th layer.

T_{r g b, n i r}^{i}

is a multidimensional vector, depending on the channels of its layer. The structure of the neural network and the output feature are shown in Figure 4.

The abundance of information can be measured by the pixel intensity within the images, and each input image corresponds to hundreds of feature maps. Using the

l_{1}

-norm [36], the feature maps corresponding to each input image were added up with Equation (5), and the information characteristics of each feature map were emphasized. This also took full advantage of the multiple convolutional kernels in neural networks when extracting features.

{\hat{T}}_{r g b, n i r}^{i} = \sum_{m = 1}^{M} |T_{r g b, n i r}^{i, m}|

(5)

where

T_{r g b, n i r}^{i, m}

is the

l_{1}

-norm processed RGB or the NIR image output in the i-th layer, whose channel is

M = 64 \times 2^{i - 1}

. Along the channel dimension, after adding up all the feature images,

{\hat{T}}_{r g b, n i r}^{i}

was total output feature map in each activation function layer. To avoid the effect of extreme pixels on the fused image, area pixel averaging was introduced to make the algorithm robust with Equation (6).

{\bar{T}}_{r g b, n i r}^{i} = \frac{\sum_{Δ x = - r}^{r} \sum_{Δ y = - r}^{r} {\hat{T}}_{r g b, n i r}^{i} (x + Δ x, y + Δ y)}{{(2 r + 1)}^{2}}

(6)

where r is the area size. In this paper,

r = 1

, so more texture information was preserved with guaranteed algorithm robustness.

2.3.3. Fusion of RGB-NIR Images

As the network became deeper, the size of the feature image output by the activation function became smaller because of the pooling operator. An up-sampling operator was introduced to restore the output feature images of each layer to the original images

I_{H F}^{r g b}

and

I_{H F}^{n i r}

.

In each output layer, up-sampled

{\bar{T}}_{r g b, n i r}^{i}

images were added up, and the output images of all five layers were added up as well with Equation (7). The fused HF part was obtained:

F_{H F} (x, y) = \sum_{i = 1}^{N} α_{r g b} {\bar{T}}_{r g b}^{i ↑} + α_{n i r} {\bar{T}}_{n i r}^{i ↑}

(7)

where

{\bar{T}}_{r g b}^{i ↑}

and

{\bar{T}}_{n i r}^{i ↑}

are the up-sampled

{\bar{T}}_{r g b}^{i}

and

{\bar{T}}_{n i r}^{i}

, respectively, and

N = 5

is the number of activation layers. The extraction and processing flow of the HF image feature image are shown in Figure 5.

Since the fused LF part and fused HF part were obtained, the final fused RGB-NIR images was generated by Equation (8):

F (x, y) = F_{L F} (x, y) + F_{H F} (x, y)

(8)

2.4. Detection Evaluation

In order to test the performance of fused images in the detection process, we introduced evaluation metrics to measure the results. The detection of strawberry ripeness stages, defect situations, and mud contamination depending on different image inputs could be evaluated by the precision (P) and recall (R), which were calculated with Equations (9) and (10) [37]:

P = \frac{T P}{T P + F P}

(9)

R = \frac{T P}{T P + F N}

(10)

where TPs, FPs, FNs, and TNs represent true positives, false positives, false negatives, and true negatives, respectively. The relationships among the evaluation symbols, detection results, and labels are shown in Table 3.

To further evaluate the results of the detection model, according to Rezatofighi et al. [38], the intersection over union (IoU) was used to define the overlapping area of the prediction box and ground truth box. In this paper, if the IoU was greater than 0.7, the prediction result was considered to be a TP. Otherwise, the prediction result was considered to be a FP. Based on the IoU, the P and R were calculated, and the average precision (AP) and mean average precision (mAP) were calculated to further assess the performance with different image inputs. Greater values of P, R, AP, and mAP indicate a better detection performance. The AP and mAP are calculated respectively as shown in Equations (11) and (12).

A P = \sum_{n = 1}^{N} (R_{n} - R_{n - 1}) P_{n}

(11)

m A P = \frac{1}{K} \sum_{i = 1}^{K} A P_{i}

(12)

where n is the index of the different confidence threshold points on the PR curve,

P_{n}

is the corresponding precision value, and

R_{n}

is the corresponding recall value. K is the total number of sample classes, and

{A P}_{i}

indicates the average precision of the i-th class.

3. Results and Discussion

3.1. Evaluation of the Fused Images

One of the key points of this study was to propose an image fusion algorithm that facilitates the detection of fruit quality. Fused images must be evaluated as this directly affects the subsequent detection performance. Four quality metrics, the mutual information (MI, based on the information measure) [39], multi-scale structural similarity (MS-SSIM, based on the structural similarity measure) [40], standard deviation (SD, based on the image feature evaluation) [41], and modified fusion artifacts measure (

N_{a b f}

, based on the noise evaluation) [42], were employed to assess the fusion performance.

In order to compare the fusion performance of the proposed method with that of the efficient algorithm, three typical fusion methods were selected. DenseFuse [43], FusionGAN [44], and U2Fusion [45] were introduced and compared with our proposed method. The compared methods are widely used in many fields [46,47,48], including object detection, recognition, and tracking. From the image data, 25 RGB and NIR image pairs and their fused images calculated using different fusion methods were randomly selected. These selected images have not been labeled yet, as only the fusion performance was verified in this section. The visible fusion results of different fusion methods are shown in Figure 6.

The average values of the four quality metrics for the proposed and comparative methods are shown in Table 4.

From Table 4, the proposed fusion method obtained the best values of MI, MS-SSIM, SD, and

N_{a b f}

compared to the other methods. A reliable fusion result was essential for subsequent detection.

One of the strengths of the proposed method is that the convolutional kernels of the pretrained network are automatically and apparently randomly fixed to the optimal values. This makes it more efficient and more flexible than the traditional manually defined filtering algorithms. First, with the more varied feature outputs, the final fused images contain more complete information. Therefore, the MI and MS-SSIM indexes were the largest for the compared methods. Second, multi-feature output layers and the area pixel averaging strategy reduced the influence of extreme pixels and there were few unobjective filters and little noise. Hence, the SD and

N_{a b f}

indexes performed well. The proposed fusion method obtained a good fusion performance, produced more content restoration, and had high structural integrity and lower noise. This was very convenient for the following detection and classification work.

3.2. Performance of Fused Image in Detection

3.2.1. Preparation of Detection Network and Dataset

The detection neural networks contain part of the functions of the classification neural networks. Detection networks are being used for the recognition of fruit and vegetable characteristics. In this field, the YOLO series of networks is very popular and is constantly being updated [49,50,51]. Additionally, Yang et al. [52] and Tian et al. [53] compared multiple types of YOLO networks with other detection networks in detail, reaching the conclusion that the YOLO series networks perform better in complex detection environments. YOLOv3-tiny [54] achieves a balance between speed and accuracy. It has the advantages of a smaller parameter volume, faster calculation speed, and high accuracy, which were validated in [52]. In this paper, YOLOv3-tiny was employed to train and test the fused images generated by the proposed method and was compared with the results obtained from the general image input. For the processing of parameter settings for training and testing, we referred to Yang et al. [52] and Shi et al. [54].

The training and testing sets of YOLOv3-tiny were divided as shown in Table 1 and Table 2. The hyperparameters are demonstrated in Table 5, and the loss curves are shown in Figure 7.

3.2.2. Comparison of the Fused Image Input with the Traditional Input

Traditional input methods include only the inputting of RGB images or NIR images directly. The processes of labeling on RGB and NIR images are the same as those of fused images. The manually labeled anchor box coordinates and corresponding labels on the fused images could be directly transferred to the RGB and NIR images. The network parameter settings for the training and testing processes were the same for all three contrasting experiments. For the randomly selected valid testing set, 255 images/samples were in the fully ripe stage, 283 images were labeled as half-ripe, and 245 images were labeled as unripe. More detailed labels are shown in Table 2.

For the detection results based on the fused images, Table 6 shows the performance on the three ripeness stages and four quality situations. And, for record purposes, Con, B, BD, and DF were used to denote contamination, bruises, both defects, and defect-free; R, HR and U were used to denote the ripe stage, half-ripe stage, and unripe stage, respectively. In the ripe stage, the proposed method achieved values of 84.75–91.58% P and 82.46–93.62% R; in the half-ripe stage, the proposed method achieved values of 77.94–86.15% P and 82.14–86.81% R; and in the unripe stage, the proposed method achieved values of 79.59–86.59% P and 75.00–88.75% R. It is worth emphasizing that compared with the ripe stage, the P and R of bruises in the half-ripe and unripe stages significantly decreased, thus causing the occurrence of both defects to decrease as well. This is because, as the strawberry samples ripen, their skin becomes softer and is more vulnerable to damage [55]. Using the same treatments to bump and rub the samples, the bruises on the skin of the ripe strawberries would definitely be larger and more visible. Furthermore, the task of the discrimination of sample ripeness stages is more inclined to be a classification process, and the output of the YOLOv3-tiny is the ripeness level of the whole strawberry sample without the help of IoU to measure it. After calculating, the proposed method achieved values of 92.61% P and 93.33% R for the ripe stage, 88.42% P and 89.05% R for the half-ripe stage, and 92.12% P and 90.61% R for the unripe stage. Another reason for the low P-R values detected in the half-ripe stage group was that the standard between the ripe and half-ripe stages as well as the unripe and half-ripe stages was not specific and could cause some misjudgment.

Compared with the results for the fused images, the values of P and R in the RGB-only based group and NIR-only based group were also calculated in the same way. As shown in Table 7, the P and R values of the RGB-only based group were overall obviously lower than those of NIR-only based group and fused-image-based group. Especially, the detection performance on contaminated and bruised samples was the worst with weighted average P and R values of only 75.72% and 76.21%, respectively. According to the results, it was difficult to distinguish well between the contaminated and bruised samples based only on the RGB images. Different contaminant colors and bruises could become mixed up, making the detection networks ineffective. But, the results of ripeness detection were outstanding and were even comparable to the fused-image-based results. This was because of the accumulation of several anthocyanins that gave the ripe strawberries their red color [56] as they ripened, which could be easily reflected and detected by RGB images.

For the NIR-only-based group, the overall performance was between those of the RGB-only and fused-image-based groups. Compared with the proposed method, the R of the quality in the NIR-only-based group was smaller, which was caused by the misjudgment of the ripeness in each situation, which also decreased the P of other situations within each group. The NIR-only-based group achieved a good performance for the classification of contaminated and bruised strawberries; especially in the half-ripe and unripe stage, the stains (or bruises) on the harder skins were better suited for analysis by NIR images. The lower P values needed to be compensated by the excellent judgment performance of the RGB-only-based method for ripeness. This provided clear evidence of the need for the proposed method.

Given the evidence, it can be seen that bruises and contamination of the skin of samples was distinguished mainly by the reflectance of the target regions within the NIR wavelength range, and the classification ofthe ripeness stages was strongly associated with the RGB images. The traditional methods were used to divide the images into several ROIs and calculate the mean relative reflectance spectra [57] within each ROI. Each bruise or stain had different characteristics in different spectral ranges; however, in the visualized images, the mean relative reflectance could be completely represented by the gray-scale values in the NIR image [58]. These confirmed the feasibility of using the whole NIR image for classification. The detection results of six samples using the three methods are shown in Figure 8. This allows us to visualize the detection performance of the different methods.

The AP values of the four quality situations and ripeness stage detection are shown in Figure 9. The proposed method (mAP of 87.18%) has 5.42% and 3.61% higher performance levels in terms of the mAP metrics than the RGB-only-based method (mAP of 81.74%) and the NIR-only-based method (mAP of 83.57%), respectively. This is caused by the uneven performances of the two traditional methods. The proposed method properly combines the advantages of the two traditional ways and thus yields good results.

However, in the comparison experiments, the initial settings of the fusion weights for RGB and NIR were 0.5 and 0.5, but the optimal weighting ratio is not necessarily 1:1 in different practical situations and tasks. In the next experiments, the performance of the proposed method was studied under different RGB and NIR weighting ratios.

3.2.3. Comparison of Different RGB and NIR Weights

In the previous discussions and experiments, the fusion weight of both RGB and NIR was set as 0.5, i.e., the high- and low-frequency RGB and NIR had the same percentage in the final fused image. Although, when comparing the RGB-only- and NIR-only-based groups, the proposed method obtained excellent results in comparative experiments on our self-made dataset. It is apparently not practical for RGB and NIR to always have the same weights in multiple tasks. In order to study the specific performance of fused images with different weights of RGB and NIR parts, a new experiment was designed to research which detection tasks the different RGB/NIR weights are best suited to. The locations of the anchor boxes and corresponding labels were fixed, as in the previous experiment. The weight of RGB was used as the main reference object, and the weight was changed from 0.1 to 0.9 (the corresponding NIR weight changed from 0.9 to 0.1), with 0.05 being the minimum change scale.

According to Section 3.2.2, we initially learned that good detection results of quality situations and ripeness stages were not available at the same time in the RGB-only-based and NIR-only-based groups. This meant that the two detection tasks mainly relied on the information from the different sources of images. It was essential to measure the quality situation detection and ripeness stage detection separately to obtain the optimal weighting matches for different tasks. Based on this division method, the graph of results with the changes in the RGB/NIR weights is demonstrated in Figure 10. As the weights of NIR and RGB in the fused image varied, both the ripeness AP and quality AP exhibited significant changes. We observed that the ripeness AP reached its maximum value when the NIR weight was 0.35. Meanwhile, the quality AP achieved its highest value when the NIR weight ranged from 0.35 to 0.45. To show the results more briefly, the detection performance is shown in Figure 11 with a weight ratio of 0.1 as the minimum scale of variation.

Taking the RGB weights as the main reference, the scatter of both APs appears as maximal in the range of variables. The trends of the two tasks do not change monotonically. The detection of quality situations obtained the best performance, where the RGB weight was about 0.4, while the best detection of ripeness stages was obtained at around 0.7. This means that it is impossible to find a suitable weight for all detection tasks. The assignment of weights can only be chosen based on the main purpose of the task, so it is appropriate to separately discuss the tasks that the two information sources apply to. The quality AP decreased rapidly and obviously after achieving its maximum value at a weight of around 0.4; this was because the various colors of contamination and bruises confused the detection model in the RGB color space, and the NIR information was necessary for detection. But, before the quality AP reached its maximum value, the overall spatial resolution of the fused image was too low due to having too many components in the NIR band, which was also detrimental to the detection results [42,43,47]. This is an important reason why the mean relative reflectance of the ROI is mostly chosen as the object of study in NIR analyses. RGB images are much more sensitive to colors than NIR images, so when detecting the ripeness stage, especially when the ripeness could be judged well and accurately by red (ripe) and green or white (unripe) degrees, as the RGB weight increased, the overall ripeness AP was on an upward trend. The maximum value of ripeness AP appeared at a weight of around 0.7, and the AP did not fall back significantly as the weight increased; the small decrease that occurred might be due to ambiguous definitions and errors that occur when producing the sample set.

Taken together, this section of study supports the idea that the use of fixed weights of RGB and NIR in fused images for strawberry sample feature detection is not always the best option. In order to cope with multiple complex tasks, a system with the ability to choose weights flexibly is essential.

4. Conclusions

Strawberry is a kind of fruit with soft skin, which can be easily damaged or stained, thus affecting the commercial value. Many researchers have detected the ripeness and defects of strawberries based on RGB or NIR images. However, many limitations have often been encountered: It is difficult to detect defects under skin and contamination with similar visual colors in RGB images; NIR images can capture the difference in reflectance bands of defects and contamination, but are challenged by the severe lack of spatial resolution in the images. In order to combine the advantages of NIR images for detecting feature changes under the skin and RGB images with their high spatial resolution in non-destructive strawberry detection, a neural network based on the pretrained VGG-19 was used to fuse the images. Using a delicate design, we processed the low- and high-frequency parts of the image separately, preserving the detailed parts of the fused image to a greater extent. The fused images not only contained the high-spatial-resolution RGB images but also contained richer spectral information. This also effectively avoided the regional limitations of the mean relative reflectance method in each ROI, effectively improving the objectivity of the detection results. Compared with the current mainstream algorithms, the proposed fusion method achieved the best structural similarity and information content and minimal noise introduction, having 5.42% and 3.61% higher performance levels than the RGB-only-based and NIR-only-based groups, respectively.

In addition, in order to fit multi-detection objects and multi-tasks, the proposed model can also modify the weights of the RGB and NIR parts. In order to cope with the different sensitivity levels of information sources brought by different detection objects, in practical applications, specific weights need to be determined according to the specific situation, and the method proposed in this paper provides this convenience, which can greatly facilitate the multi-task detection task. Although, because of hardware limitations, the computational speed needs to be improved when dealing with a large amount of samples at the same time. However, we are optimistic that faster computing platforms and better optimization algorithms will be developed. These techniques will facilitate real-time model detection and will also allow the model to detect multiple varieties and characteristics of fruits simultaneously.

Author Contributions

Y.L. was primarily responsible for conceiving the method and writing the source code and the paper. M.G. designed the experiments and revised the paper. J.L. and J.M. provided the equipment of experiments. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Data are available upon request.

Conflicts of Interest

The authors declare no conflict of interest.

References

Dhiman, P.; Kaur, A.; Balasaraswathi, V.; Gulzar, Y.; Alwan, A.A.; Hamid, Y. Image Acquisition, Preprocessing and Classification of Citrus Fruit Diseases: A Systematic Literature Review. Sustainability 2023, 15, 9643. [Google Scholar] [CrossRef]
May, Z.; Amaran, M. Automated ripeness assessment of oil palm fruit using RGB and fuzzy logic technique. In Proceedings of the 13th WSEAS International Conference on Mathematical and Computational Methods in Science and Engineering, Angers France, 17–19 November 2011; pp. 52–59. [Google Scholar]
Pardede, J.; Husada, M.G.; Hermana, A.N.; Rumapea, S.A. Fruit Ripeness Based on RGB, HSV, HSL, L ab Color Feature Using SVM. In Proceedings of the IEEE 2019 International Conference of Computer Science and Information Technology (ICoSNIKOM), Medan, Indonesia, 28–29 November 2019; pp. 1–5. [Google Scholar]
Anraeni, S.; Indra, D.; Adirahmadi, D.; Pomalingo, S. Strawberry Ripeness Identification Using Feature Extraction of RGB and K-Nearest Neighbor. In Proceedings of the 2021 3rd East Indonesia Conference on Computer and Information Technology (EIConCIT), Surabaya, Indonesia, 9–11 April 2021; pp. 395–398. [Google Scholar]
Cubero, S.; Aleixos, N.; Moltó, E.; Gómez-Sanchis, J.; Blasco, J. Advances in machine vision applications for automatic inspection and quality evaluation of fruits and vegetables. Food Bioprocess Technol. 2011, 4, 487–504. [Google Scholar] [CrossRef]
Costa, C.; Antonucci, F.; Pallottino, F.; Aguzzi, J.; Sun, D.W.; Menesatti, P. Shape analysis of agricultural products: A review of recent research advances and potential application to computer vision. Food Bioprocess Technol. 2011, 4, 673–692. [Google Scholar] [CrossRef]
Wei, X.; Liu, F.; Qiu, Z.; Shao, Y.; He, Y. Ripeness classification of astringent persimmon using hyperspectral imaging technique. Food Bioprocess Technol. 2014, 7, 1371–1380. [Google Scholar] [CrossRef]
Guo, C.; Liu, F.; Kong, W.; He, Y.; Lou, B. Hyperspectral imaging analysis for ripeness evaluation of strawberry with support vector machine. J. Food Eng. 2016, 179, 11–18. [Google Scholar]
Khodabakhshian, R.; Emadi, B. Application of Vis/SNIR hyperspectral imaging in ripeness classification of pear. Int. J. Food Prop. 2017, 20, S3149–S3163. [Google Scholar] [CrossRef]
Benelli, A.; Cevoli, C.; Fabbri, A.; Ragni, L. Ripeness evaluation of kiwifruit by hyperspectral imaging. Biosyst. Eng. 2022, 223, 42–52. [Google Scholar] [CrossRef]
ElMasry, G.; Wang, N.; ElSayed, A.; Ngadi, M. Hyperspectral imaging for nondestructive determination of some quality attributes for strawberry. J. Food Eng. 2007, 81, 98–107. [Google Scholar] [CrossRef]
Liu, Q.; Sun, K.; Zhao, N.; Yang, J.; Zhang, Y.; Ma, C.; Pan, L.; Tu, K. Information fusion of hyperspectral imaging and electronic nose for evaluation of fungal contamination in strawberries during decay. Postharvest Biol. Technol. 2019, 153, 152–160. [Google Scholar] [CrossRef]
Luo, X.; Takahashi, T.; Kyo, K.; Zhang, S. Wavelength selection in vis/NIR spectra for detection of bruises on apples by ROC analysis. J. Food Eng. 2012, 109, 457–466. [Google Scholar] [CrossRef]
Wu, G.; Wang, C. Investigating the effects of simulated transport vibration on tomato tissue damage based on vis/NIR spectroscopy. Postharvest Biol. Technol. 2014, 98, 41–47. [Google Scholar] [CrossRef]
Wang, Z.; Künnemeyer, R.; McGlone, A.; Burdon, J. Potential of Vis-NIR spectroscopy for detection of chilling injury in kiwifruit. Postharvest Biol. Technol. 2020, 164, 111160. [Google Scholar] [CrossRef]
Lurie, S.; Vanoli, M.; Dagar, A.; Weksler, A.; Lovati, F.; Zerbini, P.E.; Spinelli, L.; Torricelli, A.; Feng, J.; Rizzolo, A. Chilling injury in stored nectarines and its detection by time-resolved reflectance spectroscopy. Postharvest Biol. Technol. 2011, 59, 211–218. [Google Scholar] [CrossRef]
Moscetti, R.; Haff, R.P.; Monarca, D.; Cecchini, M.; Massantini, R. Near-infrared spectroscopy for detection of hailstorm damage on olive fruit. Postharvest Biol. Technol. 2016, 120, 204–212. [Google Scholar] [CrossRef]
Grandón, S.; Sanchez-Contreras, J.; Torres, C.A. Prediction models for sunscald on apples (Malus domestica Borkh.) cv. Granny Smith using Vis-NIR reflectance. Postharvest Biol. Technol. 2019, 151, 36–44. [Google Scholar] [CrossRef]
Jian, Y.; Jiyu, G.; Qibing, Z. Predicting bruise susceptibility in apples using Vis/SWNIR technique combined with ensemble learning. Int. J. Agric. Biol. Eng. 2017, 10, 144–153. [Google Scholar] [CrossRef]
Gao, Z.; Shao, Y.; Xuan, G.; Wang, Y.; Liu, Y.; Han, X. Real-time hyperspectral imaging for the in-field estimation of strawberry ripeness with deep learning. Artif. Intell. Agric. 2020, 4, 31–38. [Google Scholar] [CrossRef]
Su, Z.; Zhang, C.; Yan, T.; Zhu, J.; Zeng, Y.; Lu, X.; Gao, P.; Feng, L.; He, L.; Fan, L. Application of hyperspectral imaging for maturity and soluble solids content determination of strawberry with deep learning approaches. Front. Plant Sci. 2021, 12, 1897. [Google Scholar] [CrossRef]
Gulzar, Y. Fruit image classification model based on MobileNetV2 with deep transfer learning technique. Sustainability 2023, 15, 1906. [Google Scholar] [CrossRef]
Fazari, A.; Pellicer-Valero, O.J.; Gómez-Sanchıs, J.; Bernardi, B.; Cubero, S.; Benalia, S.; Zimbalatti, G.; Blasco, J. Application of deep convolutional neural networks for the detection of anthracnose in olives using VIS/NIR hyperspectral images. Comput. Electron. Agric. 2021, 187, 106252. [Google Scholar] [CrossRef]
Jiang, H.; Zhang, C.; Liu, F.; Zhu, H.; He, Y. Identification of strawberry ripeness based on multispectral indexes extracted from hyperspectral images. Guang Pu Xue Yu Guang Pu Fen Xi = Guang Pu 2016, 36, 1423–1427. [Google Scholar] [PubMed]
Weng, S.; Yu, S.; Dong, R.; Pan, F.; Liang, D. Nondestructive detection of storage time of strawberries using visible/near-infrared hyperspectral imaging. Int. J. Food Prop. 2020, 23, 269–281. [Google Scholar] [CrossRef]
Jiang, J.; Liu, F.; Xu, Y.; Huang, H. Multi-spectral RGB-NIR image classification using double-channel CNN. IEEE Access 2019, 7, 20607–20613. [Google Scholar] [CrossRef]
Sarvaiya, J.N.; Patnaik, S.; Bombaywala, S. Image registration by template matching using normalized cross-correlation. In Proceedings of the IEEE 2009 International Conference on Advances in Computing, Control, and Telecommunication Technologies, Bangalore, India, 28–29 December 2009; pp. 819–822. [Google Scholar]
LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef] [PubMed]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
Aggarwal, S.; Gupta, S.; Gupta, D.; Gulzar, Y.; Juneja, S.; Alwan, A.A.; Nauman, A. An artificial intelligence-based stacked ensemble approach for prediction of protein subcellular localization in confocal microscopy images. Sustainability 2023, 15, 1695. [Google Scholar] [CrossRef]
Verdú, S.; Barat, J.M.; Grau, R. Laser scattering imaging combined with CNNs to model the textural variability in a vegetable food tissue. J. Food Eng. 2023, 336, 111199. [Google Scholar] [CrossRef]
Magabilin, M.C.V.; Fajardo, A.C.; Medina, R.P. Optimal Ripeness Classification of the Philippine Guyabano Fruit using Deep Learning. In Proceedings of the IEEE 2022 Second International Conference on Power, Control and Computing Technologies (ICPC2T), Raipur, India, 1–3 March 2022; pp. 1–5. [Google Scholar]
Mamat, N.; Othman, M.; Abdulghafor, R.; Alwan, A.; Gulzar, Y.; Malaysia, U.; Sultan, J.; Petra, Y. Enhancing Image Annotation Technique of Fruit Classification Using a Deep Learning Approach. Sustainability 2023, 15, 901. [Google Scholar] [CrossRef]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef]
Li, S.; Kang, X.; Hu, J. Image fusion with guided filtering. IEEE Trans. Image Process. 2013, 22, 2864–2875. [Google Scholar]
Li, H.; Wu, X.J. Multi-focus image fusion using dictionary learning and low-rank representation. In Proceedings of the International Conference on Image and Graphics, Shanghai, China, 13–15 September 2017; Springer: Berlin/Heidelberg, Germany, 2017; pp. 675–686. [Google Scholar]
Goutte, C.; Gaussier, E. A probabilistic interpretation of precision, recall and F-score, with implication for evaluation. In Proceedings of the European Conference on Information Retrieval, Santiago de Compostela, Spain, 21–23 March 2005; Springer: Berlin/Heidelberg, Germany, 2005; pp. 345–359. [Google Scholar]
Rezatofighi, H.; Tsoi, N.; Gwak, J.; Sadeghian, A.; Reid, I.; Savarese, S. Generalized intersection over union: A metric and a loss for bounding box regression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 658–666. [Google Scholar]
Qu, G.; Zhang, D.; Yan, P. Information measure for performance of image fusion. Electron. Lett. 2002, 38, 313–315. [Google Scholar] [CrossRef]
Wang, Z.; Simoncelli, E.P.; Bovik, A.C. Multiscale structural similarity for image quality assessment. In Proceedings of the IEEE Thrity-Seventh Asilomar Conference on Signals, Systems & Computers, Pacific Grove, CA, USA, 9–12 November 2003; Volume 2, pp. 1398–1402. [Google Scholar]
Li, H.; Wu, X.J.; Kittler, J. RFN-Nest: An end-to-end residual fusion network for infrared and visible images. Inf. Fusion 2021, 73, 72–86. [Google Scholar] [CrossRef]
Li, H.; Wu, X.j.; Durrani, T.S. Infrared and visible image fusion with ResNet and zero-phase component analysis. Infrared Phys. Technol. 2019, 102, 103039. [Google Scholar] [CrossRef]
Li, H.; Wu, X.J. DenseFuse: A fusion approach to infrared and visible images. IEEE Trans. Image Process. 2018, 28, 2614–2623. [Google Scholar] [CrossRef] [PubMed]
Ma, J.; Yu, W.; Liang, P.; Li, C.; Jiang, J. FusionGAN: A generative adversarial network for infrared and visible image fusion. Inf. Fusion 2019, 48, 11–26. [Google Scholar] [CrossRef]
Xu, H.; Ma, J.; Jiang, J.; Guo, X.; Ling, H. U2Fusion: A unified unsupervised image fusion network. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 44, 502–518. [Google Scholar] [CrossRef]
Han, J.; Bhanu, B. Fusion of color and infrared video for moving human detection. Pattern Recognit. 2007, 40, 1771–1784. [Google Scholar] [CrossRef]
Kong, S.G.; Heo, J.; Abidi, B.R.; Paik, J.; Abidi, M.A. Recent advances in visual and infrared face recognition—A review. Comput. Vis. Image Underst. 2005, 97, 103–135. [Google Scholar] [CrossRef]
Liu, H.; Sun, F. Fusion tracking in color and infrared images using joint sparse representation. Sci. China Inf. Sci. 2012, 55, 590–599. [Google Scholar] [CrossRef]
Wang, Z.; Jin, L.; Wang, S.; Xu, H. Apple stem/calyx real-time recognition using YOLO-v5 algorithm for fruit automatic loading system. Postharvest Biol. Technol. 2022, 185, 111808. [Google Scholar] [CrossRef]
Wang, D.; He, D. Channel pruned YOLO V5s-based deep learning approach for rapid and accurate apple fruitlet detection before fruit thinning. Biosyst. Eng. 2021, 210, 271–281. [Google Scholar] [CrossRef]
Mirhaji, H.; Soleymani, M.; Asakereh, A.; Mehdizadeh, S.A. Fruit detection and load estimation of an orange orchard using the YOLO models through simple approaches in different imaging and illumination conditions. Comput. Electron. Agric. 2021, 191, 106533. [Google Scholar] [CrossRef]
Yang, Y.; Liu, Z.; Huang, M.; Zhu, Q.; Zhao, X. Automatic detection of multi-type defects on potatoes using multispectral imaging combined with a deep learning model. J. Food Eng. 2023, 336, 111213. [Google Scholar] [CrossRef]
Tian, Y.; Yang, G.; Wang, Z.; Wang, H.; Li, E.; Liang, Z. Apple detection during different growth stages in orchards using the improved YOLO-V3 model. Comput. Electron. Agric. 2019, 157, 417–426. [Google Scholar] [CrossRef]
Shi, R.; Li, T.; Yamaguchi, Y. An attribution-based pruning method for real-time mango detection with YOLO network. Comput. Electron. Agric. 2020, 169, 105214. [Google Scholar] [CrossRef]
Liu, Q.; Sun, K.; Peng, J.; Xing, M.; Pan, L.; Tu, K. Identification of bruise and fungi contamination in strawberries using hyperspectral imaging technology and multivariate analysis. Food Anal. Methods 2018, 11, 1518–1527. [Google Scholar] [CrossRef]
Griesser, M.; Hoffmann, T.; Bellido, M.L.; Rosati, C.; Fink, B.; Kurtzer, R.; Aharoni, A.; Munoz-Blanco, J.; Schwab, W. Redirection of flavonoid biosynthesis through the down-regulation of an anthocyanidin glucosyltransferase in ripening strawberry fruit. Plant Physiol. 2008, 146, 1528–1539. [Google Scholar] [CrossRef] [PubMed]
Ariana, D.P.; Lu, R.; Guyer, D.E. Near-infrared hyperspectral reflectance imaging for detection of bruises on pickling cucumbers. Comput. Electron. Agric. 2006, 53, 60–70. [Google Scholar] [CrossRef]
Bennedsen, B.; Peterson, D. Performance of a system for apple surface defect identification in near-infrared images. Biosyst. Eng. 2005, 90, 419–431. [Google Scholar] [CrossRef]

Figure 1. Infrared–RGB imaging system.

Figure 2. The strawberry samples.

Figure 3. Strawberry samples with manual labels.

Figure 4. Structure of the used VGG-19 framework and the output feature maps.

Figure 5. Processing, extraction, and fusion flow of RGB and NIR high-frequency feature images.

Figure 6. Fusion results of the proposed method and compared methods.

Figure 7. Training and testing loss curves of the YOLOv3-tiny detection network.

Figure 8. Detection results of six randomly selected samples with the three methods.

Figure 9. Results of the three input methods on the quality situations and ripeness stage detection.

Figure 10. Results of different RGB/NIR weights on quality situations and ripeness stage detection.

Figure 11. Detection performance of different RGB/NIR weights on quality situations and ripeness stage detection.

Table 1. Detailed labels for each ripeness stage of the strawberry training set.

Types	Ripe	Half-Ripe	Unripe
Types	1034	1124	992
Contamination	224	278	253
Bruises	231	234	214
Both defects	180	269	203
Defect-free	399	343	322

Table 2. Detailed labels for each ripeness stage of strawberry testing set.

Types	Ripe	Half-Ripe	Unripe
Types	255	283	245
Contamination	57	68	62
Bruises	56	56	52
Both defects	47	68	51
Defect-free	95	91	80

Table 3. Confusion matrix for the detection results.

Labels	Predicted	Evaluation Symbols
Positive	Positive	TP
Positive	Negative	FN
Negative	Positive	FP
Negative	Negative	TN

Table 4. The averaged values of the four quality metrics of the strawberry fused images.

Methods	MI	MS-SSIM	SD	$N_{abf}$
DenseFuse	12.95475	0.93145	65.12324	0.08453
FusionGAN	12.65822	0.77246	57.23540	0.06904
U2Fusion	13.55489	0.92492	64.01243	0.30145
Proposed	14.65096 *	0.93342 *	69.16351 *	0.05675 *

Note: The best results are denoted with *.

Table 5. Key hyperparameters of the YOLOv3-tiny detection network.

Parameter	Value	Parameter	Value
Batch size	64	Optimizer	Adam
Epochs	100	Regularization	$L_{2}$ regularization
Learning rate	0.001	Input image size	$416 \times 416$

Table 6. Detection results of the four quality situations and ripeness stages for strawberry samples in the testing set by YOLOv3-tiny based on fused images.

		R				HR				U				P (%)	R (%)
		Con	B	BD	DF	Con	B	BD	DF	Con	B	BD	DF	P (%)	R (%)
R	Con	47	3	2	0	1	1	0	0	2	1	0	0	90.38	82.46
	B	1	50	1	0	2	2	0	0	0	0	0	0	84.75	89.29
	BD	1	1	44	0	0	0	1	0	0	0	0	0	86.27	93.62
	DF	0	1	0	87	0	1	1	4	0	0	0	1	91.58	91.58
HR	Con	2	1	0	0	59	1	2	0	2	1	0	0	85.51	86.76
	B	0	2	1	0	2	46	2	0	1	1	1	0	77.97	82.14
	BD	1	1	2	0	2	2	56	0	0	1	3	0	86.15	82.35
	DF	0	0	0	6	0	1	0	79	1	0	0	4	85.87	86.81
U	Con	0	0	0	0	2	2	1	0	52	3	2	0	85.25	83.87
	B	0	0	0	0	0	1	0	3	1	39	2	6	79.59	75.00
	BD	0	0	0	0	1	2	2	0	2	3	41	0	83.67	80.39
	DF	0	0	1	2	0	0	0	6	0	0	0	71	86.59	88.75

Table 7. The values of P and R for four quality situations and ripeness stages in the strawberry sample testing set by YOLOv3-tiny based on the RGB-only and NIR-only methods.

		RGB-Only		NIR-Only
		P (%)	R (%)	P (%)	R (%)
R	Con	79.59	68.42	86.00	75.44
	B	76.36	75.00	83.93	83.93
	BD	70.91	82.98	78.43	85.11
	DF	88.17	86.32	90.22	87.37
	Ripe	91.67	90.59	89.56	87.45
HR	Con	78.13	73.53	80.30	77.94
	B	69.35	76.79	75.81	83.93
	BD	74.32	80.88	80.30	77.94
	DF	88.24	82.42	82.02	80.22
	Ripe	87.37	87.99	83.75	83.75
U	Con	80.33	79.03	82.26	82.26
	B	75.00	69.23	74.14	82.69
	BD	75.93	80.39	82.69	84.31
	DF	86.75	90.00	87.34	86.25
	Ripe	91.87	92.24	86.85	88.98

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lu, Y.; Gong, M.; Li, J.; Ma, J. Strawberry Defect Identification Using Deep Learning Infrared–Visible Image Fusion. Agronomy 2023, 13, 2217. https://doi.org/10.3390/agronomy13092217

AMA Style

Lu Y, Gong M, Li J, Ma J. Strawberry Defect Identification Using Deep Learning Infrared–Visible Image Fusion. Agronomy. 2023; 13(9):2217. https://doi.org/10.3390/agronomy13092217

Chicago/Turabian Style

Lu, Yuze, Mali Gong, Jing Li, and Jianshe Ma. 2023. "Strawberry Defect Identification Using Deep Learning Infrared–Visible Image Fusion" Agronomy 13, no. 9: 2217. https://doi.org/10.3390/agronomy13092217

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Strawberry Defect Identification Using Deep Learning Infrared–Visible Image Fusion

Abstract

1. Introduction

2. Materials and Methods

2.1. Strawberry Samples

2.2. Infrared Image and RGB Image Acquisition

2.2.1. Infrared–RGB Imaging System

2.2.2. Image Acquisition and Preprocessing

2.2.3. Dataset Preparation

2.3. Proposed Framework

2.3.1. Fusion of the Low-Frequency Part

2.3.2. Texture Extraction of the High-Frequency Part

2.3.3. Fusion of RGB-NIR Images

2.4. Detection Evaluation

3. Results and Discussion

3.1. Evaluation of the Fused Images

3.2. Performance of Fused Image in Detection

3.2.1. Preparation of Detection Network and Dataset

3.2.2. Comparison of the Fused Image Input with the Traditional Input

3.2.3. Comparison of Different RGB and NIR Weights

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI