1. Introduction
The agriculture industry grows annually in several ways to meet human needs. For example, according to the Mexican government agency Secretaría de Agricultura, Ganadería, Desarrollo Rural, Pesca y Alimentación (SAGARPA), berry crops (blackberry, raspberry, blueberry, cranberry, strawberry) along the Mexican territory reported annual increases close to 22% between 2003 and 2016 [
1], reaching up to 390,239 tons produced at the end of 2017. The harvesting tasks required to collect large production volumes are mainly based on manual labor, requiring an increase proportional to industry growth for sustainable production [
2]. However, unlike this requirement, the labor force arriving in this industrial sector exhibits a decreasing trend, causing significant losses of up to 12% of the berries produced, mainly because the shelf life of the harvested berries is shorter than that of other types of fruits [
3,
4]. This is not an isolated problem or exclusive to the production of berries in Mexico, but a continuous trend around the world that was exacerbated by the COVID-19 travel restrictions [
5,
6,
7,
8]. Monetary losses due to spoiled fruit that is not harvested are a solid motivation to implement innovative processes in the agriculture industry. For example, among the enabling technologies of precision and smart agriculture, robotic systems have been considered as one of the most promising approaches to address labor shortages for seeding, fertilizing, and harvesting tasks [
9,
10,
11]. Robotic systems for agriculture could be implemented with unmanned aerial vehicles, unmanned ground vehicles, manipulator arms, or as subsystems used in harvesting [
12]. Such subsystems consider specific developments in the gripper, vehicle, control strategies, navigation systems, and artificial vision systems.
One of the most significant issues within automated harvesting lies in artificial vision systems, as the predominant problem is the variable illumination conditions that affect the detection and identification of fruits and the maturity stages. Illumination problems affect the harvest success rate because the robot must be able to locate fruit in challenging environments [
13,
14]. Furthermore, maturity identification is severely affected, as berries do not ripen during storage and transportation periods after picking, which can cause the delivery of unripe fruit [
15]. In fact, the detection and classification of fruits in underexposed images and low-light conditions is a challenge that is not commonly addressed in the literature; most work assumes normal conditions in greenhouses and orchards without controlled light [
16,
17,
18].
However, the enhancement of low-light conditions, underexposed images, and visibility problems such as fog are more commonly addressed in image processing, independently of the application domain, in which common approaches are image fusion and image enhancement [
19,
20,
21]. Furthermore, the use of algorithms that enhance images independently if they fuse information from different sources finds applications in a wide range of fields, including fruit ripeness classification, surveillance, aerial photography for vegetation analysis, autonomous navigation, and biomedical applications, among others [
22,
23,
24]. For example, focused on low-illumination changes, Dong et al. proposed an algorithm to improve low illumination in videos based on a modification of a dehazing algorithm [
20] and Hau Ngo et al. use a nonlinear expansion function to fuse images’ Visible Spectrum (VIS) and Near Infra Red (NIR), obtaining a more intense image for dark zones while reducing the light intensity in overexposed regions [
25]. On the other hand, the current work that fuses information from different sources aims to extract information such as texture, edges, or light quantity to improve the details of the VIS image by adding these characteristics or combining several filters such as edge-preserving, bilateral, and disparity filters. Bennett et al. employ a linear mapping of luminance and extracted features from IR spectra to reduce image noise, improve image sharpness, and improve edge definition in the image while preserving the original illumination level within the VIS image [
26]; however, the algorithm produces color changes after processing, which leads to another challenge: color reconstruction. Vanmali et al. 2015 use a light transmission model to recover light scattered from the NIR [
19]. On the other hand, Vanmali et al. 2017 use a Laplacian–Gaussian pyramid filter and multiresolution fusion to improve image visibility [
27]. The main drawback in these works is the oversaturation of colors in the image, generating fused images with unreal colors. However, Herrera addressed the problem in his works Herrera et al. 2019 and Herrera et al. 2021 by fusing the information into the luminance space and adding only the information extracted from morphological operations, such as the top-hat transform [
21,
28]. Furthermore, although some works fuse the benefits of both approaches (Mohamed et al. 2019), long processing times are a limitation in real-time applications [
29].
Therefore, in this work, a fusion method is proposed that focuses on adding vegetation-related information to the image to complement visual information with those in the NIR band to improve visualization by improving the contrast between the background and blackberries. The process uses Normalized Difference Vegetation Index (NDVI) to enhance objects in the scene, adaptively focusing on fruits and vegetation. As a result, the method produces an enhanced image with more contrast between regions of low-contrast objects, resulting in a more effective classification between ripe and unripe blackberries in uncontrolled light conditions. The results obtained with the proposed method include a reduction in processing times compared to the methods in the literature and improvements in contrast and colorfulness features while maintaining saturation in low-exposure images. The remainder of the paper is as follows.
Section 2 presents the materials and methods considered for this work, including the theoretical basis required for the evaluation of fusion methods and image classification models, as well as the proposal to deal with blackberry classification in uncontrolled light environments. Later, in
Section 3, the experimental setup is explained in detail and the corresponding results are presented and analyzed. Finally,
Section 4 presents the conclusions and future work.
3. Experiments and Results
The images for the validation of the proposed algorithm were acquired in a blackberry greenhouse in Ciudad Guzmán, the county seat of Zapotlán el Grande, located in the state of Jalisco, Mexico. Weather conditions such as temperature and sunlight were uncontrolled factors. However, to have different sunlight conditions, the images were taken between 11:00–13:00 and 17:00–19:00 to obtain images with normal- and low-light conditions and thus evaluate the performance of the proposed method under different illumination circumstances. Although the pictures were taken in a greenhouse facility, there was not artificial illumination and the sunlight intensity was not measured; the crops were not completely covered, since there was only a plastic ceiling. Therefore, some cameras and frameworks were tested to determine their performance and the feasibility of implementation according to the scope of the work. After an analysis, we selected the Intel® RealSense™ Model D435f depth camera (Intel, Mountain View, CA, USA) for image acquisition. Its features include one RGB and two Infra Red (IR) sensors with an IR pass filter that improves depth quality performance by increasing the relative strength of the textured IR projector pattern. In addition, the filter helps alleviate visible reflections that transmit NIR light and absorb visible light, allowing it to work with solar light. A total of 370 registered image pairs with size were taken, 185 with VIS and 185 with IR spectra. A tripod with a height of 135 cm was used to control height and angles (67.5, 90, and 112.5) to visualize the fruits. Different distances from the camera and the plants were used, from 20 cm to 60 cm, with the same shot angles but horizontally. The images were processed on the MATLAB 2023a (The Mathworks, Inc., Natick, Massachusetts) platform due to the availability of state-of-the-art methods on a computer with a Ryzen 5 CPU at 3.6 Mhz and 32 GB RAM with an RTX 2060 Xtreme with 6 GB GDDR6 RAM and a core clock of 1845 MHz. Fusion methods were used with default parameters, and the proposed method used the values of and , which were selected by qualitative analysis to improve visualization while maintaining image quality. The parameter improves the vegetation without significantly modifying the background. It improves the contrast between the leaves and the fruits. However, in values near 1, the edges blur. On the other hand, the parameter lightens the image, allowing visualization of dark areas without blurring, but also may generate an overexposure effect of the brightened areas.
The contrast, entropy, colorfulness, and saturation were calculated for all images in the dataset to compare the results between the proposed and reference methods. The comparison between the fusion and proposed methods is shown in representative images in
Figure 5. The main objective of the fusion and enhancement method is to improve the differences between the berries and the background by combining information from three sources: the original image, an enhanced version, and information from the NIR band. The fusion of the three sources adapting the weight according to the atmospheric light means an improved contrast in the image in those images with low illumination; in other words, the difference between regions in the image may increase, which can be translated into a better visualization of the objects in the image for this kind of image. Thus, an increased contrast is expected while the entropy is maintained, since the proposed method increases the differences among regions without focusing on the details to maintain a low complexity. Comparisons were made using the Tukey multiple comparison test to compare the results with
. The measured contrast values shown in
Figure 6 confirm the hypothesis that contrast increased significantly not only compared to the original image (difference = 2.083,
T-value = −19.96,
p-value = 0.00), but also compared to [
19] (difference = 4.56,
T-value = 0,
p-value = 0) and [
21] (difference of 3.12,
T-value = 9.94,
p-value = 0.00) for the most significant improvements. This increase in contrast provides evidence that visual improvement may translate into better discrimination between regions and elements in the image, given the increased distance between pixels. On the other hand, the visual comparison presented in
Figure 5 shows that the differences, on the contrary, can be perceived as shadow loss between vegetation regions, i.e., [
19,
21,
27] can obtain a better perception of the colors, but lose the shadows between the bright and dark regions of the vegetation. However, the proposed method obtained a slightly reduced mean in the entropy measure (difference = 0.122,
T-value = 4.37,
p-value = 0.00). Let us consider that the proposed method fuses the information from NIR, maintaining the relationship between the red, blue, and green channels without focusing on details such as the fusion method of [
28]. However, [
28] extracts the details with the top-hat transform before the fusion step, which can be included in an extension of the method. In contrast, maintaining the relationship between channels avoided a stronger reduction in entropy than [
21], because although the luminance in the image increased, fusion is performed more straightforwardly in this method, which can be seen as a greater difference in the means with the original images (difference = 0.283,
T-value = 10.06,
p-value = 0.00). Unlike other fusion strategies, the proposed method does not include information on the ridges, edges, and corrugated areas of the NIR band, i.e., the greenhouse tarpaulin is more corrugated in [
21,
28]. However, the proposed method remains the same as the original image, because this part was not enhanced and remains the same.
However, most of the methods compared significantly improve color, resulting in images with different colors than the original, [
19,
21,
27], and the proposed method consistently changes the color of the images, which can be distinguished perceptually in the comparison presented in
Figure 5. The proposed method significantly increased the colorfulness (difference = 6.66,
T-value = 10.09,
p-value = 0.00), but compared to the reference method, the saturation remains the same as the original images (difference = 0.0147,
T-value = 1.46,
p = 0.766). The change in these two metrics (See
Figure 7) can make unripe blackberries look unnatural [
19,
21,
27].
The results show that the proposed method can obtain images with greater contrast between regions. These more colorful images can improve the visual perception of elements in the image without oversaturating the image and without obtaining unnatural colors. These results are because the proposed method controls this aspect with the parameters and and the transmission map. When the image is more colorful and with more light, the transmission map controls the fusion to take more information from the original image and less from the NIR and vice versa in an adaptive process that can control the added fusion depending on the light in the scene.
However, in addition to image quality, experiments were performed to understand how fusion strategies could be helpful in a classification task between environment and unripe and ripe blackberries. The results between the original image and the fused methods were compared to determine the effect of the strategies on the classification rates. We expect to improve the classification rates in images with low light because the proposed method increases the contrast and colorfulness, which may be helpful to a classification method in circumstances where the light is challenging, such as those in which the images are little exposed or the ambient light is insufficient.
The blackberries in the images were annotated with bounding boxes, a square shape around the fruit, and a label that assigned a maturity stage to the fruit: unripe, nearly ripe, and ripe. The unripe maturity level was assigned to all blackberries with a green or orange color, the nearly ripe stage was assigned to those fruits with a predominant dark color but with orange patches in the image, and the ripe stage considers only those dark fruits without orange spots. Then the square patches of the fruits were cropped and resized to a size of 224 × 224 pixels and classified according to the method used to process them and their maturity stage, obtaining a dataset with 2203 images, depicted in
Table 1. The dataset was balanced at the validation to maintain the same proportion in the training and validation sets. Additionally, a data augmentation process was applied to obtain more variability with a random reflection with probability
, a random rotation between −180 and 180 degrees, a random translation on the X and Y axes between −15 and 15 px, and a random shear between −10 and 10 px. Data augmentation was applied to the original images prior to each training iteration so that the probability of training with the same image decreases. The images were labeled according to the overall illumination of the scene, and the mean luminance in the CIE L*a*b color space of the images was calculated and classified according it; a threshold of 40 of the mean luminance was used in channel L, dividing the images with low (underexposed) illumination from those with normal illumination. Some images with low and normal illumination before and after processing them with the fusion methods are shown in
Figure 5.
Object detection in computer vision is a task that involves locating coordinates for each object of interest in an image, which in our case is blackberries. The goal of a detection model is to locate the boundaries of the object in the image by determining a boundary box around them, which can be defined as , where x and y are the center of the bounding box and w and h are the width and height, respectively. The object detection task typically takes an image as input, extracts features that allow the model to identify where an object is in the image, then proposes the possible objects in the image, and the last step classifies the proposals into predefined categories or classes, for example, background, ripe blackberry, or unripe blackberry. Experiments were conducted with the classification stage to extract from the model the influence of the feature extraction capabilities and the region proposal stages of a detection model. In this manner, the analysis performed can help to understand whether the fusion methods work not only in harvester tasks, but also in post-harvesting processes such as classification before sending the fruits to storage or stores. In addition, an experiment was performed to understand if the fusion methods can help to discriminate between background and fruits (the proposal stage) by training and evaluating the classification models with two classes, fruit and no fruit, independently of its ripeness stage. Thus, K-fold validation with was used for repeatability to validate an SVM model with default parameters for multiclass classification with features extracted from a pre-trained ResNet50 model with ImageNet with frozen weights to avoid introducing variability in the statistical model from the training and feature extraction steps to determine with more confidence the effect of the fusion methods, taking the original image set as a baseline. The databases were divided into fruit and no fruit, in which the unripe, nearly ripe, and ripe blackberries were grouped as positive classes and the images of the background without fruits as negative ones. This experiment allowed us to determine the ability of a classification model to discriminate between blackberries independently of their ripeness and other vegetation, such as leaves, stems, and flowers. This is a difficult task in low-light conditions or when the exposure time of the cameras is not appropriately selected. The results suggest that a classification model trained with images processed with the proposed method performed better for both types of illumination: low and normal.
The classification model achieved F1 score measures of
and
, respectively, for the two types of illumination (see
Table 2). On the contrary, the method with which the classifier model had the lowest performance is Herrera et al. 2019 [
21], with
and
, respectively. This result means an improvement of up to 18 % with respect to the F1-measure obtained with images without treatment as the training dataset for low-illumination conditions (
) and 14% for normal illumination (
). It is important to note that the standard deviation obtained with the data of the proposed method was the lowest of all methods, with
for low illumination and
for normal illumination, being the main improvement of the model, obtaining more robust classification rates independently of the light in the scene. On the other hand, the highest standard deviation obtained was
and
for low and normal illumination by Vanmali et al. 2017 [
27]. Then, to stabilize the effect of the fusion models to distinguish fruits from the unripe and ripe stages, background images without a fruit were discarded and only those with fruits were maintained in this experiment, considering ripe fruits as the positive class and nearly unripe and unripe as the negative class. The classification model achieved the highest F1 score using processed images with the proposed fusion method as input, maintaining the low- and normal-illumination results, at
and
, respectively. In contrast, the model with the lowest F1 score was Herrera et al. 2019 [
21], with
for low and
for normal. As in the first experiment, the proposed method proved to be the most robust despite illumination circumstances, allowing the classifier model to obtain the smallest value as the standard deviation,
at low illumination and
at normal illumination. Instead, the method with more variation in a low-illumination scenario was Herrera et al. 2021 [
28] (
), while in normal illumination it was Vanmali et al. 2015 [
19] (
).
The results obtained with the proposed method imply a gain of
and
improvement in low and normal illumination compared to the model trained with images without processing (
and
). Then another experiment evaluated the overall performance of the classification model trained with the images processed for each fusion method. The data used for the training are the complete database considering each separation for the analysis. It consists of four classes: background, unripe blackberry, nearly ripe blackberry, and ripe blackberry. The proposed method obtained a mean F1-score of 0.935 ± 0.060, which corresponds to the highest classification rate with the lowest dispersion, making the proposed method more robust to changes in illumination and improving the classification rates to 7.27% regarding training the classifier method without image fusion. The improvement in mean and dispersion may be due to adaptive enhancement of the image depending on the calculated atmospheric light and the vegetation index that controls the amount of information fused from the enhanced and NIR images. The F-1 score was used to evaluate the classification models because it considers the correct classification of the samples under unbalanced datasets, or when there is only one class of interest and several negative classes, i.e., when evaluating the ability of the classification model to distinguish between unripe, nearly ripe, and ripe fruits and the background independently of their ripeness stage and the background. A Convolutional Neural Network (CNN) architecture was selected to determine the effectiveness of a computational model that performs inference in a limited environment with low computing power and energy constraints. The selected architecture is a common, simple, and effective design for a typical image classification task (see
Figure 8). It contains a series of convolutional and batch normalization layers with ReLu activation followed by maximum pooling. This series of layers enables the classifier to learn regularized spatially invariant hierarchical features from the data, which is essential in datasets with light variations, as in our case. The first layer consists of an input layer with a size of 224 × 224, which includes the three channels, red, blue, and green, of the image to be classified. Then the architecture has three consecutive sets of convolutional layers with a filter size of 5 × 5 and max pooling layers with a filter size of 2 × 2. The convolutional layers have an increasing number of filters: eight for the first set, sixteen for the second, and thirty-two for the last convolutional layer. The activation function for the convolutional layers is ReLu, stride 1 for the convolutions, and stride 2 for the max pooling layers. The final convolutional layer has eight filters and a filter size of 5 × 5 to obtain 2048 features connected to a fully connected final layer with five neurons and a Softmax function. The CNN was trained with stochastic gradient descent with momentum optimizer with 100 maximum epochs, a validation patience of 5, a validation frequency every 10 iterations, and a mini-batch size of 32 images per iteration with a hold-out strategy for validation with a set 70% (1542 images) for training and 30% (661 images) for testing.
The results show the effectiveness of the fusion methods in improving classification rates under both light conditions, underexposed and normal illumination, depicted in
Table 3. The proposed method consistently obtained the overall highest F1 measurement of the other fusion methods under low- and normal-illumination conditions. It achieved the highest F1 scores for fruit classification and ripeness assessment, achieving 2.23% and 5.72% of improvement in low and normal illumination, respectively, in comparison with the images without processing (0.941 ± 0.081 and 0.926 ± 0.021), suggesting that the method is effective in improving the quality of input images for the frozen ResNet50 and trained CNN model. These results suggest that the fusion methods and more specifically the proposed one can enhance fruit classification rates in challenging conditions, which can impact the reduction in spoiled fruit, improving harvesting efficiency in this application.
Additionally, the images were processed to determine the processing time and frames per second that can be obtained in the MATLAB environment. Although these results could be comparable, let us remark that the time can be improved if the code is implemented in a more efficient language code and the obtained values are used for reference in the comparison. The average execution time of the proposed method was
s, which means at least 15 fps compared to 2 fps for Herrera et al. 2019 ([
21]) and Herrera et al. 2021 ([
28]), 1 fps for Vanmali et al. 2017 ([
27]), 1 fps for Vanmali et al. 2015 ([
19]), and 0.06 fps for Mohamed et al. 2019 ([
29]). This number of frames per second means that the method is 7 times faster than Herrera et al. 2019 and Herrera et al. 2021 [
21,
28], 13 times faster than Vanmali et al. 2017 [
27], and 249 times faster than Mohamed et al. 2019 [
29]. The mean time for all methods is presented in
Table 4 along with a
T-value and a
p-value of multiple comparisons of Tukey with
.