A Novel Two-Stage Approach for Automatic Extraction and Multi-View Generation of Litchis

Li, Yuanhong; Wang, Jing; Liang, Ming; Song, Haoyu; Liao, Jianhong; Lan, Yubin

doi:10.3390/agriculture14071046

Open AccessArticle

A Novel Two-Stage Approach for Automatic Extraction and Multi-View Generation of Litchis

by

Yuanhong Li

^1,2,

Jing Wang

^1,2,

Ming Liang

^1,2,

Haoyu Song

^1,2,

Jianhong Liao

^1,2 and

Yubin Lan

^1,2,3,4,5,*

¹

College of Electronic Engineering (College of Artificial Intelligence), South China Agricultural University, Guangzhou 510642, China

²

National Center for International Collaboration Research on Precision Agricultural Aviation Pesticide Spraying Technology, Guangzhou 510642, China

³

Center for International Cooperation and Disciplinary Innovation of Precision Agricultural Aviation Applied Technology (‘111 Center’), Guangzhou 510642, China

⁴

Guangdong Laboratory for Lingnan Modern Agriculture, Guangzhou 510642, China

⁵

National Key Laboratory of Green Pesticide, Guangzhou 510642, China

^*

Author to whom correspondence should be addressed.

Agriculture 2024, 14(7), 1046; https://doi.org/10.3390/agriculture14071046

Submission received: 28 May 2024 / Revised: 25 June 2024 / Accepted: 26 June 2024 / Published: 29 June 2024

(This article belongs to the Section Digital Agriculture)

Download

Browse Figures

Versions Notes

Abstract

:

Obtaining consistent multi-view images of litchis is crucial for various litchi-related studies, such as data augmentation and 3D reconstruction. This paper proposes a two-stage model that integrates the Mask2Former semantic segmentation network with the Wonder3D multi-view generation network. This integration aims to accurately segment and extract litchis from complex backgrounds and generate consistent multi-view images of previously unseen litchis. In the first stage, the Mask2Former model is utilized to predict litchi masks, enabling the extraction of litchis from complex backgrounds. To further enhance the accuracy of litchi branch extraction, we propose a novel method that combines the predicted masks with morphological operations and the HSV color space. This approach ensures accurate extraction of litchi branches even when the semantic segmentation model’s prediction accuracy is not high. In the second stage, the segmented and extracted litchi images are passed as input into the Wonder3D network to generate multi-view of the litchis. After comparing different semantic segmentation and multi-view synthesis networks, the Mask2Former and Wonder3D networks demonstrated the best performance. The Mask2Former network achieved a mean Intersection over Union (mIoU) of 79.79% and a mean pixel accuracy (mPA) of 85.82%. The Wonder3D network achieved a peak signal-to-noise ratio (PSNR) of 18.89 dB, a structural similarity index (SSIM) of 0.8199, and a learned perceptual image patch similarity (LPIPS) of 0.114. Combining the Mask2Former model with the Wonder3D network resulted in an increase in PSNR and SSIM scores by 0.21 dB and 0.0121, respectively, and a decrease in LPIPS by 0.064 compared to using the Wonder3D model alone. Therefore, the proposed two-stage model effectively achieves automatic extraction and multi-view generation of litchis with high accuracy.

Keywords:

litchi; litchi branches; semantic segmentation; multi-view generation; two-stage model

1. Introduction

The litchi (Litchi chinensis Sonn.) is a tropical to subtropical fruit cultivated extensively in over 20 countries worldwide. It is commonly consumed in fresh or processed forms and has emerged as one of the most favored fruits, owing to its delightful taste, enticing color, and high nutritional value [1]. In 2019, global litchi production reached approximately 4 million tons. China stands as the primary producer with the largest cultivation area and output worldwide, contributing over 50% of the global production and planting area on average annually [2]. With the rapid advancement of computer vision and artificial intelligence technologies, an increasing number of researchers are turning their attention to the automation and intelligent analysis of litchi. In order to address the challenges encountered by existing object detection algorithms in identifying small-sized and densely distributed litchi fruits due to complex and dynamic orchard environments, Ref. [3] proposed an optimized YOLOv7-Litchi detection algorithm. This algorithm incorporates the CNeB module from ConvNeXt into the backbone network, reducing information loss through a reverse bottleneck structure and larger convolutional kernel sizes, thereby amplifying global interaction representation and enhancing network performance for improved detection of litchi fruits in complex environments. In order to achieve accurate detection of litchi fruits in natural environments, Ref. [4] proposed a method for mature litchi recognition based on red-green-blue depth (RGB-D) cameras, which can be utilized for estimating litchi fruit yield. In order to advance the development of automatic harvesting technology and improve the efficiency and accuracy of litchi picking, Ref. [5] addressed the issue of inaccurate litchi segmentation under natural conditions by proposing an improved DeepLabv3+-based method for litchi image segmentation. This method replaces the backbone network of DeepLabv3+ with dilated residual networks to enhance the model’s feature extraction capabilities. Additionally, it combines cross-entropy loss and Dice coefficient loss as the loss functions to focus the model more on the litchi branch regions. This approach aligns with techniques used in other domains, such as lagoon water quality monitoring based on digital image analysis and machine learning estimators, demonstrating the versatility and effectiveness of machine learning in various fields [6]. Ref. [7] proposed a deep learning-based instance segmentation method for litchi tree crown segmentation from images captured by unmanned aerial vehicles (UAVs), which is of significant importance for precise orchard management. In the field of 3D reconstruction, research on litchis is relatively limited. However, in practice, we can accurately reconstruct the 3D model of litchis by fusing multi-view images. This approach not only helps us to better understand the morphological structure of litchis but also provides strong scientific evidence for variety improvement and cultivation management, further promoting the development of the litchi industry. Similar techniques have been applied in other contexts, such as real-time pattern-recognition of GPR images with YOLO v3 implemented by Tensorflow and detecting the material of hard objects buried in tillage soil using FDTD models [8,9]. All the aforementioned studies on litchis rely on an essential aspect—the litchi image dataset. In object detection and image segmentation, a large number of litchi images are needed to train models to improve their generalization ability and robustness. In 3D reconstruction, acquiring multi-view images is crucial. Traditional methods for obtaining multi-view images often rely on expensive equipment and complex operations, leading to inefficiency and high costs. Therefore, there is a need for methods to expand the dataset.

In agricultural image analysis, to better accomplish various recognition tasks such as image classification, image segmentation, object detection, and localization, large-scale and high-quality datasets can significantly enhance model performance. However, obtaining large-scale and high-quality datasets often requires a considerable amount of human resources. Image augmentation plays a crucial role in improving model performance. In addition to traditional data augmentation techniques, in 2014, a generative adversarial network (GAN) was first proposed in the field of computer vision [10]. GAN, as one of the most important research approaches in artificial intelligence, has garnered widespread attention for its remarkable data generation capabilities [11]. When exploring the feasibility of automating litchi defect surface detection, Ref. [12] employed a generative adversarial network (GAN) based on transformer architecture as a data augmentation strategy. This approach effectively augmented the original training set, providing a more diverse range of samples, thus successfully addressing the issue of imbalanced distribution within the original dataset. Ref. [13] proposed a spectral sample augmentation technique based on a K-conditioned boundary equilibrium generative adversarial network (KC-BEGAN). This approach addresses the challenge of obtaining a sufficient number of spectral samples suitable for deep learning due to environmental constraints, equipment limitations, and labor costs. In recent years, with the introduction of denoising diffusion probabilistic models (DDPM) in image generation tasks [14], the diffusion models have once again garnered widespread attention from researchers. Many research approaches based on diffusion models have demonstrated better performance in computer vision compared to generative adversarial networks (GANs). The paper by OpenAI [15] also provides evidence of diffusion models surpassing GANs. Ref. [16] conducted novel research on the efficacy of diffusion models in generating weed images to enhance weed recognition. Experiments on two large-scale public multi-class weed datasets indicate that, compared to GANs (BigGAN, StyleGAN2, StyleGAN3), diffusion models strike the best balance between sample fidelity and diversity, achieving the highest Fréchet inception distance. Through the integration of artificially generated weed images produced by Stable Diffusion technology with advanced convolutional neural network (CNN) models, Ref. [17] significantly improved the accuracy of weed detection and classification. This approach overcomes the limitations of weed recognition systems caused by the scarcity of real image data. This approach not only enhances the efficiency of weed management but also provides robust technological support for the development of automated weed management systems. With the continuous advancement of technology and the enhancement of data processing capabilities, diffusion models have evolved into large-scale diffusion models, capable of handling massive amounts of data, thus further improving the quality and efficiency of generative models. In recent years, the rise of deep learning technologies has significantly altered the landscape of 3D reconstruction. These advanced techniques bring an unprecedented combination of efficiency and precision to model creation. By integrating the core principles of multi-view 3D reconstruction with the exceptional capabilities of deep learning, a new frontier has been opened, allowing for the comprehensive and precise reconstruction of 3D scenes by fully leveraging the rich information contained in multi-view images [18]. Ref. [19] leveraged geometric prior knowledge learned by large-scale diffusion models to achieve the capability of altering object camera viewpoints solely based on a single RGB image. This addresses the challenge of performing novel viewpoint synthesis in under-constrained settings and enables the ability to conduct 3D reconstruction from a single image, significantly surpassing existing single-view 3D reconstruction and novel viewpoint synthesis models. Ref. [20] achieved the capability to accurately and comprehensively reconstruct a wide range of objects from a single RGB image by utilizing a series of visual language models and the Segment Anything object segmentation model. This addresses the challenges of diversity and complexity in 3D reconstruction in real-world scenarios, potentially contributing to the field of 3D reconstruction. Ref. [21] employed image-conditioned diffusion models and pre-trained 2D generative priors to generate high-quality, 3D-consistent multi-view images from a single view, addressing issues such as texture degradation and geometric misalignment. This approach enables the generation of high-quality, diverse 3D assets. Ref. [22] utilized a cross-domain diffusion model to efficiently generate high-fidelity textured meshes from single-view images, addressing issues of time-consuming optimization and geometric shape inconsistency in existing methods. This approach achieves high-quality, consistent, and efficient single-view reconstructions. In delving into the current state of research on single-view 3D reconstruction, it is evident that while traditional 3D reconstruction techniques, such as sensor scanning, have achieved significant advancements, they often rely on expensive equipment and complex operational procedures, resulting in relatively low efficiency. In contrast, 3D reconstruction methods based on multi-view generative networks bring innovative progress to this field due to their higher flexibility and efficiency. It is noteworthy that with the continuous development of large-model technology, large models often possess powerful prior knowledge and generalization capabilities. However, despite the significant achievements of multi-view generative networks, their application in the agricultural domain remains scarce. This study aims to fill this gap by applying advanced large-scale diffusion models to the litchi domain in agriculture, thereby promoting the progress of litchi-related research as a foundational study.

The aim of this study is to apply large-scale diffusion models in agriculture. Based on existing research, we propose a novel litchi view generation model that combines Mask2Former [23] and Wonder3D. This model can be used to enhance litchi datasets, facilitate the 3D reconstruction of litchis, assist in litchi harvesting, and support various litchi-related research endeavors, thereby providing substantial support for fundamental research on litchis.

2. Materials and Methods

2.1. Image Data Collection

The image dataset for this study was collected on 3 June 2021, at Litchi Expo in Conghua District, Guangzhou City, Guangdong Province, China (23°58′96″ N, 113°62′51″ E). The weather conditions were suitable with clear skies, making it an ideal period for capturing images of ripe and vibrant-colored litchi fruits. We employed smartphones as the image capture devices to ensure clarity and color fidelity. A total of 458 images of litchis were collected, each with a resolution of 1440 × 1080 pixels, aiming to preserve image details adequately for subsequent processing. The litchi image dataset collected in this session exhibits high diversity. In terms of complexity, it encompasses images of single litchi fruits (Figure 1a), depicting scenarios where litchi fruits hang independently on branches with relatively clear features of both fruits and branches, facilitating easier extraction by models. Additionally, it includes images of multiple litchi fruits (Figure 1b), presenting challenges in segmentation due to occlusions between fruits and branches. Furthermore, images of litchi clusters (Figure 1c) were captured, depicting branches laden with more than five fruits arranged in clusters. These images depict fruits growing in clusters, intertwined with each other, presenting the highest number of fruits and branches compared to single and multiple fruits, with complex backgrounds and a relatively distant shooting distance, posing greater challenges to semantic segmentation models. This categorization considers the arrangement and quantity of fruits, providing more diverse training data for models. Regarding lighting conditions, the images are categorized into sunny (Figure 1d), where litchi fruits directly exposed to sunlight exhibit vivid skin colors and higher glossiness; shaded (Figure 1e), where fruits in insufficient light appear relatively darker; and shadowed (Figure 1f), where fruits under shadows due to foliage or other objects exhibit lower saturation, transitioning from vibrant to deeper colors as they move from sunlight to complete shadow, presenting a challenge for semantic segmentation. This categorization considers the different manifestations of litchi fruits under varying lighting conditions, including front-lit, back-lit, and shaded situations. Such diversity aids in the adaptability and robustness of models under different lighting conditions. Furthermore, the dataset is categorized based on varieties into Guiwei (Figure 1g), characterized by oval or nearly spherical fruits with light red skins; Dahongpao (Figure 1h), featuring elongated oval or elliptical fruits with bright red skins; and Feizixiao (Figure 1i), exhibiting nearly circular or oval fruits with light red skins tinged with green. This variety enhances the ability of semantic segmentation models to identify and segment litchi fruits of different varieties. To further enhance the robustness of our model, we also captured images of some litchi fruits from multiple angles. The richness and representativeness of these diversities contribute to enhancing the robustness and generalization capabilities of semantic segmentation models trained on our litchi image dataset.

2.2. Data Annotation

Litchi images often contain numerous complex and irrelevant background features, which can interfere with the performance of models generated. Therefore, we employ semantic segmentation networks to isolate the crucial parts of litchis, utilizing the LabelMe tool to annotate data for semantic segmentation. The tool label files are in JSON format, and each image is converted into PNG format for training using Python scripts. The specific label descriptions for semantic segmentation are shown in Table 1.

2.3. Data Augmentation

In this study, we conducted a thorough cleaning and selection process on the collected litchi images to remove low-quality ones, such as excessively blurry, severely overexposed, heavily occluded, or irrelevant images. Subsequently, 412 images were retained for the development of the semantic segmentation model. The total sample was divided into a training dataset and a validation dataset in the ratio of 8:2. To augment the training data and enhance the model’s learning accuracy and generalization ability in complex orchard environments, various transformations were applied to the input images during training, including random scaling, flipping, brightness adjustment, and color changes. Specifically, random flipping simulated litchis captured from different angles, while brightness and color augmentation simulated variations in light intensity and scenes. These augmentation techniques were applied to the training dataset in each training iteration to enrich the image data.

2.4. Semantic Segmentation Network Architecture

In this study, we introduced Mask2Former [23] for segmenting the key parts of litchis. Mask2Former is a highly potent image segmentation model built upon transformer architecture, capable of efficiently and accurately accomplishing image segmentation tasks. The core idea of Mask2Former is to unify semantic segmentation, instance segmentation, and even panoptic segmentation into a single framework, loss, and training process, thereby enabling Mask2Former to handle various image segmentation tasks and enhancing the model’s versatility and flexibility. The network architecture is illustrated in Figure 2. Initially, the input image undergoes deep feature extraction via a backbone network, yielding four layers of features. These features are then passed to the pixel decoder, where the features at 1/32, 1/16, and 1/8 resolutions are fed into the transformer decoder as keys and values to compute cross-attention with query features. The outputs of each layer’s computation are recursively calculated with the features at the next resolution. The pixel decoder repeats this process gradually to restore the resolution of features, eventually reconstructing feature maps at the original image resolution of 1/4. This operation empowers the model to excel in handling small objects. Mask2Former integrates the backbone network for feature extraction, pixel decoder, and transformer decoder. This architecture allows the model to fully utilize multi-scale features in the image, thereby enhancing segmentation accuracy. Whether dealing with large or small objects, Mask2Former effectively performs segmentation. In litchi image segmentation tasks, Mask2Former demonstrates significant advantages over other segmentation networks. Litchis and litchi branches often exhibit varying sizes and shapes in images. Mask2Former, through multi-scale feature extraction, accurately captures these variations, thereby achieving precise segmentation of litchis and litchi branches.

2.5. Combining Semantic Segmentation with HSV Color Space for Extracting Litchi Branches

Semantic segmentation can isolate litchis and their supporting branches from complex environments. However, our experiments revealed that while semantic segmentation performs well in predicting litchi fruits, the intricate growth patterns and interwoven structures of the branches complicate their precise segmentation. This leads to suboptimal prediction accuracy for the supporting branches, resulting in incomplete extraction, which significantly affects the subsequent multi-view generation of litchis. To address this issue, we propose a method that combines the segmentation masks predicted by semantic segmentation with morphological dilation operations and the HSV color space to efficiently extract litchi-supporting branches.

2.5.1. Establishing the Litchi Branch Dataset

The HSV (Hue, Saturation, Value) color space is a color space model proposed by A.R. Smith in 1978 [24]. This color space has been widely utilized in image analysis research in the agricultural field. The effectiveness of this approach lies in the HSV color space’s ability to more intuitively reflect the color characteristics of images, particularly the hue information. This is crucial in many agricultural applications for distinguishing between different crops, soil conditions, or disease states. To establish the threshold range for litchi branches in the HSV color space, this study constructed a dataset comprising 100 images containing only litchi branches (without fruits or background), based on previously captured litchi data (Figure 3). This dataset facilitates the subsequent creation of HSV histograms for litchi branches. By observing the distribution of histograms, the HSV threshold range for litchi branches was determined.

2.5.2. Denoising Diffusion Probabilistic Models (DDPM)

The diffusion model first appeared in [25], laying important theoretical groundwork for subsequent research in generative models. As research progressed, the diffusion model garnered widespread attention and application, with denoising diffusion probabilistic models (DDPM) [14] emerging as a significant branch of the diffusion model, acclaimed for its integration of deep learning techniques. DDPM consists of two components: the diffusion process and the denoising process. In essence, the diffusion process of DDPM involves adding noise to input data, which, after T iterations, is assumed to conform to a known distribution (such as a Gaussian distribution). The denoising process, on the other hand, iteratively computes the posterior distribution through neural networks, gradually removing noise to recover images that adhere to the original input data distribution.

The diffusion process is defined as a Markov chain that satisfies the probability density

q

. Through

T

iterations, Gaussian noise is gradually added to the input distribution. The specific process is as follows:

\begin{matrix} q (x_{1 : T} x_{0}) = \prod_{t = 1}^{T} q (x_{t}| x_{t - 1}) \\ q (x_{t}| x_{t - 1}) ~ N (x_{t}| \sqrt{α_{t}} x_{t - 1}, (1 - α_{t}) I) \end{matrix}

(1)

The parameter

α_{t}

, which lies within the range

(0,1)

, dictates the variance of the noise added at each iteration. Moreover, to ensure its boundedness as

t

approaches infinity,

α_{t}

should gradually decrease over time. Here,

x_{0}

represents the input distribution, while

x_{t}

signifies the noise distribution around

x_{0}

at time

t

. Consequently, given an input

x_{0}

, the distribution at any moment

t

during the diffusion process can be computed from Equation (1). Equation (2) demonstrates the diffusion outcome at any arbitrary time

t

.

x_{t} = \sqrt{{\bar{α}}_{t}} x_{0} + \sqrt{1 - α_{t}} z, z ~ N (0, I) q (x_{t} | x_{0}) ~ N (x_{t}) | \sqrt{{\bar{α}}_{t}} x_{0}, (1 - {\bar{α}}_{t}) I), t \in (0, T)

(2)

where

{\bar{α}}_{t} = \prod_{i = 1}^{t} α_{t}

.

From Equation (2), it can be observed that the diffusion process does not involve unknown parameters, and the additive noise distribution at each moment can be directly computed from the available information.

The denoising process of DDPM begins with random noise conforming to a standard Gaussian distribution and gradually recovers to the input distribution through iteration. Building upon the assumptions made in the diffusion process, the aim of the denoising process is to obtain the posterior distribution

p_{θ} (x_{t - 1} | x_{t})

through iterative solutions, eventually leading to

p_{θ} (x_{0} | x_{1})

. Based on the assumptions of Markov chains and diffusion, sampling from

x_{t}

(assuming

x_{t}

follows standard Gaussian noise) allows for the distribution of

x_{0 : T}

to be expressed as Equation (3).

p_{θ} (x_{0 : T}) = p (x_{T}) \prod_{t = 1}^{T} p_{θ} (x_{t - 1}| x_{t}) p (x_{T}) ~ N (0, I)

(3)

Through appropriate mathematical computations, it can be derived that given

x_{0}

and

x_{t}

, the posterior distribution of

x_{t - 1}

satisfies

p_{θ} (x_{t - 1}| x_{t}, x_{0}) ~ N (x x_{t - 1}; μ_{θ}, σ_{θ}^{2} I) .

Leveraging the properties of the Bayesian theorem and Markov chains, Equations (4) and (5) represent the mean and variance of the posterior distribution, respectively.

μ_{θ} = \frac{\sqrt{{\bar{α}}_{t - 1}} (1 - α_{t})}{1 - {\bar{α}}_{t}} x_{0} + \frac{\sqrt{α_{t}} (1 - {\bar{α}}_{t - 1})}{1 - {\bar{α}}_{t}} x_{t}

(4)

σ_{θ}^{2} = \frac{(1 - {\bar{α}}_{t - 1}) (1 - α_{t})}{1 - {\bar{α}}_{t}}

(5)

From Equations (4) and (5), it is evident that the variance of this posterior distribution is known, while the mean comprises unknowns involving

x_{0}

and

x_{t}

. However, from Equation (2), we have:

x_{0} = \frac{x_{t} - \sqrt{1 - {\bar{α}}_{t}} z}{\sqrt{{\bar{α}}_{t}}}

(6)

Substituting Equation (6) into Equation (4), we obtain:

μ_{θ} = \frac{1}{\sqrt{α_{t}}} (x_{t} - \frac{1 - α_{t}}{\sqrt{1 - {\bar{α}}_{t}}} z_{t})

(7)

Hence,

z_{t}

can be predicted through a neural network. Subsequently, the mean of the posterior Gaussian distribution can be computed. By associating with known variances, the posterior distribution at each moment can be calculated. Ultimately, the network outputs a pseudo-image that satisfies the input distribution after iterating

T

steps.

2.5.3. Expanding the Litchi Branch Dataset Using DDPM

It is well known that generative adversarial networks (GANs) [10] are notoriously difficult to train, often facing challenges such as mode collapse and vanishing gradients. In comparison, diffusion models offer more stable training and the ability to generate a wider variety of samples [15]. In order to better analyze the distribution patterns of litchi branches in the HSV color space, this study employed DDPM to augment the existing litchi branch dataset (Figure 4).

2.5.4. Extracting Crucial Litchi Components Using Semantic Segmentation Results and the HSV Color Space

In order to accurately extract litchi fruits and their associated branches from cluttered outdoor environments, we first trained a semantic segmentation model. This model predicts masks for litchi fruits and branches in the images, which are then overlaid onto the original images to clearly delineate the regions to be extracted. During testing, we found a relatively high recognition rate for litchi fruits, but the recognition rate for litchi branches was generally lower, with frequent gaps in the predicted results. To address this issue, we employed a dilation operation to process the branch masks, filling in the gaps in the model predictions and making the branch masks more complete. However, dilation also introduced a potential problem: it could extend into small areas of the background that were not originally intended for extraction. To mitigate this, we divided the branch masks into two parts for processing. Firstly, we extracted the branches directly predicted by the semantic segmentation model, i.e., the original masks. Secondly, for the newly added areas after dilation, we applied a threshold based on the HSV color space to extract only those regions that matched the threshold, thus avoiding the inclusion of irrelevant background. Experimental validation showed that this approach not only maintained a high extraction efficiency but also significantly improved the accuracy of extracting litchi-bearing branches. This method is applicable not only to litchi fruit and branch extraction but also provides new insights into target segmentation and extraction in similar scenarios.

2.6. Multi-View Generation Network Architecture

With the advancement of computer vision and image processing technologies, the transition from single-view images to multi-view consistent image transformation has become a research focus. However, existing methods face challenges in achieving geometric and texture consistency. Wonder3D [22] utilizes cross-domain diffusion models and multi-view cross-domain attention mechanisms to efficiently generate multi-view consistent images from single-view images. Through this approach, Wonder3D ensures that the generated color images and normal maps maintain consistency in both geometry and appearance, providing a reliable foundation for subsequent 3D reconstruction. By combining domain switchers and geometry-aware normal fusion algorithms, Wonder3D further enhances the consistency and fidelity of the generated multi-view images. Experimental results demonstrate that the Wonder3D method outperforms existing methods significantly in terms of image generation quality, robustness, and efficiency, offering an innovative and effective solution for generating consistent multi-view images from single-view images (Figure 5).

2.7. Evaluation Metrics

Evaluation metrics are used to measure the performance of each trained model on the test set. For semantic segmentation, two metrics are employed: mean Intersection over Union (mIoU) and mean pixel accuracy (mPA). mIoU represents the average Intersection over Union score calculated when predicting annotated images for each class, as shown in Equation (8). mPA is used to compute the number of pixels correctly classified relative to the total number of pixels, as depicted in Equations (9) and (10).

m I o U = \frac{1}{k} \sum_{i = 1}^{k} \frac{p_{i i}}{\sum_{j = 1}^{k} p_{i j} + \sum_{j = 1}^{k} p_{i j} - p_{i i}}

(8)

P A = \frac{\sum_{i = 1}^{k} p_{i i}}{\sum_{i = 1}^{k} \sum_{j = 1}^{k} p_{i j}}

(9)

m P A = \frac{1}{k} \sum_{i = 1}^{k} P A_{i}

(10)

where

p_{i j}

is the number of class

i

pixels predicted as class

j

,

p_{i i}

is the number of class

i

pixels predicted as class

i

, and

k

is the number of semantic segmentation classes.

For multi-view generation, we utilize three metrics: peak signal-to-noise ratio (PSNR), structural similarity index (SSIM) [26], and learned perceptual image patch similarity (LPIPS) [27]. PSNR measures the quality loss of an image by calculating the peak signal-to-noise ratio between the original image and the image after compression or processing. The formula for PSNR is given by Equation (11).

P S N R = 10 \cdot {l o g}_{10} (\frac{{M A X}^{2}}{M S E})

(11)

where, MAX represents the maximum possible value of pixel values (usually 255), and MSE (mean squared error) denotes the average of the squared differences between pixel values of the original and processed images.

SSIM is employed to compare the structural similarity between two images, taking into account brightness, contrast, and structure. The formula for SSIM is given by Equation (12).

S S I M (x, y) = \frac{(2 μ_{x} μ_{y} + C_{1}) (2 σ_{x y} + C_{2})}{(μ_{x}^{2} + μ_{y}^{2} + C_{1}) (σ_{x}^{2} + σ_{y}^{2} + C_{2})}

(12)

In this context,

x

and

y

represent two images, where

μ_{x}

and

μ_{y}

denote the respective pixel-wise mean values of the images, and

σ_{x}^{2}

and

σ_{y}^{2}

represent their pixel-wise variances.

σ_{x y}

represents the covariance between the two images. Constants

C_{1}

and

C_{2}

are employed for numerical stability during computation. The structural similarity index (SSIM) ranges between

- 1

and

1

, where a value closer to

1

indicates greater similarity between the two images.

LPIPS is an image perceptual similarity metric learned through neural network techniques, utilized to quantify the perceptual similarity between two images. It evaluates the perceptual differences between two images by training a neural network. Typically, this network employs convolutional neural network (CNN) architecture and is trained on large-scale datasets to learn the perceptual similarity between images. The formula for LPIPS is commonly represented as Equation (13).

L P I P S (x, y) = \sum_{i} w_{i} \cdot d_{i} (x, y)

(13)

In this context,

x

and

y

represent two images, where

d_{i} (x, y)

denotes the feature distance computed by the

i

-th layer of the neural network, and

w_{i}

represents the weight associated with that layer.

The advantage of LPIPS lies in its ability to capture perceptual differences between images, rather than solely focusing on pixel-level disparities. This characteristic enables LPIPS to better reflect human subjective perception across various image processing tasks.

3. Results

3.1. Model Training

This study was conducted and analyzed on the following hardware setup: an Intel(R) Xeon(R) E5-2680 v4 @ 2.40 GHz CPU, 32 GB of RAM, NVIDIA RTX A4000 graphics card (manufactured by NVIDIA, Santa Clara, CA, USA), Ubuntu 20.04.2 LTS operating system, CUDA version 11.6(NVIDIA Corporation, Santa Clara, CA, USA), and PyTorch 1.10 deep learning framework. For training the Mask2Former model, transfer learning was employed, utilizing a pre-trained model trained for 160 k iterations on the ADE20K dataset. The input image size was set to 512 × 512, with a batch size of 2 and a learning rate of 1 × 10⁻⁴. The model underwent training for 20,000 iterations. To validate Wonder3D’s zero-shot capability, we loaded the officially trained model, which was trained on the LVIS subset of the Objaverse dataset [28]. This subset contains approximately 30,000 or more objects, with an input image size of 320 × 320 and a batch size of 512. The model was trained for 30 k iterations.

3.2. Comparison of Different Semantic Segmentation Models

This study employs Mask2Former for litchi image segmentation. We selected comparison models that have been the most representative in recent years. These models represent a mix of traditional CNN-based approaches (DeepLabV3+ and PSPNet), modern transformer-based methods (SegFormer), and innovative segmentation techniques (KNet and Mask2Former). Each method has its strengths, and this selection allows for a more comprehensive comparison of different models’ performance on litchi segmentation. We conducted a comparative analysis of the Mask2Former model’s performance across different input resolutions, and the results are summarized in Table 2. As evident from the table, the model achieved the highest mIoU value when the input resolution was set to 512 × 512. Although a resolution of 1024 × 1024 can provide more details, it may also introduce irrelevant features and affect the generalization ability of the model. Our pre-trained model is trained on a resolution of 512 × 512, so it is more suitable for this resolution. Despite the model’s slightly higher mPA (mean pixel accuracy) at a resolution of 1024 × 1024, we ultimately opted for 512 as the input resolution. This decision was made with the consideration of computational resource consumption and efficiency in practical applications. By selecting 512 × 512 as the input resolution, we aimed to optimize the utilization of computational resources, thereby enhancing training speed and reducing computational costs while maintaining model performance.

To further enhance the robustness and accuracy of the semantic segmentation model for the task of litchi segmentation, we applied a series of data augmentation techniques during training. These techniques include Random Resize, where images were scaled with a random ratio ranging from 0.5 to 2.0, simulating different physical sizes and shooting distances to help the model adapt to litchis of various sizes and distances. We also used Random Crop, where images were randomly cropped to a size of 512 × 512 with a maximum class ratio of 0.75, encouraging the model to focus on different parts of the image during training and enhancing its robustness to partial occlusions and edge regions. Additionally, we applied Random Flip, where images were horizontally flipped with a probability of 0.5, aiding the model in adapting to different perspectives and orientations. PhotoMetricDistortion was also used to adjust brightness, contrast, saturation, and hue, simulating various lighting conditions to help the model maintain stable performance under changing illumination. We conducted ablation experiments using Mask2Former with a Swin-S backbone to demonstrate the effectiveness of these preprocessing operations. We trained the model twice on the same dataset, once with the augmentation techniques and once without, keeping all other conditions identical. After 20,000 iterations, the model with data augmentation achieved an mIoU of 79.79% and an mPA of 85.82%, while the model without data augmentation achieved an mIoU of 75.64% and an mPA of 80.40%. The experimental results clearly indicate that data augmentation significantly improves model performance. The primary reason is the high diversity within the litchi dataset, which demands a robust model. These data augmentation techniques effectively enhance the model’s robustness. Consequently, we will employ these data augmentation preprocessing methods in our subsequent experiments.

To validate the advantages of using Swin-S as the backbone of the Mask2Former model, training was conducted with DeepLabV3+, PSPNet, SegFormer, KNet, and Mask2Former with Swin-L as the backbone. Both the training and test sets are consistent with Mask2Former (S). The training results are compared with those obtained from Mask2Former (S), as shown in Table 3. Table 3 illustrates that the Mask2Former model with Swin-S as the backbone outperforms network models with ResNet-101, MIT-B5, and Swin-L as backbones in terms of detection performance. Similarly, when comparing different models using Mask2Former, the detection performance of the Swin-S backbone, which has fewer parameters, is higher than that of Swin-L with more parameters. The proposed method achieves higher mIoU and mPA values than Mask2Former with Swin-L as the backbone, with improvements of 1.15% and 2.47%, respectively, indicating excellent performance of the proposed model. This is attributed to Mask2Former’s combination of backbone networks for feature extraction, pixel decoder, and transformer decoder, allowing the model to fully leverage multiscale features in images, thereby enhancing segmentation accuracy. Mask2Former demonstrates effective segmentation for both large and small objects. PA is an important metric for evaluating semantic segmentation models, with PA values for each class shown in Figure 6. The results indicate that although the segmentation performance of the Mask2Former (S) model surpasses that of other models, the segmentation performance for litchi branches is poor. This could be attributed to the complex growth patterns and intertwined structures of litchi branches, which increase the difficulty of precise segmentation. Therefore, secondary processing of segmented images is necessary.

The semantic segmentation results of the Mask2Former (S) model are illustrated in Figure 7. As depicted, the segmentation model divides the litchi images into three categories: background (purple), litchi fruits (green), and branches (yellow). From the segmentation results, it is evident that the Mask2Former (S) model accurately segments litchi fruits from the field images, with clear segmentation edges and good adaptability and robustness. However, there are slight deficiencies in segmenting the litchi branches, indicating the need for further optimization and improvement to enhance its performance.

3.3. Experiment on HSV Color Space Thresholding for Litchi Branches

3.3.1. Results of Litchi Branch Generation by DDPM

To better analyze the distribution patterns of litchi branches in the HSV color space, we employed DDPM to augment the litchi branch dataset. In this study, we utilized the PyTorch framework, trained the network for 400 epochs on a dataset consisting of 100 litchi branch images, with a batch size of 4, a learning rate of 1 × 10⁻⁴, and the Adam optimizer. A total of 100 litchi branch images were generated, from which we selected and retained 48. These 148 litchi branch images were used to test the HSV color space thresholds of litchi branches. Figure 8 displays some of the generated litchi branch data.

3.3.2. Histogram of Average Hue Values for Litchi Branches

We generated a histogram of the average hue values using the litchi branch dataset. The vertical axis (Y-axis) represents the frequency of occurrences of hue values within a given H-value interval. In the average histogram, it signifies the average occurrences of all images within the same H-value interval. The horizontal axis (X-axis) represents the hue values, referring to the H (Hue) channel values in the HSV color space. In this experiment, the H values were evenly divided into 180 intervals, with each interval representing a bin. The range of the H channel is typically from 0 to 179, but for OpenCV, since H is measured in degrees, the range is from 0 to 180. Here, we excluded the case of H = 0 (i.e., excluding the background color and considering only the litchi branch main body), so the range is from 1 to 180.

In this study, we only considered the hue (H) thresholds and did not take into account saturation (S) and value (V) thresholds. Firstly, saturation indicates the degree to which a color is close to the spectral color; the higher the saturation, the darker the color and the closer it is to the spectral color, while the lower the saturation, the lighter the color and the closer it is to white. Due to the presence of lighting and shadow, the same color may exhibit different saturations under different lighting conditions. If saturation is used as one of the segmentation criteria, the segmentation results may be affected by changes in lighting, leading to instability. Secondly, value determines the brightness of colors in the color space. Under different lighting conditions, the same color’s brightness may also vary. If brightness is used as one of the segmentation criteria, it will similarly be affected by changes in lighting, leading to unstable segmentation results. Therefore, setting only hue (H) thresholds without considering saturation (S) and value (V) thresholds can better cope with the effects of lighting and shadow, thereby improving segmentation robustness. This approach simplifies the segmentation process and maintains segmentation result stability, making it more suitable for litchi branch segmentation tasks in different scenarios.

In analyzing the distribution characteristics of hue values for litchi branches (Figure 9), we observed that the majority of litchi branch hue values are concentrated within the range of 0 to 25. Based on this observation and our assessment of background hue values in the experiment, we found that background hue values basically do not fall within the threshold interval of 0 to 35. To enhance the robustness of image processing, we decided to set a broader range of hue value thresholds, specifically from 0 to 35, to ensure more accurate extraction of litchi branches during the image segmentation process while avoiding misidentifying the background as litchi branches.

3.4. Comparison of Improved Litchi Branch Extraction Methods

Improving the extraction method for litchi branches allows for more accurate identification and extraction of the litchi main body. This prevents the inclusion of background or other irrelevant parts in the generated multi-view images, thus enhancing the quality of the generated images. Moreover, it enables more precise control over the parts included in the final image. For instance, when using the original litchi image as input for Wonder3D, the model fails to accurately segment the litchi and consistently cuts off litchi branches, which is an undesired outcome. By providing accurately segmented litchi main bodies as input, the model can more effectively process image data, accelerate algorithm execution, and enhance the efficiency of generated image production.

When extracting litchis and branches, we initially employed a semantic segmentation algorithm to process the original image and obtain masks for litchis and branches. The purpose of this step was to assign each pixel in the image to different categories, enabling precise localization and identification of objects. Subsequently, to address the issue of low accuracy in identifying litchi branches, we introduced an analysis of the HSV color space. Based on the observations above, we set the hue value threshold to 0–35. Leveraging the predicted results of the litchi branch mask, we determined the surrounding position of the litchi branch mask. Through dilation operations, we expanded the area of the litchi branch mask outward to form a larger region, serving as a reference for restricting the HSV detection range. This effectively avoids outputting irrelevant objects. Subsequently, the predicted branch mask is outputted along with pixels within the 0–35 hue value range as the output for litchi branches. Meanwhile, to avoid affecting the litchi itself, we excluded the area covered by the litchi mask. The following flowchart (Figure 10) illustrates the detailed process of extracting litchis and branches.

Figure 11 and Figure 12 compare the extraction effects of litchi branches before and after improvement, respectively. Through comparative experiments, we found that using the previous litchi extraction method, due to insufficient segmentation performance for the litchi branches, instances of missed detection occurred in practical operations. This indicates that when attempting to extract litchis using conventional methods, litchi branches are often found to be missing, or the connection between litchi branches and litchi fruits is incorrect. However, by introducing the improved extraction method, significant changes were observed. The improved model, after predicting the litchi branch mask, could easily cover the entire litchi branch through dilation operations. Although over-segmentation may occur in this process, where some background areas are incorrectly included, we cleverly utilized the HSV color space thresholding to successfully isolate these over-segmented background areas. Adopting the improved method resulted in a significant improvement in the segmentation effect for litchi branches. In summary, combining semantic segmentation masks with HSV color space and dilation operations to extract litchi branches is a feasible approach. By handling the boundaries between masks and utilizing range information around litchi masks, we can effectively extract litchi branches and minimize the occurrences of false positives and missed detections.

3.5. Comparison of Different Multi-View Generation Models

3.5.1. Qualitative Comparison

In Figure 13, we present the generation results of Wonder3D, Zero123++, Zero-1-to-3, and Syncdreamer [19] using litchi single-view RGB images as inputs. The litchi images are sourced from the litchi dataset used in our semantic segmentation research. By applying our proposed method, the extraction of litchi demonstrates outstanding performance, maintaining the integrity of both fruits and branches without extraneous interference. Upon observing the contents depicted in the figures, we notice that Zero-1-to-3, while exhibiting satisfactory results in detail texture and visual coherence in generating new viewpoint litchi images, suffers from geometric misalignment issues, with instances of branch misplacement. This is attributed to its independent processing of each view, resulting in a lack of multi-view consistency. Conversely, Syncdreamer’s generated results show relatively good structural consistency due to its introduction of a volume attention scheme to enhance multi-view image consistency. However, it exhibits texture distortion in the details, likely due to the model’s sensitivity to the pitch of input images. For objects like litchi, surface texture details are crucial, making such texture distortion unacceptable for litchi image generation tasks. On the other hand, Zero123++ demonstrates more consistent and higher-quality new viewpoint image generation capabilities. It inherits the texture details from Zero-1-to-3 while possessing the structural consistency of Syncdreamer. Nevertheless, upon close examination, we still identify some issues with Zero123++, as shown in Figure 14, where although it demonstrates a degree of consistency, geometric misalignment persists, especially evident when handling multiple litchi fruits. In contrast, Wonder3D, leveraging its robust cross-domain diffusion model and multi-view cross-domain attention mechanism, handles such scenarios better, maintaining high consistency in color and geometric shape across multiple litchi views. The morphology and texture details generated by Wonder3D exhibit notable consistency across different viewpoints, making it the most suitable choice for litchi new viewpoint generation tasks.

Furthermore, we tested Wonder3D under various challenging conditions, including overexposure, partial shading, and occlusion by leaves. These results are illustrated in Figure 15.

In Figure 15a, we examine the impact of overexposure. The results reveal that while the model accurately predicts the overall litchi contour, there are noticeable color discrepancies. Specifically, in viewpoints unfamiliar to the model, the litchi color is inaccurately predicted. This is likely due to overexposure, which can cause loss of detail in the highlighted regions and introduce noise that adversely affects subsequent image processing stages, thereby altering the color representation. Figure 15b demonstrates the model’s performance under partial shading. The results indicate that partial shading does not significantly affect the model’s predictions. The shading is not severe, merely reducing brightness while preserving details. Consequently, the model’s predictions remain relatively unaffected by such minor lighting variations, likely because it can “understand” that these changes are due to shading rather than inherent object or environmental alterations. Figure 15c,d illustrate the scenarios where the litchi is occluded by leaves, showing images with and without the leaves. The model’s performance in these occluded scenarios is generally suboptimal. Although the model can predict the spatial relationship between the leaves and the litchi, both scenarios exhibit texture distortion. Several factors might contribute to this: the information limitation due to the single input image, which restricts the model’s ability to directly observe the hidden details when leaves occlude parts of the litchi, making accurate multi-view prediction challenging, and the influence of the training data. The performance of the model is heavily reliant on the training data. If the training dataset lacks sufficient samples of litchi occluded by leaves or the diversity of such samples is inadequate, the model may not learn the necessary detail features. This deficiency can hinder the model’s ability to generate accurate details for occluded parts.

In summary, to ensure the multi-view generation model can accurately and qualitatively predict multiple viewpoints of litchi images, we strongly recommend using images of unobstructed and complete litchi as inputs. Doing so will significantly enhance the model’s performance in generating precise and comprehensive multi-view images.

3.5.2. Quantitative Comparison

In the field of image processing and computer vision, quantitative comparison is a crucial step in assessing algorithm performance and result quality. To evaluate the performance of different models in litchi new viewpoint generation tasks, we constructed a simulated litchi orchard environment in the laboratory and captured a total of 200 images of litchis from various angles. For each litchi image sample, we captured images from six different viewpoints. The azimuth angles of these six views were 0°, 45°, 90°, 180°, −90°, and −45°. Using the same single-view RGB image captured as input for four models, we compared the “fake images” of litchis generated by each model from different viewpoints with the “real images” of the corresponding viewpoints captured with a camera to assess them. We conducted extensive numerical evaluations using three metrics covering different aspects of image similarity: PSNR, SSIM, and LPIPS. Table 4 presents the performance of Wonder3D, Zero123++, Zero1-to-3, and Syncdreamer models in generating new viewpoint images of litchis using litchi images extracted with the method proposed in this paper as input. It can be observed that Wonder3D exhibits excellent performance in generating new viewpoint images of litchis. The performance difference between Zero123 and Zero123++ is minimal, with both models yielding satisfactory results overall. However, the performance of the Syncdreamer model is not ideal in terms of data, despite maintaining good consistency in generating new viewpoint images. Syncdreamer often exhibits texture distortion in litchi images, suggesting that it may not be suitable for the current litchi new viewpoint generation task. Considering the comprehensive analysis above, we ultimately selected the Wonder3D model for our litchi new viewpoint generation task.

3.6. Litchi Multi-View Generation Network Based on Two-stage Model

Based on the aforementioned studies, this research proposes a two-stage litchi multi-view generation model based on Mask2Former and Wonder3D (Figure 16). Given an outdoor litchi image as input, the image undergoes semantic segmentation to generate a predicted mask, followed by our proposed method for precise litchi image segmentation to extract the desired litchi image. This segmented litchi image is then fed into Wonder3D to achieve multi-view generation.

Table 5 presents a comparison between the results obtained using the combined approach and those from a single model, based on 50 image pairs. The data indicate (Table 5) that after adopting our proposed combined approach, the PSNR (peak signal-to-noise ratio) increased by 0.21 dB, the SSIM (structural similarity index) increased by 0.0121, and LPIPS (learned perceptual image patch similarity) decreased by 0.060. This is because using raw outdoor litchi images as inputs to Wonder3D sometimes results in inaccurate litchi segmentation by invoking third-party segmentation models for foreground segmentation. Additionally, there is a tendency to clip litchi branches (Figure 17). The integrity of litchi branches is crucial for subsequent research tasks such as the localization of litchi-picking robots and 3D reconstruction of litchi fruits. Obtaining high-quality segmentation results often requires manual intervention, which is not desirable. This indicates the effectiveness of the proposed two-stage model.

4. Discussion

We propose a two-stage litchi precise extraction and multi-view generation model based on Mask2Former and Wonder3D. Despite the satisfactory performance of the proposed two-stage model in generating new perspectives of litchis, the current framework still possesses some unresolved limitations. For instance, the model can only generate six views, which may impact the accuracy of the three-dimensional reconstruction of litchis due to the limited number of views. Furthermore, the current framework has constraints regarding the integrity of captured litchis. In cases of overexposure, partial shading, and leaf occlusion, the model’s performance is adversely affected. Overexposure results in color discrepancies, partial shading can reduce detail accuracy, and leaf occlusion leads to texture distortion. Therefore, high-quality, unoccluded images are essential for optimal model performance in multi-view generation.

To address these issues, future models can undertake several measures. For litchi extraction, enhancing the dataset of the semantic segmentation model or optimizing the model for improved segmentation accuracy can be beneficial. Regarding occlusion issues, incorporating an image inpainting network to repair and complete small occluded areas can ensure consistency in the reconstructed litchi. Additionally, for multi-view generation and 3D reconstruction of litchis, collecting and constructing a 3D dataset of litchis, and applying transfer learning to Wonder3D or other superior models, can enhance the model’s generalization capability in litchi reconstruction tasks to adapt to various types of litchis.

5. Conclusions

In this study, we propose a two-stage litchi precise extraction and multi-view generation model, which combines the Mask2Former semantic segmentation network and the Wonder3D multi-view generation network to achieve accurate segmentation extraction and multi-view generation of field litchi images. In the semantic segmentation model, the litchi fruits and branches are segmented by the Mask2Former network, with mIoU and mPA values of 79.79% and 85.82%, respectively. For the relatively low accuracy of litchi branch segmentation, we propose a method for litchi branch extraction, which effectively improves the segmentation accuracy by combining the semantic segmentation-predicted mask with morphological dilation and HSV color space. Based on the segmented litchi images, we use Wonder3D for multi-view generation. Compared to using Wonder3D alone, the Mask2Former + Wonder3D model improves PSNR and SSIM by 0.21 dB and 0.0121, respectively, while reducing LPIPS by 0.064. The proposed two-stage-model-based method for generating multiple views from a single field litchi image achieves a PSNR, SSIM, and LPIPS of 18.89, 0.8199, and 0.114, respectively. The results demonstrate that Wonder3D has good generalization ability for litchis, providing high-quality geometric shapes. The proposed two-stage model can be used for litchi dataset expansion, litchi 3D reconstruction, litchi picking, etc., providing innovative methods and technical support for litchi research, promoting the development of smart agriculture, and having enormous potential.

Author Contributions

Conceptualization, Y.L. (Yuanhong Li) and J.W.; Data curation, M.L.; Formal analysis, H.S.; Funding acquisition, Y.L. (Yubin Lan); Investigation, J.W.; Methodology, Y.L. (Yuanhong Li) and J.W.; Project administration, Y.L. (Yuanhong Li); Resources, Y.L. (Yubin Lan); Software, M.L.; Supervision, Y.L. (Yuanhong Li); Validation, J.W., H.S. and J.L.; Visualization, M.L.; Writing—original draft, Y.L. (Yuanhong Li) and J.W.; Writing—review & editing, J.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Key Research and Development Plan Project (2023YFD2000200), the Laboratory of Lingnan Modern Agriculture Project (NT2021009), the ‘111 Center’ (D18019), and the National Natural Science Foundation of China (Grant No. 32301708).

Institutional Review Board Statement

Not applicable.

Data Availability Statement

Part of the original data of this study can be found at https://github.com/Jingbaostudy/Dataset.git (accessed on 27 May 2024).

Acknowledgments

We would also like to thank all reviewers for their valuable comments.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zhao, L.; Wang, K.; Wang, K.; Zhu, J.; Hu, Z. Nutrient Components, Health Benefits, and Safety of Litchi (Litchi Chinensis Sonn.): A Review. Compr. Rev. Food Sci. Food Saf. 2020, 19, 2139–2163. [Google Scholar] [CrossRef] [PubMed]
Wene, Q.I.; Houbin, C.; Tao, L.; Fengxian, S. Development Status, Trend and Suggestion of Litchi Industry in Mainland China. Guangdong Agric. Sci. 2019, 46, 132–139. [Google Scholar]
Li, C.; Lin, J.; Li, Z.; Mai, C.; Jiang, R.; Li, J. An Efficient Detection Method for Litchi Fruits in a Natural Environment Based on Improved YOLOv7-Litchi. Comput. Electron. Agric. 2024, 217, 108605. [Google Scholar] [CrossRef]
Yu, L.; Xiong, J.; Fang, X.; Yang, Z.; Chen, Y.; Lin, X.; Chen, S. A Litchi Fruit Recognition Method in a Natural Environment Using RGB-D Images. Biosyst. Eng. 2021, 204, 50–63. [Google Scholar] [CrossRef]
Xie, J.; Jing, T.; Chen, B.; Peng, J.; Zhang, X.; He, P.; Yin, H.; Sun, D.; Wang, W.; Xiao, A.; et al. Method for Segmentation of Litchi Branches Based on the Improved DeepLabv3+. Agronomy 2022, 12, 2812. [Google Scholar] [CrossRef]
Li, Y.; Wang, X.; Zhao, Z.; Han, S.; Liu, Z. Lagoon Water Quality Monitoring Based on Digital Image Analysis and Machine Learning Estimators. Water Res. 2020, 172, 115471. [Google Scholar] [CrossRef] [PubMed]
Mo, J.; Lan, Y.; Yang, D.; Wen, F.; Qiu, H.; Chen, X.; Deng, X. Deep Learning-Based Instance Segmentation Method of Litchi Canopy from UAV-Acquired Images. Remote Sens. 2021, 13, 3919. [Google Scholar] [CrossRef]
Li, Y.; Zhao, Z.; Luo, Y.; Qiu, Z. Real-Time Pattern-Recognition of GPR Images with YOLO v3 Implemented by Tensorflow. Sensors 2020, 20, 6476. [Google Scholar] [CrossRef] [PubMed]
Li, Y.; Zhao, Z.; Xu, W.; Liu, Z.; Wang, X. An Effective FDTD Model for GPR to Detect the Material of Hard Objects Buried in Tillage Soil Layer. Soil Tillage Res. 2019, 195, 104353. [Google Scholar] [CrossRef]
Goodfellow, I.J.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative Adversarial Networks. arXiv 2014, arXiv:1406.2661. [Google Scholar] [CrossRef]
Pan, Z.; Yu, W.; Yi, X.; Khan, A.; Yuan, F.; Zheng, Y. Recent Progress on Generative Adversarial Networks (GANs): A Survey. IEEE Access 2019, 7, 36322–36333. [Google Scholar] [CrossRef]
Wang, C.; Xiao, Z. Lychee Surface Defect Detection Based on Deep Convolutional Neural Networks with GAN-Based Data Augmentation. Agronomy 2021, 11, 1500. [Google Scholar] [CrossRef]
Huang, Y.; Chen, Z.; Liu, J. Limited Agricultural Spectral Dataset Expansion Based on Generative Adversarial Networks. Comput. Electron. Agric. 2023, 215, 108385. [Google Scholar] [CrossRef]
Ho, J.; Jain, A.; Abbeel, P. Denoising Diffusion Probabilistic Models. In Advances in Neural Information Processing Systems; The MIT Press: Cambridge, MA, USA, 2020. [Google Scholar]
Dhariwal, P.; Nichol, A. Diffusion Models Beat GANs on Image Synthesis. In Advances in Neural Information Processing Systems; The MIT Press: Cambridge, MA, USA, 2021. [Google Scholar]
Chen, D.; Qi, X.; Zheng, Y.; Lu, Y.; Huang, Y.; Li, Z. Synthetic Data Augmentation by Diffusion Probabilistic Models to Enhance Weed Recognition. Comput. Electron. Agric. 2024, 216, 108517. [Google Scholar] [CrossRef]
Moreno, H.; Gómez, A.; Altares-López, S.; Ribeiro, A.; Andújar, D. Analysis of Stable Diffusion-Derived Fake Weeds Performance for Training Convolutional Neural Networks. Comput. Electron. Agric. 2023, 214, 108324. [Google Scholar] [CrossRef]
Wu, J.; Wyman, O.; Tang, Y.; Pasini, D.; Wang, W. Multi-View 3D Reconstruction Based on Deep Learning: A Survey and Comparison of Methods. Neurocomputing 2024, 582, 127553. [Google Scholar] [CrossRef]
Liu, Y.; Lin, C.; Zeng, Z.; Long, X.; Liu, L.; Komura, T.; Wang, W. SyncDreamer: Generating Multiview-Consistent Images from a Single-View Image. arXiv 2024, arXiv:2309.03453. [Google Scholar]
Shen, Q.; Yang, X.; Wang, X. Anything-3D: Towards Single-View Anything Reconstruction in the Wild. arXiv 2023, arXiv:2304.10261. [Google Scholar]
Shi, R.; Chen, H.; Zhang, Z.; Liu, M.; Xu, C.; Wei, X.; Chen, L.; Zeng, C.; Su, H. Zero123++: A Single Image to Consistent Multi-View Diffusion Base Model. arXiv 2023, arXiv:2310.15110. [Google Scholar]
Long, X.; Guo, Y.-C.; Lin, C.; Liu, Y.; Dou, Z.; Liu, L.; Ma, Y.; Zhang, S.-H.; Habermann, M.; Theobalt, C.; et al. Wonder3D: Single Image to 3D Using Cross-Domain Diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023. [Google Scholar]
Cheng, B.; Misra, I.; Schwing, A.G.; Kirillov, A.; Girdhar, R. Masked-Attention Mask Transformer for Universal Image Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]
Smith, A.R. Color Gamut Transform Pairs. ACM SIGGRAPH Comput. Graph. 1978, 12, 12–19. [Google Scholar] [CrossRef]
Sohl-Dickstein, J.; Weiss, E.A.; Maheswaranathan, N.; Ganguli, S. Deep Unsupervised Learning Using Nonequilibrium Thermodynamics. In Proceedings of the 32nd International Conference on Machine Learning, Lille, France, 7–9 July 2015. [Google Scholar]
Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image Quality Assessment: From Error Visibility to Structural Similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef] [PubMed]
Zhang, R.; Isola, P.; Efros, A.A.; Shechtman, E.; Wang, O. The Unreasonable Effectiveness of Deep Features as a Perceptual Metric. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018. [Google Scholar]
Deitke, M.; Schwenk, D.; Salvador, J.; Weihs, L.; Michel, O.; VanderBilt, E.; Schmidt, L.; Ehsani, K.; Kembhavi, A.; Farhadi, A. Objaverse: A Universe of Annotated 3D Objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]

Figure 1. Figures illustrating dataset quantities (a–c), lighting conditions (d–f), and varieties (g–i).

Figure 2. Mask2Former network architecture.

Figure 3. Litchi branch dataset.

Figure 4. The directed graphical model considered in DDPM.

Figure 5. Wonder3D takes a single image as input and uses text embeddings from the CLIP model, camera parameters from multiple views, and a domain switcher to generate consistent multi-view normal maps and color images.

Figure 6. Comparison of PA values for each category in semantic segmentation model.

Figure 7. Semantic segmentation effect of the Mask2Former segmentation model.

Figure 8. Litchi branches generated by DDPM.

Figure 9. Histogram of average distribution of litchi branch color values.

Figure 10. Segmentation process flowchart for litchis and branches.

Figure 11. The pre-improved litchi extraction method.

Figure 12. The improved litchi extraction method.

Figure 13. Qualitative comparison of Wonder3D with various methods on the single image to multi-view task.

Figure 14. Comparison of generation performance of Wonder3D and Zero123++ on multiple litchi fruits.

Figure 15. Performance of the multi-view generation model under various conditions: overexposure (a), shadow (b), and occlusion by leaves (c,d).

Figure 16. Two-stage litchi multi-view generation model.

Figure 17. Effectiveness of the proposed litchi segmentation method.

Table 1. Semantic segmentation label description.

Label	Explanation
Background	Image background
Litchi	Litchi Fruit
Branch	Litchi Branch

Table 2. Effect of input resolution and data augmentation on model performance.

Model	Backbone	Input Resolution	Data Augmentation Applied	mIoU (%)	mPA (%)
Mask2Former	Swin-S	256 × 256	Yes	78.52	83.36
Mask2Former	Swin-S	1024 × 1024	Yes	79.04	87.39
Mask2Former	Swin-S	512 × 512	Yes	79.79	85.82
Mask2Former	Swin-S	512 × 512	No	75.64	80.40

Table 3. Comparison of the segmentation models.

Model	Backbone	mIoU (%)	mPA (%)
DeepLabV3+	ResNet-101	73.96	77.55
PSPNet	ResNet-101	73.29	79.03
SegFormer	MIT-B5	74.45	80.48
KNet	Swin-L	76.44	81.26
Mask2Former	Swin-L	78.64	83.35
Mask2Former	Swin-S	79.79	85.82

Table 4. The quantitative comparison in multi-view generation.

Method	PSNR ↑	SSIM ↑	LPIPS ↓
Syncdreamer	8.01	0.4578	0.492
Zero123	14.16	0.7411	0.283
Zero123++	16.51	0.7959	0.199
Wonder3D	18.89	0.8199	0.114

Table 5. The quantitative comparison in multi-view generation.

Method	PSNR ↑	SSIM ↑	LPIPS ↓
Wonder3D	19.24	0.8231	0.168
Mask2Former + Wonder3D	19.45	0.8352	0.108

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, Y.; Wang, J.; Liang, M.; Song, H.; Liao, J.; Lan, Y. A Novel Two-Stage Approach for Automatic Extraction and Multi-View Generation of Litchis. Agriculture 2024, 14, 1046. https://doi.org/10.3390/agriculture14071046

AMA Style

Li Y, Wang J, Liang M, Song H, Liao J, Lan Y. A Novel Two-Stage Approach for Automatic Extraction and Multi-View Generation of Litchis. Agriculture. 2024; 14(7):1046. https://doi.org/10.3390/agriculture14071046

Chicago/Turabian Style

Li, Yuanhong, Jing Wang, Ming Liang, Haoyu Song, Jianhong Liao, and Yubin Lan. 2024. "A Novel Two-Stage Approach for Automatic Extraction and Multi-View Generation of Litchis" Agriculture 14, no. 7: 1046. https://doi.org/10.3390/agriculture14071046

APA Style

Li, Y., Wang, J., Liang, M., Song, H., Liao, J., & Lan, Y. (2024). A Novel Two-Stage Approach for Automatic Extraction and Multi-View Generation of Litchis. Agriculture, 14(7), 1046. https://doi.org/10.3390/agriculture14071046

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Novel Two-Stage Approach for Automatic Extraction and Multi-View Generation of Litchis

Abstract

1. Introduction

2. Materials and Methods

2.1. Image Data Collection

2.2. Data Annotation

2.3. Data Augmentation

2.4. Semantic Segmentation Network Architecture

2.5. Combining Semantic Segmentation with HSV Color Space for Extracting Litchi Branches

2.5.1. Establishing the Litchi Branch Dataset

2.5.2. Denoising Diffusion Probabilistic Models (DDPM)

2.5.3. Expanding the Litchi Branch Dataset Using DDPM

2.5.4. Extracting Crucial Litchi Components Using Semantic Segmentation Results and the HSV Color Space

2.6. Multi-View Generation Network Architecture

2.7. Evaluation Metrics

3. Results

3.1. Model Training

3.2. Comparison of Different Semantic Segmentation Models

3.3. Experiment on HSV Color Space Thresholding for Litchi Branches

3.3.1. Results of Litchi Branch Generation by DDPM

3.3.2. Histogram of Average Hue Values for Litchi Branches

3.4. Comparison of Improved Litchi Branch Extraction Methods

3.5. Comparison of Different Multi-View Generation Models

3.5.1. Qualitative Comparison

3.5.2. Quantitative Comparison

3.6. Litchi Multi-View Generation Network Based on Two-stage Model

4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI