Next Article in Journal
Parsing Unranked Tree Languages, Folded Once
Previous Article in Journal
Re-Orthogonalized/Affine GMRES and Orthogonalized Maximal Projection Algorithm for Solving Linear Systems
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Semi-Self-Supervised Domain Adaptation: Developing Deep Learning Models with Limited Annotated Data for Wheat Head Segmentation

by
Alireza Ghanbari
1,
Gholam Hassan Shirdel
1 and
Farhad Maleki
2,*
1
Department of Mathematics and Computer Sciences, Faculty of Sciences, University of Qom, Qom 3716146611, Iran
2
Department of Computer Science, University of Calgary, Calgary, AB T2N 1N4, Canada
*
Author to whom correspondence should be addressed.
Algorithms 2024, 17(6), 267; https://doi.org/10.3390/a17060267
Submission received: 19 March 2024 / Revised: 4 June 2024 / Accepted: 6 June 2024 / Published: 17 June 2024
(This article belongs to the Special Issue Efficient Learning Algorithms with Limited Resources)

Abstract

:
Precision agriculture involves the application of advanced technologies to improve agricultural productivity, efficiency, and profitability while minimizing waste and environmental impacts. Deep learning approaches enable automated decision-making for many visual tasks. However, in the agricultural domain, variability in growth stages and environmental conditions, such as weather and lighting, presents significant challenges to developing deep-learning-based techniques that generalize across different conditions. The resource-intensive nature of creating extensive annotated datasets that capture these variabilities further hinders the widespread adoption of these approaches. To tackle these issues, we introduce a semi-self-supervised domain adaptation technique based on deep convolutional neural networks with a probabilistic diffusion process, requiring minimal manual data annotation. Using only three manually annotated images and a selection of video clips from wheat fields, we generated a large-scale computationally annotated dataset of image–mask pairs and a large dataset of unannotated images extracted from video frames. We developed a two-branch convolutional encoder–decoder model architecture that uses both synthesized image–mask pairs and unannotated images, enabling effective adaptation to real images. The proposed model achieved a Dice score of 80.7% on an internal test dataset and a Dice score of 64.8% on an external test set composed of images from five countries and spanning 18 domains, indicating its potential to develop generalizable solutions that could encourage the wider adoption of advanced technologies in agriculture.

1. Introduction

Precision agriculture refers to the use of advanced technologies, including GPS guidance, control systems, sensors, robotics, drones, autonomous vehicles, and variable rate technology, to optimize farm management. It aims to reduce costs and increase yields in farming while ensuring sustainability and environmental protection [1]. Deep learning (DL) methodologies can offer efficient, automated, and data-driven decision-making in agricultural practices. DL has demonstrated significant advancements in visual data analysis across various tasks, including image classification [2,3], object detection [4,5], semantic segmentation [6,7,8], and instance segmentation [9,10,11]. Automated visual monitoring of agricultural fields can aid in the early identification of issues such as pest infestations, diseases, and nutrient deficiencies while optimizing resource utilization. Consequently, this approach facilitates timely interventions, which can lead to enhanced crop yields, improved quality of harvests, and increased overall operational efficiency, thereby ensuring sustainability.
However, the wide adoption of DL approaches for crop monitoring faces significant challenges. Agricultural fields are constantly changing environments. For example, a crop field in the early growth stages is substantially different from the same field in the later stages of growth. Additionally, these systems must operate accurately under various weather and lighting conditions. This challenges the generalizability of DL models because models trained on data from a specific growth stage of a field might not generalize well to the same crop at different growth stages. In the DL context, this phenomenon is known as a distribution shift [12,13]. One could develop a large-scale dataset encompassing various crop growth stages and environmental conditions to alleviate this issue. However, this approach presents challenges to data collection and annotation. Collecting data from various growth stages of crop fields under various weather and lighting conditions is time-consuming. Further, annotating agricultural images is particularly challenging, often requiring pixel-level annotation. These images frequently contain numerous objects of interest—e.g., wheat spikes in a wheat field—making the data annotation process laborious.
Environmental conditions—such as climate, soil quality, geographical location, and pest and disease pressure—significantly contribute to data variation in crop datasets. Variations in temperature, precipitation, humidity, soil types, regional microclimates, and pest and disease prevalence collectively influence crop growth and yield outcomes. Further, phenotypic varieties introduce considerable data variation into crop datasets, attributed to genetic diversity, breeding practices, and different responses to environmental stresses. The unique traits of various genetic varieties, including their drought and pest resistance, yield potential, and nutritional content, and the specific characteristics of hybrid and traditional crops, significantly influence growth and development patterns. In addition, growth stages contribute to data variation in crop datasets, with data collected at various points in the crop lifecycle, from germination to harvesting phases, being substantially different. As such, deep learning models developed only using datasets generated from one snapshot of crop fields might not generalize well when applied to other fields with different environmental, phenotypic, and growth stages. Therefore, there is a pressing need to develop automated or semi-automated methodologies that enhance the performance of deep learning models on new datasets without requiring substantial manual annotation.
Domain adaptation techniques refer to the methodologies used to alleviate distribution shifts and can be divided into supervised, semi-supervised, self-supervised, and unsupervised approaches depending on whether or not annotated data are used for domain adaptation [14,15]. Supervised domain adaptation relies on labeled data from both source and target domains. Semi-supervised domain adaptation combines labeled data from the source domain with both labeled and unlabeled data from the target domain. Finally, unsupervised domain adaptation operates solely with unlabeled data in the target domain, focusing on learning features applicable across domains.
The proposed method utilizes video clips of wheat fields and background video clips, along with only three manually annotated images. Acquiring video clips of fields and background scenes is straightforward and involves low data acquisition costs. Manual annotation of a few images is also achievable in a short period of time. Therefore, the proposed approach substantially decreases the time and effort needed to develop deep learning models for crop monitoring in agriculture. The closest work to our approach relies on extra manually annotated images and semi-automated image annotation for domain adaptation to improve model performance. In contrast, our approach does not use manually annotated data for domain adaptation. Our approach benefits from an architecture designed to bridge the domain gap. Following a multi-task learning paradigm, the proposed architecture utilizes both image reconstruction and mask prediction tasks and is trained end to end to facilitate the full utilization of both synthetic annotated data and real unannotated data, thereby enhancing the generalizability of the resulting model.
In this paper, we develop a semi-self-supervised domain adaptation approach based on a probabilistic diffusion model that operates without the need for manual annotation of data from the target domain. This alleviates the need for further manual data annotation, accelerating the model development process. Self-supervised learning refers to methodologies where supervisory signals are generated computationally from the input data, enabling learning data representations and extracting informative features without the need for manual annotation [16]. This approach allows for utilizing large-scale but unannotated datasets and has recently led to substantial progress in various image-processing tasks [16,17]. Deep diffusion probabilistic models, such as self-supervised techniques, have shown remarkable results in image-processing tasks, especially in generative AI [18,19,20]. In the following, we provide a mathematical description of these models.

Background

A Markov chain is a sequence of random variables X 1 , X 2 , X 3 , that satisfies the Markov property, which states that the probability of the system transitioning to the next state depends solely on the current state of the system and not the preceding events/states. The Markov property can be mathematically expressed as follows:
P ( X n + 1 = x X 0 = x 0 , X 1 = x 1 , X 2 = x 2 , , X n = x n ) = P ( X n + 1 = x X n = x n )
In this equation, P denotes a probability distribution for the state of the system. X i is a vector-valued variable representing the state at the ith step, and x , x 0 , x 1 , x 2 , , x n are the specific state vector values of the system.
Given a data point x 0 from the actual data distribution, we can define a Markov chain—referred to as a forward diffusion process—by iteratively adding a multivariate Gaussian noise vector to x 0 . More specifically, at state X t 1 , where the system has a specific state of x t 1 , we add Gaussian noise—characterized by a mean vector of μ t and a covariance matrix of Σ t 2 —to x t 1 to transition to a state X t with a specific value of x t . The probability distribution for this transition can be described as q ( X t = x t X t 1 = x t 1 ) = N ( x t ; μ t = 1 β t x t 1 , Σ t 2 = β t I ) , where β t is a scalar; I is an m × m identity matrix; and m is the cardinality of x i ( 0 i T ) . Considering the Markov chain property, we have
q ( x 1 : T | x 0 ) = t = 1 T q ( x t x t 1 )
where q ( x 1 : T | x 0 ) is a short notation for q ( x 1 , x 2 , , x T | x 0 ) . A state x t can be achieved by iteratively applying the Gaussian noise to x 0 t times. Instead of an iterative process, this could also be achieved in a single step, utilizing a reparametrization trick, in which x t is rewritten as
x t = 1 β t x t 1 + β t ϵ
where ϵ is an n-dimensional vector with standard normal distribution N ( 0 , I ) . It can be inferred that this reparametrization preserves the distribution of x t , i.e., x t N ( μ t = 1 β t x t 1 , Σ t 2 = β t I ) . By changing the notation to α t = 1 β t , Equation (2) can be written as:
x t = α t x t 1 + 1 α t ϵ
As Equation (3) is a recursive equation, we can further expand the equation.
x t = α t x t 1 + 1 α t ϵ = α t α t 1 x t 2 + 1 α t 1 ϵ + 1 α t ϵ = α t α t 1 x t 2 + α t ( 1 α t 1 ) ϵ + 1 α t ϵ
Since ϵ has a standard normal distribution N ( μ = 0 , Σ 2 = I ) , α t ( 1 α t 1 ) ϵ has a normal distribution N ( μ = 0 , Σ 2 = α t ( 1 α t 1 ) ) and 1 α t ϵ has a normal distribution N ( μ = 0 , Σ 2 = 1 α t ) . Consequently, their summation, i.e., α t ( 1 α t 1 ) ϵ + 1 α t ϵ , has a normal distribution N ( μ = 0 , Σ 2 = 1 α t α t 1 ) . Therefore, Equation (4) can be written as:
x t = α t α t 1 x t 2 + 1 α t α t 1 ϵ
By expanding Equation (5) further, we obtain the following equation:
x t = α t α t 1 α 1 x 0 + 1 α t α t 1 α 1 ϵ
Defining α ¯ t = α t α t 1 α 1 , Equation (6) can be written concisely as follows:
x t = α ¯ t x 0 + 1 α ¯ t ϵ
Given x 0 , Equation (7) allows us to generate x t in a computationally efficient manner in just one step, rather than iteratively adding noise in t steps. This is achieved by sampling from the Gaussian distribution N ( x t ; μ t = α ¯ t x 0 , Σ t 2 = ( 1 α ¯ t ) I ) . α ¯ t is defined using α 1 , , α t , where α i = 1 β i ( 1 i t ) . β i values are often defined using a linear [21] or cosine scheduler [21,22], with the cosine schedule leading to superior results as it smoothly transitions the noise added to the data across the diffusion steps, which is essential for controlling the quality and characteristics of the generated data in diffusion models. A Cosine scheduler can be defined as follows:
β t = β max 1 2 ( β max β min ) 1 + cos t T π
where β t represents the noise level at step t (higher β t values correspond to higher noise levels); t = 1 and t = T result in the minimum and maximum noise levels, respectively; T is the total number of steps in the diffusion process; and t is the current time step. All experiments in this study were conducted using β min = 0.0001 and β max = 0.02 .
As T approaches infinity, x T converges to an isotropic Gaussian distribution, meaning that all components of x t are normally distributed and decorrelated. Therefore, by having a procedure for learning the reverse distribution q ( x t 1 x t ) , we can start with sampling x T from a normal distribution and follow the reverse distribution to arrive at x 0 . It should be noted that in the forward diffusion process, noise is added at each step; that is, x t is derived from adding noise to x t 1 according to a distribution of q ( x t | x t 1 ) . In the backward diffusion process, however, the aim is to reverse the forward process by removing the noise added to x t to recover x t 1 . This is achieved by a reverse transition using a distribution denoted as q ( x t 1 | x t ) .
In practical terms, the true reverse distribution q ( x t 1 | x t ) is unknown and impossible to compute directly, as accurately estimating it would require complex calculations involving the entire data distribution. Instead, we approximate q ( x t 1 | x t ) using a parameterized model, a deep neural network, denoted as p θ ( x t 1 | x t ) . Given that q ( x t 1 | x t ) is also Gaussian for small enough values of β t in the forward diffusion process, we can choose p θ ( x t 1 | x t ) to be a Gaussian distribution. In this case, the neural network is trained to parameterize the mean and variance of this Gaussian distribution.

2. Data and Methodology

This section provides an in-depth explanation of the data used in this research, along with detailed descriptions of the preprocessing methods, the model architecture, and the processes for model training and evaluation. The source code for this study is available at https://github.com/ARGhanbari/DualBranchSSLDomainAdaptation (accessed on 29 April 2024).

2.1. Data

Figure 1 illustrates the three manually annotated images ( I η , I ζ , and I τ ) used for synthesizing computationally annotated datasets. Utilizing the methodology presented in our previous work [7], we computationally synthesize three datasets: D η , a set of 8000 images derived from I η ; D η + ζ , a set of 16,000 images derived from I η and I ζ and D ζ + τ , a set of 4000 images derived from I ζ and I τ . Figure 2 illustrates examples of the synthesized images and their corresponding segmentation masks. In addition to these manually and computationally annotated images, for the training process, we utilize 10,592 unannotated image frames extracted from two video clips of wheat fields, with 5296 frames from each. Hereafter, we refer to these datasets as D ρ 1 and D ρ 2 . We refer to the dataset resulting from combining D ρ 1 and D ρ 2 as D ρ .
We also use two sets of Ψ and Γ as the internal and external test sets, as introduced by Najafian et al. [7]. The set Ψ comprises 100 image frames, which were randomly selected from a video clip of a wheat field and have been manually annotated to serve as an internal test set. The set Γ represents a subset of 365 manually annotated images from the GWHD dataset [23], which includes images from five countries and spans 18 domains, covering various growth stages of wheat.

2.2. Model Architecture

Figure 3 illustrates the convolutional neural network (CNN)-based model architecture employed in this research. This architecture is designed to facilitate representation learning through a diffusion process, in addition to performing the primary task of segmentation. The segmentation task is executed using computationally synthesized images along with their corresponding masks. Meanwhile, the representation learning aspect enables adaptation to real images using solely unannotated data, thereby reducing the reliance on manual annotation.
We use an encoder to develop a joint representation of both synthesized and real images. Additionally, the model architecture comprises an image decoder and a mask decoder. The former is designed to reconstruct the image from the output of the encoder, thereby enforcing learning features from real images. On the other hand, the mask decoder is tasked with developing a segmentation mask given the output of the encoder.
In implementing the encoder and decoders, we leveraged residual building blocks [24], each comprising two pairs of convolution and GroupNorm layers, supported by the Swish activation function [25] and integrated skip connections. Figure 4 shows a residual building block.
Figure 5 illustrates the encoder, consisting of one convolutional layer followed by 12 ResNet building blocks organized in six levels. Additionally, two skip-connection-free ResNet blocks serve as the network bottleneck. Moreover, down-sampling operations are performed after each pair of ResNet blocks using a convolution layer with a kernel size of three and a stride of two. The Swish activation function [25] is also consistently applied as the non-linearity across the encoder layers.
Synthetic images undergo a series of image augmentations [26]; then, they are fed into the encoder. The outputs of the encoder are subsequently passed to a mask decoder to generate accurate masks corresponding to the input images. To calculate the segmentation error, we use a linear combination of binary cross entropy (BCE) loss [27] and Dice loss [27] (see Equation (8)).
L B C E = 1 B × N j = 1 B i = 1 N y i log ( y ^ i ) + ( 1 y i ) log ( 1 y ^ i ) L Dice = 1 2 B j = 1 B | M j s M ^ j s | | M j s | + | M ^ j s | L S e g = λ B C E × L B C E + λ D i c e × L D i c e
where y i is the true label of the pixel i, with a value of 1 if the pixel belongs to a wheat head and 0 otherwise. y ^ i represents the model prediction of whether pixel i is from a wheat head. N is the number of pixels per image, and B is the batch size. M j s is the ground truth mask for the synthetic image I j s , and M ^ j s is the model prediction for I j s .
We also generate noise-augmented images by applying Gaussian noise to the real image I j over t ( 1 t T ) timesteps, as detailed in Equation (7). These noise-augmented images are subsequently routed through the encoder, followed by the image decoder, aiming to reconstruct the original images. The error for this branch is calculated by comparing the original image with the reconstructed image using a linear combination of MSE [28], SSIM [29], and a perceptual loss function [30]. The perceptual loss function calculates the mean absolute error between the pretrained ResNet18 output feature maps for the original image I j and the reconstructed image I ^ j . We used the R e s N e t 18 _ W e i g h t s . I M A G E N E T 1 K _ V 1 weights from the Torchvision Python package [31]. We used the feature maps generated by the convolutional layer before the final average pooling layer of the ResNet18 model.
L MSE = 1 B × N j = 1 B i = 1 N ( x j i x ^ j i ) 2 L SSIM ( I j , I ^ j ) = 1 ( 2 μ I j μ I ^ j + c 1 ) ( 2 σ I j I ^ j + c 2 ) ( μ I j 2 + μ I ^ j 2 + c 1 ) ( σ I j 2 + σ I ^ j 2 + c 2 ) L Perceptual = 1 B j = 1 B θ ( I j ) θ ( I ^ j ) 2 2 L R e c = λ M S E × L M S E + λ S S I M × L S S I M + λ P e r c e p t u a l × L P e r c e p t u a l
where B is batch size; N is the number of pixels in an image I j ; x j i is the actual value of the i-th pixel of I j ; x ^ j i is the predicted value for the i-th pixel of I j ; μ I j and μ I ^ j are the mean intensity values of images I j and I ^ j , respectively; σ I j 2 and σ I ^ j 2 are the variances of images I j and I ^ j , respectively; σ I j I ^ j is the covariance between images I j and I ^ j ; and c 1 and c 2 are constants introduced to stabilize the division. θ ( I j ) and θ ( I ^ j ) are ResNet18 features extracted from images I j and I j ^ , respectively.
These losses focus on variations between the reconstructed image and the original noise-free image from different perspectives: MSE loss focuses on pixel-level variations, SSIM loss focuses on local variations, and perceptual loss focuses on high-level global variations. The reconstruction loss is calculated as a linear combination of these loss functions, as described in Equation (9), where λ M S E , λ S S I M , and λ P e r c e p t u a l are constant values and model hyperparameters.

2.3. Model Training and Evaluation

Considering the size of the datasets, we train our models over a total of 50 training epochs. For each epoch, we iterate over all the data in the datasets. At each iteration, we select a batch of 32 synthesized image–mask pairs and a separate batch of 32 real images from our datasets. The image–mask pairs are fed into the segmentation branch of our model, which consists of an encoder and a mask decoder. The segmentation loss is then calculated according to Equation (8). Similarly, the real images are fed into the reconstruction branch of our model, comprising the encoder (shared with the segmentation branch) and an image decoder, where the reconstruction loss is calculated as per Equation (9). The total loss is computed by summing the segmentation loss and the reconstruction loss. Figure 3 illustrates this process.
This dual-stream approach ensures that the encoder effectively attends to both types of data, with each decoder specializing in a task corresponding to its branch—namely, segmentation or reconstruction. Model updates are orchestrated using the AdamW optimizer [32], with the learning rate set at 0.0001 . For all experiments, the coefficients λ M S E , λ S S I M , λ P e r c e p t u a l , λ B C E , and λ D i c e are set to 1. After training for 50 epochs, we selected the model with the highest Dice score as the best model. During the evaluation process, we assessed the model performance using the Dice score and IoU. As a baseline for comparison, we used a model developed in [7].

3. Results

Table 1 presents a quantitative evaluation of the performance of our developed models, namely, model F η + ρ 1 trained on D η and D ρ 1 and model F η + ζ + ρ trained on D η + ζ and D ρ . These models were trained using both synthesized image–mask pairs and unannotated real images extracted from video clips of wheat fields. The performance of the baseline model S from Najafian et al. [7] is also reported in Table 1. The results demonstrate that the developed models, F η + ρ 1 and F η + ζ + ρ , consistently outperformed the baseline model S . These models rely on computationally annotated synthetic images and real unannotated images extracted from the video frames of wheat fields.
Figure 6 presents a qualitative evaluation of the performance of model F η + ζ + ρ , identified as the best-performing model, and model S on several different domains. In general, model F η + ζ + ρ demonstrates superior performance compared to model S . We also reported a case, as shown in the middle column, where model S performs better.
Table 2 presents the performance of the proposed models across various domains of the GWHD dataset. The results indicate that the proposed models generally outperform model S in most domains, exhibiting a lower variance and greater stability.

4. Discussion

Precision agriculture aims to integrate advanced technologies to address various challenges in the agricultural sector, enhancing productivity, efficiency, and profitability while minimizing waste and environmental impacts. Deep learning approaches play a crucial role in enabling automated decision-making capabilities. By automating the processing of visual data from agricultural fields, these decision-making processes can be significantly improved and scaled up.
However, the development of DL-based techniques encounters challenges stemming from the dynamic nature of agricultural fields, characterized by diverse growth stages, inconsistent weather conditions, and variable lighting. As such, models developed based on one snapshot of these fields are often not generalizable across different growth stages or environmental conditions. Furthermore, developing large-scale annotated datasets across various growth stages and environmental conditions is time-consuming and expensive. These challenges hinder the development of generalizable DL-based solutions for various agricultural tasks. Consequently, developing methodologies that allow DL-based solutions to generalize across different conditions (domains) could facilitate the widespread adoption of these technologies in the agricultural sector.
In response to these challenges, we developed a semi-self-supervised domain adaptation technique that employs deep convolutional neural networks and a probabilistic diffusion process. This approach was developed with minimal data annotation, requiring only three manually annotated images. We classify this method as semi-supervised because it leverages a small amount of annotated data (three images) alongside a significant volume of unannotated data. Additionally, this method can be considered as self-supervised as it utilizes computationally annotated data for model development.
The proposed model demonstrated substantial improvements in performance compared to the baseline model presented in recent work by Najafian et al. [7], while also showing a lower variance in model performance across 18 different domains from the GWHD dataset. These 18 domains represent various growth stages of wheat fields and different environmental conditions.
In this study, we devised a two-phase model training process. Initially, model F η + ρ 1 was developed by synthesizing a large number of computationally annotated samples resulting from a single sample for image synthesis and validation, showcasing a nearly 20 % improvement over a recent work by Najafian et al. [7] on the target domain. Subsequently, in the second phase, model F η + ζ + ρ leveraged additional computationally generated data, further enhancing the performance difference to 28 % .
As presented in Table 1, our final model, F η + ζ + ρ , shows a 28.1 % improvement in the Dice score and a 25.2 % improvement in the IoU when compared to model S from Najafian et al. The increased performance showcases the utility of the proposed semi-self-supervised domain adaptation technique.
Label noise poses a significant challenge in semi-supervised domain adaptation, whether it stems from manual annotation errors or errors introduced by the use of non-curated pseudo-labeled data for model fine-tuning. The presence of label noise in manually annotated data can severely degrade the performance of deep learning models by introducing incorrect information during the training phase, leading to suboptimal decision boundaries and poor generalization to the target domain. Utilizing pseudo-labels without any form of curation may also further exacerbate this issue, as these uncurated labels can propagate errors and amplify the negative impact of label noise. Adopting systematic approaches, such as the Label Recovery and Trajectory Designable Network (LRTDN) proposed by Yang et al. [33], could alleviate this challenge. The LRTDN aims to address the noise in pseudo-labels by incorporating a robust annotation check module that identifies and corrects label anomalies, thereby enhancing the reliability of the training data. This ensures that the model learns from more accurate labels, ultimately improving its performance in the presence of annotation noise.
In agricultural fields, pixel-level annotation poses unique challenges. The presence of many small regions of interest, such as wheat heads, necessitates meticulous annotation. Moreover, the lack of contrast between these regions and the background further complicates the task. For instance, accurately differentiating wheat heads from the background requires significant attention and accuracy during manual annotation. Consequently, pixel-level accurate annotation of a single image may take several hours, making developing large-scale, pixel-level, accurately annotated images costly and tedious. Furthermore, agricultural data are substantially variable due to variations in environmental conditions, phenotypic varieties, and crop growth stages. As such, it is impractical to develop a large-scale annotated dataset that represents all these variations across the crop’s lifecycle. By utilizing only a small number of manually annotated images from the target domain, the proposed method accelerates the development of deep learning models for various fields. Additionally, the resulting model could facilitate the development of further annotated samples from the target domain by making predictions using the initial model for samples from the target domain and then applying corrections if needed. This semi-automated approach substantially accelerates the creation of annotated samples from the target domain that could be used to further improve the performance of the developed model in the target domain.
In this research, we adopted an extreme strategy by utilizing only two annotated images for training and one for validation. However, we recommend incorporating more manually annotated samples from diverse domains to further enhance the robustness of the proposed approach. Additionally, in this research, we primarily relied on default hyperparameters. Tuning the model hyperparameters could lead to further improvements in model performance.

5. Conclusions

This research introduces a semi-self-supervised domain adaptation methodology characterized by a dual-stream encoder–decoder model architecture, specifically designed for the downstream task of semantic segmentation of wheat heads. We synthesized a large-scale, computationally annotated dataset from three manually annotated images. We also used unannotated image frames extracted from wheat field videos captured at two different growth stages. The model architecture and training strategy, utilizing both image segmentation and reconstruction, were strategically designed to mitigate the challenges associated with domain shifts while minimizing the need for extensive data annotation. The external evaluation of our proposed approach on a subset of the GWHD dataset demonstrated a substantial improvement in model performance over a recent work. This underscores the utility of our proposed approach in alleviating domain shift, allowing for the development of generalizable models with minimal manual data annotation, which, in turn, could enable the widespread adoption of DL-based approaches in the agricultural sector.

Author Contributions

The methodology of this study was designed and implemented by A.G., who also authored the initial draft of the manuscript. Both F.M. and G.H.S. edited and refined subsequent versions of the paper. F.M. contributed to the ideation of the research. F.M. and G.H.S. provided supervision throughout the study. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

We utilized publicly accessible data published in a prior study, accessible at https://www.cs.usask.ca/ftp/pub/whs/ (accessed on 29 April 2024). We also utilized an extra wheat field video and its accompanying annotated image frame in this work, which can be accessed at https://www.cs.usask.ca/ftp/pub/sslwhda/ (accessed on 29 April 2024).

Acknowledgments

We acknowledge Keyhan Najafian for providing access to the dataset used in this study and for granting access to the baseline model employed for model evaluation.

Conflicts of Interest

The authors declare that there are no conflicts of interest to disclose.

References

  1. Oliver, M.A.; Bishop, T.F.; Marchant, B.P. Precision Agriculture for Sustainability and Environmental Protection; Routledge: Abingdon, UK, 2013. [Google Scholar]
  2. Tan, M.; Le, Q. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; pp. 6105–6114. [Google Scholar]
  3. Najafian, K.; Jin, L.; Kutcher, H.R.; Hladun, M.; Horovatin, S.; Oviedo-Ludena, M.A.; De Andrade, S.M.P.; Wang, L.; Stavness, I. Detection of Fusarium Damaged Kernels in Wheat Using Deep Semi-Supervised Learning on a Novel WheatSeedBelt Dataset. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 660–669. [Google Scholar]
  4. Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar]
  5. Najafian, K.; Ghanbari, A.; Stavness, I.; Jin, L.; Shirdel, G.H.; Maleki, F. A Semi-Self-Supervised Learning Approach for Wheat Head Detection using Extremely Small Number of Labeled Samples. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Virtual, 11–17 October 2021; pp. 1342–1351. [Google Scholar]
  6. Mo, Y.; Wu, Y.; Yang, X.; Liu, F.; Liao, Y. Review the state-of-the-art technologies of semantic segmentation based on deep learning. Neurocomputing 2022, 493, 626–646. [Google Scholar] [CrossRef]
  7. Najafian, K.; Ghanbari, A.; Sabet Kish, M.; Eramian, M.; Shirdel, G.H.; Stavness, I.; Jin, L.; Maleki, F. Semi-Self-Supervised Learning for Semantic Segmentation in Images with Dense Patterns. Plant Phenomics 2023, 5, 0025. [Google Scholar] [CrossRef] [PubMed]
  8. Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.Y.; et al. Segment Anything. arXiv 2023, arXiv:2304.02643. [Google Scholar]
  9. He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar] [CrossRef]
  10. Hafiz, A.M.; Bhat, G.M. A survey on instance segmentation: State of the art. Int. J. Multimed. Inf. Retr. 2020, 9, 171–189. [Google Scholar] [CrossRef]
  11. Champ, J.; Mora-Fallas, A.; Goëau, H.; Mata-Montero, E.; Bonnet, P.; Joly, A. Instance segmentation for the fine detection of crop and weed plants by precision agricultural robots. Appl. Plant Sci. 2020, 8, e11373. [Google Scholar] [CrossRef] [PubMed]
  12. Sinha, S.; Gehler, P.; Locatello, F.; Schiele, B. TeST: Test-Time Self-Training Under Distribution Shift. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 2–7 January 2023; pp. 2759–2769. [Google Scholar]
  13. Hu, Q.; Guo, Y.; Xie, X.; Cordy, M.; Papadakis, M.; Ma, L.; Le Traon, Y. CodeS: Towards code model generalization under distribution shift. In Proceedings of the International Conference on Software Engineering (ICSE): New Ideas and Emerging Results (NIER), Melbourne, Australia, 14–20 May 2023. [Google Scholar]
  14. Hwang, D.; Misra, A.; Huo, Z.; Siddhartha, N.; Garg, S.; Qiu, D.; Sim, K.C.; Strohman, T.; Beaufays, F.; He, Y. Large-scale ASR Domain Adaptation using Self- and Semi-supervised Learning. In Proceedings of the ICASSP 2022—2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 23–27 May 2022; pp. 6627–6631. [Google Scholar]
  15. Pan, F.; Shin, I.; Rameau, F.; Lee, S.; Kweon, I.S. Unsupervised Intra-domain Adaptation for Semantic Segmentation through Self-Supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 3764–3773. [Google Scholar]
  16. Rani, V.; Nabi, S.T.; Kumar, M.; Mittal, A.; Kumar, K. Self-supervised Learning: A Succinct Review. Arch. Comput. Methods Eng. 2023, 30, 2761–2775. [Google Scholar] [CrossRef] [PubMed]
  17. Zbontar, J.; Jing, L.; Misra, I.; LeCun, Y.; Deny, S. Barlow Twins: Self-Supervised Learning via Redundancy Reduction. In Proceedings of the 38th International Conference on Machine Learning, Virtual Event, 18–24 July 2021; Meila, M., Zhang, T., Eds.; Proceedings of Machine Learning Research (PMLR). Volume 139, pp. 12310–12320. [Google Scholar]
  18. Ramesh, A.; Dhariwal, P.; Nichol, A.; Chu, C.; Chen, M. Hierarchical Text-Conditional Image Generation with CLIP Latents. arXiv 2022, arXiv:2204.06125. [Google Scholar]
  19. Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-Resolution Image Synthesis with Latent Diffusion Models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 10684–10695. [Google Scholar]
  20. Croitoru, F.A.; Hondru, V.; Ionescu, R.T.; Shah, M. Diffusion Models in Vision: A Survey. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 10850–10869. [Google Scholar] [CrossRef] [PubMed]
  21. Chen, T. On the Importance of Noise Scheduling for Diffusion Models. arXiv 2023, arXiv:2301.10972. [Google Scholar]
  22. Nichol, A.Q.; Dhariwal, P. Improved Denoising Diffusion Probabilistic Models. In Proceedings of the International Conference on Machine Learning. PMLR, Virtual Event, 18–24 July 2021; pp. 8162–8171. [Google Scholar]
  23. David, E.; Serouart, M.; Smith, D.; Madec, S.; Velumani, K.; Liu, S.; Wang, X.; Pinto, F.; Shafiee, S.; Tahir, I.S.; et al. Global Wheat Head Detection 2021: An Improved Dataset for Benchmarking Wheat Head Detection Methods. Plant Phenomics 2021, 22, 9846158. [Google Scholar] [CrossRef] [PubMed]
  24. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
  25. Ramachandran, P.; Zoph, B.; Le, Q.V. Searching for Activation Functions. arXiv 2017, arXiv:1710.05941. [Google Scholar]
  26. Buslaev, A.; Iglovikov, V.I.; Khvedchenya, E.; Parinov, A.; Druzhinin, M.; Kalinin, A.A. Albumentations: Fast and flexible image augmentations. Information 2020, 11, 125. [Google Scholar] [CrossRef]
  27. Wazir, S.; Fraz, M.M. HistoSeg: Quick attention with multi-loss function for multi-structure segmentation in digital histology images. In Proceedings of the 2022 12th International Conference on Pattern Recognition Systems (ICPRS), Saint-Etienne, France, 7–10 June 2022; pp. 1–7. [Google Scholar]
  28. Beheshti, S.; Hashemi, M.; Sejdic, E.; Chau, T. Mean Square Error Estimation in Thresholding. IEEE Signal Process. Lett. 2010, 18, 103–106. [Google Scholar] [CrossRef]
  29. Hore, A.; Ziou, D. Image Quality Metrics: PSNR vs. SSIM. In Proceedings of the 2010 20th International Conference on Pattern Recognition, Istanbul, Turkey, 23–26 August 2010; pp. 2366–2369. [Google Scholar]
  30. Johnson, J.; Alahi, A.; Fei-Fei, L. Perceptual Losses for Real-Time Style Transfer and Super-Resolution. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part II 14. Springer: Cham, Switzerland, 2016; pp. 694–711. [Google Scholar]
  31. Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Proceedings of the Advances in Neural Information Processing Systems 32, Vancouver, BC, Canada, 8–12 December 2019; pp. 8024–8035. [Google Scholar]
  32. Loshchilov, I.; Hutter, F. Decoupled Weight Decay Regularization. arXiv 2017, arXiv:1711.05101. [Google Scholar]
  33. Yang, B.; Lei, Y.; Li, X.; Li, N.; Nandi, A.K. Label Recovery and Trajectory Designable Network for Transfer Fault Diagnosis of Machines with Incorrect Annotation. IEEE/CAA J. Autom. Sin. 2024, 11, 932–945. [Google Scholar] [CrossRef]
Figure 1. Three manually annotated image–mask pairs were utilized for data synthesis. We developed two training sets by synthesizing computationally annotated images using manually annotated images from the left ( I η ) and the middle ( I ζ ), producing 8000 images based on I η and 8000 images based on I ζ . Hereafter, we refer to the 8000 images developed based on I η as dataset D η . We refer to the set comprising the whole 16,000 images as D η + ζ . Additionally, we created a validation set by synthesizing 4000 images, with 2000 from the image on the right ( I τ ) and 2000 images based on I ζ . Hereafter, we refer to this set of 4000 images as D ζ + τ . Dataset D ζ + τ was made to allow for a balanced representation of wheat field images from the early and late growth stages. All computationally annotated samples were synthesized following the methodology described by Najafian et al. [7].
Figure 1. Three manually annotated image–mask pairs were utilized for data synthesis. We developed two training sets by synthesizing computationally annotated images using manually annotated images from the left ( I η ) and the middle ( I ζ ), producing 8000 images based on I η and 8000 images based on I ζ . Hereafter, we refer to the 8000 images developed based on I η as dataset D η . We refer to the set comprising the whole 16,000 images as D η + ζ . Additionally, we created a validation set by synthesizing 4000 images, with 2000 from the image on the right ( I τ ) and 2000 images based on I ζ . Hereafter, we refer to this set of 4000 images as D ζ + τ . Dataset D ζ + τ was made to allow for a balanced representation of wheat field images from the early and late growth stages. All computationally annotated samples were synthesized following the methodology described by Najafian et al. [7].
Algorithms 17 00267 g001
Figure 2. Examples of computationally synthesized images and their corresponding segmentation masks.
Figure 2. Examples of computationally synthesized images and their corresponding segmentation masks.
Algorithms 17 00267 g002
Figure 3. Schematic representation of the model architecture. The encoder focuses on developing a joint image representation for both synthesized and real images, while the mask decoder aims at generating segmentation masks and the image decoder aims at reconstructing the real images, forcing the encoder to adapt to the real images.
Figure 3. Schematic representation of the model architecture. The encoder focuses on developing a joint image representation for both synthesized and real images, while the mask decoder aims at generating segmentation masks and the image decoder aims at reconstructing the real images, forcing the encoder to adapt to the real images.
Algorithms 17 00267 g003
Figure 4. A ResNet block comprises three groups of operations, including convolution, GroupNorm layers, and the Swish activation function for nonlinearity. It also incorporates skip connections to enhance feature propagation.
Figure 4. A ResNet block comprises three groups of operations, including convolution, GroupNorm layers, and the Swish activation function for nonlinearity. It also incorporates skip connections to enhance feature propagation.
Algorithms 17 00267 g004
Figure 5. Encoder model architecture designed by combining convolutional layers, ResNet blocks, and GroupNorm layers. Also, in each of the two decoding streams, we utilize concatenation instead of addition.
Figure 5. Encoder model architecture designed by combining convolutional layers, ResNet blocks, and GroupNorm layers. Also, in each of the two decoding streams, we utilize concatenation instead of addition.
Algorithms 17 00267 g005
Figure 6. Showcasing the prediction performance of model F η + ζ + ρ (highlighted in a red box in the upper row) in comparison with the results obtained by model S [7] (highlighted in a blue box in the lower row) on samples from the Global Wheat Head Detection dataset [23].
Figure 6. Showcasing the prediction performance of model F η + ζ + ρ (highlighted in a red box in the upper row) in comparison with the results obtained by model S [7] (highlighted in a blue box in the lower row) on samples from the Global Wheat Head Detection dataset [23].
Algorithms 17 00267 g006
Table 1. The performance of the trained models was evaluated on our internal and external test sets using the IoU and Dice scores. Model F η + ρ 1 was trained on D η and D ρ 1 and model F η + ζ + ρ is the result of fine-tuning model F η + ρ 1 on datasets D η + ζ and D ρ . We also compared the performance of these two models with the model developed in [7]. All of these models rely exclusively on synthesized image–mask pairs and/or unannotated images.
Table 1. The performance of the trained models was evaluated on our internal and external test sets using the IoU and Dice scores. Model F η + ρ 1 was trained on D η and D ρ 1 and model F η + ζ + ρ is the result of fine-tuning model F η + ρ 1 on datasets D η + ζ and D ρ . We also compared the performance of these two models with the model developed in [7]. All of these models rely exclusively on synthesized image–mask pairs and/or unannotated images.
ModelEvaluation MethodDiceIoU
S  [7]Internal test set Ψ 0.7090.565
F η + ρ 1 Internal test set Ψ 0.7730.638
F η + ζ + ρ Internal test set Ψ 0.8070.686
S [7]External test set Γ 0.3670.274
F η + ρ 1 External test set Γ 0.5510.427
F η + ζ + ρ External test set Γ 0.6480.526
Table 2. The performance of the models on each of the 18 domains of the GWHD dataset. Model S, which was trained using the synthesized dataset, was reported in [7]. Model F η + ρ 1 was developed using dataset D η —consisting of 8000 computationally annotated synthesized images—and dataset D ρ 1 , which includes 5296 unannotated real images. Model F η + ζ + ρ was developed using dataset D η + ζ —consisting of 16,000 computationally annotated synthesized images—and dataset D ρ , which includes 10,592 unannotated real images. We trained model F η + ρ 1 from scratch, while model F η + ζ + ρ resulted from fine-tuning model F η + ρ 1 using dataset D η + ζ and dataset D ρ .
Table 2. The performance of the models on each of the 18 domains of the GWHD dataset. Model S, which was trained using the synthesized dataset, was reported in [7]. Model F η + ρ 1 was developed using dataset D η —consisting of 8000 computationally annotated synthesized images—and dataset D ρ 1 , which includes 5296 unannotated real images. Model F η + ζ + ρ was developed using dataset D η + ζ —consisting of 16,000 computationally annotated synthesized images—and dataset D ρ , which includes 10,592 unannotated real images. We trained model F η + ρ 1 from scratch, while model F η + ζ + ρ resulted from fine-tuning model F η + ρ 1 using dataset D η + ζ and dataset D ρ .
DomainModelDice ScoreDomainModelDice Score
Algorithms 17 00267 i001 S 0.731Algorithms 17 00267 i002 S 0.711
F η + ρ 1 0.660 F η + ρ 1 0.644
F η + ζ + ρ 0.692 F η + ζ + ρ 0.759
Algorithms 17 00267 i003 S 0.848Algorithms 17 00267 i004 S 0.290
F η + ρ 1 0.815 F η + ρ 1 0.193
F η + ζ + ρ 0.857 F η + ζ + ρ 0.271
Algorithms 17 00267 i005 S 0.309Algorithms 17 00267 i006 S 0.601
F η + ρ 1 0.764 F η + ρ 1 0.599
F η + ζ + ρ 0.812 F η + ζ + ρ 0.769
Algorithms 17 00267 i007 S 0.240Algorithms 17 00267 i008 S 0.583
F η + ρ 1 0.503 F η + ρ 1 0.795
F η + ζ + ρ 0.431 F η + ζ + ρ 0.748
Algorithms 17 00267 i009 S 0.794Algorithms 17 00267 i010 S 0.156
F η + ρ 1 0.715 F η + ρ 1 0.286
F η + ζ + ρ 0.698 F η + ζ + ρ 0.356
Algorithms 17 00267 i011 S 0.389Algorithms 17 00267 i012 S 0.582
F η + ρ 1 0.479 F η + ρ 1 0.492
F η + ζ + ρ 0.605 F η + ζ + ρ 0.667
Algorithms 17 00267 i013 S 0.509Algorithms 17 00267 i014 S 0.884
F η + ρ 1 0.655 F η + ρ 1 0.648
F η + ζ + ρ 0.625 F η + ζ + ρ 0.805
Algorithms 17 00267 i015 S 0.859Algorithms 17 00267 i016 S 0.501
F η + ρ 1 0.674 F η + ρ 1 0.488
F η + ζ + ρ 0.671 F η + ζ + ρ 0.686
Algorithms 17 00267 i017 S 0.539Algorithms 17 00267 i018 S 0.629
F η + ρ 1 0.586 F η + ρ 1 0.496
F η + ζ + ρ 0.579 F η + ζ + ρ 0.520
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Ghanbari, A.; Shirdel, G.H.; Maleki, F. Semi-Self-Supervised Domain Adaptation: Developing Deep Learning Models with Limited Annotated Data for Wheat Head Segmentation. Algorithms 2024, 17, 267. https://doi.org/10.3390/a17060267

AMA Style

Ghanbari A, Shirdel GH, Maleki F. Semi-Self-Supervised Domain Adaptation: Developing Deep Learning Models with Limited Annotated Data for Wheat Head Segmentation. Algorithms. 2024; 17(6):267. https://doi.org/10.3390/a17060267

Chicago/Turabian Style

Ghanbari, Alireza, Gholam Hassan Shirdel, and Farhad Maleki. 2024. "Semi-Self-Supervised Domain Adaptation: Developing Deep Learning Models with Limited Annotated Data for Wheat Head Segmentation" Algorithms 17, no. 6: 267. https://doi.org/10.3390/a17060267

APA Style

Ghanbari, A., Shirdel, G. H., & Maleki, F. (2024). Semi-Self-Supervised Domain Adaptation: Developing Deep Learning Models with Limited Annotated Data for Wheat Head Segmentation. Algorithms, 17(6), 267. https://doi.org/10.3390/a17060267

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop