1. Introduction
Manual crop growth monitoring and modeling have been two key ingredients of agricultural practices since the origin of agricultural science, allowing farmers to guide their yields and manage their limited resources as rationally as possible [
1]. From ancient times, direct observation and manual measurements of fields by farmers were used to monitor crop health and growth across stages [
2,
3]. This, however, is laborious, time-consuming, and prone to human error. Demand for more efficient monitoring methods picked up as agriculture scaled up and diversified. From there, the advancement of statistical modeling and after that of computational tools offered another opportunity to make crop yield predictions based on the manual collection of crop data [
1,
4,
5,
6,
7]. Nevertheless, the primary hurdle remained unchanged which was the requirement for a substantial workforce to accurately name and monitor individual plants and rows of crops [
8]. The wide range of conditions in the field, along with the subjective nature of human observations, has resulted in data that are typically unreliable and inconsistent. As a result, there is a need for automated and exact alternative approaches.
The absence of uniformity and compatibility across different smart agriculture technologies is evident. Developing an integrated framework that enables diverse systems and devices to function together is essential for wider acceptance and improved effectiveness. An important challenge is the assimilation of data from diverse sensors and guaranteeing compatibility between different systems and platforms [
6]. Standardization is necessary for data analysis and interoperability across multiple platforms for smooth functioning. To make the models more accurate, they need to be consistently integrated and calibrated. This is mainly based on differences and default parametrization used by different models, which can lead to errors [
7,
8,
9]. It is imperative to ensure the efficient collaboration of different autonomous systems for the successful implementation of autonomous farming operations. By incorporating chemical soil health indicators and crop quality into a hierarchical multinomial logistic regression, a robust model was created that could accurately forecast the crop growth, making use of the vegetation indices [
10]. Additional comparative studies are required to determine the dependability and precision of unmanned aerial vehicle (UAV) sensors in comparison to conventional ground-based techniques for crop growth and analysis over time.
When each crop row is considered as a separate class, the analysis of temporal crop characterization for plant growth becomes easier, more precise, and more cost-effective. Therefore, this approach is preferable to classification of individual crop specimens which can result in challenges due to spectrum overlap problems due to intra-crop occlusions [
11]. It has also been demonstrated that the instance segmentation pipeline trained over nadir RGB images for prediction and analysis of plant shapes can be ensembled with other machine learning (ML) models to perform similar and additional analysis over ortho-mosaics to suit different agricultural environments [
12]. Another work based on combinational machine learning techniques coupled classification and regression trees to enhance the accuracy of ground coverage (GC) estimation from UAV imagery for wheat phenotyping [
13]. The capacity of deep learning techniques to greatly enhance the precision and effectiveness of plant segmentation in high-throughput phenotyping, thereby opening possibilities for progress in automated crop annotations based on transfer learning, was demonstrated by [
14].
The issue of manual annotations was underscored in research centered on a proximally sensed multispectral imaging environment, which included a heterogeneous mixture of wheat and horseradish. The dataset was quantitatively inadequate due to human limitations [
15]. The Grounded SAM [
16] is an architecture which combines a Grounding DINO [
17], an open-set object detector, with the Segment Anything Model (SAM) [
18] to perform precise detection and segmentation of objects based on arbitrary text inputs. It is a breakthrough in open-world visual perception and provides a robust and flexible framework for a range of visual tasks through the integration of specialized models. It utilizes the Recognize Anything Model (RAM) [
19] and image-caption models such as BLIP [
20] to automatically provide comprehensive annotations for images, hence greatly minimizing the need for manual annotation. The performance of the Grounded SAM can be extended through integration of more models to enhance its capabilities, such as faster inference models and high-quality mask generators. The ensembling of the SAM and Explainable Contrastive Language–Image Pretraining (ECLIP) was demonstrated for plant recognition and phenological characterization, making use of B-spline curves to quantify plants’ dimensions [
21]. A similar study stressing the problem of sample scarcity in crop mapping based on medium resolution satellite imagery devised an automated sample generation framework based on the SAM for improving the accuracy of crop mapping and to do away with manual annotations [
22]. The generated samples significantly improved the accuracy of crop mapping in both research areas, particularly in locations with clearly defined parcel borders. The implementation of this automated workflow mitigated the problem of limited sample availability, resulting in the development of crop mapping solutions that are more dependable and can be expanded to a larger scale. A practical use case of the Grounded SAM was also demonstrated in automated weed annotation over proximally sensed multispectral images [
23].
Transfer learning is a machine learning technique where a pre-trained model developed for a detection or segmentation task can be used as a starting point for a deep learning model on the subsequent task. It has been demonstrated to improve performance and reduce training time, leading to efficient utilization of the resources. A study integrated Mask R-CNN, a technique for segmenting instances, with transfer learning using VGG16 for classification, resulting in improved system robustness and accuracy [
24]. Another group researched utilising advanced deep learning techniques to improve automated monitoring and management of lettuce seedlings in regulated agricultural situations and utilised an enhanced Mask R-CNN model that was pre-trained on the COCO dataset [
25]. The model’s learning was then transferred to CB-Net in order to improve the accuracy and efficiency of seedling segmentation and growth trait estimation. This process ultimately facilitated automatic sorting of lettuce seedlings [
26]. Transfer learning has been shown to facilitate environmental and agricultural monitoring in various areas, such as land cover mapping, vegetation monitoring, crop yield mapping, and water resource management [
27].
Transfer learning, coupled with partially supervised techniques, makes use of bounding box annotations to efficiently enable one-shot image segmentation. These methods greatly decrease the dependency on over detailed annotations at the pixel level while still achieving a high level of accuracy in segmentation. Box2Mask introduces a novel method; for instance, segmentation that integrates the traditional level-set evolution model with deep neural network training. This methodology allows for precise mask prediction using only bounding box supervision [
28]. Another work introduced a method called image-aligned style modification towards reinforcement of dual-mode iterative learning for one-shot robust segmentation of brain tissues, thereby achieving notable improvements in performance over existing methods [
29]. An application in wildlife monitoring and precision livestock farming was demonstrated through the proposal of a one-shot learning-based approach towards the segmentation of animal videos using only one labeled frame [
30] Integrating a learning approach with noisy annotations in a framework was observed to show improvement of the segmentation model. This was achieved by utilizing pseudo masks created from unlabeled volumes. This strategy improved the performance of segmentation by using extra unannotated data throughout the training process. One-shot localization and weakly-supervised segmentation were found to reduce the need for substantial annotation. This method achieved a high level of segmentation accuracy in complicated medical imaging tasks using only one annotated image and a few unannotated volumes [
31].
In a nutshell, the SAM has been found robust in coupling with one-shot learning approaches such as PerSAM [
32] and PerSAM-F to enhance its segmentation capabilities in the remote sensing field. These techniques enable the SAM to personalize and refine its segmentation abilities with minimal input. The performance of the SAM was assessed and verified using multi-scale imaging, showcasing its ability to process images of varying resolutions and scales, a critical requirement for remote sensing applications [
33]. The Grounded SAM can improve crop growth analysis by offering an accurate, automated segmentation and comprehensive monitoring capabilities. These features enable informed decision making in crop management, health assessment, and yield forecasting, thereby promoting sustainable agricultural practices. Our study is concentrated on temporal crop growth analysis of cauliflowers using segmentation masks derived from automatically annotated multi-date aerial images and ortho-mosaics for observing growth stages of individual crop rows validated through different statistical examinations. The principal idea is to automatically segment these two sets of multi-date images and ortho-mosaics using a Grounded SAM over the trained YOLOv8x’s inference of a particular date’s imagery, which was observed to exhibit the best training parameters. The predicted multi-date mask instances would be used to calculate the mean crop pixel count for every crop row as an individual class across both the datasets. The pixel count of the individual crop rows across the dates in both the datasets would be used to observe the temporal growth patterns and the correlation between the type of observation—aerial or ortho-mosaic—and the date of observation.
This article is organized as follows.
Section 2 describes the two kinds of datasets: multidate aerial imagery and ortho-mosaics, detailing the image acquisition and the annotation of images. It also explains how the dataset was used in the training of YOLOv8x-seg and transferring the optimal inference to the Grounded SAM for automatic annotation and segmentation of crop images. Subsections within this section explain the process of transfer learning, conversion of COCO format annotations to PASCAL VOC format, and the creation of instance segmentation masks from bounding box detections.
Section 3 describes the results and discussion with the examination of relative crop growth rates, employing multi-date aerial imagery and ortho-mosaics to detect growth patterns and variations over time. This section also comprises a time-series analysis, Pearson correlation analysis, and a two-way ANOVA test to assess the consistency and statistical significance of the observed growth patterns to examine the findings, validating the efficacy of the automated annotation techniques towards improving decision making in precision agriculture.
Section 4 concludes the study, summarizing the key findings, addressing the limitations, and suggesting directions for future research.
This work presents an automated segmentation workflow for multi-modal crop growth monitoring of cauliflower using deep learning models like YOLOv8 and the Grounded SAM. A single-shot transfer learning technique was employed to enhance the Grounding DINO’s ability to generate bounding boxes around cauliflowers identified in multi-date aerial images and ortho-mosaics. The Grounded SAM was facilitated by a pre-trained YOLOv8x-seg model to extract features from input images, providing a comprehensive detail of the images for further training. The experiment also demonstrates the potential for automated temporal crop growth monitoring based on segmented pixels of the cauliflower crop rows over a period, showing a high degree of correlation between aerial imagery and ortho-mosaics, validated through statistical techniques like Pearson correlation and ANOVA. The use of transfer learning improved the pipeline’s accuracy, resulting in higher true positives and facilitating the generation of segmentation masks for temporal growth analysis.
3. Results and Discussion
3.1. Relative Crop Growth Rate Analysis
Initial observations indicate that throughout the period from 8 October to 21 October, the aerial imagery generally exhibits higher growth percentages in most rows when compared to the ortho-mosaics. Both datasets exhibited variations, with peaks observed between 29 October and 11 November for most crop rows. Both datasets, ortho-mosaics and aerial imagery, exhibit a consistent decline towards the end of the studied period, 18–25 November, with their values converging to similar levels. Despite some individual variances, the datasets generally exhibit consistent relative patterns in their crop rows. Specifically, Row 1 and Row 2 consistently display higher initial values followed by large drops over time.
3.1.1. Multi-Date Aerial Imagery
The relative growth analysis of cauliflower crop rows across multi-date aerial imagery was calculated as percentages to reveal distinct trends for each row (
Table 3). Row 1 begins with a high growth percentage of approximately 140%, experiences a sharp decline, then fluctuates, peaking around 29 October–11 November, and finally drops to around 40%. Row 2 starts at 100% and exhibits minor fluctuations but generally trends downward, ending just below the initial value. Row 3 starts at 80%, shows a significant initial drop, then fluctuates before stabilizing slightly above the starting value. Row 4 starts and ends around 80%, maintaining a stable pattern with minor fluctuations. Row 5 and Row 6 both start at 100%, exhibit minor fluctuations, and end slightly below their starting values. Lastly, Row 7 starts at 80%, experiences fluctuations, and ends slightly below the initial value. Overall, while individual rows display unique patterns, most exhibit a general downward trend with notable fluctuations (
Figure 5).
3.1.2. Multi-Date Ortho-Mosaics
The relative growth analysis of cauliflower crop rows across multi-date ortho-mosaic imagery highlights significant variations across the observed periods (
Table 4). Row 1 starts at a high growth percentage of around 120%, consistently declines, and ends slightly below 40%. Row 2 begins at approximately 60%, with fluctuations leading to a steady decline, ending below 40%. Row 3 starts at 40%, increases slightly, then steadily decreases, ending just above 20%. Row 4 starts at 80%, experiences a sharp drop and fluctuations, ending slightly above the initial value. Row 5 begins around 60%, shows a peak around 29 October–11 November, and ends slightly below the starting value. Row 6 starts at 100%, drops sharply, and fluctuates, ending slightly below the starting value. Row 7 begins at 120%, drops significantly, and stabilizes around 40%. Overall, each row demonstrates unique patterns, but the general trend across all rows indicates a decline in growth percentages over time, with periods of significant fluctuation (
Figure 6).
3.2. Time Series Analysis
A multi-modal time series analysis was conducted on multi-date aerial imagery and ortho-mosaics. An averaging function was used in this analysis. This function computes the mean value for each crop row across various dates, thereby providing the average growth over time in crop rows for both formats: aerial imagery and ortho-mosaics.
3.2.1. Multi-Date Aerial Imagery
The time-series analysis for multi-date aerial images reveals a steady and rising trend in growth of cauliflower crop rows within the studied timeframe (
Figure 7). The growth rates of the rows differ, with Row 1 and Row 2 experiencing much greater increases compared to the other rows by the conclusion of the period. The growth rates indicate periods of fast expansion, notably from 8 October to 21 October, and an overall pattern of consistent growth with occasional changes. Nevertheless, there is a noticeable decrease in the rate of increase towards the conclusion of the time frame for certain categories, indicating a potential point of saturation. In general, the data (
Table 5) show a distinct trend of steady growth with significant periods of rapid expansion, which could be linked to particular circumstances or interventions.
3.2.2. Multi-Date Ortho-Mosaics Imagery
The time-series analysis of growth observed across multiple rows and dates in the ortho-mosaics reveals a consistent upward trend, indicating steady growth over the observed period (
Figure 8). Rows exhibit varying rates of growth, with Row 7 showing the highest increase by the end of the period. The growth rates indicate significant growth between 8 October and 21 October and between 11 November and 18 November, suggesting intervals of rapid growth. However, some rows exhibit a deceleration or slight decline towards the end of the period, indicating a possible plateau or saturation point. Overall, the data demonstrate a clear pattern of consistent growth with notable periods of rapid increase, which may be influenced by specific conditions or interventions (
Table 6).
3.3. Pearson Correlation Analysis
The Pearson correlation coefficients for each crop row between the aerial imagery and ortho-mosaics data were calculated (
Table 7). The correlation values for the growth observations from aerial imagery and ortho-mosaics demonstrate consistently high positive correlations across all rows, indicating strong agreement between the two measurement methods. Row 1 (
Figure 9a) exhibits an extremely high correlation of 0.9735, suggesting nearly identical measurements, similar to Row 3 with a correlation of 0.9737 (
Figure 9c) and Row 4 with a correlation of 0.976 (
Figure 9d). Row 2 also shows a strong positive correlation of 0.9457 (
Figure 9b), which is slightly lower than Row 1. Row 5 has the lowest coefficient of 0.943 (
Figure 9e). However, it still indicates a strong positive correlation, suggesting minor discrepancies. Row 6 has a near-perfect correlation of 0.9938 (
Figure 9f), implying almost identical measurements, while Row 7 shows the highest correlation of 0.9982 (
Figure 9g), indicating virtually higher growth measurements for ortho-mosaics. The correlations for all rows are very high, mostly above 0.95. This quantifies the strong linear relationship observed in the scatterplots. While the overall trend is very strong, there are slight variations in the degree of correlation across different rows. For instance, Row 2 and Row 5 have slightly lower correlation coefficients compared to the others, indicating minor discrepancies between the two methods in these rows. These could be due to localized factors affecting the crop growth measurements. Across all rows, the consistent pattern is that crop growth observed through ortho-mosaics tends to be higher than that observed through aerial imagery. This suggests that ortho-mosaics provide a more sensitive or higher-resolution method for detecting crop growth variations over time.
3.4. Two-Way ANOVA Test
The results of the two-way ANOVA indicate that both the date of observation and the type of observation, i.e., aerial imagery vs. ortho-mosaics, significantly affect the observed growth (
Table 8). Specifically, the effect of the date on growth is highly significant (F(5, 72) = 80.88,
p < 0.001), as is the effect of the observation type (F(1, 72) = 123.13,
p < 0.001). Furthermore, there is a significant interaction between the date and the type of observation (F(5, 72) = 7.05,
p < 0.001), indicating that the influence of the date on growth depends on whether the data were collected through aerial imagery or ortho-mosaics. These findings suggest that both the timing and the method of observation are crucial in assessing growth patterns.
Effect of Date: The sum of squares (SS) for the effect of date is 46.6858, representing the total variability in growth due to different dates. With degrees of freedom (dF) equal to 5 (calculated as the number of dates minus one), we have an F-value of 80.8775. This high F-value indicates that the variability in growth attributed to the date is much larger than the random error. The p-value for this factor is extremely small, at 3.86 × 10−28. This indicates that the effect of the date on growth is statistically significant, meaning changes in date have a significant impact on growth measurements.
Effect of Observation Type: For the effect of type, the sum of squares is 14.2154, which represents the total variability in growth due to the type of observation (aerial vs. ortho-mosaics). The degree of freedom for this factor is one (calculated as the number of types minus one). The F-value here is very high at 123.1324, indicating that the variability in growth due to the type is significantly larger than the random error. The p-value is 3.02 × 10−17, which is extremely small. This result suggests that the type of observation has a statistically significant effect on growth.
Interaction between date and type of observation: The interaction between date and type has a sum of squares of 4.0712, representing the variability in growth due to the interaction between these two factors. The degrees of freedom for this interaction are 5. The F-value for the interaction effect is 7.053, which indicates a significant interaction effect between date and type. The p-value for this interaction is 2.04 × 10−5; a small value indicating that the interaction effect on growth is statistically significant. This means that the effects of date and time on observed growth are inter-related.
Residual: The residual sum of squares is 8.3122 represents the unexplained variability or random error in the data. The degrees of freedom for the residuals are 72, calculated as the total number of observations minus the total degrees of freedom for the factors and their interaction. F-values and p-values are not applicable for the residuals as they are part of the error term.
4. Conclusions
The study highlighted the effectiveness of using multiple sets of aerial images and ortho-mosaics taken at different dates to monitor the growth of the cauliflower crop. The integration of YOLOv8x for approximate boundary box detection around the cauliflower was followed by precise segmentation with the Grounded SAM to obtain segmentation masks for each crop row across different dates. This combination leverages YOLOv8’s detection strength and the Grounded SAM’s segmentation accuracy, improving the overall performance of leaf and canopy bound precise segmentation over individual crop rows.
Both datasets yielded consistent and comparable information regarding the relative rates of growth and patterns across time of crop rows. The aerial photography showed clear patterns with fluctuations and overall decreases in relative growth percentages, while ortho-mosaics displayed notable variances with a general decrease over time. The time-series analysis revealed intervals of rapid expansion and consistent growth, interspersed with intermittent fluctuations. Both approaches indicated the possibility of reaching saturation points near the conclusion of the period. The Pearson correlation analysis revealed robust positive correlations between the two datasets, indicating a high level of agreement in growth measures. However, it was observed that ortho-mosaics exhibited more sensitivity in detecting observed growth differences. The two-way ANOVA test verified that both the date of observation and the manner of observation had a substantial impact on the observed growth patterns.
However, there have been limitations that were observed, such as slight inconsistencies between datasets in particular rows. These may be attributable to localized factors, like the removal of overlapping weeds and damaged crops due to attacks by predatory birds and foxes, impacting the assessments of crop growth. To enhance future analyses, it would be advantageous to address these inconsistencies by including additional data points specific to the local area to enhance the accuracy of measurement techniques and performing the experiments in a controlled environment.
Automated annotation greatly speeds up the examination of crop growth by quickly processing vast amounts of visual data to create segmentation masks. This process of automation minimises the requirement for human involvement, enabling faster detection and measurement of crop phenology and growth patterns. Through the utilisation of deep learning methods, automatic annotations can provide a uniform and unbiased evaluation, hence improving the overall effectiveness and precision of monitoring crop development. This not only expedites the analysis process but also facilitates more frequent and meticulous observations, resulting in enhanced agricultural practices and decision-making based on well-informed data.
This study is useful for interpreting information from automatically generated segmentation masks derived from automatically annotated crop phenology. It provides a reliable method for monitoring and analysing the growth patterns of crops throughout time. The noteworthy findings from both the datasets emphasise the possibility of incorporating these methods into automated agricultural monitoring systems.