1. Introduction
The United Nations 2030 Agenda establishes a global imperative to reconfigure our society, guided by 17 Sustainable Development Goals (SDGs). These goals, from poverty eradication to sustainable agricultural practices, outline a path toward a more equitable and resilient future. In particular, SDG 2 emphasizes the need to eradicate hunger, ensure food availability, improve nutrition, and promote sustainable agriculture [
1]. This goal highlights the importance of implementing productive and sustainable agricultural practices resilient to climate variations. In this framework, the
Opuntia spp. (nopal) emerges as a paradigmatic crop. Native to Mexico and prevalent in arid and semi-arid regions, this
Opuntia spp. is an essential source of nutrition for many communities and symbolizes resilience and adaptability in the face of climate challenges [
2], making its sustainable cultivation crucial in feeding countless people and preserving vulnerable ecosystems.
The cultivation of
Opuntia spp. stands out for its resistance to adverse conditions and its versatility in food and industry; its cultivation is a traditional practice in several regions, especially in Mexico, thanks to its multiple uses and benefits [
3].
Opuntia spp. are essential as food, fodder, and industrial products; they play a crucial role in soil and water conservation, carbon fixation, and biodiversity [
4]. With the growing interest in products derived from
Opuntia spp., farmers and researchers seek to optimize their production and management.
In 2022, in Mexico, the annual per capita consumption of
Opuntia spp. was 6.2 kg. A total of 12,491 hectares were cultivated, with a production of 872 thousand tons and an average yield of 70.5 tons per hectare, generating a value of MXN 2981 million (USD 175.3 million) at an average price of MXN 3417 per ton. Additionally, 7.4% of the production was exported, reaching a value of USD 26.2 million [
5].
In agriculture, precision and efficiency in measuring vegetation area are crucial for optimal crop management. Although Opuntia spp. cultivation is known for its resistance and versatility, its effective management requires precise monitoring. Traditional manual techniques, while effective, are laborious and prone to errors. In contrast, emerging technologies such as remote sensing and artificial intelligence offer promising solutions to overcome these limitations. Integrating high-resolution orthophotos with advanced semantic segmentation techniques promises to revolutionize how we measure and manage vegetation area in Opuntia spp. cultivation. This research seeks to explore and validate the effectiveness of these advanced techniques, specifically in the agricultural region of Tulancingo, Hidalgo, Mexico.
The integration of advanced technologies allows agricultural practices to be adapted to the specific conditions of each plot, thus optimizing resource use and improving productivity. Remote sensing, an essential tool for this approach, facilitates the monitoring of the health and condition of vegetation through remote sensors.
Precision agriculture (PA) has been established as a promising solution for increasing crop efficiency and productivity. Based on observation, measurement, and action on intra-field variability, precision agriculture seeks to optimize crop yields and minimize production costs [
6].
Using platforms such as satellites, airplanes, and unmanned aerial vehicles (UAVs), remote sensing has proven a valuable tool in quantifying vegetation for precision agriculture [
7,
8]. These sensors collect data in various spectral bands, allowing detailed analysis of plant health, water content, and plant density, among other critical factors [
9].
In viticulture, for example, developing specific vegetation index has revolutionized how we monitor vineyard health and optimize vineyard management [
8]; with the data, winegrowers can make informed decisions about critical practices such as irrigation, fertilization, and pest control.
A study by Sebastian Candiago et al. [
10] highlights the value of UAVs for rapidly collecting field data in precision agriculture applications. Equipped with multispectral cameras, these UAVs produce high-resolution images that, once processed, result in precise orthoimages of the studied areas. Such orthoimages facilitate the extraction of vegetation indices (VI) that evaluate the vigor of vegetation in various crops, thus demonstrating the considerable potential of high-resolution data from UAVs and photogrammetric techniques in agriculture.
Furthermore, integrating machine learning techniques with remote sensing has significantly strengthened the ability to predict and manage agronomic aspects with greater precision [
11]. These modern methods, in conjunction with data obtained from remote sensors, provide a more detailed and quantitative perspective of agricultural fields, facilitating more effective and sustainable management of agriculture.
Advanced hyperspectral technologies have enabled innovative studies and practical applications in agriculture, ranging from regional scales through satellite platforms to the field level, using airplanes, drones, and ground platforms [
12]. Integrating AI into these systems has further revolutionized the field of remote sensing and smart agriculture, opening new avenues to optimize agricultural production.
Moreover, the combination of AI and remote sensing has emerged as an effective technological solution to support the improvement of agricultural yields, offering advanced methods for data analysis to increase productivity, anticipate possible threats, and alleviate the workload of farmers [
13]. In this environment, image analysis has become an invaluable tool, allowing farmers to access detailed information about their crops and supporting them to make data-driven decisions.
The application of advanced technologies in agriculture, as demonstrated in recent studies on broccoli growth monitoring, underlines the importance of adaptability and precision in crop management [
14]. The present
Opuntia spp. study seeks to expand the application of these technologies, demonstrating how the integration of UAV imagery and deep learning can optimize precision agriculture for specific crops. The ability to accurately identify the status and growth of individual plants can revolutionize cactus pear management, enabling a more informed response to environmental variations and improving sustainability and efficiency.
Traditional vegetation segmentation methods, while widely used, present several significant limitations that need to be urgently addressed for more effective solutions in agricultural research. For example, threshold-based methods are susceptible to lighting conditions and plant color and texture variations, resulting in inaccurate segmentations under different environmental conditions [
15]. Furthermore, manual segmentation and semi-automatic methods require much time and human effort, making them unfeasible for large crop areas. On the other hand, methods that use specific image characteristics, such as texture and color, may not generalize well to different types of crops and different growth phases of the same plant, limiting their applicability to various agricultural situations [
16]. Compared to more advanced deep learning techniques, these methods may not be robust enough to handle the complexities and variability of crops [
17]. Although convolutional neural networks (CNNs) have proven to be very effective for image segmentation, they require a significant amount of computational resources for training and inference, including the need for high-end GPUs, which may not be accessible to all farmers [
18]. Additionally, CNNs and other deep learning techniques require large volumes of labeled data to train adequately. Collecting and labeling this data can be expensive and time-consuming [
19]. Another significant area for improvement is that methods developed in a specific context may not directly apply to other regions with different soil conditions, climates, and crop types. Generalizability is a significant challenge [
20]. Furthermore, many studies are conducted in controlled environments or with limited datasets, which may not fully reflect the diversity and complexity of real-world agricultural scenarios [
21].
Semantic segmentation, an image processing technique that partitions an image into segments corresponding to objects or specific semantic categories, is particularly relevant in agriculture, which can be used to identify and quantify different elements in a field, such as individual plants, areas affected by diseases or pests, and areas that require irrigation [
22,
23].
In the realm of semantic segmentation, models such as DeepLabV3+ [
24], UNet [
25], and UNet Style Xception [
26] have proven to be highly effective, not only in medical and agricultural applications, thanks to its ability to capture characteristics at different scales. These models generally combine convolutional dilation and decoding techniques to improve segmentation accuracy, and they are widely adopted for their effectiveness.
The advantages of AI and semantic segmentation in agriculture are notable. These technologies provide a detailed and quantitative interpretation of agricultural data, allowing farmers to make decisions based on accurate information [
23]. Additionally, image interpretation automation minimizes manual inspections, representing significant savings in time and resources.
Recent research on plant counting using aerial imagery offers a helpful parallel to our study. This approach—initially applied to tobacco crops—combines deep learning techniques and semantic segmentation to count plants from aerial images accurately, thus demonstrating the efficiency of this technology in optimizing agricultural management [
27].
Another notable study used UAVs to capture images and applied the UNet Style Xception model to successfully segment the parasitic species Cuscuta spp. in chili tree crops. This method improved the accuracy in identifying affected areas and facilitated the creation of orthomosaics useful for decision making in crop management [
28]. Although the results from using UAV data were promising, it is essential to note that the study did not include a validation of these findings. This limitation highlights the need to perform a more thorough validation of the model, encompassing not only individual images but also a broader view of the entire crop to ensure the applicability and accuracy of the method in broader agricultural contexts.
The versatility of these techniques to adapt to different crop types highlights their potential to improve accuracy in measuring and managing vegetation areas, a crucial aspect of precision agriculture. In particular, precision agriculture can help identify areas of optimal growth, optimize resource use, and improve both the quality and quantity of production for Opuntia spp. cultivation. In this context, the present study contributes significantly to advancing precision agriculture. The main contributions of this research include:
Detailed comparison: A comprehensive comparison of the performance of different semantic segmentation architectures is provided, including DeepLabV3+, UNet, and UNet Style Xception.
Validation in a specific context: The research validates the effectiveness of advanced deep learning techniques in the particular context of the agricultural region of Tulancingo, Hidalgo, Mexico.
Optimization of vegetation measurement: The study proposes a methodology to carry out a more precise and efficient measurement of the Opuntia spp. vegetation area, overcoming the limitations of traditional manual techniques.
Integration of orthophotos and AI: The combination of high-resolution orthophotos obtained by UAV with advanced semantic segmentation techniques stands out.
Contribution to crop analysis techniques: This study significantly contributes to developing and improving deep-learning-based crop analysis techniques.
One of the main innovations of this work lies in overcoming the difficulty of segmenting images with resolutions greater than 2000 pixels, especially in orthomosaics, using well-known segmentation methods. This is a significant contribution, given that no studies or previous methods have addressed this problem. Likewise, we highlight the need for manual quantification for pixel counting, given that current neural networks have limitations in handling extremely large images; this aspect is crucial to guarantee precision in measuring the area of pixels and vegetation and underlines the relevance of our proposed methods. The advances presented not only improve the accuracy and efficiency of vegetation analysis but also open new possibilities for agricultural management and monitoring.
With these contributions, a technological tool is presented to carry out measurements that facilitate the analysis, measurement, and management of the vegetation area in
Opuntia spp. cultivation to promote a more sustainable and efficient agriculture. The following sections will detail the study area, data acquisition, and the most-used semantic segmentation techniques. In the
Section 3, we describe the processing of Orthomosaic images of
Opuntia spp. cultivation and deep learning techniques to segment areas of interest. In the
Section 4, we present a comparative analysis of the architectures in semantic segmentation and the measurement of the
Opuntia spp. vegetation area. Finally, we report the conclusions of the study.
2. Materials and Methods
2.1. Study Area
The object of study focuses on a nopal cultivation (
Opuntia spp.) located at the geographical coordinates 20.12555 latitude and −98.37867 longitude, in the municipality of Tulancingo, state of Hidalgo, Mexico. The exact location is illustrated in
Figure 1. The crop field is located near the Tulancingo–Huauchinango highway, approximately 1 km from the Polytechnic University of Tulancingo.
The investigation focuses on a specific crop area comprising around 539 m2. This delimitation is established to have a representative sample that enables an exhaustive evaluation of data acquisition and processing techniques.
Accessibility and proximity to the laboratory justify the location selection, thus optimizing the logistics for data collection and subsequent analysis. Furthermore, this proximity allows for a detailed characterization of soil, vegetation, and other relevant elements. Therefore, choosing a representative area ensures the research findings significantly impact Opuntia spp. crops in the Tulancingo region.
2.2. Data Acquisition
Meticulous and detailed data acquisition is essential to thoroughly understanding the development of
Opuntia spp. cultivation from its initial stage to maturity. This process allows us to capture the crop’s physical evolution and identify the management practices that optimize its growth and health. In practice, the stems are arranged in central furrows during planting, facilitating their growth and management. As the stems take root, they develop shoots that become new cladodes. Furthermore, the technical management of
Opuntia spp. cultivation involves various field practices, mainly focused on maintaining the health of the plantation; this includes pruning, sprouting, and rejuvenation training to improve the quality and quantity of products available for the market [
29,
30]. All considerations mentioned are essential for the success of
Opuntia spp. cultivation in several regions, including Tulancingo.
Integrating advanced technology in this process emphasizes choosing tools and technologies that allow processes to be carried out with maximum efficiency and precision, considering the ability to capture fine details of the crop and its adaptability to specific field conditions. We used the DJI Mavic 2 Mini (DJI, Shenzhen, China) drone for data acquisition, which weighs just 249 g. The drone is equipped with a 1/2.3″ 12-megapixel CMOS image sensor, capable of capturing high-quality images. Its compact design is ideal for flights near structures and vegetation. Plus, it features multiple sensors, including GPS, for precise navigation; the battery offers up to 25 min of continuous flight, while its intuitive remote control and Dronelink mobile app (Dronelink, Austin, TX, USA) simplify flight mission planning and execution.
Combining high resolution with maneuverability and adaptability becomes essential in research methodology. The flight plan is meticulously executed over the area of interest at a height of 4 m, with the primary objective of evaluating and validating the robustness of the semantic segmentation models used in this research. The chosen height significantly improves the quality and reliability of the results obtained in the study.
Figure 2 shows a diagram of the flight carried out over the crop.
Precise flight trajectory plans are designed with Dronelink software (Version 3.4.0) with 75% lateral and 70% vertical overlap, ensuring comprehensive coverage of the area of interest. All flights take place around 11:00 a.m. to ensure stable lighting conditions.
Furthermore, three flight missions are carried out with different configurations to capture images from various angles and perspectives. Settings include a vertical orientation of 90° directly above the crop, a tilt of 45° to obtain lateral images of the crop, and a tilt of 135° to capture a more perspective and facilitate a more complete and detailed evaluation of the
Opuntia spp. crop, adequately documenting all relevant characteristics. During the flights, 443 high-resolution aerial photographs were taken.
Table 1 details the images captured for each flight.
2.3. Semantic Segmentation
Semantic segmentation constitutes one of the main techniques in image analysis, allowing the identification and classification of each pixel according to predefined semantic categories. This approach is particularly valuable in photogrammetric studies aimed at distinguishing between different features and objects present in aerial images, especially in applications related to managing and exploiting plantations. In the present research, leading architectures in semantic segmentation are implemented, such as UNet, DeepLabV3+, and UNet Style Xception. The following subsections concisely describe these architectures, highlighting their contributions and applicability in semantic segmentation.
2.3.1. UNet
The UNet architecture for segmentation has its characteristic “U”-shaped design, integrating encoding and decoding sections to process and reconstruct images with high precision [
25]. It excels at detecting fine details, which are crucial for identifying affected crop areas. Its distinctive design, depicted in
Figure 3, is based on a symmetric topology with an encoder (contraction path) and a decoder (expansion path) connected by a central bridge.
The encoder is composed of repetitive blocks of two convolutions followed by a ReLU (rectified linear unit) activation function and a max pooling operation with stride 2, which progressively reduces the dimensionality of the feature space, allowing the network to learn more abstract and robust representations of the input data.
In contrast, the decoder is structured by blocks of up-convolutions that increase the resolution of the mapped features, followed by concatenations with the corresponding encoder features, technically called “copy and trim”. This process recovers location information lost during clustering and combines it with higher-level features to facilitate accurate segmentation.
Finally, each expansive step is completed with two convolutions and ReLU, and the sequence culminates in a convolution layer that maps the high-dimensional features to the desired classification per pixel, resulting in the segmentation prediction image.
The strength of UNet lies in its ability to operate with a limited number of training samples and still produce highly accurate segmentations, highlighting its usefulness in applications where data can be sparse or extremely varied.
2.3.2. DeepLabV3+
The DeepLabV3+ [
24] architecture, illustrated in
Figure 4, is a significant innovation for semantic segmentation, designed to improve accuracy at object edges, and in small-scale regions. This network expands the context capture capability and enhances the contour accuracy by implementing atrous convolution modules in its encoder and decoder structure.
The DeepLabV3+ encoder starts with an atrous convolution that adjusts the network’s field of view to the image features and continues with parallel atrous convolution blocks with different dilation rates (ratios), allowing the network to capture contexts in multiple scales without losing image resolution. These atrous convolutions, also known as dilated convolutions, process the image at various spatial scales and allow you to explicitly control the resolution at which the feature operation is carried out, preserving the edge information of the objects.
The decoder uses ‘upsampling’ operations to increase the resolution of the extracted features. These extended features are concatenated with the encoder’s high-level features through a ‘concatenation’ procedure to preserve detailed image information. Then, a series of convolutions are applied to refine the combined features, followed by a second ‘upsampling’ operation to achieve the resolution of the original image.
Finally, the architecture applies a convolution to obtain the segmentation prediction. This last layer projects the multidimensional feature tensor into the segmentation label space, producing the final prediction image.
The architecture is distinguished by its ability to handle objects of various scales and its efficiency in precisely segmenting fine and complex details, as required in detecting specific areas of vegetation in Opuntia spp., making it a preferred option for applications that demand high fidelity in segmentation.
2.3.3. UNet Style Xception
In the development of advanced techniques for image segmentation, the UNet Style Xception architecture is presented as a significant innovation. Combining UNet with the Xception architecture, using separable convolutions [
31], results in a highly efficient and accurate system that improves the network’s encoding, decoding, and feature detection. The original UNet is characterized by its contraction path, which captures the overall context of the image, and an expansion path that reconstructs the segmentation output from those encoded features. Xception, for its part, introduces separable convolution modules, which allow more efficient handling of spatial and channel characteristics.
In the contraction path, the Xception-Style architecture improves context capture by efficiently processing low- and high-dimensional features. This results in an improved ability to identify and encode relevant image information. On the other hand, the expansion path benefits from the accuracy of separable convolutions, which leads to a more accurate reconstruction of the segmented image. Additionally, feature detection is refined, allowing finer segmentation and greater model generalization.
The Xception architecture indicates that its integration with UNet is theoretically sound and empirically validated. Such fusion expands the capabilities of the UNet convolutional neural network and inaugurates new possibilities in remote sensing and geospatial analysis.
Figure 5 exposes the structure of UNet Style Xception in the segmentation of
Opuntia spp. vegetation, emphasizing its effectiveness in the face of the complexity of aerial images.
2.4. Generation of Orthomosaics
Implementing advanced technologies has emerged as a fundamental pillar for optimizing crop production and management in agriculture. One of these technologies is the generation of orthomosaics, which has transformed the perception and understanding of agricultural land.
Among these innovations, orthomosaics have emerged as a transformative tool that revolutionizes how farmers visualize and understand their farmland. Orthomosaics are the result of processing and integrating aerial images, offering a detailed and panoramic perspective of the terrain, thus allowing informed and accurate decision making.
An orthomosaic is a georeferenced and orthorectified image, the result of the integration of multiple aerial photographs. Mathematically, orthorectification can be described as a transformation
T that corrects the geometric distortions of an image
I to obtain a planar representation
:
The orthomosaic generation process stands out for its structure and precision, covering several essential stages to produce a high-quality final result. Initially, we capture images using aerial platforms like drones. Aerial photographs ensure adequate overlap on the x and y axes, thus ensuring complete terrain coverage.
After capturing, we proceed to image processing. This step is characterized by meticulous alignment and correction, where key point-matching algorithms identify similar regions between neighboring images. In this process, tools such as WebODM (Version 2.5.0, Open Drone Map, open-source) software specialized in aerial image processing play a crucial role in facilitating the generation of orthomosaics, 3D models, and elevation maps. These tools’ web-based interface simplifies project management and visualization, while their modular architecture promotes integration with a variety of tools and plugins [
32].
Finally, we integrate the georeferenced images to create a continuous orthomosaic. The process eliminates redundancies and ensures a smooth transition between images, producing a cohesive and detailed orthomosaic.
Figure 6 illustrates image captures taken over a specific area and the corresponding orthomosaic generated with WebODM.
This example illustrates how we transformed individual aerial images into a unified orthomosaic representation. This transformation provides a detailed, panoramic view of the crop field, enabling applications such as plant health monitoring, crop planning, resource management, and canopy disease detection. In addition, it enables the training of machine learning algorithms to identify growth patterns or detect variations in vegetation. With technological advances and the continuous improvement of aerial image capture and processing techniques, orthomosaics are consolidated as essential tools in modern agriculture.
4. Results
4.1. Comparison of Semantic Segmentation Architectures
The study performs a detailed evaluation of three architectures—UNet, DeepLabV3+, and UNet Style Xception—with a specific focus on the
Opuntia spp. semantic segmentation task. Performance measurement is performed using RMSE and IoU metrics, and each architecture trains multiple models to obtain a robust evaluation to ensure a complete and reliable understanding of its performance. Of the three models evaluated, UNet Style Xception is the most effective in accurately identifying
Opuntia spp. plants based on the metrics presented in
Table 2. The results offer valuable guidance for the informed selection of the most appropriate architecture in contexts that require precise semantic segmentation for
Opuntia spp. identification.
One consideration in evaluating semantic segmentation architectures is analyzing computation times on GPU and CPU, as shown in
Table 2. Computation times indicate each model’s operational efficiency, with significant implications for its practical application in real-time scenarios and limited computing environments. In this study, it is observed that UNet Style Xception not only excels in accuracy, as indicated by the RMSE (1696.5831) and IoU (0.6733) metrics, but it also excels in efficiency, exhibiting the lowest processing times on GPU (0.03038 s) and CPU (0.06013 s). The comparison details that, while DeepLabV3+ and UNet offer competitive times, the optimization of UNet Style Xception positions it as the preferred architecture for contexts that demand high precision and operational efficiency.
Figure 9 shows the orthomosaics generated from segmented images using the three architectures proposed in this research work. The generation of orthomosaics is carried out using the open-source WebODM platform, where the results reveal not only the ability of each architecture to capture precise details of the
Opuntia spp. but also their uniqueness in spatial representation and texture fidelity. The DeepLabV3+ application exhibits coherence in the three-dimensional reconstruction of the
Opuntia spp., highlighting its efficiency in capturing details. On the other hand, the UNet architecture, with its layers of contraction and expansion, demonstrated the ability to preserve the topology of the
Opuntia spp. in the resulting orthomosaic. Likewise, the UNet Style Xception variant presents an orthomosaic that combines the high precision of predictions with the ability to discern subtle nuances in the structure of the
Opuntia spp.
In this work, the generation of binary masks from orthomosaics is considered very useful for the precise quantification of the plant mass of the
Opuntia spp. cultivation because they show a general context of the study area.
Figure 10 shows the results of the binary masks of the three study architectures.
The masks are the direct result of the semantic segmentation of the images, which become essential tools to carry out the pixel count and, therefore, quantify in detail the presence of Opuntia spp. in the region of interest to carry out a quantitative analysis and obtain approximate measurements of the Opuntia spp. plant cover.
4.2. Measurement of the Plant Mass Area of the Opuntia spp. Cultivation
The measurement from an orthomosaic is challenging because they are generally large images that exceed 4K resolution, so they cannot be introduced to a deep neural network for semantic segmentation due to storage and processing capacity issues. Therefore, we take strategies such as the one presented in the work of Gutierrez-Lazcano et al. [
28], which divides the image into four and is subsequently reduced to a size of 160 pixels, which facilitates the implementation and training of different semantic segmentation models. However, it is essential to measure segmentation performance with orthomosaics since we have a general view of the crop area we want to analyze for decision making. Our analysis proposal uses the LabelMe software (Version 5.4.1) to manually segment the original orthophoto.
Figure 11 shows the results of the reference segmentation from the original orthophoto image. This process allowed the generation of a detailed binary mask, essential for accurately quantifying the
Opuntia spp. area in the region of interest.
Manual segmentation guarantees a careful and precise delimitation of the areas occupied by the Opuntia spp. in the orthophoto, thus providing a reliable basis for comparing the regions obtained through the three models analyzed. By using a manually segmented mask, you not only establish a reliable benchmark for evaluating automatic segmentation techniques but also enable the validation of the accuracy and effectiveness of such architectures in identification and quantification. This methodological approach provides a solid basis for comparing and critically analyzing the results obtained by automatic segmentation architectures against a manual reference standard.
The models are evaluated in detail to quantify the area of
Opuntia spp. vegetation in each orthomosaic.
Table 3 presents the quantitative results, where a significant agreement between the automatic measurements and manual quantification is evident. However, the UNet Style Xception architecture stands out notably, as it exhibited greater precision in the delimitation and quantification of
Opuntia spp. areas compared to the other two architectures. This finding suggests adaptability and generalization capacity in the semantic segmentation task of
Opuntia spp. vegetation.
In the existing literature [
33,
34,
35,
36,
37,
38], various semantic segmentation models based on deep learning have been applied to identify vegetation in aerial images. However, these studies are generally limited to smaller images and must address large orthomosaics segmentation. Our approach, instead, introduces an innovative methodology that allows the accurate and efficient segmentation of large orthomosaics, which is crucial for the extensive management of crops such as
Opuntia spp.
Table 4 compares recent approaches to segmenting vegetation areas, highlighting the resolution of the segmented images and their architectures, showing no high-resolution segmented images.
4.3. Discussion of Results
The results obtained show the potential that deep learning architectures possesses, specifically DeepLabV3+, UNet, and UNet Style Xception, for semantic segmentation and quantification of vegetative areas in Opuntia spp. cultivation. Through comparative analysis, this work highlights the significant advantages of these technologies over traditional manual methods, offering a more precise and efficient approach. However, selecting the optimal model depends on several factors, including desired accuracy and resource limitations; while UNet is distinguished by its ability to segment fine details, DeepLabV3+ and UNet Style Xception demonstrate greater adaptability to scale variations. Despite their strengths, these architectures face challenges due to the resolution imagery, application context specificity, and operational complexity.
The comparative analysis of the orthomosaics generated by the UNet, DeepLabV3+, and UNet Style Xception architectures, presented in
Figure 9a–c, allows us to have a visual reference in terms of detail and edge accuracy. Meanwhile,
Figure 10a–c use binary masks to quantify the vegetation area. The precision of UNet Style Xception is reflected in a higher agreement with the manual segmentation presented in
Figure 11b and the binary mask in
Figure 11c, showing that the UNet Style Xception architecture results are very close to the manual segmentation. The analysis highlights the ability of the UNet Style Xception architecture to handle large images and achieve accurate results.
The comparative analysis of the performance of the trained models of UNet, DeepLabV3+, and UNet Style Xception architectures for image segmentation is presented in
Table 2, the IoU (>0.5) and RMSE values in a homogeneous range allow us to establish that the three models can effectively segment small images with variations in performance and execution times. The best results were from the UNet Style Xception architecture.
Table 3 presents the results of the comparative analysis of the performance of the trained models of the UNet, DeepLabV3+, and UNet Style Xception architectures for the quantification of the vegetation area in the orthomosaic. The RMSE values of the three architectures are homogeneous and allow us to establish that the three models achieve the automatic quantification of vegetation area, while the value IoU = 0.5051 and greater than 0.5, allows us to recognize that the UNet Style Xception model has the best performance for this task.
The study analyzes the architectures of DeepLabV3+, UNet standard, and UNet Style Xception. Although these methods showed competitive performance, the UNet Style Xception architecture excelled in accuracy and computational efficiency. The quantitative results presented in
Table 2 indicate that UNet Style Xception obtained a lower RMSE (1696.5831) and a higher IoU (0.6733), demonstrating its robustness in handling fine details and complex structures within the vegetation. Furthermore, it presented the lowest processing times on GPU (0.03038s) and CPU (0.06013s), demonstrating its superiority in the specific
Opuntia spp. segmentation task.
The superior performance of the UNet Style Xception architecture in this study can be attributed to its integration of separable convolutions (SeparableConv), which improve efficiency and accuracy. SeparableConv operators decompose a standard convolution into two simpler operations: a spatial convolution and a depth convolution, significantly reducing the number of calculations required and allowing faster processing without compromising accuracy [
31]. This additional efficiency is essential in segmenting vegetation areas, where it is crucial to distinguish between different plant textures and details. This results in better detection and segmentation of complex structures within orthomosaic images, facilitating greater precision in the delimitation of vegetative areas.
A significant limitation of the proposed method is its dependence on the resolution and quality of the captured images. The accuracy of semantic segmentation can be negatively affected by low-resolution images, complicating the identification and quantification of small vegetative areas due to the loss of fine details that are essential for segmentation accuracy. Models such as UNet, DeepLabV3+, and UNet Style Xception present variations in their performance depending on the resolution of the images; UNet stands out for its ability to segment fine details, while DeepLabV3+ and UNet Style the environmental and climatic conditions can significantly affect the results of semantic segmentation and quantification of vegetative areas. The variability in lighting conditions and the presence of shadows represent a challenge for the generalization of the models. The air flights were scheduled around 11:00 a.m. to ensure stable lighting conditions and minimize the impact of shadows on images. However, the adaptability of the proposed method to different agricultural conditions is a key strength that provides reassurance for its potential in precision agriculture.
Additionally, weather conditions, such as cloud cover and wind, can influence the quality and clarity of captured images. To mitigate these effects, flight plans were designed with 75% lateral overlap and 70% vertical overlap, ensuring exhaustive coverage of the study area. Furthermore, the proposed approach of validating the results with orthomosaics, not only with images, allows for correcting possible errors caused by non-ideal capture conditions, providing a reliable basis for comparing and analyzing the segmentation models. Although these models have proven effective in the specific region studied, their applicability to other agricultural conditions requires additional validation. Computational efficiency is another crucial significant consideration when scaling to larger datasets. Advanced data augmentation and processing techniques are essential to maintaining data integrity and balancing computational efficiency with the quality of results. Furthermore, it is necessary to consider the complexity and operational costs to implement the proposal on a larger scale in precision agriculture. Validation using orthomosaics is essential to confirm the accuracy of the segmentation and correct possible errors caused by non-ideal capture conditions.
Furthermore, despite this study’s relatively small dataset, several strategies have been implemented to mitigate bias and enhance the model’s generalization ability. The data augmentation scheme, which includes a wide range of transformations such as flips and rotations, has significantly increased the diversity of the training images, thereby bolstering the robustness of the model. Moreover, detailed manual segmentation of the original orthophotos has been used as a reference to validate the accuracy of the automatic models, providing a solid basis for evaluating the effectiveness of semantic segmentation techniques. These combined approaches ensure that the results are consistent and reproducible while improving the model’s ability to generalize under previously unseen conditions.
On the other hand, future research can explore other advanced segmentation models, such as Segment Anything, an emerging methodology in the field of segmentation that promises greater adaptability and accuracy in various segmentation tasks. This methodology is based on the ability to segment any object in an image without needing to train a specific model for each object class; this can be advantageous in agricultural contexts where the variability of vegetative characteristics can be high. The flexibility and generalizability of Segment Anything could provide a robust framework for orthomosaic segmentation in diverse agricultural settings, potentially offering improvements in the accuracy and efficiency of the segmentation process.