UniU-Net: A Unified U-Net Deep Learning Approach for High-Precision Areca Palm Segmentation in Remote Sensing Imagery

Wang, Shaohua; Wang, Yan; Yue, Jianwei; Liang, Haojian; Zhang, Zihan; Li, Bojun

doi:10.3390/app15094813

Open AccessArticle

UniU-Net: A Unified U-Net Deep Learning Approach for High-Precision Areca Palm Segmentation in Remote Sensing Imagery

by

Shaohua Wang

^1,2,3

,

Yan Wang

⁴

,

Jianwei Yue

^4,*,

Haojian Liang

¹

,

Zihan Zhang

⁴ and

Bojun Li

⁵

¹

State Key Laboratory of Remote Sensing and Digital Earth, Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100101, China

²

Key Laboratory of Earth Observation of Hainan Province, Hainan Aerospace Information Research Institute, Sanya 572029, China

³

University of Chinese Academy of Sciences, Beijing 101408, China

⁴

Faculty of Geographical Science, Beijing Normal University, Beijing 100875, China

⁵

College of Resource Environment and Tourism, Capital Normal University, Beijing 100048, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(9), 4813; https://doi.org/10.3390/app15094813 (registering DOI)

Submission received: 12 February 2025 / Revised: 2 March 2025 / Accepted: 15 March 2025 / Published: 26 April 2025

(This article belongs to the Special Issue Geospatial Insights: Unleashing the Power of Big Data and GeoAI, 2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

:

This study introduces a novel deep learning-based model, UniU-Net, designed to achieve the high-precision segmentation of areca palms in remote sensing imagery. UniU-Net incorporates an auxiliary encoder and a unified attention fusion module (UAFM), enhancing the model’s anti-overfitting capabilities to improve its overall segmentation performance. Specifically, the primary and auxiliary encoders, through isomorphic parallel processing, leverage the principles of structural reparameterization to enhance the model’s effective learning of areca palm features while reducing the risk of overfitting. The UAFM utilizes a spatial attention mechanism to facilitate the effective fusion of multi-scale features. This architecture enables the model to capture intricate morphological details and accurately delineate the boundaries of areca palms, even under complex and heterogeneous environmental conditions such as mixed vegetation and varying illumination. To validate the effectiveness of UniU-Net, comprehensive experiments were conducted on a specialized areca palm dataset, demonstrating superior performance compared to several state-of-the-art semantic segmentation models. The proposed method achieves significant improvements in key evaluation metrics, such as the F1-score and intersection over union (IoU), highlighting its robustness and precision in automated areca palm extraction tasks. The integration of advanced attention mechanisms not only enhances the model’s ability to focus on relevant regions but also improves the segmentation accuracy in challenging scenarios. Beyond the specific application of areca palm segmentation, the methodologies introduced in this study hold substantial practical significance for broader agricultural applications, such as precision farming and crop monitoring.

Keywords:

areca palm extraction; remote sensing; deep learning; attention mechanism; UniU-Net

1. Introduction

Areca palms primarily grow in tropical and subtropical regions, spanning countries and regions across South Asia, Southeast Asia, and the Pacific Islands. Their fruit is extensively utilized in food processing and medicinal applications. As a pharmaceutical agent, the areca nut exhibits anti-parasitic properties, stimulates digestion, regulates the nervous and cardiovascular systems, and demonstrates antioxidant activity, making it a commonly used traditional Chinese medicine [1]. In the food sector, the areca nut’s distinct flavor and satisfying chewing quality have rendered it a daily staple for many individuals [2]. Furthermore, areca nut fiber not only offers nutritional and therapeutic value but also possesses favorable physical properties, suitable for industrial applications such as textiles. It can serve as a substitute for synthetic fibers, thereby contributing to the development of a green economy and sustainability [3]. In China, Hainan Province currently produces more than 90% of the country’s areca nuts, making it the largest cultivation base nationwide. This production plays an essential role in the province’s agricultural sector and has become a significant source of income for local farmers [4].

With the rapid growth of the areca nut industry, the cultivated area of areca palms in Hainan Province continues to expand, underscoring the importance of managing these plantations in a scientific and rational manner to promote sustainable agricultural development. Obtaining accurate and effective information on the spatial distribution of areca palms is crucial in optimizing cultivation strategies, forecasting yields, and monitoring pest outbreaks. Traditional manual extraction methods—such as visual interpretation, relying on remote sensing experts’ knowledge and experience to identify areca palms from image features—remain highly accurate but are also time-consuming and labor-intensive [5].

In recent years, deep learning techniques have gained widespread recognition for their exceptional feature extraction capabilities and powerful generalization performance. Deep convolutional neural networks (DCNNs) are particularly adept in processing two-dimensional data [6], achieving remarkable results in tasks such as semantic segmentation [7,8], change detection [9,10], and object detection [11,12]. Among these studies, U-Net has been extensively researched and improved due to its stable segmentation performance. These advances have opened up new opportunities for the use of deep learning in remote sensing. For instance, Weijia Li and colleagues [13] first applied a convolutional neural network (CNN) within a deep learning framework, optimizing the parameters and merging sampling strategies to significantly improve the oil palm detection accuracy in high-resolution remote sensing imagery. Similarly, Ulku et al. [14] introduced a lightweight spatiotemporal feature extraction module into U-Net and DeepLabv3+, complemented by a hierarchical loss function and time series data, thereby enhancing the tree species segmentation performance. Meanwhile, Guo et al. [15] proposed the ME-Net model, which incorporates a global attention module (GAM), a multi-scale context embedding module (MCE), and a boundary fitting unit (BFU) to innovatively refine mangrove extraction using Sentinel-2A data, resulting in the more precise identification of mangrove distributions. Gibril et al. [16] investigated the capability of deep vision transformers in extracting date palm trees from multi-scale, multi-source, very-high-spatial-resolution (VHSR) images and found that the Segformer model outperformed other convolutional neural network models in terms of accuracy and efficiency. However, it still faces certain challenges when dealing with varying lighting conditions and background complexity. Qin et al. [17] proposed a deep learning-based method for the identification of pine wilt nematode-infested trees by fusing multispectral and visible light images captured by unmanned aerial vehicles (UAVs). This approach enhanced the rapid and accurate detection of pine wilt disease in complex terrain environments. However, there remains room for optimization in terms of computational efficiency. Despite these advancements, existing methods often focus on specific tree species and still face challenges posed by unbalanced datasets, varying illumination, and complex climatic conditions, thus limiting their applicability in accurately extracting areca palms.

To address these challenges, this study puts forward a novel deep learning model designed to achieve high-precision areca palm extraction. First, we build a thematic dataset on the areca nut distribution using imagery obtained from UAV remote sensing, laying the groundwork for automated and efficient extraction. Next, we propose a deep learning network named UniU-Net, which integrates an auxiliary encoder and a feature fusion module, thus exhibiting clear advantages when handling complex environmental backgrounds. Finally, we introduce a feature fusion mechanism known as the UAFM, which employs spatial attention to weight the input features derived from both the main and auxiliary encoders. This approach maximizes the use of effective features and significantly improves the overall segmentation performance.

The main contributions of this study are as follows.

This study introduces a well-annotated distribution dataset for areca palm trees, which can provide data support for the rapid extraction of areca palm trees and related research.
This study proposes a deep learning model named UniU-Net for the high-precision extraction of areca palm trees. The model integrates a primary–auxiliary encoder structure and a feature fusion module, which reduces the risk of overfitting during training and enhances the ability to learn multi-scale features, thereby achieving the high-precision extraction of areca palm trees.
This study introduces a feature fusion module called the unified attention fusion module (UAFM). This module utilizes spatial attention to weight the input features from both the primary and auxiliary encoders, thereby enhancing the model’s capability for multi-feature learning.

2. Related Works

In recent years, the rapid advancement of computer technology has given rise to a multitude of machine learning algorithms, thereby offering new pathways for automated remote sensing applications. Machine learning algorithms capitalize on data and experience to uncover patterns, regularities, and correlations hidden in large datasets, which are then utilized for tasks such as prediction, classification, and optimization. Depending on the learning approach, machine learning can generally be categorized into supervised learning, unsupervised learning, and reinforcement learning. Traditional machine learning methods—such as decision trees, support vector machines (SVM), K-means clustering, and principal component analysis—have been widely applied in image segmentation to extract target objects from imagery [18]. For instance, leveraging remote sensing data from Sentinel-2 and Landsat-8, Aqil Tariq [19] and colleagues developed classification models based on decision trees and random forest algorithms, using NDVI time series and phenological parameters. Their approach successfully differentiated various crops and their rotation patterns, significantly enhancing the accuracy of agricultural land use maps, cropping systems, and crop types. Similarly, Haoran Lin [20] conducted a comparative study of four machine learning algorithms—random forest (RF), SVM, K-nearest neighbors (KNN), and extreme gradient boosting (XGBoost)—combining high-resolution multispectral imagery with LiDAR data to classify tree species across regions at different elevations.

Despite the extensive use of traditional machine learning methods, they still exhibit certain limitations when applied to complex, high-dimensional remote sensing data [21]. These limitations include the strong dependence on handcrafted features and insufficient capacity to capture non-linear feature relationships [22]. With the rapid evolution of deep learning—especially the advent and widespread application of convolutional neural networks (CNNs)—remote sensing image classification has undergone a transformative leap. Through multi-layer convolutional structures, CNNs can autonomously extract spatial characteristics and spectral information from data, offering superior adaptability and representational power for large-scale remote sensing tasks. Their outstanding performance in image classification, object detection, and semantic segmentation surpasses that of traditional methods and provides more effective means for the detailed classification of intricate terrain types.

Building on these advancements, Hae Gwang Park [23] and collaborators introduced a multi-channel CNN-based object detection method to identify trees suspected of infection by pine wilt disease (PWD). The model employs a two-stage network structure consisting of a region proposal network (RPN) and a Fast R-CNN, where the RPN generates candidate regions and the Fast R-CNN classifies and refines the detection predictions. Through this two-stage optimization, the loss function converges to a minimal value, enhancing the overall detection accuracy. Likewise, Ferreira [24] utilized a fully convolutional neural network (combining ResNet-18 with a DeepLabv3+ architecture) and morphological operations to classify and extract the crowns of palm trees in the Amazon Basin, achieving high classification accuracy. In another study, Shijie Yan [25] proposed a gated recurrent convolutional neural network (G-RecConNN) that couples a CNN for feature extraction with a GRU to capture temporal dependencies in image sequences, thus facilitating crop disease classification and early prediction. Extending this body of work, Zhangxi Ye [26] designed the U2-Net model based on U-Net, incorporating an encoder–decoder structure with residual connections. Its dual nested U-shaped architecture adeptly extracts multi-scale features and, when applied to UAV-based visible-light imagery, attains high-precision olive tree crown contour detection. Furthermore, Lou [27] employed three popular deep learning object detection algorithms—Faster R-CNN, YOLOv3, and SSD—to identify and measure tree crown widths in two loblolly pine plantations (one younger, one more mature) in Eastern Texas, achieving an R² value of 0.94.

Although areca palms hold considerable economic value, research on the extraction of their distribution patterns remains scarce. This study thus endeavors to fill this gap by proposing an innovative deep learning model to enable automatic and efficient areca palm extraction, leveraging cutting-edge technology to advance precision agriculture in regions where areca cultivation is prominent.

3. Materials and Methods

3.1. Study Area

The study area is located in an areca palm plantation in Beida Town, Wanning City, Hainan Province (18°54′41.66″ N, 110°17′46.01″ E; see Figure 1) and is representative of a typical tropical monsoon climate zone. The mean annual temperature is 23.6 °C, with the coldest month averaging 18.7 °C and the hottest month averaging 28.5 °C. The annual precipitation reaches about 2200 mm, and the yearly sunshine exceeds 1800 h. Situated in hilly terrain, the area is primarily composed of red and sandy loam soils. Wanning City boasts the largest areca plantation area in Hainan Province, covering 18,138 hectares in 2018—equivalent to 16.5% of the province’s total planting area. These climatic and soil conditions offer a favorable environment for areca growth, making the region particularly suitable for research on areca cultivation and extraction.

3.2. Data Acquisition and Processing

In this study, the experimental data were obtained from UAV-based remote sensing imagery. A DJI Phantom 4 Pro V2.0 quadcopter served as the UAV platform, with a total weight of 1.375 kg (including battery and propellers), a maximum horizontal flight speed of 72 km/s, and a maximum flight altitude of 6000 m. The payload was a MicaSense RedEdge-M multispectral camera developed by Micasense (USA). This camera simultaneously captures data across five bands, covering visible, near-infrared, and red-edge wavelengths, as detailed in Table 1. Data acquisition was conducted under favorable lighting conditions and wind speeds below Force 3. Before and after each flight, a calibration reflectance panel was placed on the ground and oriented as vertically as possible to the camera in order to facilitate relative radiometric calibration, thus ensuring accurate reflectance data.

To acquire stable imagery, the flight route was meticulously planned in advance, enabling the UAV to follow a predetermined flight path that covered the entire study area. The flight altitude was set to 60 m, with a cruising speed of 7 m/s; the imagery had a spatial resolution of approximately 4 cm, and the side overlap and forward overlap were 80% and 70%, respectively. Following data collection, the images were mosaicked using the Pix4D Mapper software V4.5.6 and then subjected to geometric correction, radiometric calibration, and clipping as part of the preprocessing workflow [28].

In the field of quantitative remote sensing research, a series of vegetation indices have been proposed for the extraction of tree vegetation. To investigate whether these vegetation indices could be integrated with deep learning models to achieve better segmentation results, this study selected three relevant vegetation indices based on the characteristics of the research area: the ratio vegetation index (RVI) [29], the difference vegetation index (DVI) [30], and the modified soil-adjusted vegetation index (MSAVI) [31]. These indices were calculated separately and incorporated as new spectral features of the training data input into the model. The RVI enhances the radiation difference between vegetation and soil, enabling it to characterize biomass information across varying vegetation coverage. The DVI is highly sensitive to changes in the soil background, making it more effective in distinguishing vegetation from water bodies. The MSAVI reflects information on ground soil and vegetation coverage under the influence of soil background factors, allowing for the accurate identification of areas with low vegetation coverage. The calculation formulas for these three vegetation indices are presented in Table 2. To ensure the separability of spectral features, this study performed normalization when fusing the original data with the vegetation index data.

3.3. UniU-Net

To enhance the automation of areca extraction and reduce the reliance on domain expertise, this study builds upon the classic U-Net framework [32] to propose a newly redesigned deep learning model named UniU-Net, as shown in Figure 2. Given the multispectral nature of the input data, a single convolutional encoder may not fully capture the distinguishing features of areca palms. Although increasing the depth of a single encoder could improve the classification accuracy, doing so also raises the risk of overfitting and introduces structural redundancy.

Grounded in the concept of structural reparameterization, a parallel design can improve the comprehensiveness of feature learning and mitigate overfitting. Moreover, because of the additive nature of convolutions, a parallel layout often delivers faster inference than a serial design. Therefore, UniU-Net incorporates an auxiliary encoder during the encoding phase to further enhance the model’s capacity to learn areca-related features. This auxiliary encoder maintains the same module design and layer depth as the primary encoder, ensuring a balanced structure while providing complementary feature extraction.

The convolutional module in UniU-Net follows the standard U-Net convolutional block design, using a kernel size of 3 × 3. The calculation process of convolution is as shown in Equation (1), where

y_{i j}

represents the output result of the convolution, and

f_{k}

denotes the convolution operation with a kernel size of

k

. Recognizing that a binary classification task such as areca extraction is prone to overfitting, dropout layers have been added to the convolutional modules. Moreover, the conventional ReLU activation has been replaced by LeakyReLU in order to mitigate potential issues with vanishing or exploding gradients during training. The structural combination of the Conv-block is shown in Figure 3a.

y_{i j} = f_{k} (\{X_{i + δ i, j + δ j}\})

(1)

In the decoding phase, a standard U-Net uses skip connections to concatenate multi-scale features from the encoding phase with upsampled features, thereby enhancing the recovery of image details. Numerous studies have validated the effectiveness of these skip connections. Accordingly, UniU-Net retains this strategy. However, at each hierarchical level in the encoding phase, both the primary encoder and the auxiliary encoder produce intermediate feature maps of identical dimensions. Determining which of these features should be used—and in which proportions—becomes crucial for the final result. To address this, UniU-Net introduces a unified attention fusion module (UAFM) that assigns weights to the respective feature maps of the primary and auxiliary encoders during decoding. A detailed explanation of this module can be found in Section 3.4. Ultimately, by applying convolutional and upsampling operations, the decoder restores the features to the original input size and yields per-pixel classification results.

3.4. Unified Attention Fusion Module

The integration of multi-level features is crucial in achieving high segmentation accuracy. In remote sensing imagery, areca palm trees are typically scattered in distribution. The attention mechanism enhances the model’s feature learning for these sparsely distributed areca palm trees through global associations, thereby improving the model’s segmentation capabilities. Drawing on insights from previous research [33], this study introduces a unified attention fusion module (UAFM) that employs spatial attention to effectively merge feature representations from both the primary and auxiliary encoders.

As depicted in Figure 3b, the overall framework of the UAFM applies an attention module to obtain a weight α. This weight is then fused with the input features through pixel-wise multiplication and summation. In more concrete terms,

F_{m a i n}

represents the input features from the primary encoder, and

F_{a u x}

denotes the input features from the auxiliary encoder. The attention module calculates a proportion α for the primary encoder’s features and 1 − α for the auxiliary encoder’s features. These weighted features are summed to yield the final output feature map. The entire process is formulated as in Equations (2) and (3).

α = A t t e n t i o n (F_{m a i n}, F_{a u x})

(2)

F_{o u t} = F_{m a i n} \cdot α + F_{a u x} \cdot (1 - α)

(3)

The attention module is designed as a plug-and-play component that can be replaced with various attention mechanisms—most commonly spatial [34] or channel attention [35]. In this study, we will address the choice of attention mechanism in the experimental section. For illustration, Figure 4 presents the spatial attention approach. Given the input features

F_{m a i n}, F_{a u x} ϵ R^{C \times H \times W}

, the module first computes the mean and maximum values along the channel dimension, yielding four feature maps of size

R^{1 \times H \times W}

. These four maps are then concatenated to form a new feature map of size

R^{4 \times H \times W}

. A convolution operation followed by a sigmoid function is applied to this new map, producing the final attention weights

α ϵ R^{1 \times H \times W}

. Equations (4) and (5) formally describe this procedure.

F_{c a t} = C o n c a t (M a x (F_{m a i n}), M e a n (F_{m a i n}), M a x (F_{a u x}), M e a n (F_{a u x}))

(4)

α = S i g m o i d (C o n v (F_{c a t}))

(5)

4. Experiments

(1) Training Settings: This study implements the network model using the PyTorch 1.7 framework and conducts experiments on an NVIDIA RTX A3000 GPU with 12 GB of memory. As momentum-based stochastic gradient descent (SGD) has proven more effective for model optimization [36], the optimizer is configured with a momentum value of 0.9 and weight decay of

1 \times 10^{- 3}

. The initial learning rate is set to 0.01, the batch size is fixed at 8, and the training process spans a maximum of 50 epochs.

(2) Loss Function: Owing to the class imbalance present in the dataset, the training could be adversely affected if a suitable loss function is not chosen. To address this issue, this study adopts a combined loss function that integrates the Dice loss and cross-entropy loss. The joint formulation can be summarized as

L_{c e} = - \frac{1}{N} \sum_{n = 1}^{N} \sum_{k = 1}^{K} y_{k}^{(n)} \log {\hat{y}}_{k}^{(n)}

(6)

L_{d i c e} = 1 - \frac{2}{N} \sum_{n = 1}^{N} \sum_{k = 1}^{K} \frac{{\hat{y}}_{k}^{(n)} y_{k}^{(n)}}{{\hat{y}}_{k}^{(n)} + y_{k}^{(n)}}

(7)

L = L_{c e} + L_{d i c e}

(8)

where N denotes the number of samples, and K denotes the number of classes. The terms

y^{(n)}

and

{\hat{y}}^{(n)}

represent the ground-truth label and the predicted outcome, respectively.

(3) Evaluation Metrics: In this experiment, we employ three metrics commonly used in remote sensing segmentation tasks—accuracy (Acc), the intersection over union (IoU), and the F1-score—to evaluate the performance of our model. Each of these metrics is derived from the confusion matrix, which consists of four key components: true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN).

The formula for the calculation of Acc is expressed as follows:

A c c = \frac{T P + T N}{T P + F P + T N + F N}

(9)

The formula for the calculation of the IoU is expressed as follows:

I o U = \frac{T P}{T P + F P + T N}

(10)

The formula for the calculation of F1 is expressed as follows:

P r e c i s i o n = \frac{T P}{T P + F P}

(11)

R e c a l l = \frac{T P}{T P + F N}

(12)

F 1 = \frac{2 \times P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l}

(13)

5. Results

5.1. Comparison with Other Methods

In order to assess the performance of the proposed UniU-Net, we compared it against several existing methods, including the classic U-Net [32] and ResUnet, as well as five improved models: MAResUNet [37], ABCNet [38], BANet [39], CREVNet [40], and LiViTNet [41]. To ensure consistent and comparable results, all of these networks were trained using the same training strategy and were not subjected to any form of pre-training.

Table 3 summarizes the extraction accuracy achieved by various methods, while Figure 4 contrasts the segmentation maps generated by different models. Overall, UniU-Net yields the highest segmentation accuracy in terms of the F1-score and IoU, although its accuracy (Acc) is marginally lower than that of U-Net. More specifically, UniU-Net attains an F1-score of 79.20%, surpassing the second-best model, U-Net, by 1.30%. In terms of the IoU, UniU-Net achieves 65.56%, outperforming the next-best model by 1.77%.

From the segmentation predictions illustrated in Figure 5, UniU-Net provides the most coherent results, exhibiting sharper boundaries and more realistic details. In the first and second rows, other methods produce coarser segmentation around tree leaves and branches, whereas UniU-Net aligns more closely with the ground-truth labels. In the third and fourth rows, competing methods display noticeable under-segmentation, resulting in discontinuous and blurry edges. By contrast, UniU-Net maintains a high level of stability, delineating the areca palms with continuous and distinct boundaries. On the other hand, constrained by the low data resolution and limited spectral bands, all models exhibit instances of individual trees being indistinguishable in areas where they are densely distributed. Comparatively, UniU-Net demonstrates the smallest amount of adhesion and offers better differentiation.

The parameters and inference speed of the model are also crucial metrics in practical applications. We compared the parameter counts and inference speeds of the different methods, as detailed in Table 4. The results indicate that the UniU-Net proposed in this study, despite having the largest number of parameters, achieves the second-best inference speed at 53 FPS. Benefiting from structural reparameterization theory, maintaining the same structure for the auxiliary encoder as the primary encoder not only reduces the risk of model overfitting but also optimizes the inference speed during prediction, allowing UniU-Net to stand out.

Overall, UniU-Net precisely identifies and delineates the true edges of the areca palms, offering a clear advantage in terms of segmentation accuracy compared with existing approaches.

5.2. Ablation Study

To determine the optimal hyperparameters for training, we conducted ablation experiments on the momentum and weight decay values in the training method, with the experimental results illustrated in Figure 6. The results demonstrate that UniU-Net achieves a significant advantage in the segmentation of betel nut trees when the momentum value is between 0.8 and 0.9. Additionally, a weight decay value of 0.001 yields superior overall training performance compared to 0.0001. Consequently, this study selects a momentum value of 0.9 and weight decay of 0.001 as the hyperparameters for the SGD optimizer during training.

Taking into account the spectral characteristics of areca palm trees and research in the field of quantitative remote sensing, this study calculated the DVI, RVI, and MSAVI vegetation indices as additional spectral features for the initial data and conducted ablation experiments on the model performance. The experimental results are presented in Table 5. When using only the spectral features of the vegetation indices as input data, the overall segmentation performance of the model was relatively poor. This was attributed to the minimal absolute quantity of spectral features, which provided limited learning features for the model. However, after integrating the spectral features of the vegetation indices with the original data, UniU-Net showed slight improvements in both the F1 and IoU metrics. This enhancement is a result of the additional feature information provided by the spectral features of the vegetation indices. However, the extent of the performance improvement was not significant.

To evaluate the performance gain due to UniU-Net’s improved design components, we conducted a series of ablation experiments. These experiments primarily focused on examining how the auxiliary encoder and the choice of attention mechanism within the UAFM module affected the overall model performance.

Table 6 presents a comparison of the evaluation metrics obtained after the ablation of different components, while Figure 5 shows the resulting areca segmentation maps. As is evident from the table, incorporating the auxiliary encoder into U-Net yields improvements of 0.35% in the F1-score and 0.63% in the IoU, underscoring the auxiliary encoder’s clear contribution to the model’s overall performance. After adding the UAFM, the use of channel attention resulted in a slight performance decrease; however, employing spatial attention led to an additional gain, improving the F1 and IoU by 0.20% and 0.12%, respectively, over the result achieved with only the auxiliary encoder.

A possible explanation for this outcome is that the spatial resolution of the UAV images—captured for the areca dataset—is relatively high compared with their spectral resolution. Consequently, spatial attention better captures the relationships among individual trees in the scene. For this reason, we opt for spatial rather than channel attention within the UAFM. From the three sets of comparative segmentation results in Figure 7, it is evident that integrating the auxiliary encoder and UAFM significantly enhances the model’s ability to learn and accurately delineate the fine-grained structures of areca branches and leaves, resulting in more complete and clearer segmented boundaries.

6. Discussion

In this study, the proposed UniU-Net model demonstrated exceptional performance in segmenting areca palms from remote sensing images, particularly in challenging environments, where traditional methods often struggle. By incorporating an auxiliary encoder and the UAFM, UniU-Net significantly improves the ability to capture detailed morphological features of areca palms and excels in preserving fine structural details along the boundaries of individual trees. This enhancement in performance is evident in the model’s superiority over existing models, such as U-Net and ResUnet, as measured by both the F1 and the IoU metrics. The combination of the auxiliary encoder, UAFM, and spatial attention mechanism enhances the model’s feature extraction capabilities and improves its ability to capture spatial relationships between individual trees.

Despite these successes, several limitations must be addressed. First, the relatively small dataset used in this study limits the model’s generalization ability. To improve the robustness, future work could expand the study area to include more diverse areca palm distributions and morphological characteristics. Additionally, the use of long temporal sequences to observe pathological changes, such as yellowing disease, would increase the model’s practical applicability for disease detection. Second, while UniU-Net performed well on the areca palm dataset, its adaptability to other types of remote sensing data needs further evaluation. Optical satellite images (e.g., Landsat, Sentinel-2) provide richer spectral information that could enhance tree extraction by offering additional spectral features, while SAR data can penetrate cloud cover and vegetation canopies to capture structural information, such as tree heights. The integration of both optical and SAR data could offer more comprehensive support for tree extraction and monitoring applications.

Moreover, recent advancements in network architectures, such as Mamba, have shown strong compatibility with long temporal and multi-modal data, which could be beneficial for future studies. To further extend the application of UniU-Net, the combination of transfer learning techniques and multi-modal data fusion could enhance the model’s performance with fewer computational resources. In subsequent research, it will be valuable to explore additional attention mechanisms, as well as multi-modal data fusion techniques, to improve the accuracy and robustness of areca palm extraction models, ultimately advancing their practical use in remote sensing applications.

7. Conclusions

This study presents the UniU-Net model, which demonstrates significant potential for the high-precision segmentation of areca palms in remote sensing imagery. By incorporating an auxiliary encoder and a unified attention fusion module (UAFM), the model provides a novel solution to the challenges of accurately extracting tree boundaries and spatial relationships, particularly in complex environments. The experimental results show that UniU-Net achieves superior performance over existing models, marking a substantial advancement in automated tree extraction techniques. The contributions of this research extend beyond the immediate task of areca palm segmentation. The model’s innovative use of auxiliary encoding and attention mechanisms offers valuable insights that could enhance the segmentation accuracy in other remote sensing applications. As remote sensing datasets continue to expand, UniU-Net has the potential to be applied to a broader range of tree species and environmental settings, providing a foundation for future advancements in automated tree extraction and land cover classification.

In future work, three key areas offer the potential to further enhance the model’s capabilities and applications. Firstly, expanding the dataset is considered essential to improve the model’s generalization ability. By including a greater number of diverse geographic regions, tree species, and environmental conditions, the model’s robustness and applicability across different scenarios will be strengthened. Secondly, the fusion of multi-modal remote sensing data, particularly the combination of optical imagery with SAR data, will provide richer spatial and structural information, thus enhancing the model’s performance and accuracy. Finally, the incorporation of more advanced models and techniques, such as transfer learning and refined attention mechanisms, will improve the efficiency and scope of the model’s applications, enabling it to tackle more complex remote sensing tasks.

In conclusion, this study offers a novel and effective approach for areca palm segmentation in remote sensing imagery. The findings provide a valuable contribution to the field of remote sensing and automated tree extraction. Moving forward, with continued research and refinement, UniU-Net has the potential to significantly improve the accuracy, efficiency, and applicability of remote sensing technologies across a variety of environmental and agricultural monitoring tasks.

Author Contributions

Conceptualization, S.W. and J.Y.; methodology, Y.W.; software, Y.W., Z.Z. and B.L.; validation, Y.W., Z.Z. and B.L.; resources, S.W. and H.L.; data curation, Y.W.; writing—original draft preparation, Z.Z., B.L. and Y.W.; writing—review and editing, H.L.; visualization, H.L. and Y.W.; supervision, S.W. and J.Y.; project administration, S.W., H.L. and J.Y.; funding acquisition, S.W. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Natural Science Foundation of Hainan Province of China (Grant E3D1HN03), China; the talent introduction program Youth Project of the Chinese Academy of Sciences (E43302020D, E2Z10501), China; and the innovation group project of the Key Laboratory of Remote Sensing and Digital Earth of the Chinese Academy of Sciences (E33D0201-5), China.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this study.

References

Peng, W.; Liu, Y.-J.; Wu, N.; Sun, T.; He, X.-Y.; Gao, Y.-X.; Wu, C.-J. Areca catechu L. (Arecaceae): A Review of Its Traditional Uses, Botany, Phytochemistry, Pharmacology and Toxicology. J. Ethnopharmacol. 2015, 164, 340–356. [Google Scholar] [CrossRef] [PubMed]
Koeslulat, E.E.; Yulni, T.; Njurumana, G.N. Product Adaptation to Enhance the Market of Areca Nut (Areca Catechu). AIP Conf. Proc. 2024, 2957, 050043. [Google Scholar]
Wang, Z.; Guo, Z.; Luo, Y.; Ma, L.; Hu, X.; Chen, F.; Li, D. A Review of the Traditional Uses, Pharmacology, and Toxicology of Areca Nut. Phytomedicine 2024, 134, 156005. [Google Scholar] [CrossRef]
Jin, Y.; Guo, J.; Ye, H.; Zhao, J.; Huang, W.; Cui, B. Extraction of Arecanut Planting Distribution Based on the Feature Space Optimization of PlanetScope Imagery. Agriculture 2021, 11, 371. [Google Scholar] [CrossRef]
Giri, C.; Pengra, B.; Long, J.; Loveland, T.R. Next Generation of Global Land Cover Characterization, Mapping, and Monitoring. Int. J. Appl. Earth Obs. Geoinf. 2013, 25, 30–37. [Google Scholar] [CrossRef]
Liu, W.; Wang, Z.; Liu, X.; Zeng, N.; Liu, Y.; Alsaadi, F.E. A Survey of Deep Neural Network Architectures and Their Applications. Neurocomputing 2017, 234, 11–26. [Google Scholar] [CrossRef]
He, J.; Deng, Z.; Qiao, Y. Dynamic Multi-Scale Filters for Semantic Segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 3562–3572. [Google Scholar]
Chen, L.-C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 834–848. [Google Scholar] [CrossRef] [PubMed]
Shi, W.; Zhang, M.; Ke, H.; Fang, X.; Zhan, Z.; Chen, S. Landslide Recognition by Deep Convolutional Neural Network and Change Detection. IEEE Trans. Geosci. Remote Sens. 2021, 59, 4654–4672. [Google Scholar] [CrossRef]
Suh, J.W.; Zhu, Z.; Zhao, Y. Monitoring Construction Changes Using Dense Satellite Time Series and Deep Learning. Remote Sens. Environ. 2024, 309, 114207. [Google Scholar] [CrossRef]
He, Z.; Zhang, L. Multi-Adversarial Faster-RCNN for Unrestricted Object Detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6668–6677. [Google Scholar]
Zhu, X.; Pang, J.; Yang, C.; Shi, J.; Lin, D. Adapting Object Detectors via Selective Cross-Domain Alignment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 687–696. [Google Scholar]
Li, W.; Fu, H.; Yu, L.; Cracknell, A. Deep Learning Based Oil Palm Tree Detection and Counting for High-Resolution Remote Sensing Images. Remote Sens. 2016, 9, 22. [Google Scholar] [CrossRef]
Ulku, I.; Akagunduz, E.; Ghamisi, P. Deep Semantic Segmentation of Trees Using Multispectral Images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 7589–7604. [Google Scholar] [CrossRef]
Guo, M.; Yu, Z.; Xu, Y.; Huang, Y.; Li, C. ME-Net: A Deep Convolutional Neural Network for Extracting Mangrove Using Sentinel-2A Data. Remote Sens. 2021, 13, 1292. [Google Scholar] [CrossRef]
Gibril, M.B.A.; Shafri, H.Z.M.; Al-Ruzouq, R.; Shanableh, A.; Nahas, F.; Al Mansoori, S. Large-Scale Date Palm Tree Segmentation from Multiscale UAV-Based and Aerial Images Using Deep Vision Transformers. Drones 2023, 7, 93. [Google Scholar] [CrossRef]
Qin, B.; Sun, F.; Shen, W.; Dong, B.; Ma, S.; Huo, X.; Lan, P. Deep Learning-Based Pine Nematode Trees’ Identification Using Multispectral and Visible UAV Imagery. Drones 2023, 7, 183. [Google Scholar] [CrossRef]
Pu, R. Mapping Tree Species Using Advanced Remote Sensing Technologies: A State-of-the-Art Review and Perspective. J. Remote Sens. 2021, 2021, 9812624. [Google Scholar] [CrossRef]
Tariq, A.; Yan, J.; Gagnon, A.S.; Riaz Khan, M.; Mumtaz, F. Mapping of Cropland, Cropping Patterns and Crop Types by Combining Optical Remote Sensing Images with Decision Tree Classifier and Random Forest. Geo-Spat. Inf. Sci. 2023, 26, 302–320. [Google Scholar] [CrossRef]
Lin, H.; Liu, X.; Han, Z.; Cui, H.; Dian, Y. Identification of Tree Species in Forest Communities at Different Altitudes Based on Multi-Source Aerial Remote Sensing Data. Appl. Sci. 2023, 13, 4911. [Google Scholar] [CrossRef]
Jiang, H.; Peng, M.; Zhong, Y.; Xie, H.; Hao, Z.; Lin, J.; Ma, X.; Hu, X. A Survey on Deep Learning-Based Change Detection from High-Resolution Remote Sensing Images. Remote Sens. 2022, 14, 1552. [Google Scholar] [CrossRef]
Osco, L.P.; Marcato Junior, J.; Marques Ramos, A.P.; De Castro Jorge, L.A.; Fatholahi, S.N.; De Andrade Silva, J.; Matsubara, E.T.; Pistori, H.; Gonçalves, W.N.; Li, J. A Review on Deep Learning in UAV Remote Sensing. Int. J. Appl. Earth Obs. Geoinf. 2021, 102, 102456. [Google Scholar] [CrossRef]
Park, H.G.; Yun, J.P.; Kim, M.Y.; Jeong, S.H. Multichannel Object Detection for Detecting Suspected Trees with Pine Wilt Disease Using Multispectral Drone Imagery. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 8350–8358. [Google Scholar] [CrossRef]
Ferreira, M.P.; Almeida, D.R.A.D.; Papa, D.D.A.; Minervino, J.B.S.; Veras, H.F.P.; Formighieri, A.; Santos, C.A.N.; Ferreira, M.A.D.; Figueiredo, E.O.; Ferreira, E.J.L. Individual Tree Detection and Species Classification of Amazonian Palms Using UAV Images and Deep Learning. For. Ecol. Manag. 2020, 475, 118397. [Google Scholar] [CrossRef]
Yan, S.; Jing, L.; Wang, H. A New Individual Tree Species Recognition Method Based on a Convolutional Neural Network and High-Spatial Resolution Remote Sensing Imagery. Remote Sens. 2021, 13, 479. [Google Scholar] [CrossRef]
Ye, Z.; Wei, J.; Lin, Y.; Guo, Q.; Zhang, J.; Zhang, H.; Deng, H.; Yang, K. Extraction of Olive Crown Based on UAV Visible Images and the U2-Net Deep Learning Model. Remote Sens. 2022, 14, 1523. [Google Scholar] [CrossRef]
Lou, X.; Huang, Y.; Fang, L.; Huang, S.; Gao, H.; Yang, L.; Weng, Y.; Hung, I.-K. Measuring Loblolly Pine Crowns with Drone Imagery through Deep Learning. J. For. Res. 2022, 33, 227–238. [Google Scholar] [CrossRef]
Wei, P.; Xu, X.G.; Li, Z.; Yang, G.; Li, Z.H.; Feng, H.K.; Chen, G.; Fan, L.; Wang, Y.; Liu, S. Remote sensing estimation of nitrogen content in summer maize leaves based on multispectral images of UAV. Trans. Chin. Soc. Agric. Eng. 2019, 35, 126–133. [Google Scholar] [CrossRef]
Huete, A.R.; Jackson, R.D. Suitability of spectral indices for evaluating vegetation characteristics on arid rangelands. Remote Sens. Environ. 1987, 23, 213–232. [Google Scholar] [CrossRef]
Jordan, C.F. Derivation of leaf area index from quality of light on the forest floor. Ecology 1969, 50, 663–666. [Google Scholar] [CrossRef]
Qi, J.; Chehbouni, A.; Huete, A.R.; Kerr, Y.H.; Sorooshian, S. A modified soil adjusted vegetation index. Remote Sens. Environ. 1994, 48, 119–126. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015, Munich, Germany, 5–9 October 2015; Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F., Eds.; Springer International Publishing: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar]
Peng, J.; Liu, Y.; Tang, S.; Hao, Y.; Chu, L.; Chen, G.; Wu, Z.; Chen, Z.; Yu, Z.; Du, Y.; et al. PP-LiteSeg: A Superior Real-Time Semantic Segmentation Model. arXiv 2022, arXiv:2204.02681. [Google Scholar]
Jaderberg, M.; Simonyan, K.; Zisserman, A. Spatial transformer networks. arXiv 2015, arXiv:1506.02025. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
He, X.; Zhou, Y.; Zhao, J.; Zhang, D.; Yao, R.; Xue, Y. Swin transformer embedding UNet for remote sensing image semantic segmentation. IEEE Trans. Geosci. Remote Sens. 2022, 60, 4408715. [Google Scholar] [CrossRef]
Li, R.; Zheng, S.; Duan, C.; Su, J.; Zhang, C. Multistage Attention ResU-Net for Semantic Segmentation of Fine-Resolution Remote Sensing Images. IEEE Geosci. Remote Sens. Lett. 2022, 19, 8009205. [Google Scholar] [CrossRef]
Li, R.; Zheng, S.; Zhang, C.; Duan, C.; Wang, L.; Atkinson, P.M. ABCNet: Attentive Bilateral Contextual Network for Efficient Semantic Segmentation of Fine-Resolution Remotely Sensed Imagery. ISPRS J. Photogramm. Remote Sens. 2021, 181, 84–98. [Google Scholar] [CrossRef]
Wang, L.; Li, R.; Wang, D.; Duan, C.; Wang, T.; Meng, X. Transformer Meets Convolution: A Bilateral Awareness Network for Semantic Segmentation of Very Fine Resolution Urban Scene Images. Remote Sens. 2021, 13, 3065. [Google Scholar] [CrossRef]
Zheng, K.; Li, Q.; Wang, Z.; An, J.; Huang, F.; Liu, M.; Bao, S. CREVNet: A Transformer and CNN-based network for accurate segmentation of ice shelf crevasses. IEEE Geosci. Remote Sens. Lett. 2024, 21, 2001105. [Google Scholar] [CrossRef]
Tong, L.; Li, T.; Zhang, Q.; Zhang, Q.; Zhu, R.; Du, W.; Hu, P. LiViT-Net: A U-Net-like, lightweight Transformer network for retinal vessel segmentation. Comput. Struct. Biotechnol. J. 2024, 24, 213–224. [Google Scholar] [CrossRef]

Figure 1. Study area and areca palm distribution.

Figure 2. Architecture of UniU-Net.

Figure 3. (a) Architecture of the Conv-block. (b) Architecture of the UAFM. (c) Structure of the attention module.

Figure 4. The process of the spatial attention mechanism.

Figure 5. Examples of semantic segmentation results of different methods. (a) U-Net. (b) ResU-Net. (c) MAResU-Net. (d) ABCNet. (e) BANet. (f) CREVNet. (g) LiViTNet. (h) UniU-Net (ours).

Figure 6. Comparison of results regarding hyperparameter settings.

Figure 7. Comparison of segmentation results for different components. (a) U-Net. (b) U-Net + Aux. (c) U-Net + Aux. + CA. (d) UniU-Net.

Table 1. Parameters of MicaSense RedEdge-M multispectral.

Waveband	Central Wavelength/nm	Spectral Bandwidth/nm	Panel Reflectivity
Blue	475	20	0.49
Green	560	20	0.49
Red	668	10	0.49
Near-infrared	840	40	0.49
Red edge	717	10	0.49

Table 2. Formulas for vegetation indices.

Vegetation Index	Formula
RVI	$B_{N I R} / B_{R}$
DVI	$B_{N I R} - B_{R}$
MSAVI	$\frac{1}{2} [(2 B_{N I R} + 1) - \sqrt{{(2 B_{N I R} + 1)}^{2} - 8 (B_{N I R} - B_{R})}]$

B_{N I R}

,

B_{R}

represent the spectral NIR band and red band.

Table 3. Comparison of segmentation results of different methods.

Method	Acc (%)	F1 (%)	IoU (%)
U-Net	92.33	77.90	63.79
ResUnet	91.72	76.18	61.52
MAResUNet	92.20	77.57	63.35
ABCNet	91.46	75.62	60.79
BANet	91.07	73.75	58.42
CREVNet	91.51	76.34	61.73
LiViTNet	91.59	77.65	63.47
UniU-Net (ours)	92.30	79.20	65.56

Table 4. Number of parameters and inference speeds of different methods.

Method	Param. (M)	Speed (FPS)
ResU-Net	23.1	55
MAResU-Net	26.3	33
ABCNet	13.5	48
BANet	12.7	41
CREVNet	1.3	49
LiViTNet	6.8	39
UniU-Net (ours)	33.5	53

Table 5. Comparison of segmentation results on different datasets s.

Dataset	Acc (%)	F1 (%)	IoU (%)
origin	92.30	79.20	65.56
index	91.24	75.25	60.32
merge	92.28	79.23	65.61

Table 6. Ablation results for the different components.

Method	Auxiliary Encoder	UAFM		F1 (%)	IoU (%)
Method	Auxiliary Encoder	SA	CA	F1 (%)	IoU (%)
U-Net				77.90	63.79
U-Net + Aux.	√			78.25	64.42
U-Net + Aux. + CA	√		√	77.74	63.58
UniU-Net (ours)	√	√		78.45	64.54

SA represents spatial attention; CA represents channel attention.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, S.; Wang, Y.; Yue, J.; Liang, H.; Zhang, Z.; Li, B. UniU-Net: A Unified U-Net Deep Learning Approach for High-Precision Areca Palm Segmentation in Remote Sensing Imagery. Appl. Sci. 2025, 15, 4813. https://doi.org/10.3390/app15094813

AMA Style

Wang S, Wang Y, Yue J, Liang H, Zhang Z, Li B. UniU-Net: A Unified U-Net Deep Learning Approach for High-Precision Areca Palm Segmentation in Remote Sensing Imagery. Applied Sciences. 2025; 15(9):4813. https://doi.org/10.3390/app15094813

Chicago/Turabian Style

Wang, Shaohua, Yan Wang, Jianwei Yue, Haojian Liang, Zihan Zhang, and Bojun Li. 2025. "UniU-Net: A Unified U-Net Deep Learning Approach for High-Precision Areca Palm Segmentation in Remote Sensing Imagery" Applied Sciences 15, no. 9: 4813. https://doi.org/10.3390/app15094813

APA Style

Wang, S., Wang, Y., Yue, J., Liang, H., Zhang, Z., & Li, B. (2025). UniU-Net: A Unified U-Net Deep Learning Approach for High-Precision Areca Palm Segmentation in Remote Sensing Imagery. Applied Sciences, 15(9), 4813. https://doi.org/10.3390/app15094813

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

UniU-Net: A Unified U-Net Deep Learning Approach for High-Precision Areca Palm Segmentation in Remote Sensing Imagery

Abstract

1. Introduction

2. Related Works

3. Materials and Methods

3.1. Study Area

3.2. Data Acquisition and Processing

3.3. UniU-Net

3.4. Unified Attention Fusion Module

4. Experiments

5. Results

5.1. Comparison with Other Methods

5.2. Ablation Study

6. Discussion

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI