Attention-Driven and Hierarchical Feature Fusion Network for Crop and Weed Segmentation with Fractal Dimension Estimation

Akram, Rehan; Kim, Jung Soo; Jeong, Min Su; Gondal, Hafiz Ali Hamza; Tariq, Muhammad Hamza; Irfan, Muhammad; Park, Kang Ryoung

doi:10.3390/fractalfract9090592

Open AccessArticle

Attention-Driven and Hierarchical Feature Fusion Network for Crop and Weed Segmentation with Fractal Dimension Estimation

by

Rehan Akram

,

Jung Soo Kim

,

Min Su Jeong

,

Hafiz Ali Hamza Gondal

,

Muhammad Hamza Tariq

,

Muhammad Irfan

and

Kang Ryoung Park

^*

Division of Electronics and Electrical Engineering, Dongguk University, 30 Pildong-ro, 1-gil, Jung-gu, Seoul 04620, Republic of Korea

^*

Author to whom correspondence should be addressed.

Fractal Fract. 2025, 9(9), 592; https://doi.org/10.3390/fractalfract9090592

Submission received: 5 August 2025 / Revised: 3 September 2025 / Accepted: 8 September 2025 / Published: 10 September 2025

(This article belongs to the Special Issue Fractional Order Complex Systems: Advanced Control, Intelligent Estimation and Reinforcement Learning Image Processing Algorithms, Second Edition)

Download

Browse Figures

Versions Notes

Abstract

In precision agriculture, semantic segmentation enhances the crop yield by enabling precise disease monitoring, targeted herbicide application, and accurate crop–weed differentiation. This enhances yield; reduces the overuse of herbicides, water, and fertilizers; lowers labor costs; and promotes sustainable farming. Deep-learning-based methods are particularly effective for crop and weed segmentation, and achieve potential results. Typically, segmentation is performed using homogeneous data (the same dataset is used for training and testing). However, previous studies, such as crop and weed segmentation in a heterogeneous data environment, using heterogeneous data (i.e., different datasets for training and testing) remain inaccurate. The proposed framework uses patch-based augmented limited training data within a heterogeneous environment to resolve the problems of degraded accuracy and the use of extensive data for training. We propose an attention-driven and hierarchical feature fusion network (AHFF-Net) comprising a flow-constrained convolutional block, hierarchical multi-stage fusion block, and attention-driven feature enhancement block. These blocks independently extract diverse fine-grained features and enhance the learning capabilities of the network. AHFF-Net is also combined with an open-source large language model (LLM)-based pesticide recommendation system made by large language model Meta AI (LLaMA). Additionally, a fractal dimension estimation method is incorporated into the system that provides valuable insights into the spatial distribution characteristics of crops and weeds. We conducted experiments using three publicly available datasets: BoniRob, Crop/Weed Field Image Dataset (CWFID), and Sunflower. For each experiment, we trained on one dataset and tested on another by reversing the process of the second experiment. The highest mean intersection of union (mIOU) of 65.3% and F1 score of 78.7% were achieved when training on the BoniRob dataset and testing on CWFID. This demonstrated that our method outperforms other state-of-the-art approaches.

Keywords:

semantic segmentation; crops and weeds; heterogeneous datasets; limited training data; pesticide recommendation; fractal dimension estimation

1. Introduction

The global increase in food demand is driven by the increase in the global population, which is projected to exceed nine billion by 2050 [1]. This growth necessitates efforts to enhance the food supply by increasing crop productivity. In recent years, deep learning and computer vision techniques have advanced rapidly, playing key roles in applications such as medical imaging, image enhancement, building monitoring, biomedical engineering, and precision agriculture [2,3,4,5,6,7]. In parallel, fractal-related studies have also gained attention in various deep-learning-based computer vision tasks. Although many studies employ similar fractal dimension (FD) estimation techniques, their implementation and interpretation vary significantly across domains such as medical image analysis and agricultural applications [8,9,10,11]. Currently, field phenotyping systems for precision agriculture constitute a prominent method to address these challenges and increase crop yields [12]. Field phenotyping uses many advanced tools such as wearable sensors [13], ground robots [14], and computer vision technologies such as semantic segmentation [15]. Semantic-segmentation-based plant phenotyping and semantic segmentation of crops and weeds have been performed in previous studies [16,17]. Earlier, conventional methods like extensive manpower, expensive labor costs, overuse of fertilizers and herbicides, and conventional spraying methods were deemed ineffective in satisfying the growing demand for crops. Some prior deep learning methods like CG-Net and Unet++ also showed low performance. To overcome these problems, it is important to decrease manpower and utilize advanced methods of spraying herbicides [18].

In precision agriculture, robots and wearable sensor devices are used to spray herbicides on targeted weeds, monitor farms, and collect data to mitigate the factors that affect crop yield. Such farming problems are mainly resolved by accurately detecting crops and weeds [19]. Two methods are widely used for accurately detecting crops and weeds. The first is box-based detection [20,21], which has the drawback of overlooking certain regions of crops and weeds. The other is pixel-based detection and is commonly known as semantic segmentation [22,23]. It precisely detects regions of crop and weed objects at the pixel level. Therefore, it is necessary to correctly differentiate the segments of crops and weeds at the pixel level because these crop and weed objects have irregular shapes. Figure 1 compares the shapes of “sugar beets” (the BoniRob dataset) [24], the Crop/Weed Field Image Dataset (CWFID) [25], and the Sunflower dataset [26].

Many machine learning (ML)- and convolutional neural network (CNN)-based methods have been used for semantic segmentation. These typically require large amounts of training data for efficient training. When considering a large training dataset, a significant number of annotations are necessary. These should be generated by agricultural experts. This process requires considerable time and effort. Owing to these constraints, sufficient training data may not be available in certain cases. This can result in a decrease in semantic segmentation accuracy. To address this issue, researchers have used small training datasets with potential outcomes. In ref. [27,28], classification and segmentation problems in the field of healthcare were resolved. This yielded satisfactory results with limited training data. In the agricultural sector, previous research [22] has employed a similar concept of small training data. In general, better testing results can be achieved with homogeneous data (using the same dataset for training and testing). However, the performance tends to decrease significantly in heterogeneous environments (using different datasets for training and testing). Furthermore, they demonstrated that the concept of small training data effectively contributed to achieving good testing performance even in heterogeneous environments for crops and weeds. In imagery data, factors like complex backgrounds, lighting, and crop–weed density can affect the method’s performance, leading to poor feature extraction and lower accuracy [29]. These challenges emphasize the need to develop a model that can effectively segregate crops and weeds even in the presence of complex backgrounds and heterogeneous environments. We propose a novel approach, the attention-driven hierarchical feature fusion network (AHFF-Net), to segregate crops and weeds using the concept of patch-based augmented small training data. This method effectively detected regions of crops and weeds in complex backgrounds and heterogeneous environments.

1.1. Considering Homogenous Dataset Environments

The methods in this category typically exhibit a higher performance because the data distribution for training and testing originates from the same dataset. Numerous studies have been conducted across various domains using homogeneous datasets, although not specifically for the agricultural sector. In this subsection, we classify crop and weed semantic segmentation into three categories: ML-based methods, single-stage CNN-based methods, and multi-stage CNN-based methods.

1.1.1. ML-Based Methods

The conventional ML-based methods were popular prior to the advancements and widespread use of deep learning for crop and weed segmentation. In ref. [30], principal component analysis (PCA) and a random forest (RF) algorithm were used to optimize feature selection for effective weed detection and maize crop differentiation. A vegetation detection system using Markov random fields (MRFs) for smoothing and random forest classification (RFC) was proposed in [31], enabling precise weed identification and robotic spraying. Another study [32] detected four common weed species in sugar beet fields using shape features evaluated with an artificial neural network (ANN) and a support vector machine (SVM). An automated method [33] used support vector data description (SVDD) and color indices to segregate maize crops and weeds, applying excess green [34] and Otsu thresholding for binary mask generation. Morphological operations and feature extraction methods like local binary patterns and Gabor filters [35] further improved accuracy through classifier fusion. Zhao et al. [36] also developed an accurate SVM-based system for cabbage–weed identification with real-time spraying, achieving efficient pesticide usage.

1.1.2. Single-Stage CNN-Based Methods

The single-stage CNN method proposed in [37] utilizes data augmentation by extracting patches, normalizing these, and feeding these into a CNN. Furthermore, using the segmentation results from a ResNet-50-based CNN, the study [38] proposed a blob extraction and region of interest (ROI) technique. These ROIs were then classified as crops or weeds using a VGG-16 [39] two-class classifier and achieved better results. A modified U-Net with diverse augmentation strategies enhanced pixel-level classification (PLC) in [40]. This approach effectively addressed the challenges of limited training data and multiple data combinations for augmentation methods to achieve a higher performance. Another study introduced DeepVeg [41], which segmented healthy and damaged crops and weeds using semantic masks, effectively learning from class imbalance and overlap. In ref. [42], the authors modified Enet [43] and SegNet [44] to develop a deep learning model for crop and weed segmentation with high accuracy. They replaced SegNet convolutional layers (CLs) with residual blocks, which improved both precision and efficiency by reducing inference time. A U-Net with ResNet-50 and conditional random field (CRF) refined segmentation from multispectral images in [45], achieving robustness but with high computational cost. It combined RGB and filtered near-infrared (NIR) inputs and applied the normalized difference vegetation index (NDVI) for normalization. In the study [46], U-Net [47] and Unet++ [48] were employed for early-stage weed detection; however, challenges such as missed detection of small objects, low contrast, and high computational cost persisted. To address these issues, the multi-stage CNN-based methods have been proposed.

1.1.3. Multi-Stage CNN-Based Methods

A multi-stage MTS-CNN method [49] uses stage-wise training with two models based on the U-Net architecture [47] and its modified version [50], resulting in improved segmentation performance. The cascaded encoder–decoder network (CED-Net) [51], another U-Net-based multi-stage model, separates weed and crop segmentation into different stages to improve performance. A two-stage model in [52] uses a lightweight U-Net for object-background separation and a VGG-16 backbone for detailed multiclass segmentation. Although these methods exhibit high accuracies in homogeneous dataset environments, these experience performance degradation in heterogeneous dataset environments.

1.2. Considering Heterogeneous Dataset Environments

In heterogeneous dataset environments, crop and weed segmentation is grouped further into two types: CNN with image-based augmentation (IMBA) methods and CNN with patch-based augmentation methods. To our knowledge, only one study has utilized a CNN with the IMBA method to address crop and weed semantic segmentation using heterogeneous data in the agricultural sector.

1.2.1. CNN with Image-Based Augmentation Methods

In ref. [22], researchers proposed a framework for heterogeneous data environments by utilizing IMBA methods for CNNs. Initially, Dataset-1 was used, with 70% of its data allocated for training using a conventional semantic segmentation model. After the model was trained and saved, Dataset-2 was pre-processed using the Reinhard transformation to adjust its visual properties to resemble those of Dataset-1. The transformed Dataset-2 was then split into two equal parts: 50% for the small training data and 50% for testing. They selected one training dataset from a small training data split, applied IMBA, and fine-tuned the model using augmented training data. Finally, semantic segmentation of crops and weeds was performed using the testing split of Dataset-2. This study demonstrated the applicability of the framework to heterogeneous datasets.

1.2.2. CNN with Patch-Based Augmentation Methods

Unlike ref. [22], our study employs patch-based augmentation and proposes AHFF-Net. It is a deep-learning-based modified version of U-Net designed to improve the semantic segmentation results for crops and weeds in heterogeneous environments. The strengths and weaknesses of previous studies and those of the proposed model are outlined in Table 1. The key contributions of this study are as follows:

This study proposes AHFF-Net, which incorporates patch-based augmentation to improve the accuracy compared to previous methods. This method effectively addresses the challenges of a heterogeneous dataset environment for semantic segmentation of crops and weeds.
AHFF-Net includes a progressive encoder-stage refinement block (PERB) as part of the encoder. The PERB enhances feature extraction by capturing more sophisticated details such as the edges and textures of crops and weeds through deeper convolutions while preserving the feature map size. Additionally, the PERB improves low-level feature retention and helps prevent the loss of critical region-level information on crops and weeds. This enables the progressive refinement of spatial features before downsampling at each encoder stage. This is particularly favorable in heterogeneous field environments, where the high intraclass variability (e.g., varying crop textures) and low interclass separability (e.g., weeds replicating crop morphology) challenge conventional encoders. By enriching the representation before resolution reduction, PERB preserves the fine-grained cues essential for distinguishing visually similar classes. The symmetrical design also facilitates a smoother gradient flow and stabilizes the training. Thereby, it improves the generalization of the model to unseen or complex conditions that agricultural segmentation tasks generally involve.
AHFF-Net employs a hierarchical multi-stage fusion block (HMFB) to capture and combine diverse features across multiple semantic levels. Each stage within the HMFB independently learns unique representations by focusing on fine-grained spatial details and high-level contextual information. These features are combined hierarchically. This results in a rich multiscale representation that strengthens the model’s capability to handle variations in object size, shape, and appearance during the semantic segmentation of crops and weeds in heterogeneous environments. Moreover, the HMFB includes residual connections that help preserve low-level spatial details that are generally lost in deeper layers. These low-level details are important for accurately identifying the small or thin structures of crops and weeds, particularly when their visual patterns differ across heterogeneous environments.
The proposed attention-driven feature enhancement block (AFEB) in the decoder of AHFF-Net uses an attention mechanism to focus on critical regions of crops and weeds. It thereby enhances the segmentation performance. This functionality is particularly advantageous in heterogeneous environments because the attention mechanism adaptively highlights the most informative spatial features while suppressing irrelevant or noisy background information. By selectively emphasizing discriminative features, the AFEB contributes to more accurate semantic segmentation across field conditions. Furthermore, this framework integrates an FD estimation component to extract critical information on the spatial distribution patterns of crops and weeds. AHFF-Net is also combined with open-source large language model (LLM)-based pesticide recommendation systems by large language model Meta AI (LLaMA). To support transparency and reproducibility, the implementation of AHFF-Net has been made publicly available on GitHub (https://github.com/) [53].

Table 1. Summarized comparisons of the proposed and related works on crop and weed segmentation.

Category		Method	Strengths	Limitations
Considering homogeneous dataset	ML-based	RF algorithm with PCA was used to distinguish maize crop from weeds [30]	Effective use of hyperspectral data by leveraging rich spectral information, particularly red-edge NIR bands, to improve weed–crop discrimination	This method was validated only under controlled conditions, not in real field scenarios
		This method used both RFC- and MRF-based vegetation detection [31]	It detected both local and object base features, and was also evaluated on real farm of sugar beet fields	For image acquisition, the system requires a specific setup (e.g., a four-channel RGB+NIR camera and artificial halogen lighting), which limits its flexibility in varying field environments
		ANN and SVM were evaluated based on shape features [32]	Focusing on shapes and patterns, it precisely detected the weeds and crops	In a sugar beet field, it was evaluated on only four types of common weeds
		Background was separated from crops and weeds, and further classified using SVDD [33]	Reduced computation time by using color index features instead of shape-based features	Threshold values may change and become inaccurate if the colors of crops and weeds vary from green
		This approach was based on multiple classifier system (MCS) [35]	It showed better performance with selection-based MCS, which utilized multiple classifiers instead of a single classifier	The model became too heavy due to the combination of multiple classifiers
		SVM with radial basis function (RBF) kernel [36]	Real-time applicability achieved through integrated image acquisition, decision-making, and nozzle control for in-field operation.	Tiny cabbage plants were often missed by the targeted pesticide spraying mechanism, reducing overall detection accuracy
	Single-stage CNN-based	CNN model with ROI and two-class classifiers [37]	ROI-based accurate segmentation method for crops and weeds	Light variations across different datasets can reduce overall accuracy
		A modified U-Net with data augmentations [40]	This study compared various input sizes for models and augmentation techniques to identify the optimal learning methods for PLC	Crops and weeds which are too small may not be detected in the augmented patches
		DeepVeg segmentation focused on damaged crops alongside healthy crops and weeds [41]	This lightweight model effectively addresses complex backgrounds and imbalanced classes	Image segmentation is challenging due to unclear boundaries, poor lighting, low contrast, and limited boundary information, resulting in errors and imprecise region definitions
		A modified Enet- and SegNet-based model taking 14 input channels [42]	14 transformed images were concatenated as input to improve segmentation performance and reduce environmental impact	Generating 14 RGB-based channels increased computational costs and risked overfitting due to isolated features
		U-Net with a ResNet-50 backbone and CRF as post-processing [45]	The U-Net with a ResNet-50 backbone and CRF method exhibited enhanced performance across diverse environmental images	Processing took longer because the relationships between all pixels needed to be analyzed
		This method utilizes the conventional U-Net and U-Net++ architectures with resize [46]	Weeds were detected during their early growth stages to enable timely and effective intervention	The study utilized a small dataset, limiting its representation of the diversity and complexity of real-world agricultural environments
	Multi-stage CNN-based	A multi-stage MTS-CNN method for crop and weed segmentation [49]	It reduced the disparity between crop and weed segmentation accuracy while enhancing overall segmentation performance	The second stage depends on the first stage; poor first-stage training hampers second-stage performance
		Modified U-Net based on conventional U-Net [50]	To address the challenges of data labeling and limited data availability, augmentation is employed, which enables accurate segmentation, even in complex backgrounds	The algorithm is limited to green bristlegrass, restricting its general applicability to other weeds
		A novel four-stage model, called CED-Net [51], is proposed, with each stage based on a modified U-Net encoder–decoder structure	A lightweight multi-stage model was developed, where training for each stage focused on a specific weed or crop separately	An error in any of the four stages impacted the overall performance
		Two stages encode a decoder-based semantic segmentation model [52]	It focused on objects rather than the background and delivered good results	Including the background yields varying results across different datasets
Considering heterogeneous dataset	CNN with image-based augmentation	A framework for crop and weed semantic segmentation in heterogeneous environments using conventional CNN methods [22]	Limited training data was used for segmentation and achieved good results on heterogeneous datasets	Overlapping of crop and weed plants causes misclassification and lowers the performance
Considering heterogeneous dataset	CNN with patch-based augmentation	AHFF-Net (proposed method)	It improves crop and weed segmentation accuracy across diverse shapes and heterogeneous data environments using small training data	Segregating very small crops and weeds poses a significant challenge in heterogeneous data environments

The remainder of this paper is organized as follows: Section 2 outlines the proposed method, Section 3 presents the experimental results, and Section 4 provides the discussion. Section 5 summarizes the study and presents future work directions.

2. Material and Methods

2.1. Experimental Setup

We conducted our experiments using three publicly available datasets: the BoniRob dataset [24], Crop/Weed Field Image Dataset (CWFID) [25], and the Sunflower dataset [26]. These are commonly used for crop and weed segmentation tasks because these provide images in conjunction with their corresponding annotated masks. The BoniRob dataset consists of 496 images captured by a camera on a farming robot, with a pixel resolution of 1296 × 966. The CWFID contains 60 images captured using a JAI AD-130GE camera (JAI A/S, Kushima, Japan) with a resolution of 1296 × 966 pixels. The Sunflower dataset includes 172 images collected during the emergency phase with a resolution of 1296 × 964 pixels. Sample images from these datasets, in conjunction with their corresponding masks, are shown in Figure 1. We started with a single image and used data augmentation to generate additional images. These two images were used to create 512 × 512 patches. We selected five patches based on high SSIM and used these for patch-based augmentation. This is explained in Section 2.2.2. To train the AHFF-Net, all the datasets were resized to an input image resolution of 512 × 512 by bilinear interpolation. The experiments were performed on a desktop system running Windows 10 and equipped with a Core i5-2320 Intel CPU (Intel Corporation, Santa Clara, CA, USA) with a 3.00 GHz processor, 16 GB of RAM, and a GeForce GTX 1070 graphical processing unit (GPU) with 8 GB of memory manufactured by NVIDIA corporation (Santa Clara, CA, USA). We developed the model using PyTorch version 1.13.0 and Python version 3.8.

2.2. Overview of the Proposed Method

This subsection provides an overview of the proposed method. It details the steps involved, including dataset pre-processing and patch-based augmentation of small training data for crop and weed segmentation, as shown in Figure 2. For the training, we employed the proposed AHFF-Net. It was initially trained on Dataset-1. Upon completing the training, we performed a pre-processing step (discussed in detail in the next subsection). This pre-processing stage involves adjusting the visual properties in Dataset-2 to align these with those of Dataset-1. This ensures consistency and enhances the generalizability of the model across datasets. After pre-processing Dataset-2, we augmented a training image by an in-plane rotation of 180°. Furthermore, these two images were selected and utilized to generate patches. Five patches with a high structural similarity index measure (SSIM) were selected and used to fine-tune the AHFF-Net on Dataset-2, which had been trained from scratch using Dataset-1. This patch-based augmentation approach enabled us to maximize the learning potential from the limited training data of crops and weeds. Thus, it effectively addressed the challenges posed by a heterogeneous dataset environment. Finally, the fine-tuned AHFF-Net was evaluated using the transformed testing data of crops and weeds to calculate the segmentation results. This systematic framework optimized the model performance through pre-processing, patch-based augmentation, and fine-tuning within heterogeneous datasets.

2.2.1. Pre-Processing

The variations in visual attributes such as contrast, intensity, and illumination across different datasets generally result in a lower performance, particularly in heterogeneous environments. This issue is addressed using color transfer techniques that adjust the colors of one image (input image) to match those of the other (target image). Various transformation methods have been developed to enhance the visual attributes of the data. In our study, we used the Reinhard transformation (RHT) [54]. It uses the Ɩαβ color space [55] by adjusting the standard deviation and mean of the image. In the Ɩαβ color space, “Ɩ” represents the brightness channel. It captures the intensity independent of color. The “α” channel captures the balance between yellow and blue hues. Meanwhile, the “β” channel reflects the relationship between red and green shades. The RHT uses the µ mapping function with π parameters to transform

d a t a A

into pre-processed

{d a t a A}^{*}

, as shown in Equation (1). These π parameters include the standard deviation and mean values of input and target images. The pre-processing approach used in our study was inspired by a previously used technique explained in detail in [22]. This study [22] also shows that Reinhard gives better results using the same datasets. Reinhard may incur additional processing time, but according to this study, it provides better performance.

{d a t a A}^{*} = μ (d a t a A, π)

(1)

Sample pre-processed images with input and target images are depicted in Figure 3.

2.2.2. Patch-Based Augmentation

However, data augmentation is a technique widely used in deep learning to enhance the diversity of training data, reduce overfitting, and improve model generalization. By introducing variations such as in-plane rotations, flips, and intensity variations, data augmentation enables models to better learn invariant and robust features. These data augmentation methods ultimately yield an improved performance. Building on this principle, patch-based augmentation provides additional advantages by breaking down images into smaller overlapping patches. This approach enables the models to focus on localized details and better capture spatial variations.

In our study, we used patch-based augmentation to extract the patches. We selected different step sizes for the height and width to extract patches that matched the 512 × 512 input size of the network. We evaded padding (which adds additional pixels) and cropping (which removes important information) because both the methods can adversely impact the performance. To address this issue, the step size was adjusted to ensure that 512 × 512 patches were extracted without altering or losing any part of the input image. Sample input images of the extracted patches are shown in Figure 4.

After extracting the patches, we calculated the SSIM [56] of each patch and selected those with high SSIM scores to ensure superior structural similarity. The mathematical expression for SSIM is given in Equation (2). Herein,

σ_{x y}

,

μ_{x}

,

σ_{x}

,

μ_{y}

, and

σ_{y}

indicate the covariance of x and y, mean of x, standard deviation x, mean of y, and standard deviation y, respectively. The constants p and q are used to prevent the instability caused by division with near-zero denominators. This approach increases the amount of training data and preserves important features of the input image. This, in turn, yields an improved model accuracy. Leveraging patch-based augmentation, we effectively enhance the capability of the model to learn from high-quality localized patterns, thereby contributing to its overall performance.

S S I M (x, y) = \frac{({2 μ}_{x} μ_{y} + p^{2}) ({2 σ}_{x y} + q^{2})}{(μ_{x}^{2} + μ_{y}^{2} + p^{2}) (σ_{x}^{2} + σ_{y}^{2} + q^{2})}

(2)

2.3. AHFF-Net Architecture

The proposed AHFF-Net has encoder and decoder architecture similar to that of U shape. It is based on U-Net as the backbone with three new blocks: the PERB, HMFB, and AFEB. An architectural representation of the proposed AHFF-Net with a combination of blocks is shown in Figure 5.

We consider an input image of size 512 × 512 × 3 for the encoder part. We process it with a convolutional layer (CL) with a kernel size of 3 × 3, followed by a rectified linear unit (ReLU) activation function, batch normalization (BN), and the PERB. The output of the PERB follows the max pool layer (MPL) with a kernel size of 2 × 2 and stride of two. This is repeated four times in the encoder. The bottleneck starts with two CLs, followed by ReLU and BN. The output of these two CLs is passed through the HMFB, and the resultant is upsampled by a 2 × 2 up-convolutional layer (UCL) as the start of the decoder.

In the decoder, the upsampled feature map is concatenated with the feature map from the encoder, followed by two CLs, each with ReLU and BN. This is repeated until the final stage of the decoder. In the final stage of the decoder, the resultant of these two CLs is followed by the AFEB and a 1 × 1 CL as the final layer, with a size of 512 × 512 × 3. This is similar to the input of the encoder part. Furthermore, the mathematical representation of AHFF-Net, including the first encoder stage (

{E n d}_{1}

, representing before and with the first MPL), first decoder stage (

{D e c}_{1}

, representing the first UCL with subsequent layers), and bottleneck layer (Bottleneck), is shown in Equations (3)–(5):

{E n d}_{1} = M_{1} (B_{1} ({C N}_{1})) w h e r e \{\begin{matrix} {C N}_{1} = C o n v (Y, X_{1}) \\ B_{1} = P E R B ({C N}_{1}) \\ M_{1} = M a x p o o l (B_{1}) \end{matrix}

(3)

{D e c}_{1} = {D C}_{1} (C_{c o n c a t} ({D M}_{u p 1} ({U P}_{1}), E S)),

(4)

w h e r e \{\begin{matrix} {U P}_{1} = U p c o n v (D, Z_{u p}) \\ {D M}_{u p 1} = M a t c h (s i z e (E S), s i z e ({U P}_{1})) \\ C_{c o n c a t} = C o n c a t e n a t i o n ({D M}_{u p 1}, E S) \\ {D C}_{1} = C o n v (C o n v (C_{c o n c a t}, N_{c o n v})) \end{matrix}

B o t t l e n e c k = B T (B C) w h e r e \{\begin{matrix} B C = C o n v (C o n v (Y, F_{c o n v})) \\ B T = H M F B (B C) \end{matrix}

(5)

In

{E n d}_{1}

, Y denotes the input feature and

X_{1}

denotes the convolutional operation weight tensor. After convolution, the PERB is applied. This is followed by ReLU and BN, which are applied after each convolution operation in all the encoder stages. In

{D e c}_{1}

,

D

represents the tensor from the previous layer, and

Z_{u p}

in

U p c o n v (D, Z_{u p})

represents the weight tensor used with the UCL operation.

E S

in

M a t c h (s i z e (E S), s i z e ({U P}_{1}))

is the encoder stage skip connection with the size defined as

s i z e (E S)

. The size of

{U P}_{1}

is adjusted to match the size of

E S

to ensure a smooth further operation.

C_{c o n c a t}

represents the concatenation of

{D M}_{u p 1}

and

E S

. It forms the final tensor from the previous layers.

N_{c o n v}

denotes the weight tensor with convolution operations. Finally, after two convolutional operations in the decoder, the output is fed into

{D C}_{1}

. The remaining stages of the decoder operations persist using a similar approach. For the

B o t t l e n e c k

,

Y

is the tensor from the previous layer, and

F_{c o n v}

is the weight tensor used for the convolutional operations. After two convolutional operations, the output is fed into

B C

. Subsequently,

H M F B

is applied to the output of the previous layer, and the final resultant tensor is fed into

B T

to obtain the output (denoted as

B o t t l e n e c k

).

2.3.1. PERB

We placed the PERB before the MPL after each stage in the encoder part of the AHFF-Net. The PERB enhances feature extraction by capturing sophisticated details of crops and weeds, such as edges and textures. More detailed features can be extracted from crops and weeds using deeper convolutions while preserving the feature map size. The PERB consists of three

3 \times 3

CLs. Here, the first and final CLs are followed by a BN, and the middle CL is followed by a BN with an activation function, ReLU. This is shown in Figure 6.

The mathematical expression for the PERB is given by Equation (6). The input feature vector for the PERB is

F_{P E R B (i n p)} \in R^{H \times W \times C}

. The output is represented as

F_{P E R B (o u t)} \in R^{H \times W \times C}

. The intermediate layers of the PERB are expressed in three steps. In step 1, convolution and BN are applied to

F_{P E R B (i n p)}

, and the resultant output is denoted as

A

. In step 2,

A

is considered as the input, and another CL (followed by BN with the ReLU activation function) is applied to produce output

B

. Finally, in step 3,

B

is considered as the input, and a CL with BN is applied to produce output

C

.

C

becomes the final output of the PERB and is denoted as

F_{P E R B (o u t)}

.

\begin{matrix} F_{P E R B (o u t)} \in R^{H \times W \times C} = C (B (A)), \\ w h e r e \{\begin{matrix} A = B N (C o n v (F_{P E R B (i n p)})) \\ B = R e L U (B N (C o n v (A))) \\ C = B N (C o n v (B)) \end{matrix} \end{matrix}

(6)

2.3.2. HMFB

The HMFB is placed in the bottleneck layer and contains multiple stages. Here, each stage independently learns diverse and richer feature information of crops and weeds. The stages are combined hierarchically. After combining all the stages, a residual connection is applied to recover the previously degraded features. This process provides fine-grained patterns and detailed feature representations of crops and weeds. The structure of the HMFB is shown in Figure 7.

The HMFB comprises four stages. At each stage, 3 × 3 CLs (followed by the BN and ReLU) are applied to the input of HMFB. The output of the first two stages is concatenated channel-wise, and the resultant feature vector is concatenated with the output of the third stage. Finally, the concatenated result of the three stages is concatenated channel-wise with the outputs of the four stages. After all the stages of concatenation, element-wise addition is applied between the output after the fourth stage and the residual from the input. The resultant output after the element-wise addition is then convolved with a 1 × 1 CL to produce the final output layer of the HMFB. A mathematical representation of the HMFB is presented in Equations (7)–(13).

F_{1} \in R^{H \times W \times \frac{C}{4}} = R e L U (B N (C o n v (F_{H M F B (i n p)})))

(7)

F_{2} \in R^{H \times W \times \frac{C}{4}} = R e L U (B N (C o n v (F_{H M F B (i n p)})))

(8)

F_{3} \in R^{H \times W \times \frac{C}{4}} = R e L U (B N (C o n v (F_{H M F B (i n p)})))

(9)

F_{4} \in R^{H \times W \times \frac{C}{4}} = R e L U (B N (C o n v (F_{H M F B (i n p)})))

(10)

F \in R^{H \times W \times C} = F_{4} © (F_{3} © (F_{1} © F_{2}))

(11)

R \in R^{H \times W \times C} = F_{H M F B (i n p)}

(12)

F_{H M F B (o u t)} \in R^{H \times W \times C} = C o n v (F \oplus R)

(13)

The input feature vector for the HMFB is denoted as

F_{H M F B (i n p)} \in R^{H \times W \times C}

. The output feature vector is denoted as

F_{H M F B (o u t)} \in R^{H \times W \times C}

. The output of the first stage,

F_{1} \in R^{H \times W \times \frac{C}{4}}

, is computed using

F_{H M F B (i n p)}

by applying 3 × 3 CL, BN, and ReLU as the activation function. In the output of

F_{1}

, we obtain

\frac{C}{4}

number of channels as the output, whereas C is the total number of channels considered as the input. Similarly, the output of the second stage,

F_{2} \in R^{H \times W \times \frac{C}{4}}

, is also calculated by applying the operations to input

F_{H M F B (i n p)}

and obtaining an equal number

(\frac{C}{4})

of channels. In a similar manner

F_{3} \in R^{H \times W \times \frac{C}{4}}

and

F_{4} \in R^{H \times W \times \frac{C}{4}}

are computed at the third and fourth stages, respectively. The concatenation process begins with channel-wise concatenation (denoted as

©

) of

F_{1}

and

F_{2}

by obtaining

\frac{C}{2}

number of channels. The result of this concatenation (

F_{1} © F_{2}

) is concatenated with

F_{3}

and increases the number of channels by

\frac{3 C}{4}

. Then, the subsequent results (

F_{3} © (F_{1} © F_{2})

) are concatenated with

F_{4}

and equal the numbers of channels with input C. The final output from all the stages is denoted as

F \in R^{H \times W \times C}

. A residual connection is applied from

F_{H M F B (i n p)}

, and element-wise addition (denoted as

\oplus

) is performed between

F

and residual

R

as shown in Equation (13). After this operation, a 1 × 1 CL is applied to produce the final output of the HMFB (denoted as

F_{H M F B (o u t)}

).

2.3.3. AFEB

The AFEB was placed before the final 1 × 1 CL in the AHFF-Net decoder. The AFEB is based on an attention mechanism that focuses on the critical regions of crops and weeds. In segmentation tasks, attention mechanisms are particularly advantageous because certain features are more important than others. In the AFEB, the input is passed through a 3 × 3 CL followed by a BN. The resultant is passed through a 1 × 1 CL and sigmoid function, whereas the other is used as a residual connection. The outputs of the sigmoid and residual operations are combined using an element-wise multiplication operation. This output functions as an attention mechanism within the AFEB. Furthermore, the output of the element-wise multiplication is passed through a 3 × 3 CL followed by a BN to produce the final output. The complete structure of the AFEB is shown in Figure 8. The mathematical details are provided in Equations (14)–(16).

R \in R^{H \times W \times C} = B N (C o n v (F_{A F E B (i n p)}))

(14)

D \in R^{H \times W \times C} = S i g m o i d (C o n v (R))

(15)

F_{A F E B (o u t)} \in R^{H \times W \times C} = B N (C o n v (R ⊙ D))

(16)

In the mathematical equations of AFEB,

F_{A F E B (i n p)} \in R^{H \times W \times C}

is used as the input vector. Meanwhile,

F_{A F E B (o u t)} \in R^{H \times W \times C}

is represented as the output feature vector of AFEB. In Equation (13),

F_{A F E B (i n p)}

is passed through 3 × 3 CL and BN to produce

R \in R^{H \times W \times C}

.

R \in R^{H \times W \times C}

is also used as a residual connection. Concurrently, it is passed through 1 × 1 CL and a sigmoid function. The output of the sigmoid and residual skip connection is combined through an element-wise multiplication operation

(R ⊙ D)

. Finally, the result of this operation is passed through 3 × 3 CL and BN to calculate

F_{A F E B (o u t)} \in R^{H \times W \times C}

, which is the output feature vector of AFEB (Equation (16)).

2.4. Loss Function

The Dice loss function [57] was used as the loss function for training the AHFF-Net, as shown in Equation (17):

L_{d i c e} = 1 - \frac{1}{c l a s s e s} \sum_{c l = 1}^{c l a s s e s} \frac{2 \sum_{i = 1}^{H \times W} (M_{c l} (i) N_{c l} (i))}{\sum_{i = 1}^{H \times W} (M_{c l} (i) + N_{c l} (i))}

(17)

where the number of classes (

c l a s s e s

) used in all the experiments was three: weed, crop, and background. The ground truth label map and predicted segmentation output are denoted as

{N \in R}^{H \times W \times c l a s s e s}

and

{M \in R}^{H \times W \times c l a s s e s}

, respectively. For a specific class

c l

,

{N_{c l} \in R}^{H \times W}

and

{M_{c l} \in R}^{H \times W}

represent the ground truth and predicted values, respectively.

2.5. Evaluation Metrices

We evaluated the semantic segmentation accuracy using the intersection over union (

I O U

) for all the three classes (background, weed, and crop), mean

I O U

(

m I O U

),

R e c a l l

,

P r e c i s i o n

, and

F 1 s c o r e

(harmonic mean of

P r e c i s i o n

and

R e c a l l

). The mathematical expressions for all the evaluation metrics are provided in Equations (18)–(22). These metrics were used to assess the semantic segmentation performance of the proposed AHFF-Net. The number of classes (

C l

) was set to three. True positives (TPs) denote cases in which the true label matches the prediction, whereas true negatives (TNs) denote those in which a false label matches the prediction. False positives (FPs) occur when false labels are predicted incorrectly as true labels, and false negatives (FNs) occur when a true label is predicted incorrectly as a false label. The

I O U

for class

j

(I O U_{j})

is defined in Equation (19).

R e c a l l

and

P r e c i s i o n

were calculated for each class, and the average values were computed. The average values were then used to calculate the

F 1 s c o r e

. This provided a comprehensive evaluation of the performance of the model.

I O U = \frac{T P}{T P + F N + F P}

(18)

m I O U = \frac{\sum_{j = 1}^{C l} I O U_{j}}{C l}

(19)

R e c a l l = \frac{T P}{T P + F N}

(20)

P r e c i s i o n = \frac{T P}{T P + F P}

(21)

F 1 s c o r e = \frac{2 \times P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l}

(22)

2.6. FD Estimation

The FD is a mathematical measure to describe quantitative analysis of irregularities of different structures. It measures the complexity of a shape or pattern, especially in fractal shapes and geometric structures that possess self-similarity across the different scales. These repeating patterns at multiple scales can be effectively quantified using the FD. FD is a numerical value, usually varying between 1 and 2, showing its complexity level [58]. An increased FD value reflects a more complex structure. One of the most commonly used methods to calculate FD is the box-counting algorithm [59]. The pseudo-code to measure the FD using the box-counting algorithm is provided in Algorithm 1.

Algorithm 1. Pseudo-code to measure FD

Input: image (input image path)

Output: Fractal dimension (FD) value

1:         Read the input image and further convert it into grayscale
2:         Set the maximum box-size with the power of 2 and ensure the dimensions
                  q = 2^(log(max(size(image))/log2)]
            Add the padding if required to match the dimensions
3:         Initialize the number of boxes
4:         Calculate the number of boxes Y(q) to minimum pixels
5:         Decrease the box size by 2 and recalculate Y(q) iteratively
                  while q > 1
6:         Calculate log(Y(q)) and log(1/q) for each q
7:         Draw a fitted line to the points (log(Y(q)) and log(1/q))

8: FD value is the slope of the fitted line
Return Fractal Dimension Value

This algorithm enables the calculation of FD. Due to its flexibility, this technique is widely used in many fields, including agriculture [11,22] and many other systems. The formula for calculating FD using this method is given below:

FD = \lim_{q \to 0} \frac{\log (Y (q))}{\log (1 / q)}

(23)

Here

Y (q)

represents the total number of boxes with size

q

needed to completely cover the curve. The variable

q

means the box size and FD means the fractal dimension that characterizes the complexity of the curve.

3. Experimental Results

3.1. Training Details

The proposed AHFF-Net utilized the BoniRob, CWFID, and Sunflower public datasets, with input images of size 512 × 512. Initially, Dataset-1 was trained using 70% of the data images with an Adam optimizer (AO) and an initial learning rate (LER) of

{1 \times 10}^{- 5}

for 150 epochs. The learning rate was reduced gradually during training by a cosine annealing (CA) strategy [60]. The CA helps the model converge smoothly, improves generalization, and ensures better optimization without abrupt learning rate reductions. After training with Dataset-1, Dataset-2 was pre-processed and divided into two equal parts: one used for testing and the other for small training data. A single image from the small training set was augmented, and the two small training images were augmented using the patch-based techniques described in Section 2.2.2. These two small training images, in conjunction with the five patch-based augmented images, were used for training with the same AO, LER, and CA strategies. The test data remained unaltered. Subsequently, Dataset-2 was replaced with Dataset-3, and the segmentation results were calculated for crops and weeds in heterogeneous environments. Similarly, each dataset was first trained on one dataset, pre-processed, and tested on the other two datasets. Table 2 shows the training, validation, and testing data split for all the experiments.

3.2. Testing of Proposed Method

3.2.1. Performance Evaluation with CWFID Following BoniRob Dataset Training

Ablation Study

We performed eight ablation studies with multiple potential scenarios while testing the CWFID, following training with the BoniRob dataset. We conducted a block-by-block ablation study that included individual combinations of two blocks (Table 3). It is noteworthy that the base model without any of the proposed blocks achieved the highest

I O U

(C) for the crop class. This may be because it was challenging for the base model to separate visually similar weeds from crops. As a result, certain weed regions may have been labeled incorrectly as crops. This would have caused an increase in

I O U

(C) at the cost of a reduction in

I O U

(W). When the proposed blocks were introduced, the attention and feature representations of the model became distributed more uniformly across all the classes. Among the individual blocks, the overall segmentation performance (in terms of

F 1 s c o r e

and

m I O U

) achieved using only the PERB was better than that obtained using only the HMFB or AFEB. This is likely because the PERB enhances the initial feature representation and helps better distinguish between classes in coarse segmentation. However, the PERB alone is insufficient for capturing fine-grained boundaries, as reflected by the reduced

I O U

(C) for the crop class. Compared with the use of a single block, the combination of two blocks resulted in better segmentation accuracy. This improvement can be attributed to the effect of the blocks: one block (e.g., PERB) provides robust base features, and the second block (e.g., HMFB or AFEB) complements it by enhancing contextual understanding or focusing on semantically important spatial regions. This complementary behavior yields a better class balance and more accurate segmentation results. We observed that the three proposed blocks yielded the best segmentation results for the semantic segmentation of crops and weeds.

Subsequently, we performed experiments by applying IMBA and patch-based augmentation using the proposed AHFF-Net. Table 4 shows that the AHFF-Net trained with patch-based augmentation achieved superior performance in all the metrics compared with the same model trained using IMBA. Patch-based augmentation preserves the fine-grained spatial details and increases the diversity of localized training. In contrast, IMBA applies global transformations that may not preserve the small-scale structure and distribution of critical features, particularly in complex scenes with dense crop and weed mixtures.

We then investigated the effects of using different numbers of patches and compared the segmentation results. As shown in Table 5, the use of five patches for augmentation yielded the best overall segmentation performance. It achieved the highest

m I O U

(0.653) and

F 1 s c o r e

(0.787). When four patches were used, the performance reduced observably (

m I O U

= 0.623 and

F 1 s c o r e

= 0.760). This reduction can be attributed to the limited diversity of the augmented training samples. However, when six patches were used, a marginal performance reduction (compared with the use of five patches) was observed. Although the use of more patches increases the number of training samples, it may introduce redundant or highly overlapping regions. This results in less informative or repetitive features. Moreover, although the overall performance was lower for six patches, the

I O U

for the crop class was marginally higher than that for the proposed method using five patches. This may be because an increase in the number of patches results in an enhanced exposure of the model to small crop structures, thereby enabling more precise boundary learning for crops.

Additionally, we conducted separate ablation studies for the PERB, HMFB, and AEFB. For the PERB, we performed an ablation study based on the number of convolutions and convolutional manner (either parallel or sequential). This is explained in Table 6 and Table 7. As shown in Table 6, using only two CLs resulted in the highest

I O U

for the crop class. However, it yielded the lowest

I O U

for weeds and overall performance (

m I O U

= 0.636,

F 1 s c o r e

= 0.773). This may be because a shallower configuration has an inadequate capacity to extract sufficient semantic and contextual features, particularly for less distinct or more complex classes. Although the limited depth may enable the model to focus on clearer structures, it fails to capture the fine-grained variable patterns required for accurately segmenting weed regions. This results in imbalanced class performance. In contrast, using four convolutions marginally improved

I O U

(C) to 0.353. However, it still underperformed in terms of

I O U

(W) (=0.598),

m I O U

(=0.645), and

F 1 s c o r e

(=0.781) compared with the proposed three-CL design. This indicates that adding more depth may cause redundancy, oversmoothing, or increased difficulty in optimizing feature representations, particularly in the early encoder stages. The proposed three-convolution configuration achieved the best overall results (

m I O U

= 0.653,

F 1 s c o r e

= 0.787) by achieving an optimal balance between low-level detail preservation and mid-level semantic abstraction. This enabled robust performance across all the classes.

As shown in Table 7, the proposed sequential configuration achieved the best overall segmentation performance, with

m I O U

= 0.653 and

F 1 s c o r e

= 0.787. In contrast, the parallel convolution variant yielded an

I O U

for the crop class (

I O U

(C) = 0.371) that was marginally higher than the 0.338 in the sequential case. However, it experienced significant degradation in weed segmentation (

I O U

(W) = 0.554) and overall performance (

m I O U

= 0.637,

F 1 s c o r e

= 0.774). This indicates that parallel convolutions enhance the spatial diversity and potentially capture more localized features that are advantageous for crops. However, sequential convolution provides a more structured and hierarchical flow of features. This helps in learning stable representations and improves interclass separation, thereby yielding a better generalization across all classes.

For the HMFB, we conducted an ablation study with and without residual connections, as listed in Table 8. The inclusion of residual connections, similar to the proposed design (with residual connections), yielded the best performance across most key metrics:

m I O U

= 0.653 and

F 1 s c o r e

= 0.787. In contrast, removing the residual connection caused an observable performance reduction, with

I O U

(W) reducing to 0.564,

m I O U

decreasing to 0.631, and

F 1 s c o r e

reducing to 0.772. It is noteworthy that

I O U

(C) increased marginally without a residual connection. This was possibly because the model focused more narrowly on prominent and conveniently identifiable regions in the absence of competing fine-grained details preserved by residual connections. However, residual connections are critical in deep structures such as the HMFB. This is because these help mitigate gradient vanishing, maintain feature continuity, and enable multiscale fusion of low- and high-level features, which is particularly advantageous for challenging weed segmentation.

Table 9 lists the ablation of the HMFB based on the number of convolutions. Using three CLs resulted in the lowest

I O U

(C) of 0.299. This indicates that the depth was insufficient to model spatially diverse features and multiscale information, particularly for crops with varying shapes and sizes. However, an increase in the depth to five convolutions lowered the

m I O U

and

F 1 s c o r e

to 0.634 and 0.776, respectively. This reduction is attributed to overfitting or excessive feature compression, which weakens the network’s capability to generalize, particularly in underrepresented crop and weed classes. The proposed four-convolution setup provides the best overall performance by effectively integrating multiscale contextual information without complicating the feature space. This balance configuration enhances both weed and crop segmentation and ensures stable generalization across all metrics.

For the AFEB, we performed ablation studies with and without an attention mechanism, as shown in Table 10. The proposed method (with attention) provided the most consistent and balanced results across most metrics. When attention was removed, the model showed a higher crop

I O U

(

I O U

(C) = 0.371). However, it was affected significantly in terms of weed segmentation (

I O U

(W) = 0.562) and overall accuracy (

m I O U

= 0.640,

F 1 s c o r e

= 0.779). This pattern indicates that without attention, the model tends to overfit the dominant crop class. It leverages its abundant and spatially consistent features while omitting less frequent and spatially dispersed weed regions. The attention mechanism effectively reallocates the focus of the model to class-specific and informative regions. This enhances the detection of small and complex objects such as weeds. This, in turn, results in an improved class balance, particularly under conditions of spatial complexity and class imbalance (common in agricultural segmentation tasks). Through these ablation studies, we assessed the impact of various blocks on the performance. The results emphasize the significance of specific model design selections (such as the number of convolutions and incorporation of attention mechanisms) in enhancing the segmentation accuracy.

Comparison of the Proposed and State-of-the-Art (SOTA) Methods

This subsection describes the experiments conducted using the CWFID as a testing dataset, following training with the BoniRob dataset. This study compared the SOTA methods with the proposed AHFF-Net, as shown in Table 11. In all the metrics except for IOU (C), AHFF-Net surpassed the other SOTA methods in terms of the segmentation results for crops and weeds. The second-best model was U-Net [61]. It showed a 2.9% lower

m I O U

and 2.8% lower

F 1 s c o r e

than the proposed AHFF-Net. U-Net [61] achieved a marginally higher

I O U

(C) for the crop class (0.350) than AHFF-Net (0.338). This can be attributed to the structural simplicity of the U-Net, which tends to extract low-level features (such as shape and texture) more effectively and mixes the tighter boundaries of crops and weeds. U-Net is deficient in deeper refinement modules. Consequently, it may rely more significantly on early features and label mixed regions as crops. This, in turn, can result in a marginally higher

I O U

(C). In contrast, AHFF-Net marginally forfeited

I O U

(C) in favor of improved weed segmentation and class balance. Thereby, it achieved superior overall performance. The lower performance of the SOTA models can be attributed to the similar shapes of crops and weeds in the other datasets. However, AHFF-Net performed better at differentiating between crops and weeds in this heterogeneous dataset environment because of its use of patch-based augmentation, sequential convolutions, attention mechanisms, and residual connections. Patch-based augmentation provides the network with localized and balanced training samples. This is essential for learning the fine-grained details of small and irregular crop and weed structures that may be underrepresented or vary across fields in heterogeneous environments. The use of sequential convolutions facilitates deeper and more hierarchical feature extraction. This enables the model to capture semantic transitions and maintain spatial consistency across varying resolutions and object-scale conditions typical of heterogeneous agricultural scenes. The attention mechanism embedded in the AFEB module enables the model to focus selectively on class-discriminative regions. This effectively reduces misclassification in cases where crops and weeds have similar visual characteristics or overlap spatially. Furthermore, the HMFB is enhanced with residual connections. It combines multilevel contextual features while preserving low-level spatial indications. This combined mechanism ensures that both detailed textures and broader semantic relationships are retained. This, in turn, improves the model’s capability to accurately distinguish classes across diverse and unseen heterogeneous environmental data. These features enable AHFF-Net to achieve an accuracy higher than those of other SOTA models.

Figure 9 shows a visualization example of the semantic segmentation performance of the SOTA method and proposed AHFF-Net. In the visualization, crops are represented by red pixels, weeds by blue pixels, and the background by black pixels. We also visualized the error pixels, where crops were detected incorrectly as weeds or background (represented by yellow pixels). Similarly, the error pixels for weeds (where weeds were detected incorrectly as crops or backgrounds) are indicated by orange pixels. Finally, the error pixels for the background (where these were detected incorrectly as crops or weeds) are shown as gray pixels. From the visual analysis, it is evident that AHFF-Net (o) produced the most accurate and balanced segmentation with minimal misclassification and well-preserved class boundaries. It accurately captured both large crop regions and scattered weed structures, which are generally misclassified by other models. The boundary quality and object shape integrity were particularly significant in the AHFF-Net results. In contrast, earlier models such as Unet++ (h), reduced U-Net (i), and DWUNet (j) showed substantial misclassification, particularly in dense weed regions (as evidenced by the presence of orange and yellow artifacts). These errors reflect the inability to distinguish spatially similar crop and weed regions, which generally results in overlapping misclassifications. MSFCA-Net (k) and SEG-Net (m) performed moderately. Nonetheless, these displayed boundary degradation and loss of structural detail in fine plant branches. DeepLab V3 Plus (l) demonstrated improved crop segmentation. However, it failed to detect certain weed patches. Meanwhile, CG-Net (n) underperformed significantly. It displayed substantial background confusion and broken object continuity. Overall, we verified that AHFF-Net successfully outperformed the SOTA models in terms of the semantic segmentation performance of crops and weeds.

3.2.2. Performance Evaluation with BoniRob Following CWFID Training

This subsection describes the experiments conducted using the BoniRob as a testing dataset, following training with the CWFID. This study compared the SOTA methods with the proposed AHFF-Net, as shown in Table 12. In all the metrics except for

I O U

(B),

P r e c i s i o n

, and

R e c a l l,

AHFF-Net surpassed the segmentation results for crops and weeds compared to the other SOTA methods. This showed that AHFF-Net significantly differentiated the crops and weeds in the heterogeneous environment. The second-best model was U-Net [61]. It showed a 1.9% lower

m I O U

and 2.2% lower

F 1 s c o r e

than the proposed AHFF-Net. Notwithstanding its overall advantages, AHFF-Net reported slightly lower values in a few individual metrics. U-Net [61] achieved the highest

I O U

for background (

I O U

(B) = 0.975), as well as the highest

P r e c i s i o n

(0.755). This can be attributed to U-Net’s tendency to focus on lower-level features. The encoder–decoder structure of U-Net with direct skip connections can preserve low-level information (such as large and uniform background areas) more effectively. Background regions generally have less texture and more consistent appearance, which enables the cleaner segmentation of these areas through the structural design of U-Net. It also tends to favor only high-confidence areas and avoid uncertain regions near object boundaries or transitions. This can inflate the values of both

I O U

(B) and

P r e c i s i o n

. In contrast, AHFF-Net distributes focus more evenly across all classes because of its attention mechanism and patch-based learning, which may slightly reduce background accuracy while improving overall weed and crop detection. Similarly, reduced U-Net [62] reported the highest

R e c a l l

(0.803). This is likely because of its aggressive prediction strategy, favoring sensitivity over specificity. While this helped in identifying more correctly labeled crop and weed pixels, it also resulted in more false positives, leading to a trade-off where precision was marginally forfeited (

P r e c i s i o n

= 0.626) in favor of

R e c a l l

. By contrast, AHFF-Net maintained a better balance between

P r e c i s i o n

(0.742) and

R e c a l l

(0.796), reflecting its strong generalization and class-wise discrimination. Although U-Net and reduced U-Net may have outperformed AHFF-Net in a few individual metrics, the comprehensive performance of AHFF-Net reflected in the highest

m I O U

(0.661),

I O U

(W) (0.334),

I O U

(C) (0.674), and

F 1 s c o r e

(0.768) demonstrated its effectiveness in balanced and context-aware segmentation of crops and weeds.

Visual examples of semantic segmentation performance with the BoniRob evaluation dataset for proposed AHFF-Net and SOTA methods are shown in Figure 10. From the visualization, we can confirm that AHFF-Net outperformed the SOTA models in terms of crop and weed semantic segmentation performance with minimal misclassified pixels, particularly in distinguishing overlapping or tightly clustered weeds and crops. In contrast, the models such as Unet++ (h) and reduced U-Net (i) displayed significant misclassifications, particularly in background regions (gray) and between crops and weeds (orange/yellow areas), indicating confusion between spatially close or visually similar classes. Models like DWUNet (j), MSFCA-Net (k), and SEG-Net (m) also showed scattered misclassifications, particularly in boundary regions and small weed structures. DeepLab V3 Plus (l) showed improved performance over conventional U-Net-based models but still exhibited errors around crop edges and isolated weed patches. CG-Net (n) misclassified a large portion of the image, particularly mislabeling background and crops, which compromised its effectiveness in real-world applications.

3.2.3. Performance Evaluation with Sunflower Following BoniRob Dataset Training

In this subsection, the experiments were conducted using Sunflower as a testing dataset, following training with the BoniRob dataset. This study compared the SOTA methods with the proposed AHFF-Net, as shown in Table 13. In all the metrics except for

I O U

(W),

P r e c i s i o n

, and

R e c a l l

, AHFF-Net surpassed the segmentation results for crops and weeds compared to the other SOTA methods. AHFF-Net performed better at differentiating between crops and weeds in this heterogeneous environment. The second-best model was U-Net [61]. It showed a 1.4% lower

m I O U

and a 1.7% lower

F 1 s c o r e

than the proposed AHFF-Net. Although AHFF-Net achieved the highest overall performance (

m I O U

= 0.601,

F 1 s c o r e

= 0.707), a few models marginally outperformed it in isolated metrics. For instance, Unet++ [46] reported the highest

I O U

for weeds (

I O U

(W) = 0.604), which can be attributed to its densely nested skip connections that enhance spatial detail and support fine-grained segmentation of thin weed structures. However, this results in the cost of poor performance in crop segmentation (

I O U

(C) = 0.144) and overall balance, resulting in a lower

m I O U

(0.580) and

F 1 s c o r e

(0.687) than AHFF-Net. Similarly, CG-Net [67] achieved the highest

P r e c i s i o n

(0.800) because of its conservative prediction strategy, which likely reduced false positives by focusing on confident predictions. However, this strategy tends to suppress uncertain or ambiguous regions, such as overlapping or low-contrast crop or weed structures, resulting in a lower

R e c a l l

(0.572). The model’s limited exploration of harder to detect regions can result in many true positive pixels being missed, which negatively impacts and lowers

F 1 s c o r e

(0.667). Meanwhile, modified U-Net [40] recorded the highest

R e c a l l

(0.700) by assigning class labels more broadly, which helped it detect more relevant regions but also led to less accurate boundaries and a drop in

P r e c i s i o n

(0.651) and

F 1 s c o r e

(0.674).

Visual examples of semantic segmentation performance with the Sunflower evaluation dataset for the proposed AHFF-Net and SOTA methods are shown in Figure 11. The visualization showed that AHFF-Net (o) achieved the most accurate and clean segmentation, with well-preserved boundaries and minimal misclassification, especially in complex regions where crops and weeds were densely overlapped. In contrast, models such as Unet++ (h) and reduced U-Net (i) exhibited significant confusion, particularly with prominent orange and yellow misclassified pixels, indicating difficulty in accurately distinguishing between crops and weeds. These models generally tended toward mislabeled crops as weeds and vice versa. Other models, including DWUNet (j), MSFCA-Net (k), and SEG-Net (m), showed scattered misclassifications, particularly around object boundaries or within dense weed areas, reflecting limitations in capturing fine-grained structural details of crops and weeds. Although DeepLab V3 Plus (l) performed moderately better, it still struggled to preserve class boundaries consistently, particularly in regions with overlapping class distribution. CG-Net (n) demonstrated significant misclassification, particularly with large gray and yellow regions, indicating poor differentiation between crops and background, as well as failure to detect small weed clusters. Overall, these results confirmed that AHFF-Net significantly outperformed other SOTA models in terms of semantic segmentation accuracy for crops and weeds, particularly in heterogeneous and challenging agricultural environments.

3.2.4. Performance Evaluation with BoniRob Following Sunflower Dataset Training

In this subsection, the experiments were conducted using the BoniRob as a testing dataset, following training with the Sunflower dataset. This study compared the SOTA methods with the proposed AHFF-Net, as shown in Table 14. In all the metrics except for

I O U

(B), AHFF-Net surpassed the segmentation results for crops and weeds compared to the other SOTA methods. AHFF-Net significantly differentiated between crops and weeds in the heterogeneous environment. The second-best model was the modified U-Net [40]. It showed a 1.6% lower

m I O U

and a 1.9% lower

F 1 s c o r e

than the proposed AHFF-Net. Although AHFF-Net achieved the highest overall scores (

m I O U

= 0.595,

F 1 s c o r e

= 0.704), Unet++ [46] marginally outperformed it in

I O U

for background. This can be attributed to Unet++’s densely connected architecture, which reinforced the preservation of low-level features such as soil and background textures. As a result, Unet++ may have placed greater emphasis on detecting well-distributed and visually consistent background regions, leading to a small gain in

I O U

(B). However, this focus generally came at the expense of segmenting more complex or overlapping structures such as weeds, which is evident in its lower

I O U

(W) = 0.128,

I O U

(C) = 0.486, and

F 1 s c o r e

= 0.631.

Visual examples of semantic segmentation performance with the BoniRob evaluation dataset for the proposed AHFF-Net and SOTA methods are shown in Figure 12. The visualization showed that AHFF-Net (o) achieved accurate semantic segmentation of crops and weeds, with minimum confusion among classes, particularly in regions with overlapping plants and background. The boundaries between classes were well-preserved, and misclassified pixels were minimal. In comparison, Unet++ (h), reduced U-Net (i), and DWUNet (j) exhibited noticeable confusion, particularly in the form of yellow and orange misclassifications, where crops and weeds were generally mislabeled. These models also showed irregular shapes and inconsistent coverage of crop regions. MSFCA-Net (k) and SEG-Net (m) displayed poor segmentation in weed regions and frequently confused background with plant structures, leading to scattered gray pixels. DeepLab V3 Plus (l) provided marginally better boundary awareness but still misclassified small weed clusters. Notably, CG-Net (n) performed the worst among all the models, generating large gray and yellow misclassifications and failing to properly detect crop regions. We can also confirm that AHFF-Net successfully outperformed the SOTA models in terms of the semantic segmentation performance of crops and weeds.

3.2.5. Performance Evaluation with Sunflower Following CWFID Training

In this subsection, the experiments were conducted using the Sunflower as a testing dataset, following training with CWFID. This study compared the SOTA methods with the proposed AHFF-Net, as shown in Table 15. In all the metrics except for

I O U

(W),

P r e c i s i o n

, and

R e c a l l

, AHFF-Net surpassed the segmentation results for crops and weeds compared to the other SOTA methods. AHFF-Net significantly differentiated between crops and weeds in the heterogeneous environment. The second-best model was the U-Net [61]. It showed a 1% lower

m I O U

and a 1.9% lower

F 1 s c o r e

than the proposed AHFF-Net. While AHFF-Net achieved the highest overall performance (

m I O U

= 0.592,

F 1 s c o r e

= 0.701), a few models reported higher scores in individual metrics. DeepLab V3 Plus [65] achieved the highest

I O U

for weeds (

I O U

(W) = 0.622) and the highest

P r e c i s i o n

(0.746). This can be attributed to its dilated convolution design, which is effective in capturing multi-scale contextual features and may enhance weed boundary detection in structured scenes. However, this generally comes with trade-offs in detecting finer structures like crops, which were reflected in its significantly lower

I O U

(C) = 0.098 and

F 1 s c o r e

= 0.663. This suggested that while DeepLab V3 Plus exceled at isolating certain classes such as weeds, it was deficient in the balance needed for high overall performance. Similarly, reduced U-Net [62] recorded the highest

R e c a l l

(0.766), likely because of its more inclusive prediction behavior, which favors broader classes. This strategy may help identify more vegetation regions but generally forfeits accuracy in spatial detail, resulting in lower

P r e c i s i o n

(0.579) and

F 1 s c o r e

(0.659). In contrast, AHFF-Net maintained a well-rounded segmentation strategy with high scores across all key metrics, while avoiding excessive focus on any single class.

Visual examples of semantic segmentation performance with the Sunflower evaluation dataset for the proposed AHFF-Net and SOTA methods are shown in Figure 13. From the visualization, we can determine that AHFF-Net (o) delivered the most accurate and refined segmentation, particularly in distinguishing scattered weed regions with minimal misclassified pixels. The boundaries between classes were clear and consistent, even in the areas with overlapping vegetation. In contrast, Unet++ (h) and reduced U-Net (i) showed substantial misclassifications, particularly in the form of orange and yellow pixels, indicating confusion between crops and weeds. These models also suffered from inconsistent boundary preservation and poor segmentation in background regions. DWUNet (j) and MSFCA-Net (k), while showing improved weed detection compared to U-Net-based models, still produced scattered errors near object edges and small plant patches. DeepLab V3 Plus (l) and SEG-Net (m) exhibited gray and yellow areas, particularly in boundary regions, signifying difficulty in separating background and crop features. CG-Net (n) failed to accurately segment the crop area, resulting in more noise and incorrect classification. Overall, AHFF-Net’s superior performance compared to those of other SOTA models was clearly visible in terms of the semantic segmentation performance of crops and weeds.

3.2.6. Performance Evaluation with CWFID Following Sunflower Dataset Training

In this subsection, the experiments were conducted using the CWFID as a testing dataset, following training with the Sunflower dataset. This study compared several SOTA methods with the proposed AHFF-Net, as shown in Table 16. In all the metrics except for

I O U

(W) and

P r e c i s i o n

, AHFF-Net surpassed the segmentation results for crops and weeds compared to the other SOTA methods. Moreover, AHFF-Net significantly differentiated between crops and weeds in the heterogeneous environment. The second-best model was the reduced U-Net [50]. It showed a 0.5% lower

m I O U

and a 2.1% lower

F 1 s c o r e

than the proposed AHFF-Net, further confirming AHFF-Net’s superior performance. Although AHFF-Net achieved the best overall results (

m I O U

= 0.595,

F 1 s c o r e

= 0.731), reduced U-Net [62] recorded the highest

I O U

for weeds (

I O U

(W) = 0.567). This may be attributed to its tendency to give stronger focus to broader weed regions, resulting in more complete labeling of larger weed areas. However, this generally comes at the cost of performance in other classes, as reflected in its lower

I O U

(C) = 0.234 and

I O U

(B) = 0.970, as well as a marginally lower

F 1 s c o r e

= 0.710 than the proposed AHFF-Net. In contrast, AHFF-Net maintained a more balanced segmentation across all classes, with

I O U

(C) = 0.319, the highest among all models. Similarly, CG-Net [67] achieved the highest

P r e c i s i o n

(0.756), which suggests a more conservative prediction approach. This approach may result in clear segmentation boundaries and fewer misclassified pixels in confident regions. However, it also results in a lack of coverage, particularly for less distinct classes such as weeds, as seen in CG-Net’s lower

I O U

(W) = 0.409 and

I O U

(C) = 0.081, and significantly lower

R e c a l l

(0.534) and

F 1 s c o r e

(0.625).

Visual examples of semantic segmentation performance with the CWFID evaluation dataset for the proposed AHFF-Net and SOTA methods are shown in Figure 14. From the visualization, we can confirm that AHFF-Net (o) again provided the most balanced segmentation, with clearly outlined weed and crop regions and significantly fewer misclassified pixels. In contrast, CG-Net (n) showed more errors, particularly around the middle section of the plant rows. For example, one can observe that the leaf structure visible in the ground truth (d) and correctly segmented by AHFF-Net (o) was poorly segmented in CG-Net (n), with parts either missing or misclassified (seen as orange or gray regions). This highlights AHFF-Net’s superior ability to preserve class boundaries and shape consistency even in cluttered scenes. Several other models, such as Unet++ (h), DWUNet (j), and SEG-Net (m), showed similar patterns of confusion, where large areas of weeds were misclassified as crops (orange), and vice versa. DeepLab V3 Plus (l) and MSFCA-Net (k) also exhibited performance degradation at the boundaries, with scattered background misclassifications and unclear plant structures. U-Net (f) and modified U-Net (g) provided marginally better performance compared to earlier variants but still struggled to resolve overlapping vegetation accurately. Importantly, while CG-Net (n) appeared visually closer to AHFF-Net at a glance, a closer inspection revealed clear segmentation gaps, particularly in small weed structures and finer crop boundaries. AHFF-Net (o), by contrast, achieved higher fidelity, accurately identifying even subtle vegetative elements. Overall, AHFF-Net successfully outperformed the SOTA models in terms of the semantic segmentation performance of crops and weeds.

3.3. Evaluation by FD Estimation

To estimate the approximate fractal dimensions of different shapes, we applied the box-counting technique and evaluated it using the CWFID. The results are summarized in Table 17, which represents the FD value and the spatial distributions of crops and weeds across the field. Specifically, the FD values shown in the 1st–4th rows of Table 17 correspond to the 1st–4th row images displayed in Figure 15. Higher FD values observed for crops and weeds indicate greater complexity, suggesting that agricultural experts or autonomous vehicle systems should focus more on accurately distinguishing between them. This estimation technique also supports precision agriculture by enabling targeted weed removal through selective spraying in areas with dense or complex weed growth, ultimately contributing to increased crop yield. In segmentation tasks [8,22,68], it shows the real contributions of FD estimation. Fractal derivative analysis provides a useful approach for quantifying the complexity and irregularity of medical and agricultural imaging data and is also utilized in various other applications.

$Fractalfract 09 00592 g015$

Figure 15. Visual examples of crops and weeds from the CWFID for estimating FD values: (a) crop and weed segmentation results; (b) weed segmentation result; (c) crop segmentation result.

$Fractalfract 09 00592 g015$

$Fractalfract 09 00592 g016$

Figure 16. Sample images with Grad-CAM visualization examples at last 1 × 1 convolutional layer. (a) Input image, (b) ground truth mask (black, gray, and white pixels denote background, weeds, and crops, respectively), (c) Grad-CAM of weed (class 1), (d) Grad-CAM of crop (class 2), and (e) segmented results. Each row presents different examples.

$Fractalfract 09 00592 g016$

3.4. Comparisons of Algorithm Complexities

In this subsection, we evaluate the performance of AHFF-Net by measuring the number of parameters for the model, inference time, floating-point operations (FLOPs), and GPU memory requirements. The inference time was assessed on both desktop computer and the Jetson TX2 embedded system. On the desktop, the inference time was measured using two different GPUs: NVIDIA GeForce GTX 1070 (NVIDIA Corporation, Santa Clara, CA, USA) [69] and NVIDIA GeForce RTX 4080 SUPER (NVIDIA Corporation, Santa Clara, CA, USA) [70]. The Jetson TX2 system powered by an NVIDIA Pascal™-family GPU (NVIDIA Corporation, Santa Clara, CA, USA) with 256 CUDA cores was selected for testing [71]. It had a power consumption of less than 7.5 W and featured 8 GB of shared memory between the GPU and CPU.

The Jetson TX2 was selected for this evaluation because of its relevance in real-world agricultural applications. Crop and weed segmentation algorithms are typically executed onboard farming robots using embedded systems. The desktop computer specifications are outlined in Section 2.1. The inference times measured on the desktop computer and Jetson TX2 embedded system are listed in Table 18. On the desktop equipped with the NVIDIA GeForce GTX 1070 GPU, the proposed AHFF-Net achieved an inference time of 417.96 ms. This was slow compared with reduced U-Net (411.98 ms), U-Net (412.05 ms), and DWUNet (416.37 ms), but fast compared with the other SOTA methods. Using the more effective NVIDIA GeForce RTX 4080 SUPER, AHFF-Net achieved a significantly lower inference time of 29.75 ms. This was marginally slower than SEG-Net (27.72 ms) and CG-Net (26.78 ms), but faster than the other SOTA methods. In the Jetson embedded system, the inference time of AHFF-Net was 908.22 ms. Although it showed a higher inference time than other SOTA methods, it was still competitive compared to other SOTA methods such as DeepLab V3 Plus and MSFCA-Net. This was larger than those of the other SOTA methods. However, it outperformed DeepLab V3 Plus and MSFCA-Net.

Table 19 compares the GPU memory requirements, numbers of parameters, and FLOPs of the proposed AHFF-Net with those of the other SOTA methods. Although AHFF-Net had a larger number of parameters than most SOTA methods, it had fewer than DeepLab V3 Plus. Similarly, its GPU memory requirements were lower than those of MSFCA-Net, DeepLab V3 Plus, and SEG-Net, but higher than those of the other SOTA methods. U-Net, modified U-Net, MSFCA-Net, and SEG-Net had higher FLOPs than AHFF-Net, whereas the other SOTA methods had lower FLOPs. Although the algorithm complexities of the proposed method were not the lowest (Table 18 and Table 19), the segmentation accuracies of our AHFF-Net were higher than those of all the SOTA methods (Table 11, Table 12, Table 13, Table 14, Table 15 and Table 16). Moreover, there is always a tradeoff between performance and computational requirements, so we managed it by prioritizing the performance.

4. Discussion

We performed Student’s t-test [72] for statistical analysis and calculated both p-value and Cohen’s d-value [73]. For analysis, we calculated the mean and standard deviation of the F1 score. The F1 scores of the proposed AHFF-Net and second-best U-Net [61] were obtained from Table 11 and Table 13, respectively, to compute the mean and standard deviation. We achieved a p-value of 0.038. This verified that our method outperformed the second-best U-Net at the 95% confidence level. In general, Cohen’s d-value indicates the effect size. Herein, 0.2 indicates a small effect size, 0.5 shows a medium effect, and ≥0.8 denotes a large effect. The measured Cohen’s d-value was 2.17. It reflected a large effect size. Based on these results, we verified that the proposed AHFF-Net statistically outperformed the other SOTA methods in terms of semantic segmentation accuracy.

We applied gradient-weighted class activation mapping (Grad-CAM) [74] for a more detailed analysis. Grad-CAM visualizes important features in yellow and red (indicating high confidence), whereas unimportant features in blue reflect a lower confidence in the predictions. Figure 16 presents the Grad-CAM analysis of the proposed AHFF-Net in the final 1 × 1 CL. This demonstrates a high confidence level in the extraction of important features for crops and weeds in the semantic segmentation task. Grad-CAM images also verified that the proposed AHFF-Net accurately distinguished between crops and weeds in a heterogeneous dataset environment.

Examples of incorrect semantic segmentation results predicted by the proposed AHFF-Net are shown in Figure 17. The incorrect segmentation can be attributed to the similarity in shapes and colors between crops and weeds, which hinders their differentiation, particularly when the objects have thin regions. In addition, the presence of other objects with shapes similar to those of crops and weeds further contributes to segmentation errors. In the first row, although the central crop structure was well identified, the surrounding weeds with similar leaf shapes and orientations were partially misidentified as crops (orange regions), and a few finer weed structures faded into the background (black). This indicates that even with attention mechanisms, it may be challenging for the model to maintain clear boundaries in complex arrangements. Besides this similarity in the shapes and colors of crops and weeds, there was also a similarity in their appearance, as shown in Figure 17. Our proposed method still provided satisfactory segmentation performance. The second row shows a case where the low contrast between the plant and soil (upper right part of the image) resulted in inferior segmentation. The model successfully captured the main crop in the lower part of the image. However, many small weed patches were missing. This was likely owing to the weak feature information and insufficient contextual clues from the surroundings. In the third row, the non-vegetative elements shown in cyan introduced additional confusion. Although the crop was mostly detected correctly, parts of the background were misclassified (gray regions), and several peripheral weeds were mislabeled as crops or left unsegmented. This indicates that unobserved or irregular objects in the field can disturb the learned class boundaries of a model. Moreover, when crops and weed plants overlap, it causes lower performance and also leads to misclassification. This is the common limitation among all the SOTA methods.

In our study, we utilized LLaMA [75] to recommend pesticides for early-stage weed management and introduced a data-driven approach to improve decision-making (Table 20). Unlike conventional methods that depend substantially on manual observations and predefined guidelines, LLaMA enables the analysis of weed species profiles, environmental conditions, and pesticide efficacy records. Our observations showed that the LLaMA effectively identified suitable pesticides with enhanced accuracy. This would help farmers make informed decisions, minimize environmental risks, and promote sustainable agricultural practices. These systems can be utilized by non-agricultural experts to determine the appropriate pesticide using the LLaMA model, input images, and segmentation mask by our method in real applications. However, if we feed only input images to LLaMA, it perhaps fails to differentiate crops and weeds owing to the same attributes, such as shape, color, and overlapping portions. Ultimately, it leads to the wrong pesticide recommendations. The segmentation output of the proposed method helps LLaMA in distinguishing and identifying pixels belonging to weeds and crops, resulting in accurate pesticide recommendations by LLaMA. Our proposed system also benefits the practitioners with in-field weed control and optimizes the pesticide usage with enhanced accuracy. In this study, the proposed method shows satisfactory performance using three different datasets with extensive variability. The proposed method can be integrated into a vision-based robot-assisted system to discriminate between crops and weeds. Although our framework is tested with limited data, it is scalable for large datasets and other applications. Therefore, the proposed method is expected to better cope with field variability and scalability.

5. Conclusions

This study introduced the AHFF-Net. It is a novel method that incorporates small amounts of training data and patch-based augmentation to enhance the accuracy of semantic segmentation for crops and weeds. Three open databases of crops and weeds were used to evaluate the performance of AHFF-Net. Thereby, it was verified that our method outperformed the SOTA methods. The architecture of AHFF-Net featured the PERB, HMFB, and AFEB. When combined, these improved feature extraction by capturing fine details such as edges and textures. These blocks also enabled the model to learn richer and more diverse feature information, capture fine-grained patterns, and apply an attention mechanism to critical regions within crops and weeds in heterogeneous database environments. In addition, we integrated an FD estimation approach into our system to provide valuable insights into the distributional characteristics of crops and weeds. This also confirmed the AHFF-Net’s ability to effectively extract essential features for distinguishing between them. We verified through a t-test and Cohen’s d-value that our proposed AHFF-Net outperformed previous methods. Additionally, we validated the method’s capability to extract key features essential for the accurate segmentation of crops and weeds using Grad-CAM. We also verified that our method can be applied to the Jetson TX2 embedded system for farming robots. Furthermore, we integrated LLaMA to generate pesticide recommendations specific to the types of weeds identified in the segmented images. Notwithstanding the presence of a few segmentation inaccuracies (mainly owing to the visual similarities between crops and weeds, overlapping thin structures, and background clutter), the LLaMA demonstrated the capability to provide context-aware and effective recommendations. The segmentation results showed key challenges in which crops and weeds shared similar shapes, colors, and thin structures. This resulted in boundary ambiguity and inaccurate predictions. Even with attention mechanisms, it was challenging for the model to address fine weed details, low-contrast regions, and irregular non-vegetative elements that disrupted class boundaries. These issues emphasized the limitations of the model in capturing complex spatial and contextual variations under real field conditions.

In future studies, we plan to focus on pre-processing methods to address these issues and consider more diverse datasets. We will also evaluate generative adversarial networks and diffusion model-based techniques to improve the limitation of small and thin crops and weeds with similar shapes and colors in heterogeneous data environments for better segmentation performance. Additionally, we will study methods for creating more lightweight models with low computational requirements and parameters while maintaining accuracy based on the knowledge distillation method.

Author Contributions

Conceptualization, R.A.; methodology, R.A.; data curation, J.S.K. and M.S.J.; validation, H.A.H.G., M.H.T. and M.I.; supervision, K.R.P.; writing—original draft preparation, R.A.; writing—review and editing, K.R.P. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the Ministry of Science and ICT (MSIT), Korea, through the Information Technology Research Center (ITRC) Support Program under Grant IITP-2025-RS-2020-II201789 and by the Artificial Intelligence Convergence Innovation Human Resources Development supervised by the Institute of Information & Communications Technology Planning & Evaluation (IITP) under Grant IITP-2025-RS-2023-00254592.

Data Availability Statement

The data presented in this study are available in [https://github.com/] at [https://github.com/iamrehanch/AHFF-Net_for_Semantic_Segmentation_of_Crops_and_Weeds], reference number [53]. The datasets were derived from the following resources available in the public domain: [http://www.ipb.uni-bonn.de/data/sugarbeets2016/, https://github.com/cwfid, and https://sites.google.com/diag.uniroma1.it/image-synthesis/], all accessed on 4 August 2025.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zhang, Q.; Ying, Y.; Ping, J. Recent Advances in Plant Nanoscience. Adv. Sci. 2022, 9, 2103414. [Google Scholar] [CrossRef] [PubMed]
Kong, L.; Huang, M.; Zhang, L.; Chan, L.W.C. Enhancing Diagnostic Images to Improve the Performance of the Segment Anything Model in Medical Image Segmentation. Bioengineering 2024, 11, 270. [Google Scholar] [CrossRef]
Usman, M.; Sultan, H.; Hong, J.S.; Kim, S.G.; Akram, R.; Gondal, H.A.H.; Tariq, M.H.; Park, K.R. Dilated Multilevel Fused Network for Virus Classification Using Transmission Electron Microscopy Images. Eng. Appl. Artif. Intell. 2024, 138, 109348. [Google Scholar] [CrossRef]
Liu, L.; Guo, Z.; Liu, Z.; Zhang, Y.; Cai, R.; Hu, X.; Yang, R.; Wang, G. Multi-Task Intelligent Monitoring of Construction Safety Based on Computer Vision. Buildings 2024, 14, 2429. [Google Scholar] [CrossRef]
Sultan, H.; Owais, M.; Nam, S.H.; Haider, A.; Akram, R.; Usman, M.; Park, K.R. MDFU-Net: Multiscale Dilated Features Up-sampling Network for Accurate Segmentation of Tumor from Heterogeneous Brain Data. J. King Saud Univ.-Comput. Inf. Sci. 2023, 35, 101560. [Google Scholar] [CrossRef]
Li, Z.; Xiang, L.; Sun, J.; Liao, D.; Xu, L.; Wang, M. A Multi-Level Knowledge Distillation for Enhanced Crop Segmentation in Precision Agriculture. Agriculture 2025, 15, 1418. [Google Scholar] [CrossRef]
Tang, Z.; Sun, J.; Tian, Y.; Xu, J.; Zhao, W.; Jiang, G.; Deng, J.; Gan, X. CVRP: A Rice Image Dataset with High-Quality Annotations for Image Segmentation and Plant Phenomics Research. Plant Phenom. 2025, 7, 100025. [Google Scholar] [CrossRef]
Arsalan, M.; Haider, A.; Hong, J.S.; Kim, J.S.; Park, K.R. Deep Learning-Based Detection of Human Blastocyst Compartments with Fractal Dimension Estimation. Fractal Fract. 2024, 8, 267. [Google Scholar] [CrossRef]
Rai, H.M.; Omkar Lakshmi Jagan, B.; Rao, N.T.; Mohammed, T.K.; Agarwal, N.; Abdallah, H.A.; Agarwal, S. Deep Learning for Leukemia Classification: Performance Analysis and Challenges Across Multiple Architectures. Fractal Fract. 2025, 9, 337. [Google Scholar] [CrossRef]
Liu, X.; Gao, N.; He, S.; Wang, L. Application of Fractional Fourier Transform and BP Neural Network in Prediction of Tumor Benignity and Malignancy. Fractal Fract. 2025, 9, 267. [Google Scholar] [CrossRef]
Tariq, M.H.; Sultan, H.; Akram, R.; Kim, S.G.; Kim, J.S.; Usman, M.; Gondal, H.A.H.; Seo, J.; Lee, Y.H.; Park, K.R. Estimation of Fractal Dimensions and Classification of Plant Disease with Complex Backgrounds. Fractal Fract. 2025, 9, 315. [Google Scholar] [CrossRef]
Jiang, Y.; Li, C. Convolutional Neural Networks for Image-Based High-Throughput Plant Phenotyping: A Review. Plant Phenom. 2020, 2020, 4152816. [Google Scholar] [CrossRef]
Zhang, C.; Kong, J.; Wu, D.; Guan, Z.; Ding, B.; Chen, F. Wearable Sensor: An Emerging Data Collection Tool for Plant Phenotyping. Plant Phenom. 2023, 5, 0051. [Google Scholar] [CrossRef]
Al-Ghaili, A.M.; Gunasekaran, S.S.; Jamil, N.; Alyasseri, Z.A.A.; Al-Hada, N.M.; Ibrahim, Z.-A.B.; Bakar, A.A.; Kasim, H.; Hosseini, E.; Omar, R.; et al. A Review on Role of Image Processing Techniques to Enhancing Security of IoT Applications. IEEE Access 2023, 11, 101924–101948. [Google Scholar] [CrossRef]
Li, D.; Li, J.; Xiang, S.; Pan, A. PSegNet: Simultaneous Semantic and Instance Segmentation for Point Clouds of Plants. Plant Phenom. 2022, 2022, 9787643. [Google Scholar] [CrossRef]
Rawat, S.; Chandra, A.L.; Desai, S.V.; Balasubramanian, V.N.; Ninomiya, S.; Guo, W. How Useful Is Image-Based Active Learning for Plant Organ Segmentation? Plant Phenom. 2022, 2022, 9795275. [Google Scholar] [CrossRef] [PubMed]
Yun, C.; Kim, Y.H.; Lee, S.J.; Im, S.J.; Park, K.R. WRA-Net: Wide Receptive Field Attention Network for Motion Deblurring in Crop and Weed Image. Plant Phenom. 2023, 5, 0031. [Google Scholar] [CrossRef] [PubMed]
Parven, A.; Meftaul, I.M.; Venkateswarlu, K.; Megharaj, M. Herbicides in Modern Sustainable Agriculture: Environmental Fate, Ecological Implications, and Human Health Concerns. Int. J. Environ. Sci. Technol. 2024, 22, 1181–1202. [Google Scholar] [CrossRef]
Gupta, S.K.; Yadav, S.K.; Soni, S.K.; Shanker, U.; Singh, P.K. Multiclass Weed Identification Using Semantic Segmentation: An Automated Approach for Precision Agriculture. Ecol. Inform. 2023, 78, 102366. [Google Scholar] [CrossRef]
Hasan, A.S.M.M.; Diepeveen, D.; Laga, H.; Jones, M.G.K.; Sohel, F. Object-Level Benchmark for Deep Learning-Based Detection and Classification of Weed Species. Crop Prot. 2024, 177, 106561. [Google Scholar] [CrossRef]
Li, W.; Zhang, Y. DC-YOLO: An Improved Field Plant Detection Algorithm Based on YOLOv7-Tiny. Sci. Rep. 2024, 14, 26430. [Google Scholar] [CrossRef]
Akram, R.; Hong, J.S.; Kim, S.G.; Sultan, H.; Usman, M.; Gondal, H.A.H.; Tariq, M.H.; Ullah, N.; Park, K.R. Crop and Weed Segmentation and Fractal Dimension Estimation Using Small Training Data in Heterogeneous Data Environment. Fractal Fract. 2024, 8, 285. [Google Scholar] [CrossRef]
Liu, Y.; Liu, M.; Zhao, X.; Zhu, J.; Wang, L.; Ma, H.; Zhang, M. Real-time Semantic Segmentation Network for Crops and Weeds Based on Multi-branch Structure. IET Comput. Vis. 2024, 18, 1313–1324. [Google Scholar] [CrossRef]
Chebrolu, N.; Lottes, P.; Schaefer, A.; Winterhalter, W.; Burgard, W.; Stachniss, C. Agricultural Robot Dataset for Plant Classification, Localization and Mapping on Sugar Beet Fields. Int. J. Robot. Res. 2017, 36, 1045–1052. [Google Scholar] [CrossRef]
Haug, S.; Ostermann, J. A Crop/Weed Field Image Dataset for the Evaluation of Computer Vision Based Precision Agriculture Tasks. In Proceedings of the Computer Vision—ECCV 2014 Workshops, Zurich, Switzerland, 6–7 and 12 September 2014; pp. 105–116. [Google Scholar] [CrossRef]
Fawakherji, M.; Potena, C.; Pretto, A.; Bloisi, D.D.; Nardi, D. Multi-Spectral Image Synthesis for Crop/Weed Segmentation in Precision Farming. Robot. Auton. Syst. 2021, 146, 103861. [Google Scholar] [CrossRef]
Nguyen, D.T.; Nam, S.H.; Batchuluun, G.; Owais, M.; Park, K.R. An Ensemble Classification Method for Brain Tumor Images Using Small Training Data. Mathematics 2022, 10, 4566. [Google Scholar] [CrossRef]
Wahid, A.; Mahmood, T.; Hong, J.S.; Kim, S.G.; Ullah, N.; Akram, R.; Park, K.R. Multi-Path Residual Attention Network for Cancer Diagnosis Robust to a Small Number of Training Data of Microscopic Hyperspectral Pathological Images. Eng. Appl. Artif. Intell. 2024, 133, 108288. [Google Scholar] [CrossRef]
Abdalla, A.; Cen, H.; Wan, L.; Rashid, R.; Weng, H.; Zhou, W.; He, Y. Fine-Tuning Convolutional Neural Network with Transfer Learning for Semantic Segmentation of Ground-Level Oilseed Rape Images in a Field with High Weed Pressure. Comput. Electron. Agric. 2019, 167, 105091. [Google Scholar] [CrossRef]
Gao, J.; Nuyttens, D.; Lootens, P.; He, Y.; Pieters, J.G. Recognising weeds in a maize crop using a random forest machine-learning algorithm and near-infrared snapshot mosaic hyperspectral imagery. Biosyst. Eng. 2018, 170, 39–50. [Google Scholar] [CrossRef]
Lottes, P.; Hörferlin, M.; Sander, S.; Stachniss, C. Effective Vision-Based Classification for Separating Sugar Beets and Weeds for Precision Farming: Effective Vision-Based Classification. J. Field Robot. 2017, 34, 1160–1178. [Google Scholar] [CrossRef]
Bakhshipour, A.; Jafari, A. Evaluation of Support Vector Machine and Artificial Neural Networks in Weed Detection Using Shape Features. Comput. Electron. Agric. 2018, 145, 153–160. [Google Scholar] [CrossRef]
Zheng, Y.; Zhu, Q.; Huang, M.; Guo, Y.; Qin, J. Maize and Weed Classification Using Color Indices with Support Vector Data Description in Outdoor Fields. Comput. Electron. Agric. 2017, 141, 215–222. [Google Scholar] [CrossRef]
Wu, X.; Xu, W.; Song, Y.; Cai, M. A Detection Method of Weed in Wheat Field on Machine Vision. Procedia Eng. 2011, 15, 1998–2003. [Google Scholar] [CrossRef]
Kamath, R.; Balachandra, M.; Prabhu, S. Paddy Crop and Weed Discrimination: A Multiple Classifier System Approach. Int. J. Agron. 2020, 2020, 6474536. [Google Scholar] [CrossRef]
Zhao, X.; Wang, X.; Li, C.; Fu, H.; Yang, S.; Zhai, C. Cabbage and weed identification based on machine learning and target spraying system design. Front. Plant Sci. 2022, 13, 924973. [Google Scholar] [CrossRef] [PubMed]
Ahmed, A.; Rafique, A.A. Deep Network for Smart Precision Agriculture through Segmentation and Classification of Crops. In Proceedings of the International Bhurban Conference on Applied Sciences and Technology, Islamabad, Pakistan, 16–20 August 2022; pp. 502–507. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv 2014, arXiv:1409.1556v6. [Google Scholar] [CrossRef]
Brilhador, A.; Gutoski, M.; Hattori, L.T.; de Souza Inácio, A.; Lazzaretti, A.E.; Lopes, H.S. Classification of Weeds and Crops at the Pixel-Level Using Convolutional Neural Networks and Data Augmentation. In Proceedings of the IEEE Latin American Conference on Computational Intelligence, Guayaquil, Ecuador, 11–15 November 2019; pp. 1–6. [Google Scholar] [CrossRef]
Das, M.; Bais, A. DeepVeg: Deep Learning Model for Segmentation of Weed, Canola, and Canola Flea Beetle Damage. IEEE Access 2021, 9, 119367–119380. [Google Scholar] [CrossRef]
Milioto, A.; Lottes, P.; Stachniss, C. Real-Time Semantic Segmentation of Crop and Weed for Precision Agriculture Robots Leveraging Background Knowledge in CNNs. In Proceedings of the IEEE International Conference on Robotics and Automation, Brisbane, Australia, 21–25 May 2018; pp. 2229–2235. [Google Scholar] [CrossRef]
Paszke, A.; Chaurasia, A.; Kim, S.; Culurciello, E. ENet: A Deep Neural Network Architecture for Real-Time Semantic Segmentation. arXiv 2016, arXiv:1606.02147v1. [Google Scholar] [CrossRef]
Badrinarayanan, V.; Kendall, A.; Cipolla, R. SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef]
Sahin, H.M.; Miftahushudur, T.; Grieve, B.; Yin, H. Segmentation of Weeds and Crops Using Multispectral Imaging and CRF-Enhanced U-Net. Comput. Electron. Agric. 2023, 211, 107956. [Google Scholar] [CrossRef]
Fathipoor, H.; Shah-Hosseini, R.; Arefi, H. Crop and Weed Segmentation on Ground-Based Images using Deep Convolutional Neural Network. In Proceedings of the ISPRS Annals of the Photogrammetry Remote Sensing and Spatial Information Sciences, Tehran, Iran, 13 January 2023; pp. 195–200. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention, Munich, Germany, 5–9 October 2015; pp. 234–241. [Google Scholar] [CrossRef]
Zhou, Z.; Siddiquee, M.M.R.; Tajbakhsh, N.; Liang, J. UNet++: A Nested U-Net Architecture for Medical Image Segmentation. In Proceedings of the Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support, Granada, Spain, 20 September 2018; pp. 3–11. [Google Scholar] [CrossRef]
Kim, Y.; Park, K.R. MTS-CNN: Multi-Task Semantic Segmentation-Convolutional Neural Network for Detecting Crops and Weeds. Comput. Electron. Agric. 2022, 199, 107146. [Google Scholar] [CrossRef]
Zou, K.; Chen, X.; Wang, Y.; Zhang, C.; Zhang, F. A Modified U-Net with a Specific Data Argumentation Method for Semantic Segmentation of Weed Images in the Field. Comput. Electron. Agric. 2021, 187, 106242. [Google Scholar] [CrossRef]
Khan, A.; Ilyas, T.; Umraiz, M.; Mannan, Z.I.; Kim, H. CED-Net: Crops and Weeds Segmentation for Smart Farming Using a Small Cascaded Encoder-Decoder Architecture. Electronics 2020, 9, 1602. [Google Scholar] [CrossRef]
Moazzam, S.I.; Nawaz, T.; Qureshi, W.S.; Khan, U.S.; Tiwana, M.I. A W-Shaped Convolutional Network for Robust Crop and Weed Classification in Agriculture. Precis. Agric. 2023, 24, 2002–2018. [Google Scholar] [CrossRef]
AHFF-Net for Semantic Segmentation of Crops and Weeds in Heterogeneous Dataset Environment. Available online: https://github.com/iamrehanch/AHFF-Net_for_Semantic_Segmentation_of_Crops_and_Weeds (accessed on 2 January 2025).
Reinhard, E.; Adhikhmin, M.; Gooch, B.; Shirley, P. Color Transfer between Images. IEEE Comput. Graph. Appl. 2001, 21, 34–41. [Google Scholar] [CrossRef]
Ruderman, D.L.; Cronin, T.W.; Chiao, C.C. Statistics of Cone Responses to Natural Images: Implications for Visual Coding. J. Opt. Soc. Am. A 1998, 15, 2036–2045. [Google Scholar] [CrossRef]
Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef] [PubMed]
Sudre, C.H.; Li, W.; Vercauteren, T.; Ourselin, S.; Jorge Cardoso, M. Generalised Dice Overlap as a Deep Learning Loss Function for Highly Unbalanced Segmentations. In Proceedings of the International Conference on Communication, Control and Information Sciences, Québec City, QC, Canada, 9 September 2017; pp. 240–248. [Google Scholar] [CrossRef]
Rezaie, A.; Mauron, A.J.; Beyer, K. Sensitivity analysis of fractal dimensions of crack maps on concrete and masonry walls. Autom. Constr. 2020, 117, 103258. [Google Scholar] [CrossRef]
Wu, J.; Jin, X.; Mi, S.; Tang, J. An effective method to compute the box-counting dimension based on the mathematical definition and intervals. Results Eng. 2020, 6, 100106. [Google Scholar] [CrossRef]
Loshchilov, I.; Hutter, F. SGDR: Stochastic Gradient Descent with Warm Restarts. arXiv 2016, arXiv:1608.03983v5. [Google Scholar] [CrossRef]
Bhatti, M.A.; Syam, M.S.; Chen, H.; Hu, Y.; Keung, L.W.; Zeeshan, Z.; Ali, Y.A.; Sarhan, N. Utilizing Convolutional Neural Networks (CNN) and U-Net Architecture for Precise Crop and Weed Segmentation in Agricultural Imagery: A Deep Learning Approach. Big Data Res. 2024, 36, 100465. [Google Scholar] [CrossRef]
Arun, R.A.; Umamaheswari, S.; Jain, A.V. Reduced U-Net Architecture for Classifying Crop and Weed Using Pixel-Wise Segmentation. In Proceedings of the IEEE International Conference for Innovation in Technology, Bangluru, India, 6–8 November 2020; pp. 1–6. [Google Scholar] [CrossRef]
Habib, M.; Sekhra, S.; Tannouche, A.; Ounejjar, Y. New Segmentation Approach for Effective Weed Management in Agriculture. Smart Agric. Technol. 2024, 8, 100505. [Google Scholar] [CrossRef]
Yang, Q.; Ye, Y.; Gu, L.; Wu, Y. MSFCA-Net: A Multi-Scale Feature Convolutional Attention Network for Segmenting Crops and Weeds in the Field. Agriculture 2023, 13, 1176. [Google Scholar] [CrossRef]
Wang, A.; Xu, Y.; Wei, X.; Cui, B. Semantic Segmentation of Crop and Weed Using an Encoder-Decoder Network and Image Enhancement Method Under Uncontrolled Outdoor Illumination. IEEE Access 2020, 8, 81724–81734. [Google Scholar] [CrossRef]
Kamath, R.; Balachandra, M.; Vardhan, A.; Maheshwari, U. Classification of Paddy Crop and Weeds Using Semantic Segmentation. Cogent Eng. 2022, 9, 2018791. [Google Scholar] [CrossRef]
Wu, T.; Tang, S.; Zhang, R.; Cao, J.; Zhang, Y. CGNet: A Light-Weight Context Guided Network for Semantic Segmentation. IEEE Trans. Image Process. 2021, 30, 1169–1179. [Google Scholar] [CrossRef]
Sultan, H.; Ullah, N.; Hong, J.S.; Kim, S.G.; Lee, D.C.; Jung, S.Y.; Park, K.R. Estimation of fractal dimension and segmentation of brain tumor with parallel features aggregation network. Fractal Fract. 2024, 8, 357. [Google Scholar] [CrossRef]
NVIDIA GeForce GTX 1070. Available online: https://www.nvidia.com/en-us/geforce/10-series/ (accessed on 20 February 2025).
GTX 4080 Super NVIDIA GeForce RTX 4080 SUPER. Available online: https://www.nvidia.com/en-us/geforce/graphics-cards/40-series/ (accessed on 20 February 2025).
Jetson TX2. Available online: https://developer.nvidia.com/embedded/jetson-tx2 (accessed on 20 February 2025).
Mishra, P.; Singh, U.; Pandey, C.M.; Mishra, P.; Pandey, G. Application of student’s t-test, analysis of variance, and covariance. Ann. Card. Anaesth. 2019, 22, 407–411. [Google Scholar] [CrossRef] [PubMed]
Cohen, J. A power primer. Psychol. Bull. 1992, 112, 155–159. [Google Scholar] [CrossRef] [PubMed]
Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 618–626. [Google Scholar] [CrossRef]
Large Language Model Meta AI (LLaMA) 3.2. Available online: https://www.llama.com (accessed on 2 January 2025).

$Fractalfract 09 00592 g001$

Figure 1. Examples of sample images (top) and ground truth masks (bottom) (crops in green pixels, weeds in red pixels, background in black pixels) for crop and weed datasets: (a) BoniRob dataset, (b) CWFID, (c) Sunflower dataset.

$Fractalfract 09 00592 g001$

$Fractalfract 09 00592 g002$

Figure 2. Overview of the proposed method. (SSIM denotes structural similarity index measure).

$Fractalfract 09 00592 g002$

$Fractalfract 09 00592 g003$

Figure 3. Sample images from the different datasets showing input images (upper row), target images (middle row), and transformed images (lower row). (a) Transformed image from CWFID (input image) to BoniRob dataset (target image), (b) transformed image from Sunflower dataset (input image) to BoniRob dataset (target image), (c) transformed image from BoniRob dataset (input image) to CWFID (target image), (d) transformed image from Sunflower dataset (input image) to CWFID (target image), (e) transformed image from BoniRob dataset (input image) to Sunflower dataset (target image), (f) transformed image from CWFID (input image) to Sunflower dataset (target image).

$Fractalfract 09 00592 g003$

$Fractalfract 09 00592 g004$

Figure 4. Illustration of patch-based augmentation. (a) Input image to generate patches, (b–d) patch-based augmented images.

$Fractalfract 09 00592 g004$

$Fractalfract 09 00592 g005$

Figure 5. Proposed AHFF-Net architecture.

$Fractalfract 09 00592 g005$

$Fractalfract 09 00592 g006$

Figure 6. Proposed PERB structure.

$Fractalfract 09 00592 g006$

$Fractalfract 09 00592 g007$

Figure 7. Proposed HMFB structure.

$Fractalfract 09 00592 g007$

$Fractalfract 09 00592 g008$

Figure 8. Proposed AFEB structure.

$Fractalfract 09 00592 g008$

$Fractalfract 09 00592 g009$

Figure 9. Visualization examples of semantic segmentation results for the proposed AHFF-Net and SOTA methods. (a) Original image, (b) target image, (c) pre-processed image, and (d) ground truth label. Resulting images from (e) U-Net (RHT + small training data + IMBA), (f) U-Net, (g) modified U-Net, (h) Unet++, (i) reduced U-Net, (j) DWUNet, (k) MSFCA-Net, (l) DeepLAB V3 Plus, (m) SEG-Net, (n) CG-Net, and (o) AHFF-Net (proposed). (Red, blue, black, orange, yellow, and gray denote crops, weeds, background, weeds detected incorrectly as crops or the background, crops detected incorrectly as weeds or the background, and background detected incorrectly as crops or weeds, respectively, in (e–o)).

$Fractalfract 09 00592 g009$

$Fractalfract 09 00592 g010$

Figure 10. Visualization examples of semantic segmentation results for the proposed AHFF-Net and SOTA methods. (a) Original image, (b) target image, (c) pre-processed image, and (d) ground truth label. Resulting images from (e) U-Net (RHT + small training data + IMBA), (f) U-Net, (g) modified U-Net, (h) Unet++, (i) reduced U-Net, (j) DWUNet, (k) MSFCA-Net, (l) DeepLAB V3 Plus, (m) SEG-Net, (n) CG-Net, and (o) AHFF-Net (proposed). (Red, blue, black, orange, yellow, and gray denote crops, weeds, background, weeds detected incorrectly as crops or the background, crops detected incorrectly as weeds or the background, and background detected incorrectly as crops or weeds, respectively, in (e–o)).

$Fractalfract 09 00592 g010$

$Fractalfract 09 00592 g011$

Figure 11. Visualization examples of semantic segmentation results for the proposed AHFF-Net and SOTA methods. (a) Original image, (b) target image, (c) pre-processed image, and (d) ground truth label. Resulting images from (e) U-Net (RHT + small training data + IMBA), (f) U-Net, (g) modified U-Net, (h) Unet++, (i) reduced U-Net, (j) DWUNet, (k) MSFCA-Net, (l) DeepLAB V3 Plus, (m) SEG-Net, (n) CG-Net, and (o) AHFF-Net (proposed). (Red, blue, black, orange, yellow, and gray denote crops, weeds, background, weeds detected incorrectly as crops or the background, crops detected incorrectly as weeds or the background, and background detected incorrectly as crops or weeds, respectively, in (e–o)).

$Fractalfract 09 00592 g011$

$Fractalfract 09 00592 g012$

Figure 12. Visualization examples of semantic segmentation results for the proposed AHFF-Net and SOTA methods. (a) Original image, (b) target image, (c) pre-processed image, and (d) ground truth label. Resulting images from (e) U-Net (RHT + small training data + IMBA), (f) U-Net, (g) modified U-Net, (h) Unet++, (i) reduced U-Net, (j) DWUNet, (k) MSFCA-Net, (l) DeepLAB V3 Plus, (m) SEG-Net, (n) CG-Net, and (o) AHFF-Net (proposed). (Red, blue, black, orange, yellow, and gray denote crops, weeds, background, weeds detected incorrectly as crops or the background, crops detected incorrectly as weeds or the background, and background detected incorrectly as crops or weeds, respectively, in (e–o)).

$Fractalfract 09 00592 g012$

$Fractalfract 09 00592 g013$

Figure 13. Visualization examples of semantic segmentation results for the proposed AHFF-Net and SOTA methods. (a) Original image, (b) target image, (c) pre-processed image, and (d) ground truth label. Resulting images from (e) U-Net (RHT + small training data + IMBA), (f) U-Net, (g) modified U-Net, (h) Unet++, (i) reduced U-Net, (j) DWUNet, (k) MSFCA-Net, (l) DeepLAB V3 Plus, (m) SEG-Net, (n) CG-Net, and (o) AHFF-Net (proposed). (Red, blue, black, orange, yellow, and gray denote crops, weeds, background, weeds detected incorrectly as crops or the background, crops detected incorrectly as weeds or the background, and background detected incorrectly as crops or weeds, respectively, in (e–o)).

$Fractalfract 09 00592 g013$

$Fractalfract 09 00592 g014$

Figure 14. Visualization examples of semantic segmentation results for the proposed AHFF-Net and SOTA methods. (a) Original image, (b) target image, (c) pre-processed image, and (d) ground truth label. Resulting images from (e) U-Net (RHT + small training data + IMBA), (f) U-Net, (g) modified U-Net, (h) Unet++, (i) reduced U-Net, (j) DWUNet, (k) MSFCA-Net, (l) DeepLAB V3 Plus, (m) SEG-Net, (n) CG-Net, and (o) AHFF-Net (proposed). (Red, blue, black, orange, yellow, and gray denote crops, weeds, background, weeds detected incorrectly as crops or the background, crops detected incorrectly as weeds or the background, and background detected incorrectly as crops or weeds, respectively, in (e–o)).

$Fractalfract 09 00592 g014$

$Fractalfract 09 00592 g017$

Figure 17. Examples with incorrect segmentation results predicted by the proposed AHFF-Net. (a) Input image, (b) RHT image (c) ground truth mask, and (d) segmentation. (Red, blue, black, orange, yellow, and gray denote crops, weeds, background, weeds incorrectly detected as crops or the background, crops incorrectly detected as weeds or the background, and background incorrectly detected as crops or weeds, respectively).

$Fractalfract 09 00592 g017$

Table 2. Training, validation, and testing data split for all the experiments.

Dataset-1	Training	Validation	Dataset-2	Testing
BoniRob	400	30	CWFID	30
CWFID	45	5	BoniRob	246
Sunflower	120	17	BoniRob	246
BoniRob	400	30	Sunflower	86
CWFID	45	5	Sunflower	86
Sunflower	120	17	CWFID	30

Table 3. Block by block comparisons of the semantic segmentation result (B, W, and C denote background, weed, and crop, respectively).

PERB	HMFB	AFEB	$m I O U$	$I O U$ (B)	$I O U$ (W)	$I O U$ (C)	$P r e c i s i o n$	$R e c a l l$	$F 1 S c o r e$
			0.624	0.985	0.538	0.350	0.761	0.759	0.759
√			0.626	0.986	0.600	0.289	0.772	0.756	0.763
	√		0.625	0.986	0.598	0.288	0.770	0.752	0.759
		√	0.624	0.986	0.592	0.294	0.770	0.753	0.761
√	√		0.633	0.986	0.578	0.334	0.787	0.765	0.774
√		√	0.632	0.986	0.622	0.290	0.784	0.761	0.772
√	√	√	0.653	0.986	0.637	0.338	0.804	0.772	0.787

“√” indicates that the corresponding experiment includes the respective combination of blocks.

Table 4. Result comparison of semantic segmentation with IMBA and patch-based augmentation (B, W, and C, denote background, weed, and crop, respectively).

Cases	$m I O U$	$I O U$ (B)	$I O U$ (W)	$I O U$ (C)	$P r e c i s i o n$	$R e c a l l$	$F 1 S c o r e$
With IMBA	0.644	0.986	0.62	0.326	0.797	0.765	0.780
With patch-based augmentation (proposed)	0.653	0.986	0.637	0.338	0.804	0.772	0.787

Table 5. Results of semantic segmentation with different numbers of patches in patch-based augmentation (B, W, and C denote background, weed, and crop, respectively).

Number of Patches	$m I O U$	$I O U$ (B)	$I O U$ (W)	$I O U$ (C)	$P r e c i s i o n$	$R e c a l l$	$F 1 S c o r e$
4	0.623	0.986	0.548	0.335	0.772	0.751	0.760
5 (Proposed)	0.653	0.986	0.637	0.338	0.804	0.772	0.787
6	0.646	0.986	0.583	0.369	0.792	0.764	0.778

Table 6. Ablation study results for different numbers of convolutions in PERB (B, W, and C denote background, weed, and crop, respectively).

Number of Convolutions	$m I O U$	$I O U$ (B)	$I O U$ (W)	$I O U$ (C)	$P r e c i s i o n$	$R e c a l l$	$F 1 S c o r e$
2	0.636	0.985	0.562	0.359	0.783	0.765	0.773
3 (Proposed)	0.653	0.986	0.637	0.338	0.804	0.772	0.787
4	0.645	0.985	0.598	0.353	0.795	0.769	0.781

Table 7. Ablation study results comparing sequential and parallel convolutional manners in PERB (B, W, and C denote background, weed, and crop, respectively).

Cases	$m I O U$	$I O U$ (B)	$I O U$ (W)	$I O U$ (C)	$P r e c i s i o n$	$R e c a l l$	$F 1 S c o r e$
With sequential convolutions (proposed)	0.653	0.986	0.637	0.338	0.804	0.772	0.787
With parallel convolutions	0.637	0.985	0.554	0.371	0.785	0.764	0.774

Table 8. Ablation study results with and without residual connection in HMFB (B, W, and C denote background, weed, and crop, respectively).

Cases	$m I O U$	$I O U$ (B)	$I O U$ (W)	$I O U$ (C)	$P r e c i s i o n$	$R e c a l l$	$F 1 S c o r e$
With residual connection (proposed)	0.653	0.986	0.637	0.338	0.804	0.772	0.787
Without residual connection	0.631	0.986	0.564	0.341	0.792	0.754	0.772

Table 9. Ablation study results for different numbers of convolutions in HMFB (B, W, and C denote background, weed, and crop, respectively).

Number of Convolutions	$m I O U$	$I O U$ (B)	$I O U$ (W)	$I O U$ (C)	$P r e c i s i o n$	$R e c a l l$	$F 1 S c o r e$
3	0.636	0.985	0.623	0.299	0.796	0.761	0.778
4 (Proposed)	0.653	0.986	0.637	0.338	0.804	0.772	0.787
5	0.634	0.986	0.579	0.335	0.79	0.763	0.776

Table 10. Ablation study results with and without attention in AFEB (B, W, and C denote background, weed, and crop, respectively).

Cases	$m I O U$	$I O U$ (B)	$I O U$ (W)	$I O U$ (C)	$P r e c i s i o n$	$R e c a l l$	$F 1 S c o r e$
With attention (Proposed)	0.653	0.986	0.637	0.338	0.804	0.772	0.787
Without attention	0.64	0.986	0.562	0.371	0.778	0.781	0.779

Table 11. Comparison of semantic segmentation results for crops and weeds between the proposed AHFF-Net and various SOTA methods (B, W, and C denote background, weed, and crop, respectively).

Cases	$m I O U$	$I O U$ (B)	$I O U$ (W)	$I O U$ (C)	$P r e c i s i o n$	$R e c a l l$	$F 1 S c o r e$
U-Net (RHT + small training data + IMBA) [22]	0.620	0.986	0.524	0.349	0.749	0.762	0.752
U-Net [61]	0.624	0.985	0.538	0.350	0.761	0.759	0.759
Modified U-Net [40]	0.606	0.985	0.546	0.288	0.765	0.738	0.751
Unet++ [46]	0.547	0.982	0.514	0.146	0.684	0.654	0.668
Reduced U-Net [62]	0.560	0.978	0.482	0.220	0.680	0.691	0.685
DWUNet [63]	0.535	0.979	0.447	0.178	0.646	0.655	0.650
MSFCA-Net [64]	0.538	0.982	0.465	0.166	0.647	0.658	0.652
DeepLAB V3 Plus [65]	0.597	0.974	0.556	0.262	0.712	0.713	0.712
SEG-Net [66]	0.558	0.961	0.516	0.197	0.631	0.719	0.672
CG-Net [67]	0.506	0.963	0.447	0.110	0.711	0.576	0.636
AHFF-Net (proposed)	0.653	0.986	0.637	0.338	0.804	0.772	0.787

Table 12. Comparison of semantic segmentation results for crops and weeds between the proposed AHFF-Net and various SOTA methods (B, W, and C mean background, weed, and crop, respectively).

Cases	$m I O U$	$I O U$ (B)	$I O U$ (W)	$I O U$ (C)	$P r e c i s i o n$	$R e c a l l$	$F 1 S c o r e$
U-Net (RHT + small training data + IMBA) [22]	0.637	0.971	0.292	0.647	0.708	0.787	0.743
U-Net [61]	0.642	0.975	0.301	0.649	0.755	0.738	0.746
Modified U-Net [40]	0.584	0.966	0.277	0.509	0.644	0.778	0.704
Unet++ [46]	0.576	0.958	0.230	0.541	0.633	0.789	0.702
Reduced U-Net [62]	0.568	0.951	0.238	0.516	0.626	0.803	0.703
DWUNet [63]	0.533	0.967	0.077	0.555	0.667	0.578	0.619
MSFCA-Net [64]	0.570	0.972	0.225	0.513	0.652	0. 723	0.685
DeepLAB V3 Plus [65]	0.541	0.969	0.208	0.446	0.661	0.644	0.652
SEG-Net [66]	0.491	0.951	0.205	0.319	0.565	0.652	0.605
CG-Net [67]	0.475	0.930	0.211	0.285	0.553	0.735	0.631
AHFF-Net (proposed)	0.661	0.974	0.334	0.674	0.742	0.796	0.768

Table 13. Comparison of semantic segmentation results for crops and weeds between the proposed AHFF-Net and various SOTA methods (B, W, and C mean background, weed, and crop, respectively).

Cases	$m I O U$	$I O U$ (B)	$I O U$ (W)	$I O U$ (C)	$P r e c i s i o n$	$R e c a l l$	$F 1 S c o r e$
U-Net (RHT + small training data + IMBA) [22]	0.581	0.993	0.503	0.248	0.679	0.694	0.685
U-Net [61]	0.587	0.992	0.514	0.256	0.686	0.695	0.690
Modified U-Net [40]	0.575	0.990	0.577	0.158	0.651	0.700	0.674
Unet++ [46]	0.580	0.991	0.604	0.144	0.694	0.682	0.687
Reduced U-Net [62]	0.546	0.988	0.490	0.161	0.609	0.687	0.645
DWUNet [63]	0.427	0.980	0.000	0.300	0.503	0.486	0.494
MSFCA-Net [64]	0.583	0.992	0.520	0.236	0.685	0.693	0.688
DeepLAB V3 Plus [65]	0.566	0.99	0.586	0.121	0.657	0.636	0.646
SEG-Net [66]	0.505	0.934	0.544	0.037	0.571	0.699	0.628
CG-Net [67]	0.554	0.986	0.559	0.116	0.800	0.572	0.667
AHFF-Net (proposed)	0.601	0.993	0.512	0.300	0.724	0.691	0.707

Table 14. Comparison of semantic segmentation results for crops and weeds between the proposed AHFF-Net and various SOTA methods (B, W, and C mean background, weed, and crop, respectively).

Cases	$m I O U$	$I O U$ (B)	$I O U$ (W)	$I O U$ (C)	$P r e c i s i o n$	$R e c a l l$	$F 1 S c o r e$
U-Net (RHT + small training data + IMBA) [22]	0.561	0.973	0.172	0.538	0.659	0.671	0.664
U-Net [61]	0.558	0.973	0.136	0.565	0.678	0.667	0.672
Modified U-Net [40]	0.579	0.973	0.217	0.548	0.671	0.701	0.685
Unet++ [46]	0.530	0.974	0.128	0.486	0.631	0.632	0.631
Reduced U-Net [62]	0.514	0.963	0.154	0.426	0.588	0.656	0.620
DWUNet [63]	0.424	0.966	0.077	0.230	0.521	0.501	0.510
MSFCA-Net [64]	0.570	0.970	0.214	0.525	0.663	0.685	0.673
DeepLAB V3 Plus [65]	0.471	0.971	0.124	0.319	0.580	0.558	0.568
SEG-Net [66]	0.479	0.955	0.159	0.324	0.556	0.630	0.590
CG-Net [67]	0.378	0.960	0.009	0.164	0.600	0.395	0.476
AHFF-Net (proposed)	0.595	0.972	0.225	0.587	0.699	0.711	0.704

Table 15. Comparison of semantic segmentation results for crops and weeds between the proposed AHFF-Net and various SOTA methods (B, W, and C mean background, weed, and crop, respectively).

Cases	$m I O U$	$I O U$ (B)	$I O U$ (W)	$I O U$ (C)	$P r e c i s i o n$	$R e c a l l$	$F 1 S c o r e$
U-Net (RHT + small training data + IMBA) [22]	0.568	0.99	0.487	0.229	0.646	0.710	0.676
U-Net [61]	0.582	0.991	0.514	0.241	0.686	0.678	0.682
Modified U-Net [40]	0.569	0.986	0.509	0.211	0.617	0.759	0.680
Unet++ [46]	0.580	0.991	0.539	0.209	0.683	0.682	0.682
Reduced U-Net [62]	0.545	0.983	0.475	0.177	0.579	0.766	0.659
DWUNet [63]	0.498	0.988	0.416	0.090	0.621	0.550	0.583
MSFCA-Net [64]	0.578	0.992	0.498	0.243	0.678	0.673	0.675
DeepLAB V3 Plus [65]	0.569	0.987	0.622	0.098	0.746	0.598	0.663
SEG-Net [66]	0.437	0.930	0.344	0.035	0.468	0.675	0.552
CG-Net [67]	0.441	0.962	0.278	0.082	0.599	0.594	0.596
AHFF-Net (proposed)	0.592	0.992	0.513	0.272	0.694	0.710	0.701

Table 16. Comparison of semantic segmentation results for crops and weeds between the proposed AHFF-Net and various SOTA methods (B, W, and C mean background, weed, and crop, respectively).

Cases	$m I O U$	$I O U$ (B)	$I O U$ (W)	$I O U$ (C)	$P r e c i s i o n$	$R e c a l l$	$F 1 S c o r e$
U-Net (RHT + small training data + IMBA) [22]	0.515	0.977	0.313	0.254	0.666	0.676	0.670
U-Net [61]	0.580	0.978	0.571	0.189	0.699	0.690	0.694
Modified U-Net [40]	0.581	0.975	0.496	0.273	0.675	0.727	0.700
Unet++ [46]	0.569	0.976	0.485	0.246	0.695	0.715	0.704
Reduced U-Net [62]	0.590	0.970	0.567	0.234	0.703	0.718	0.710
DWUNet [63]	0.506	0.966	0.492	0.060	0.647	0.617	0.631
MSFCA-Net [64]	0.539	0.982	0.560	0.076	0.640	0.644	0.641
DeepLAB V3 Plus [65]	0.568	0.963	0.514	0.226	0.697	0.693	0.694
SEG-Net [66]	0.483	0.857	0.523	0.067	0.554	0.731	0.627
CG-Net [67]	0.480	0.950	0.409	0.081	0.756	0.534	0.625
AHFF-Net (proposed)	0.595	0.982	0.485	0.319	0.714	0.750	0.731

Table 17. FD values of images from CWFID. The 1st~4th row FD values are computed from the 1st~4th row images of Figure 16.

Weed FD	Crop FD
1.60	1.09
1.57	1.16
1.58	1.10
1.56	1.11

Table 18. Inference time of proposed method on desktop and Jetson embedded system (unit: ms).

Methods	Desktop		Jetson Embedded System
Methods	NVIDIA GeForce GTX 1070	NVIDIA GeForce RTX 4080 SUPER	Jetson Embedded System
U-Net [22,61]	412.05	41.50	629.12
Modified U-Net [40]	426.51	41.08	472.55
Unet++ [46]	443.83	40.61	502.86
Reduced U-Net [62]	411.98	33.73	453.18
DWUNet [63]	416.37	38.28	428.06
MSFCA-Net [64]	442.06	45.07	1319.61
DeepLAB V3 Plus [65]	456.16	42.43	977.56
SEG-Net [66]	419.40	27.72	446.93
CG-Net [67]	459.82	26.78	583.12
AHFF-Net (Proposed)	417.96	29.75	908.22

Table 19. Comparisons of the number of parameters, GPU memory, and FLOPs of AHFF-Net and the SOTA methods.

Methods	Number of Parameters (Unit: Mega)	GPU Memory Requirements (Unit: MB)	FLOPs (Unit: G)
U-Net [22,61]	31.04	3352	378.10
Modified U-Net [40]	31.03	3971	437.79
Unet++ [46]	26.08	3426	147.75
Reduced U-Net [62]	7.76	1906	96.25
DWUNet [63]	3.26	1571	63.12
MSFCA-Net [64]	30.88	6617	523.96
DeepLAB V3 Plus [65]	54.6	8137	201.6
SEG-Net [66]	29.44	6647	339.6
CG-Net [67]	0.503	962	30.055
AHFF-Net (Proposed)	47.88	5271	287.71

Table 20. Pesticide recommendations for early-stage weeds by LLaMA.

Input Image	Segmentation Mask	Weed Morphology (Early Stage)	Weed Name	Pesticides
$Fractalfract 09 00592 i001$	$Fractalfract 09 00592 i002$	Narrow and pointed leaves that grow in clusters from the base, spreading outward in a pattern	Barnyard Grass (Echinochloa crus-galli)	Atrazine Pendimethalin Sulfometuron Glyphosate *
$Fractalfract 09 00592 i003$	$Fractalfract 09 00592 i004$	Thin, branching stems with narrow, deeply lobed leaves forming a spreading growth habits	Knotweed (Polygonum aviculare)	Imazapyr Triclopyr * Dicamba * Glyphosate *
$Fractalfract 09 00592 i005$	$Fractalfract 09 00592 i006$	Small, oval-shaped green leaves with smooth edges, arranged in opposite pairs on delicate stems	Chickweed (Stellaria media)	2,4-D Dicamba * Sulfometuro Triclopyr *

* Pesticides with * sign are generally used for multiple weeds.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Akram, R.; Kim, J.S.; Jeong, M.S.; Gondal, H.A.H.; Tariq, M.H.; Irfan, M.; Park, K.R. Attention-Driven and Hierarchical Feature Fusion Network for Crop and Weed Segmentation with Fractal Dimension Estimation. Fractal Fract. 2025, 9, 592. https://doi.org/10.3390/fractalfract9090592

AMA Style

Akram R, Kim JS, Jeong MS, Gondal HAH, Tariq MH, Irfan M, Park KR. Attention-Driven and Hierarchical Feature Fusion Network for Crop and Weed Segmentation with Fractal Dimension Estimation. Fractal and Fractional. 2025; 9(9):592. https://doi.org/10.3390/fractalfract9090592

Chicago/Turabian Style

Akram, Rehan, Jung Soo Kim, Min Su Jeong, Hafiz Ali Hamza Gondal, Muhammad Hamza Tariq, Muhammad Irfan, and Kang Ryoung Park. 2025. "Attention-Driven and Hierarchical Feature Fusion Network for Crop and Weed Segmentation with Fractal Dimension Estimation" Fractal and Fractional 9, no. 9: 592. https://doi.org/10.3390/fractalfract9090592

APA Style

Akram, R., Kim, J. S., Jeong, M. S., Gondal, H. A. H., Tariq, M. H., Irfan, M., & Park, K. R. (2025). Attention-Driven and Hierarchical Feature Fusion Network for Crop and Weed Segmentation with Fractal Dimension Estimation. Fractal and Fractional, 9(9), 592. https://doi.org/10.3390/fractalfract9090592

Article Menu

Attention-Driven and Hierarchical Feature Fusion Network for Crop and Weed Segmentation with Fractal Dimension Estimation

Abstract

1. Introduction

1.1. Considering Homogenous Dataset Environments

1.1.1. ML-Based Methods

1.1.2. Single-Stage CNN-Based Methods

1.1.3. Multi-Stage CNN-Based Methods

1.2. Considering Heterogeneous Dataset Environments

1.2.1. CNN with Image-Based Augmentation Methods

1.2.2. CNN with Patch-Based Augmentation Methods

2. Material and Methods

2.1. Experimental Setup

2.2. Overview of the Proposed Method

2.2.1. Pre-Processing

2.2.2. Patch-Based Augmentation

2.3. AHFF-Net Architecture

2.3.1. PERB

2.3.2. HMFB

2.3.3. AFEB

2.4. Loss Function

2.5. Evaluation Metrices

2.6. FD Estimation

3. Experimental Results

3.1. Training Details

3.2. Testing of Proposed Method

3.2.1. Performance Evaluation with CWFID Following BoniRob Dataset Training

Ablation Study

Comparison of the Proposed and State-of-the-Art (SOTA) Methods

3.2.2. Performance Evaluation with BoniRob Following CWFID Training

3.2.3. Performance Evaluation with Sunflower Following BoniRob Dataset Training

3.2.4. Performance Evaluation with BoniRob Following Sunflower Dataset Training

3.2.5. Performance Evaluation with Sunflower Following CWFID Training

3.2.6. Performance Evaluation with CWFID Following Sunflower Dataset Training

3.3. Evaluation by FD Estimation

3.4. Comparisons of Algorithm Complexities

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI