MF-FusionNet: A Lightweight Multimodal Network for Monitoring Drought Stress in Winter Wheat Based on Remote Sensing Imagery

Guo, Qiang; Han, Bo; Chu, Pengyu; Wan, Yiping; Zhang, Jingjing

doi:10.3390/agriculture15151639

Open AccessArticle

MF-FusionNet: A Lightweight Multimodal Network for Monitoring Drought Stress in Winter Wheat Based on Remote Sensing Imagery

by

Qiang Guo

^1,2,3,†

,

Bo Han

^1,2,3,†

,

Pengyu Chu

^1,2,3

,

Yiping Wan

^1,2,3 and

Jingjing Zhang

^1,2,3,*

¹

College of Computer and Information Engineering, Xinjiang Agricultural University, Urumqi 830052, China

²

Engineering Research Center of Intelligent Agriculture Ministry of Education, Urumqi 830052, China

³

Xinjiang Agricultural Informatization Engineering Technology Research Center, Urumqi 830052, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Agriculture 2025, 15(15), 1639; https://doi.org/10.3390/agriculture15151639

Submission received: 23 June 2025 / Revised: 20 July 2025 / Accepted: 28 July 2025 / Published: 29 July 2025

(This article belongs to the Topic Advances in Smart Agriculture with Remote Sensing as the Core and Its Applications in Crops Field)

Download

Browse Figures

Versions Notes

Abstract

To improve the identification of drought-affected areas in winter wheat, this paper proposes a lightweight network called MF-FusionNet based on multimodal fusion of RGB images and vegetation indices (NDVI and EVI). A multimodal dataset covering various drought levels in winter wheat was constructed. To enable deep fusion of modalities, a Lightweight Multimodal Fusion Block (LMFB) was designed, and a Dual-Coordinate Attention Feature Extraction module (DCAFE) was introduced to enhance semantic feature representation and improve drought region identification. To address differences in scale and semantics across network layers, a Cross-Stage Feature Fusion Strategy (CFFS) was proposed to integrate multi-level features and enhance overall performance. The effectiveness of each module was validated through ablation experiments. Compared to traditional single-modal methods, MF-FusionNet achieved higher accuracy, recall, and F1-score—improved by 1.35%, 1.43%, and 1.29%, respectively—reaching 96.71%, 96.71%, and 96.64%. A basis for real-time monitoring and precise irrigation management under winter wheat drought stress was provided by this study.

Keywords:

winter wheat; drought stress; multimodal feature fusion; remote sensing; vegetation indices

1. Introduction

The global climate change has intensified both the frequency and severity of drought events [1], significantly affecting crop growth and agricultural productivity. Enhancing irrigation efficiency has thus become a critical strategy for ensuring sustainable agricultural production [2]. As a major staple crop, winter wheat (Triticum aestivum L.) exhibits high sensitivity to water availability, particularly during key growth stages, such as stem elongation and grain filling [3]. China is among the countries most severely impacted by meteorological disasters, with frequent, intense, and diverse types of agricultural hazards contributing substantially to disaster-related yield reductions [4]. The main winter-wheat-producing regions in China are located in arid and semi-arid zones, where water scarcity and intensifying drought conditions pose serious limitations on production potential [5]. Therefore, timely acquisition of drought information and accurate identification of drought stress during critical phenological stages are essential for making informed irrigation decisions, mitigating drought progression, and safeguarding food security.

With advancements in remote sensing and artificial intelligence technologies, image-based crop drought monitoring has become increasingly diversified and intelligent. High-throughput imaging and advanced phenotyping techniques have significantly improved our ability to detect crop responses to drought stress with high precision [6]. Numerous studies have employed sophisticated imaging techniques, such as chlorophyll fluorescence [7], thermal infrared sensing [8], visible and near-infrared (VNIR) imaging [9], and hyperspectral sensors [10]. When combined with in situ phenotypic data, spectral vegetation indices provide quantitative assessments of drought severity, offering comprehensive insights into plant responses. Commonly used indices include the Normalized Difference Vegetation Index (NDVI), Water Shortage Index (WSII), Relative Leaf Water Content (RLWC), Equivalent Water Thickness (EWT), Water Stress Index (WSI), and Water Band Index (WBI) [11], each corresponding to specific metabolic responses sensitive to particular wavelengths. However, these approaches rely on complex environmental sensing equipment, have limited spatial coverage, and are highly sensitive to external factors, such as weather conditions, making them less suitable for large-scale, real-time monitoring.

Recent advances in high-resolution imaging and computer vision have shifted research attention toward drought stress recognition based on crop phenotypic imagery. Under drought stress, crops often exhibit visible symptoms, such as leaf yellowing, curling, and wilting—visual cues that serve as important input features for image-based recognition algorithms [12]. Traditional machine learning approaches, such as Support Vector Machines (SVMs), K-Nearest Neighbors (KNNs), and Back Propagation Neural Networks, have been widely used to extract color, texture, and shape features from plant images for drought classification [13]. Nevertheless, these methods are heavily dependent on handcrafted feature engineering, which limits their adaptability to image variations under complex field conditions and reduces generalization capacity. In recent years, deep learning has demonstrated superior capabilities in feature extraction and representation for agricultural remote sensing image analysis. Classic architectures, such as ResNet, DenseNet, and EfficientNet, have been successfully applied to crop drought monitoring tasks. For example, Shiya Gao et al. [14] proposed a phenotype-driven deep learning model, SAM-ResNet50, for automated high-throughput drought stress monitoring in birch seedlings, achieving a classification accuracy of 99.6%, which significantly outperformed traditional machine learning methods and provided technical support for drought-resilient birch breeding. Jiangyong An et al. [15] developed a field-based maize drought recognition and classification method using Deep Convolutional Neural Networks (DCNNs), which achieved superior performance on the overall dataset with drought detection and classification accuracies of 98.14% and 95.95%, respectively, compared to traditional methods, such as Gradient Boosting Decision Trees (GBDTs). Pooja Goyal et al. [16] designed a customized deep-learning-based CNN model for maize drought stress recognition, achieving a test set accuracy of 98.53%. Notably, their model maintained high performance while significantly reducing parameter size (to 650,000), making it suitable for deployment on resource-constrained devices and enabling real-time precision irrigation.

Despite the promising progress of deep learning, most existing studies still rely on single-modality image inputs and thus fail to fully capture both physiological and spectral traits of crops, which limits the accuracy and robustness of drought monitoring. In response, multimodal feature fusion has emerged as a promising solution. By integrating RGB imagery with hyperspectral or environmental sensor data, this approach provides a more comprehensive depiction of crop status from multiple dimensions, substantially enhancing both feature representation and recognition performance. For instance, Dongzi Yang et al. [17] fused RGB and hyperspectral images to develop a multimodal network with soft attention and bilinear fusion mechanisms for citrus disease identification, achieving an accuracy of 97.89%. LI Shanjun et al. [18] developed a non-destructive early pest detection method based on the fusion of X-ray and RGB images. Yao J et al. [19] proposed a multimodal deep learning model (S-DNet) that integrates image features and meteorological data to achieve high-accuracy monitoring of drought stress in winter wheat, with an average classification accuracy of 96.4%. However, despite these advancements in disease and pest detection, research on multimodal approaches for winter wheat drought stress remains limited. Existing studies often rely on single-modality data or simple fusion strategies, lacking systematic and in-depth integration of multisource information. Moreover, while meteorological data offer essential environmental context, their low spatial resolution prevents accurate characterization of within-field microenvironmental variation, making them insufficient for capturing the dynamic patterns of crop water stress. Given that numerical vegetation indices (e.g., NDVI and EVI) sensitively reflect physiological changes in crops, integrating these indices with RGB imagery in a multimodal data fusion framework may overcome the limitations of unimodal methods. Therefore, developing drought stress recognition approaches for winter wheat based on multimodal fusion of RGB images and vegetation indices holds both theoretical significance and practical value.

Based on this, the present study addresses the limitations of existing research by conducting the following investigations:

(1): We propose a lightweight multimodal feature fusion network, MF-FusionNet, which effectively integrates visual information from images and non-visual numerical data. The network strikes a balance between computational efficiency and model compactness, enabling accurate identification and graded classification of drought stress in winter wheat.
(2): We design a Lightweight Multimodal Fusion Block (LMFB) to achieve deep fusion between RGB images and numerical vegetation indices. This module adaptively enhances key channel features relevant to winter wheat drought stress, while effectively suppressing environmental noise and other interfering factors, thereby improving the discriminative capacity of the fused features.
(3): We introduce a Cross-Stage Feature Fusion Strategy (CFFS) that combines channel alignment and layer-wise integration to effectively incorporate multi-scale spatial information. This allows for the collaborative representation of localized drought symptoms and overall canopy-level characteristics. Meanwhile, we embed a Dual-Coordinate Attention Feature Extraction module (DCAFE), which leverages multi-path pooling and coordinate attention mechanisms to enhance the encoding of directional and spatial positional information, thus improving the model’s sensitivity to leaf texture and drought-critical regions in winter wheat.
(4): The recognition results produced by the model are mapped back to the spatial layout of the field to generate a visual drought severity map, which intuitively displays the drought stress distribution across different winter wheat regions and supports precision agriculture management and decision-making.

2. Materials and Methods

2.1. Study Area

The field experiments were conducted at the drip-irrigated winter wheat experimental station of Huaxing Farm, located in Daxiqu Town, Changji City, Xinjiang Uygur Autonomous Region, China (87°29′ E, 44°22′ N). The site experiences abundant sunshine, with an annual total of approximately 2700 h of sunlight. The thermal conditions are also favorable, with an accumulated temperature (≥10 °C) of about 3450 °C per year. The annual mean temperature is 6.8 °C, and the average temperature in July is 24.5 °C. The region receives an average annual precipitation of 190 mm and is characterized by a typical arid inland climate. A map of the study area and ground-level photographs of the winter wheat plots are shown in Figure 1. The experimental field measured 40 m × 28 m, and the winter wheat cultivar used was Xin Dong 22. It was sown on 15 October 2024 and harvested on 11 July 2025.

2.2. Experimental Design

To systematically investigate the phenotypic responses of winter wheat (Triticum aestivum L.) under varying drought stress conditions, a completely randomized block design was implemented in the experimental field. The soil type in the experimental field is light sandy loam soil. The site was divided into 15 independent plots, incorporating 5 levels of water treatments, each replicated 3 times to enhance data representativeness and the reliability of statistical analysis, as illustrated in Figure 2. During the early growth stages (from emergence to the beginning of stem elongation; BBCH 10-29 [20]), all experimental plots were uniformly managed under optimal irrigation conditions to ensure consistent crop establishment. Differential drought treatments were initiated from the jointing stage (BBCH 30 [20]) to simulate drought stress during critical phenological periods under field conditions.

The five water treatments were defined as follows: suitable moisture (WW1), mild drought (WW2), moderate drought (WW3), severe drought (WW4), and extreme drought (WW5), each with three replicates (e.g., WW1-1 to WW1-3). Soil moisture levels were regulated by controlled irrigation based on predefined thresholds of volumetric soil water content. The field capacity of the experimental soil was determined to be 22%. Accordingly, the soil moisture thresholds for each treatment were established as follows: In WW1 (suitable moisture), soil water content was maintained between 18.7% and 22.0%, corresponding to 85% to 100% of field capacity. In WW2 (mild drought), soil water content was maintained between 15.4% and 18.7%, corresponding to 70% to 85% of field capacity. In WW3 (moderate drought), soil water content was maintained between 12.1% and 15.4%, equivalent to 55% to 70% of field capacity. In WW4 (severe drought), soil water content was maintained between 8.8% and 12.1%, representing 40% to 55% of field capacity. In WW5 (extreme drought), soil water content was maintained below 8.8%, equivalent to less than 40% of field capacity. These thresholds were used to guide irrigation timing and amounts, ensuring consistent soil moisture conditions across treatments throughout the drought period. Soil moisture content was monitored periodically during the drought treatment phase. In each plot, three sampling points were randomly selected, avoiding plot edges and areas near drip irrigation lines to reduce spatial heterogeneity. The volumetric soil water content was measured using a calibrated soil moisture sensor, and the average value of the three readings was used to represent the soil moisture status of each plot. This approach ensured reliable and representative assessment of water availability for each treatment level. Throughout the experiment, the same winter wheat cultivar, sowing method, and field management practices were uniformly applied across all plots to isolate the effect of water availability as the primary variable influencing plant growth and drought response.

2.3. Data Acquisition

In this study, both visible (RGB) and multispectral images were synchronously acquired using an unmanned aerial vehicle (UAV) platform to enable the integrated extraction and analysis of multimodal remote sensing information for winter wheat. This provided a comprehensive dataset to support subsequent drought response monitoring and modeling efforts. Data collection was conducted between May and June 2025, covering key phenological stages ranging from non-stress conditions to varying levels of drought stress. The data acquisition was performed using the DJI Mavic 3 Multispectral UAV (DJI Innovations, Shenzhen, China), as shown in Figure 3. Image capture was scheduled between 12:00 p.m. and 2:00 p.m., during which the solar elevation angle consistently exceeded 30°.

Detailed specifications of the UAV platform and sensors are provided in Table 1. The UAV is equipped with a light intensity sensor and a real-time kinematic (RTK) positioning module. The light sensor was used to record ambient illumination changes, aiding in the radiometric correction of multispectral images, while the RTK module enhanced spatial positioning accuracy. Flight missions were executed according to pre-programmed flight paths. The UAV operated at a flight altitude of 12 m and a speed of 1 m per second, with a side overlap of 70% and a forward overlap of 80%, ensuring sufficient image redundancy for subsequent data processing, including orthomosaic generation. All flight parameters were kept constant throughout the experimental period to ensure data consistency and comparability.

2.4. Data Processing and Construction

2.4.1. UAV Image Processing and Construction

In this study, preprocessing of UAV-acquired visible (RGB) and multispectral images was conducted using DJI Terra 3.7.0 (DJI Innovations, Shenzhen, China) and QGIS 3.40.6 (QGIS.org Association, Berne, Switzerland; open-source software). The overall processing workflow is illustrated in Figure 4. The images collected by the UAV were first imported into DJI Terra for stitching and generation of raw multispectral imagery. Subsequently, radiometric calibration was performed using three standard gray calibration panels with known reflectance values (25%, 50%, and 75%) to mitigate the effects of illumination variability, resulting in reflectance images standardized under uniform lighting conditions. Based on the calibrated results, orthomosaics for each spectral band were generated and then layer-fused according to central wavelengths (G < R < RE < NIR) to construct a standardized multispectral image set. To align UAV imagery with the ground-based experimental design, spatial vector segmentation was carried out in QGIS. Specifically, the orthomosaics were segmented according to the 15 independently managed field plots, and corresponding image subsets were extracted to ensure that subsequent analyses could be accurately linked to specific water treatment gradients and plot identifiers. Given that deep learning models typically require fixed-size input and benefit from a large number of training samples to enhance generalization, a sliding-window cropping operation was applied to each plot’s imagery. The original images were divided into uniform patches of 224 × 224 pixels, which both preserved critical image details and significantly increased the sample size for model training. This approach not only standardized image scale but also enhanced the model’s ability to perceive spatial heterogeneity and localized phenotypic variations in the field, thereby providing more comprehensive feature information for drought stress identification.

2.4.2. Vegetation Index Feature Processing and Construction

To enable the effective integration of RGB imagery with spectral index information, vegetation indices were calculated at the image patch level using the corresponding multispectral data. The mean value of each index within a patch was extracted as a structured numerical feature and subsequently input into the multimodal classification model. Compared to directly incorporating vegetation indices as additional two-dimensional image channels, the use of regionally averaged numerical features reduces the dimensionality and complexity of model inputs, avoids redundant computation caused by channel stacking, and improves training efficiency, which facilitates lightweight deployment. As a form of robust statistical representation, the regional average effectively captures the overall spectral response of crops and demonstrates strong sensitivity and discriminative power in detecting drought-induced spectral changes.

For each patch, vegetation indices were computed pixel-by-pixel based on the G, R, RE, and NIR bands from the multispectral images spatially aligned with the RGB imagery. The mean value within the patch was used as the final numerical feature for model input.

Let the pixel-level value of a vegetation index be denoted as

V I (x, y)

, where

(x, y)

represents the pixel coordinates within a given patch, and the patch size is

W \times H

, then the mean vegetation index value

\bar{V I}

for the patch is calculated as follows:

\bar{V I} = \frac{1}{W \times H} \sum_{x = 1}^{W} \sum_{y = 1}^{H} V I (x, y)

(1)

Vegetation indices provide critical remote sensing indicators that characterize key physiological parameters, such as plant water content, chlorophyll concentration, and canopy structure. These parameters are closely related to drought stress responses. A total of ten representative vegetation indices and their respective computational formulas are summarized in Table 2.

2.4.3. Multimodal Data Processing and Construction

In this study, the processed vegetation index features were integrated with their corresponding RGB image patches to construct multimodal samples. Each sample consisted of one RGB image and its associated vegetation index features, jointly reflecting the phenotypic status of winter wheat and its drought stress condition within the region. This fusion enriched the representational capacity of the dataset across multiple dimensions. Based on the water treatment gradient defined in the winter wheat experimental design, each sample was labeled with a drought stress level. To ensure effective training and generalization of the model, the dataset was rigorously partitioned along both spatial and temporal dimensions. Samples from the same field plot or collection period were not allowed to span across the training and validation sets, thus preventing data leakage and ensuring the reliability and accuracy of model evaluation results. In addition, to enhance the diversity of training samples and improve the robustness of the model, various data augmentation techniques were applied to the RGB image patches, including random rotation, horizontal flipping, and scaling. Meanwhile, the vegetation index features were kept unchanged to maintain the consistency and stability of the numerical input. The distribution of training and validation samples in the winter wheat drought classification dataset is detailed in Table 3.

2.5. MF-FusionNet Network Architecture

Under drought stress, winter wheat exhibits a series of subtle structural and physiological responses. For example, leaves may show slight curling or wrinkling on the surface, accompanied by physiological changes, such as reduced chlorophyll content, decreased stomatal conductance, lower leaf water potential, and diminished photosynthetic rate. These responses are often inconspicuous, making single-modality remote sensing data (e.g., visible imagery) insufficiently sensitive and robust for accurate drought stress identification. Although multimodal remote sensing offers advantages in improving monitoring accuracy, its practical application on large scales remains limited due to the high complexity of models and the immaturity of fusion strategies, which can lead to information redundancy and feature conflicts. To address these issues, this study proposes a lightweight multimodal feature fusion network, MF-FusionNet, designed for efficient identification and graded classification of drought stress in winter wheat. The overall architecture of the network is illustrated in Figure 5. The network consists of two main information extraction branches. The Fusion-StarNet branch is responsible for extracting salient visual features from RGB images, including spatial and textural patterns. In parallel, the vegetation index branch calculates multiple spectral vegetation indices (e.g., NDVI and EVI) based on multispectral data to characterize the physiological status of the crop. These two streams of modality-specific information are then effectively integrated through a feature fusion module, thereby enhancing the accuracy and robustness of drought stress recognition.

2.5.1. Fusion-StarNet Image Feature Extraction Network

Under drought stress, winter wheat typically exhibits phenotypic characteristics, such as leaf de-greening, tip burn, curling, yellowing, and canopy thinning. These drought-induced features are marked by strong spatial heterogeneity, diverse scales, and blurred boundaries, and are highly affected by natural factors, including illumination conditions, soil background, and crop density. As a result, traditional image analysis methods face significant challenges in extracting such fine-grained visual features. For instance, models often struggle to accurately localize drought-affected regions and distinguish between varying drought severity levels, thereby limiting both robustness and precision in drought stress identification. To address these limitations, we propose a lightweight image backbone network named Fusion-StarNet, which enhances the extraction quality of drought-related phenotypic features while maintaining low model complexity, which makes it suitable for deployment on edge devices, such as UAVs. Fusion-StarNet is built upon the StarNet architecture [30]. To further improve the model’s sensitivity to drought regions and adaptability to multi-scale variations, we introduce a novel feature extraction module called the Superstar Block, which significantly enhances the model’s focus on typical drought traits. Additionally, considering the scale-specific variation in drought features, we incorporate a Cross-Stage Feature Fusion Strategy (CFFS) to integrate semantic features between stages 2 and 3, and stages 3 and 4 through channel alignment and spatial resampling. This allows for the collaborative expression of high- and low-level features, thereby improving the model’s ability to distinguish among varying degrees of drought stress. To strengthen global feature perception, we also embed the Dual-Coordinate Attention Feature Extraction module (DCAFE) at the terminal stage of Fusion-StarNet. The overall network architecture (Figure 5) adopts a multistage cascaded structure, where each stage consists of convolutional downsampling and stacked lightweight Superstar Blocks. Cross-stage fusion branches are introduced between key stages. The final output preserves rich discriminative semantics along with strong spatial expressiveness, providing high-quality input for multimodal drought level classification.

(1): Superstar Block Feature Extraction Module

Drought-related phenotypic features in winter wheat often appear as sparse, weakly structured patterns in complex field environments, particularly in remote sensing imagery. These are easily affected by shadows, soil reflectance, and varying lighting conditions, which can obscure salient regions and hinder accurate feature localization and extraction. To address these challenges, we propose the Superstar Block, a novel feature extraction module that integrates a Star Operation-based Star Block with a Dual-Coordinate Attention Feature Extraction module (DCAFE). This structure enhances both representational richness and salient region focusing.

In a single-layer neural network, the Star Operation is typically denoted as

(W_{1}^{T} + B_{1}) * (W_{2}^{T} + B_{2})

, representing a fusion of two linear transformation features via elementwise multiplication centered on a key variable [30]. In matrix form, it is written as

(W_{1}^{T} X) * (W_{2}^{T} X)

, where

W = {[W, B]}^{T}, X = {[X, 1]}^{T}

. We define

ω_{1}, ω_{2}, X \in R^{(d + 1) \times 1}

, where d is the number of input channels, making it readily extensible to multiple output channels

W_{1}, W_{2} \in R^{(d + 1) \times (d^{'} + 1)}

and multiple feature elements, where

X \in R^{(d + 1) \times n}

. The Star Operation can thus be rewritten as:

ω_{1}^{T} x * ω_{2}^{T} x

(2)

= (\sum_{i = 1}^{d + 1} ω_{1}^{i} x^{i}) * (\sum_{j = 1}^{d + 1} ω_{2}^{i} x^{j})

(3)

= (\sum_{i = 1}^{d + 1} \sum_{j = 1}^{d + 1} ω_{1}^{i} ω_{2}^{j} x^{i} x^{j})

(4)

{= α}_{1}^{1} x^{1} + \dots + α_{(4,5)} x^{4} x^{5} + \dots + α_{(d + 1, d + 1)} x^{d + 1} x^{d + 1}

(5)

Here,

i, j

indexes the channel, and

α

is the coefficient of each term:

α_{(i, j)} = \{\begin{matrix} ω_{1}^{i} ω_{2}^{j} i f i = j, \\ ω_{1}^{i} ω_{2}^{j} + ω_{1}^{j} ω_{2}^{i} i f i! = j . \end{matrix}

(6)

After rewriting the Star Operation from Equation (2), it is extended to a composition of

\frac{(d + 2) (d + 1)}{2}

distinct terms, as shown in Equation (5). Notably, apart from the

α_{(d + 1, :)} x^{d + 1} x

term, each remaining component (involving x) exhibits a nonlinear relationship, indicating that they represent independent and implicit dimensions. Therefore, although the computationally efficient Star Operation is executed in d-dimensional space, it can achieve representation in a higher

\frac{(d + 2) (d + 1)}{2} \approx {(\frac{d}{\sqrt{2}})}^{2}

dimensional implicit feature space, given that d is much greater than 2. This significantly expands feature dimensionality without introducing any additional computational burden per layer.

To further improve the model’s focus on sparsely distributed and subtle drought phenotypes in winter wheat, the Superstar Block incorporates the Dual-Coordinate Attention Feature Extraction module (DCAFE) [31], as illustrated in Figure 6. This module enhances salient response regions through the parallel fusion of average and max pooling paths, improving directional sensitivity and spatial position encoding. The coordinate attention mechanism models horizontal and vertical information flow separately, enhancing the network’s ability to localize directional textures, such as leaf veins and tip burns. Moreover, multi-branch feature concatenation enhances spatial expressiveness, enabling more accurate targeting of drought-critical regions under complex environmental conditions.

Considering the image feature map

F \in H \times W \times C

, the DCAFE module first performs average pooling operations with kernel sizes of (H, 1) and (W, 1), respectively, to extract spatial information along the horizontal and vertical coordinates across channel indices, as defined in Equation (7):

f_{a}^{c} = δ (C o n v [X A v g P o o l (F_{h}^{c}); Y A v g P o o l (F_{w}^{c})]),

(7)

where:

X A v g P o o l (F_{h}^{c}) = \frac{1}{W} \sum_{i = 0}^{W} F_{c} (h, i) Y A v g P o o l (F_{w}^{c}) = \frac{1}{H} \sum_{j = 0}^{H} F_{c} (j, w) w h e r e c = 1,2, 3, \dots C δ = R e l u 6 (x) = \min (6, \max (x, 0))

(8)

Here, Conv(.) denotes a 1 × 1 convolution operation, and [.;.] represents a feature vector splitting operation, which yields

H \times 1 \times C

and

1 \times W \times C

. XAvgPool and YAvgPool denote average pooling operations along the height and width axes, respectively. These two direction-sensitive feature vectors aggregate the spatial coordinate information embedded along the X and Y directions. δ denotes the Relu6 activation function, which has a constrained activation range and is thus more computationally efficient.

The obtained horizontal and vertical feature vectors are then concatenated to form an attention layer of size (H + W). Assuming a feature map of size (64 × 64 × C = 64 × 64 × 9), the horizontal direction yields a (64 × 1 × 96) vector and the vertical a (1 × 64 × 96) vector. Concatenating the reshaped horizontal and vertical vectors along the spatial dimension (64 + 64 = 128) results in an attention feature map of size (1 × 128 × 96). After applying a 1 × 1 convolution operation, the number of channels is reduced to

\frac{C}{D_{r}}

, where

D_{r} = 32

denotes the downsampling ratio, as shown in Equation (9):

C_{o u t} = \max (8, \frac{C_{i n}}{D_{r}})

(9)

The intermediate feature vector

f^{a} ϵ (H + W) \times 1 \times \frac{C}{D_{r}}

is then split along the spatial dimension into two separate tensors (

f_{h}^{a} ϵ H \times 1 \times \frac{C}{D_{r}}

and

f_{w}^{a} ϵ H \times 1 \times \frac{C}{D_{r}}

). Next, two independent 1 × 1 convolution operations are applied to the horizontal and vertical coordinates to restore the original number of channels C, followed by a sigmoid activation function, as described in Equation (10):

s_{h}^{a} = σ (C o n v (f_{h}^{a})) a n d s_{w}^{a} = σ (C o n v (f_{w}^{a}))

(10)

Here,

s_{h}^{a} ϵ H \times 1 \times C

and

s_{w}^{a} ϵ 1 \times W \times C

represent the attention weight matrices, and σ denotes the sigmoid activation function. By elementwise multiplying these attention weights with the original feature map F, the final refined feature representation Y is obtained, as formulated in Equation (11):

Y^{a} (i, j) = F (i, j) ⨀ s_{h}^{a} (i) ⨀ s_{w}^{a} (j)

(11)

Similarly, the DCAFE module also performs the coordinate attention (CA) [32] process using max pooling instead of average pooling, as shown in Equations (12)–(14). The final output is obtained by concatenating the attention-enhanced feature maps generated through both average pooling (

Y^{a} (i, j)

) and max pooling (

Y^{m} (i, j)

), as presented in Equation (15):

f_{m}^{c} = δ (C o n v [X M a x P o o l (F_{h}^{c}); Y M a x P o o l (F_{w}^{c})])

(12)

s_{h}^{m} = σ (C o n v (f_{h}^{m})) a n d s_{w}^{m} = σ (C o n v (f_{w}^{m}))

(13)

Y^{m} (i, j) = F (i, j) ⨀ s_{h}^{m} (i) ⨀ s_{w}^{m} (j)

(14)

Y (i, j) = C o n c a t (Y^{m} (i, j), Y^{a} (i, j))

(15)

(2): Cross-Stage Feature Fusion Strategy (CFFS)

In the context of drought classification for winter wheat, features extracted from different layers of a neural network vary significantly in both scale and semantic abstraction. For example, localized traits, such as leaf tip necrosis, represent fine-grained, low-level features, while global characteristics like sparse canopies or generalized yellowing reflect higher-level semantic information. To address this challenge, we propose a Cross-Stage Feature Fusion Strategy (CFFS) that aligns and incrementally integrates features from different network stages, as illustrated in Figure 7. This strategy enables effective fusion of multi-scale spatial information, achieving joint expression of detailed and holistic semantic features. The proposed CFFS effectively bridges the information bottleneck between shallow texture features and deep semantic representations. It enhances the model’s ability to accurately capture local drought indicators, such as leaf texture and yellowing edges, while preserving macro-level attributes like canopy structure and color dynamics.

The implementation is as follows: Let the output feature map from the current stage i be denoted as x, and the output from the previous stage as prev_feats. If the current stage is included in a predefined set of fusion stages, we first apply bilinear interpolation (F.interpolate) to resample prev_feats so that its spatial resolution matches that of x. A 1 × 1 convolution is then applied to align the number of channels, ensuring consistency in both spatial and channel dimensions.

After alignment, the resampled and channel-matched features from the previous stage are elementwise added to the current stage’s features to complete the fusion. Finally, a 3 × 3 convolution is applied to refine the fused features and enhance their representational capacity. This process can be formally described as in Algorithm 1.

Algorithm 1: Cross-Stage Feature Fusion Strategy

Input: Input feature map x, fusion stage index set F

Output: Merged feature representation x

prev_feats ← None
for i = 1 to N do
    x ← stage_i(x)
    if prev_feats ≠ None and i ∈ F then
        if size(prev_feats) ≠ size(x) then
            prev_feats  //Interpolation adjustment(prev_feats,target size = size(x))
        end if
        prev_feats_aligned   //Channel alignment (prev_feats) 1 × 1Conv
        x ← x + prev_feats_aligned   // Feature fusion
        x ← fuse_convs(x)         //3 × 3Conv
    end if
    prev_feats ← x
end for

2.5.2. Vegetation Index Feature Extraction

Under drought stress, winter wheat often exhibits physiological changes, such as reduced chlorophyll content and water deficiency. Although these changes may not be visually apparent in RGB images, they can be effectively captured through vegetation indices that reflect spectral responses. For instance, indices such as NDVI and NDRE are sensitive to chlorophyll content, while NDWI is indicative of water status. Therefore, vegetation indices provide essential physiological information for drought identification.

To fully leverage this spectral-physiological information, we designed a vegetation index feature extraction method. Specifically, for each image patch, ten representative vegetation indices were computed from the corresponding multispectral data: NDVI, NDRE, NDWI, GVI, SAVI, EVI, GNDAVI, OSAVI, TVI, and NDCI. The regional average of each index was calculated to construct a structured spectral feature vector, serving as a compact and informative representation of the crop’s physiological state.

2.5.3. Lightweight Multimodal Fusion Block (LMFB)

To enable deep integration of RGB image features and vegetation index features for winter wheat drought monitoring, we propose a Lightweight Multimodal Fusion Block (LMFB). As illustrated in Figure 8, the module first projects both the RGB features and the vegetation index features into a unified feature space via linear transformations.

Next, a Lightweight Cross-Attention (LCA) mechanism [33] is introduced to model cross-modal interactions. By applying unidirectional attention to the mapped feature tensors, the model adaptively emphasizes salient channel features that are strongly associated with drought stress, while simultaneously suppressing environmental noise sources, such as soil reflectance and shadow variation. This enhances the discriminative power of the fused representation.

The resulting multimodal feature vector is then fed into a set of lightweight classifiers, which incorporate layer normalization, GELU activation, and dropout regularization to improve generalization. The final output is the predicted drought severity level for winter wheat.

2.5.4. Experimental Environment and Evaluation Metrics

The experimental training environment configuration is detailed in Table 4. During model training, all models were trained under identical hyperparameter settings: the batch size was set to 128, the number of epochs to 100, the optimizer used was AdamW, the initial learning rate was 0.001, and the loss function was cross-entropy loss.

To evaluate the performance of the classification models, the following metrics were used: accuracy (Acc), recall (R), F1-score (F1), the number of model parameters (Params), inference time (Inference), and computational complexity (GFLOPs). These indicators comprehensively assess the effectiveness, efficiency, and deployability of the network.

Accuracy refers to the proportion of correctly predicted drought category samples out of all samples. A higher value indicates better overall performance of the model:

A c c = \frac{T P + T N}{T P + T N + F P + F N}

(16)

Precision (also referred to as positive predictive value) indicates the probability that a sample predicted to belong to a specific drought category is indeed correct. In this study, a “positive sample” refers to a sample predicted to belong to a particular drought severity class:

P r e c i s i o n = \frac{T P}{T P + F P}

(17)

Recall measures the proportion of actual samples of a certain drought class that are correctly predicted as belonging to that class:

R = \frac{T P}{T P + F N}

(18)

F1-score is a harmonic mean of precision and recall, providing a balanced metric that reflects both the completeness and exactness of the model’s predictions:

F 1 = \frac{P r e c i s i o n \times R}{P r e c i s i o n + R} \times 2

(19)

The metrics are based on the following definitions: TP (true positive) is the number of samples correctly predicted as belonging to a specific drought category, FP (false positive) is the number of samples incorrectly predicted as belonging to that drought category, FN (false negative) is the number of samples that actually belong to the drought category but were misclassified into others, and TN (true negative) is the number of samples correctly predicted as belonging to other drought categories.

3. Experiments and Results

3.1. Comparative Experiments on Visual Feature Extraction Networks

To evaluate the effectiveness of the proposed Fusion-StarNet in winter wheat drought stress classification tasks, we conducted a systematic performance comparison with several state-of-the-art lightweight neural networks. The benchmark models included MobileNetV3 [34], MobileNetV4 [35], ShuffleNetV2 [36], GhostNet [37], MobileViTv2 [38], StarNet [30], and a lightweight version of StarNet, StarNet_tiny. All models were trained under identical strategies and evaluated on the same winter wheat drought stress dataset. The experimental results are summarized in Table 5.

As shown in the table, Fusion-StarNet achieved the best overall performance with an accuracy of 95.36%, recall of 95.28%, and F1-score of 95.35%, outperforming all other lightweight networks. Importantly, it maintained a compact model size (1.21 M parameters) and low computational complexity (0.16 GFLOPs), achieving an excellent balance between accuracy and efficiency.

By contrast, although GhostNet_050 recorded the lowest GFLOPs (0.05) and the fastest inference speed, its recognition accuracy was significantly lower. Models such as ShuffleNet_v2 and MobileViTv2 also failed to surpass the performance of the StarNet family. Notably, compared with StarNet_s1, Fusion-StarNet achieved higher classification accuracy while retaining a lightweight structure, further confirming its suitability for winter wheat drought stress monitoring tasks.

To intuitively compare the comprehensive performance of each model across both accuracy and efficiency dimensions, a radar chart was constructed. Performance-related metrics (Acc, R, and F1) were normalized in ascending order, while resource-related metrics (Params, Inference, and GFLOPs) were normalized in descending order. Each polygonal line in the chart represents a model, facilitating visual comparison of performance–efficiency trade-offs.

As illustrated in Figure 9, Fusion-StarNet reached a normalized value of 1.0 for all three key classification metrics, accuracy, recall, and F1-score, indicating its superior recognition capability in drought stress identification. Meanwhile, it also demonstrated moderately strong performance across the three efficiency metrics (Params, Inference, and GFLOPs), resulting in a well-balanced and full radar profile. In contrast, models such as GhostNet_050 and MobileNetv3_small performed well in computational efficiency but showed evident weaknesses in classification accuracy.

These results demonstrate that Fusion-StarNet, while maintaining a lightweight architecture, achieved an excellent trade-off between model performance and computational efficiency, making it particularly well-suited for winter wheat drought stress monitoring applications.

3.2. Comparative Experiments Between DCAFE and Other Attention Mechanisms

To evaluate the effectiveness of the proposed Dual-Coordinate Attention Feature Enhancement (DCAFE) mechanism in winter wheat drought stress recognition tasks, we designed two sets of comparative experiments: (1) a performance comparison between DCAFE and several mainstream lightweight attention mechanisms, including ECA [39], SE [40], CBAM [41], and SimAM [42], and (2) an investigation of the effect of inserting DCAFE at different structural levels of the network on recognition accuracy. All experiments were conducted on the same drought-stress-labeled dataset.

As shown in Table 6, DCAFE achieved the best overall performance in winter wheat drought stress classification, with an accuracy of 93.81% and an F1-score of 93.26%. Despite a slight increase in computational complexity, the model demonstrated significantly improved capability in distinguishing between different drought severity levels.

To further validate the effectiveness of each attention mechanism in extracting drought-related features, we utilized Grad-CAM [43] to visualize the feature response regions that contributed to classification decisions. The comparative analysis of Grad-CAM heatmaps revealed the spatial attention characteristics of each mechanism.

As shown in Figure 10, the SE attention mechanism exhibited dispersed high-response regions, making it difficult to focus on drought-critical leaf areas. This suggests that SE emphasizes inter-channel weight distribution while neglecting spatial modeling, resulting in inadequate attention to salient drought regions. The ECA attention mechanism produced more concentrated responses at the image center, while edge responses remained weak. Although ECA improved channel feature expression via local convolution, it still lacked effective spatial feature extraction. Consequently, it showed some localized activation but limited overall coverage. In contrast, the CBAM attention mechanism yielded a more balanced heatmap, with clearly delineated red regions corresponding to the core drought stress zones, while also preserving edge features. The collaborative operation of its spatial and channel attention modules allowed it to effectively balance global and local feature extraction. The SimAM attention mechanism demonstrated strong overall activation with broad coverage; however, the hotspot areas appeared relatively scattered, indicating a lack of focused attention on key regions. This highlights SimAM’s limitations in spatial localization despite its global response strength. The DCAFE attention mechanism, by comparison, delivered the most ideal visualization performance. The red regions were highly concentrated on the most drought-affected leaf zones, while peripheral features were also well preserved. This indicates that DCAFE possesses superior spatial focusing ability and high sensitivity to domain-specific drought traits. In summary, DCAFE proved to be the most suitable attention mechanism for this task, effectively supporting the discrimination requirements of winter wheat drought stress classification due to its ability to focus on key regions and respond sensitively to stress-related features.

As further confirmed in Table 7, incorporating the DCAFE module at different stages of the network consistently improved the model’s classification performance for drought stress severity. Compared to the baseline model without DCAFE (accuracy: 92.08% and F1-score: 92.42%), all three insertion strategies led to varying degrees of improvement. When DCAFE was inserted only after Stage 4, the model achieved an accuracy of 93.22% and an F1-score of 93.27%. When DCAFE was inserted only after all Star Blocks, the accuracy reached 92.68% and the F1-score 92.78%. The best results were achieved when DCAFE was inserted after both Stage 4 and all Star Blocks, yielding an accuracy of 93.81% and an F1-score of 93.26%. These results demonstrate that the multi-level integration of DCAFE can more effectively enhance the model’s capability to discriminate drought severity in winter wheat, confirming its value as a high-impact attention mechanism in this application.

3.3. Comparative Experiments of CFFS at Different Network Stages

To further enhance the model’s ability to identify drought stress in winter wheat, a Cross-Stage Feature Fusion Strategy (CFFS) was introduced in this study. The goal was to integrate features across different layers of the network and improve the model’s sensitivity to both fine-grained and high-level drought-related features. We designed five different fusion schemes for comparison. The experimental results are summarized in Table 8.

As shown in the results, compared to the baseline model without CFFS (accuracy: 92.08% and F1-score: 92.42%), introducing CFFS at various stages generally improved classification performance, confirming the positive effect of cross-stage fusion on drought stress recognition. Applying CFFS at Stage 2 alone improved the accuracy to 94.01% and F1-score to 94.16%. When CFFS was applied at both Stages 2 and 3, performance further increased slightly, reaching 94.15% accuracy and 94.11% F1-score. These two configurations primarily integrate low- and mid-level texture and structural features, which are beneficial for detecting early morphological signs of drought stress. Introducing CFFS at Stages 3 and 4 resulted in the best overall performance, with an accuracy of 94.38% and an F1-score of 94.49%. This setup combines mid-level representations with higher-level semantic features, providing a stronger basis for distinguishing among drought severity levels with improved class balance and discriminative precision. Although applying CFFS at Stages 2, 3, and 4 further increased feature interaction, it led to a slight performance drop (accuracy: 93.94% and F1-score: 94.07%) and increased computational costs. This suggests that excessive fusion may introduce redundant information, potentially impairing classification accuracy. In conclusion, the CFFS was effective in improving the model’s performance. Among the evaluated configurations, the setup with fusion at Stages 3 and 4 achieved the best balance between accuracy, model stability, and computational efficiency, making it the most suitable default configuration for drought stress classification in winter wheat.

3.4. Ablation Study

To further validate the individual and combined contributions of the key improvements proposed in this study, namely, network modification (NM; a lightweight adjustment to the StarNet_s1 architecture by reducing base_dim from 32 to 16 and modifying depths to [1,1,4,3]), the Dual-Coordinate Attention Feature Enhancement (DCAFE) module, and the Cross-Stage Feature Fusion Strategy (CFFS), a comprehensive ablation study was conducted. All experiments were performed on the same dataset under identical training settings to evaluate the influence of each module on model performance and computational complexity. The results are summarized in Table 9.

The baseline model, which included none of the proposed modules, achieved an accuracy of 92.61%, recall of 92.48%, and F1-score of 92.61%, with 2.68 M parameters, 0.43 GFLOPs, and an inference time of 4.79 ms. The model exhibited relatively high computational complexity. After applying the network modification (NM), the number of parameters was substantially reduced from 2.68 M to 0.98 M, GFLOPs dropped to 0.14, and inference time slightly decreased to 4.62 ms, achieving effective model compression. However, this came at a minor cost to performance: accuracy and F1-score dropped slightly to 92.08% and 92.42%, respectively, representing a reduction of approximately 0.53% and 0.19% compared to the baseline. This suggests that while the NM strategy significantly improves model efficiency, it may marginally impact baseline predictive performance. Next, adding the DCAFE attention mechanism on top of the NM model led to a notable performance gain: accuracy increased to 93.81% (up 1.73%) and F1-score improved to 93.26% (up 0.84%). The increase in parameters and computation was marginal (to 1.01 M and 0.15 GFLOPs, respectively), and inference time increased slightly to 5.11 ms. These results demonstrate that DCAFE effectively boosted the model’s sensitivity to fine-grained drought stress features by emphasizing informative channels and spatial regions. Similarly, incorporating the CFFS module on top of the NM model resulted in even greater improvements: accuracy reached 94.38% (an increase of 2.3%) and F1-score rose to 94.49%. The parameter count increased to 1.18 M, GFLOPs to 0.16, and inference time decreased slightly to 4.72 ms, indicating that CFFS successfully integrated multi-scale, hierarchical features to enhance semantic understanding and feature expression. Most importantly, when NM, DCAFE, and CFFS were combined, the resulting Fusion-StarNet achieved the best overall performance, with an accuracy of 95.36% and an F1-score of 95.35%, improvements of 3.28% and 3.13%, respectively, over the NM-only model. Despite this gain, the model maintained a lightweight profile, with only 1.21 M parameters, 0.16 GFLOPs, and an inference time of 5.26 ms. These results verify the synergistic effect of the three modules, which not only substantially improved recognition accuracy but also preserved computational efficiency, demonstrating the effectiveness and rationality of the Fusion-StarNet design. To further illustrate the performance difference between the baseline model and the full Fusion-StarNet, a training curve of validation accuracy across epochs was plotted, providing a more intuitive comparison of model convergence and generalization behavior.

As shown in Figure 11, Fusion-StarNet’s accuracy was generally higher than that of the baseline, and its curve was smoother and converged faster, indicating that it was more stable during training and had stronger generalization capabilities. In contrast, the baseline model’s accuracy improved slowly and fluctuated significantly in the middle stage, indicating that Fusion-StarNet has obvious advantages in performance and stability.

3.5. Comparative Experiments on Different Fusion Strategies

To further evaluate the multimodal fusion effectiveness of the proposed MF-FusionNet in winter wheat drought stress identification, we designed and compared three representative multimodal fusion strategies: early fusion, intermediate fusion, and late fusion [44], based on the integration of RGB image data and numerical vegetation index information. Additionally, a baseline model using only the image modality (Only Image) was included for reference.

The experimental results are summarized in Table 10. The Only Image model, trained and inferred without any vegetation index input, already achieved strong performance, with 95.36% accuracy and an F1-score of 95.35%. However, it struggled to accurately classify samples near drought severity boundaries, indicating a limitation in capturing physiological cues through visual features alone. In the early fusion strategy, image and vegetation index features were directly concatenated at the input level. This approach resulted in a simple model architecture and fast inference time (5.15 ms). However, since semantic representations between modalities have not yet formed at this stage, the fused information can interfere with each other. Consequently, the model’s accuracy and F1-score dropped to 94.01% and 94.02%, respectively, suggesting that early fusion failed to effectively leverage the complementary nature of the modalities. In contrast, the late fusion strategy employed two independent branches to extract features from the image and vegetation index modalities separately, combining them only at the final decision layer. This avoided early interference between modalities. The model achieved 96.43% accuracy and 96.30% F1-score, outperforming both the Only Image and early fusion strategies. This indicates that independently extracted features, when fused later, can better preserve the integrity of each modality. However, the lack of interaction between semantic layers limits the model’s ability to capture fine-grained differences between drought severity levels. The intermediate fusion strategy performed feature fusion at the mid-level of the network, using attention mechanisms to enhance semantic-level interactions between modalities. This approach achieved the highest performance, with 96.71% accuracy and an F1-score of 96.64%. At the same time, it maintained low computational overhead: only 0.24 M additional parameters, an increase of just 0.03 ms in inference time, and no change in GFLOPs compared to the baseline. These results demonstrate that intermediate fusion strikes an optimal balance between accuracy and efficiency. By enabling effective semantic-level interaction between visual and physiological features, it significantly improved the model’s capacity to identify and discriminate drought stress in winter wheat.

3.6. Comparative Experiments on Intermediate Fusion Methods

To further evaluate the effectiveness of different intermediate fusion strategies within MF-FusionNet, we conducted a comparative experiment on fusion methods. Under the same backbone architecture, four multimodal feature fusion strategies were tested: (I) feature concatenation, (II) elementwise summation, (III) gating mechanism [45], and (IV) the proposed LMFB based on an attention mechanism. The goal was to investigate how different fusion techniques affect the recognition performance for drought stress in winter wheat.

As shown in Table 11, the I fusion method yielded the poorest performance, with an accuracy of 93.43% and an F1-score of 93.72%. Although simple in design, this approach directly integrated image features and vegetation index features along the dimension axis, without accounting for semantic differences or complementary relationships between modalities. This often led to redundant or interfering information, making it difficult to capture the fine-grained physiological variations associated with different drought levels, thereby limiting the model’s discriminative capability. In contrast, the II fusion method showed improved results (accuracy = 94.76% and F1 = 94.24%), as the elementwise addition compressed feature dimensions and yielded a more compact representation. However, it still failed to dynamically model the relative importance of each modality under varying drought conditions, resulting in a certain “averaging” effect that limited further gains in performance. The III fusion method, which introduces modality-aware information flow control, further enhanced performance. It improved the model’s ability to select relevant features and suppress redundant ones, leading to an accuracy of 95.12% and an F1-score of 95.24%. The best performance was achieved by the proposed LMFB, which facilitated deep interaction between visual and physiological features through lightweight-attention-based fusion. This strategy achieved an accuracy of 96.71% and an F1-score of 96.64%, outperforming I, II, and IIII by 3.28%, 2.40%, and 1.40%, respectively. In summary, the attention-guided Lightweight Multimodal Fusion Block (LMFB) strategy enabled more effective integration of image features and vegetation indices under drought stress, capturing both modality-specific and shared representations, and is particularly well-suited for fine-grained drought severity classification in winter wheat.

3.7. Visualization of Winter Wheat Drought Stress Using Remote Sensing

To intuitively reflect the spatial distribution of drought stress within winter wheat fields, a drought severity visualization map was constructed based on UAV-acquired remote sensing imagery. The results are presented in Figure 12. The key approach involved mapping the model-predicted drought severity levels back to their corresponding geographic positions, generating a spatially explicit heatmap that reflects intra-field drought variability. First, high-resolution RGB imagery captured by UAVs was used as the base layer, preserving original spatial fidelity and visual detail to ensure that the overlaid visualization aligns with the actual field background. Then, the drought severity label of each classified image patch was color-coded and overlaid as a semi-transparent layer on the base map, enabling intuitive interpretation of drought stress distribution across different field zones.

As shown in the visualization results, the predicted drought severity levels demonstrated a clear spatial pattern, closely matching the field’s actual planting layout. The output exhibited strong spatial interpretability and continuity in drought classification. Although minor misclassifications were observed in certain boundary or transition areas, these discrepancies were localized and did not significantly affect the overall trend of drought severity classification. Considering the inherent gradational nature of drought stress, confusion between adjacent severity levels was within an acceptable margin. The final drought visualization map not only revealed the spatial heterogeneity of drought stress within the field but also provided actionable insights for agronomic technicians and farm managers, allowing them to implement targeted irrigation and regulation strategies. This enhances both the scientific basis and efficiency of moisture management at the field scale.

4. Discussion

The introduction of a multimodal feature fusion strategy in MF-FusionNet has significantly improved both the accuracy and robustness of winter wheat drought stress monitoring. RGB images contribute rich spatial and texture features, such as leaf color changes and canopy structure, which visually reflect the crop’s health status. Meanwhile, vegetation indices, such as NDCI and OSAVI, quantitatively represent vegetation vigor and chlorophyll content, enhancing the model’s ability to characterize drought-induced physiological responses. By deeply integrating such heterogeneous multisource features at the representation level, MF-FusionNet is able to effectively learn and combine spatial, spectral, and physiological dimensions of key indicators during training. This approach compensates for the limitations of unimodal inputs in drought classification tasks. As shown in Figure 13 (feature contribution analysis of vegetation indices), indices like OSAVI and NDCI exhibited the highest contribution weights, with a contribution value of 0.016, meaning that removing these features resulted in a 1.6% drop in model accuracy. This finding further validates the necessity and effectiveness of incorporating vegetation indices into the fusion framework.

The outcomes of this study are especially important given the growing need for non-destructive, timely, and scalable methods for crop stress detection. By demonstrating that meaningful drought classification can be achieved through the synergistic application of RGB and spectral data, this study paves the way for the widespread use of cost-effective drone platforms in precision agriculture. Furthermore, the success of the fusion strategy offers new directions for integrating other data modalities (e.g., thermal or LiDAR) in future research, potentially enhancing the spatiotemporal resolution and physiological relevance of crop stress models.

Overall, we provided a strong foundation for developing more intelligent, adaptable, and field-ready solutions for drought monitoring, with implications not only for wheat but for a wide range of crops and environmental conditions.

Although single-date remote sensing imagery can achieve reasonably accurate classification of drought stress, drought is inherently a cumulative and staged phenomenon. Relying solely on single-temporal data makes it difficult to capture the dynamic evolution of drought stress. Incorporating multi-temporal remote sensing data would enable continuous monitoring throughout the growth cycle, from sowing to maturity, allowing for tracking of crop development and drought responses using time-series changes in indices, such as NDVI and OSAVI. This would support earlier detection of stress and assessment of drought persistence [46].

For instance, index decline rates and peak variations in time-series signals can serve as early warning indicators [47], enhancing the timeliness and sensitivity of monitoring. Additionally, crops exhibit different levels of drought sensitivity at different phenological stages [48]. Integrating phenological information into stage-specific modeling could improve the model’s ability to assess yield impacts and support more precise agricultural decision-making.

5. Conclusions

To overcome the limitations of current winter wheat drought monitoring approaches, typically constrained by reliance on unimodal data, environmental interference, and limited feature representation, this study introduced a lightweight multimodal network, MF-FusionNet. MF-FusionNet demonstrated strong performance in classifying winter wheat drought stress, owing to its effective fusion of RGB imagery and vegetation indices via the LMFB module. The inclusion of DCAFE further enhanced the model’s ability to capture fine-grained spatial and directional features, improving robustness to field-level variability. These findings provide a valuable basis for real-time drought monitoring and precise irrigation management in large-scale winter wheat cultivation, contributing to improved agricultural resilience under climate variability. In addition, the CFFS enabled efficient integration of multi-scale semantic information, supporting more accurate discrimination of drought severity levels across the field. Considering the computational constraints commonly encountered in agricultural applications, MF-FusionNet maintained low model complexity while achieving high accuracy, making it suitable for deployment on edge devices. Experimental results demonstrated that MF-FusionNet surpassed existing methods across key performance metrics, achieving 96.71% accuracy, 96.71% recall, and an F1-score of 96.64%, representing improvements of 1.35%, 1.43%, and 1.29%, respectively, compared to traditional unimodal models. Overall, this study presents a more efficient and practical solution for the precise identification and intelligent monitoring of drought stress in winter wheat.

Author Contributions

Q.G., methodology, writing—original draft preparation; B.H., writing—review and editing; P.C., validation, visualization; Y.W., data curation, formal analysis; J.Z., re-sources, conceptualization, supervision. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Science and Technology Innovation 2030 “New Generation Artificial Intelligence” Major Project (2022ZD0115805), the Xinjiang Uygur Autonomous Region Major Science and Technology Project “Research on Key Technologies for Farm Digitalization and Intelligentization” (2022A02011-2), and the Autonomous Region Postgraduate Research Innovation Project (XJ2025G135).

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The data presented in this study are available upon request from the corresponding author. The data are not publicly available due to privacy.

Acknowledgments

The authors are very grateful to the editor and reviewers for their valuable comments and suggestions to improve the paper.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study, in the collection, analyses, or interpretation of data, in the writing of the manuscript, or in the decision to publish the results.

Abbreviations

The following abbreviations are used in this manuscript:

LMFB	Lightweight Multimodal Fusion Block
CFFS	Cross-Stage Feature Fusion Strategy
DCAFE	Dual-Coordinate Attention Feature Extraction Module
NM	Network Modification
SE	Squeeze-and-Excitation
CBAM	Convolutional Block Attention Module
ECA	Efficient Channel Attention
SimAM	A Simple, Parameter-Free Attention Module

References

Zhao, H.; Cai, D.H.; Wang, H.L.; Yang, Y.; Wang, R.; Zhang, K.; Qi, Y.; Zhao, F.; Chen, F.; Yue, P.; et al. Progress and prospect on impact of drought disaster on food security and its countermeasures. J. Arid Meteorol. 2023, 41, 187–206. [Google Scholar]
Zhu, G.; Liu, Y.; Shi, P.; Jia, W.; Zhou, J.; Liu, Y.; Ma, X.; Pan, H.; Zhang, Y.; Zhang, Z.; et al. Stable water isotope monitoring network of different water bodies in Shiyang River basin, a typical arid river in China. Earth Syst. Sci. Data 2022, 14, 3773–3789. [Google Scholar] [CrossRef]
Wu, H.; Yang, Z. Effects of Drought Stress and Postdrought Rewatering on Winter Wheat: A Meta-Analysis. Agronomy 2024, 14, 298. [Google Scholar] [CrossRef]
Shah, S.; Depeng, W.; Shah, F.; Alharby, H.F.; Bamagoos, A.A.; Mjrashi, A.; Alabdallah, N.M.; AlZahrani, S.S.; AbdElgawad, H.; Adnan, M.; et al. Comprehensive Impacts of Climate Change on Rice Production and Adaptive Strategies in China. Front. Microbiol. 2022, 13, 926059. [Google Scholar] [CrossRef]
Wu, Y.M.; Zhu, J.T.; Zhu, D.L.; Li, D. Meta-analysis on influencing factors of irrigated winter wheat yield and water use efficiency in China. J. Irrig. Drain. 2020, 39, 84–92. [Google Scholar]
Xiao, X.; Ming, W.; Luo, X.; Yang, L.; Li, M.; Yang, P.; Ji, X.; Li, Y. Leveraging multisource data for accurate agricultural drought monitoring: A hybrid deep learning model. Agric. Water Manag. 2024, 293, 108692. [Google Scholar] [CrossRef]
Zait, Y.; Shemer, O.E.; Cochavi, A. Dynamic responses of chlorophyll fluorescence parameters to drought across diverse plant families. Physiol. Plant. 2024, 176, e14527. [Google Scholar] [CrossRef] [PubMed]
Das, S.; Christopher, J.; Apan, A.; Choudhury, M.R.; Chapman, S.; Menzies, N.W.; Dang, Y.P. UAV-Thermal imaging and agglomerative hierarchical clustering techniques to evaluate and rank physiological performance of wheat genotypes on sodic soil. ISPRS J. Photogramm. Remote Sens. 2021, 173, 221–237. [Google Scholar] [CrossRef]
Das, S.; Christopher, J.; Choudhury, M.R.; Apan, A.; Chapman, S.; Menzies, N.W.; Dang, Y.P. Evaluation of drought tolerance of wheat genotypes in rain-fed sodic soil environments using high-resolution UAV remote sensing techniques. Biosyst. Eng. 2022, 217, 68–82. [Google Scholar] [CrossRef]
Maji, A.K.; Das, S.; Marwaha, S.; Kumar, S.; Dutta, S.; Choudhury, M.R.; Arora, A.; Ray, M.; Perumal, A.; Chinusamy, V. Intelligent decision support for drought stress (IDSDS): An integrated remote sensing and artificial intelligence-based pipeline for quantifying drought stress in plants. Comput. Electron. Agric. 2025, 236, 110477. [Google Scholar] [CrossRef]
Mucchiani, C.; Zaccaria, D.; Karydis, K. Assessing the potential of integrating automation and artificial intelligence across sample-destructive methods to determine plant water status: A review and score-based evaluation. Comput. Electron. Agric. 2024, 224, 108992. [Google Scholar] [CrossRef]
Devi, S.; Singh, V.; Yashveer, S.; Poonia, A.K.; Paras; Chawla, R.; Kumar, D.; Akbarzai, D.K. Phenotypic, physiological and biochemical delineation of wheat genotypes under different stress conditions. Biochem. Genet. 2024, 62, 3305–3335. [Google Scholar] [CrossRef] [PubMed]
Gupta, A.; Kaur, L.; Kaur, G. Drought stress detection technique for wheat crop using machine learning. PeerJ Comput. Sci. 2023, 9, e1268. [Google Scholar] [CrossRef] [PubMed]
Gao, S.; Liang, H.; Hu, D.; Hu, X.; Lin, E.; Huang, H. SAM-ResNet50: A Deep Learning Model for the Identification and Classification of Drought Stress in the Seedling Stage of Betula luminifera. Remote Sens. 2024, 16, 4141. [Google Scholar] [CrossRef]
An, J.; Li, W.; Li, M.; Cui, S.; Yue, H. Identification and Classification of Maize Drought Stress Using Deep Convolutional Neural Network. Symmetry 2019, 11, 256. [Google Scholar] [CrossRef]
Goyal, P.; Sharda, R.; Saini, M.; Siag, M. A deep learning approach for early detection of drought stress in maize using proximal scale digital images. Neural Comput. Appl. 2024, 36, 1899–1913. [Google Scholar] [CrossRef]
Yang, D.; Wang, F.; Hu, Y.; Lan, Y.; Deng, X. Citrus Huanglongbing Detection Based on Multi-Modal Feature Fusion Learning. Front. Plant Sci. 2021, 12, 809506. [Google Scholar] [CrossRef]
Li, S.; Song, Z.; Liang, Q.; Meng, L.; Yu, Y.; Chen, Y. Nondestructive Detection of Citrus Infested by Bactrocera dorsalis Based on X-ray and RGB Image Data Fusion. Trans. Chin. Soc. Agric. Mach. 2023, 54, 385–392, (In Chinese with English abstract). [Google Scholar]
Yao, J.; Wu, Y.; Liu, J.; Wang, H. Multimodal deep learning-based drought monitoring research for winter wheat during critical growth stages. PLoS ONE 2024, 19, e0300746. [Google Scholar] [CrossRef]
Meier, U. Growth Stages of Mono- and Dicotyledonous Plants: BBCH Monograph; Open Agrar Repositorium: Quedlinburg, Germany, 2018. [Google Scholar] [CrossRef]
Rouse, J.W., Jr.; Haas, R.H.; Deering, D.W.; Schell, J.A.; Harlan, J.C. Monitoring the Vernal Advancement and Retrogradation (Green Wave Effect) of Natural Vegetation; NASA/GSFC Type III, Final Report; Greenbelt, MD, USA, 1974. [Google Scholar]
Xu, H. Modification of normalised difference water index (NDWI) to enhance open water features in remotely sensed imagery. Int. J. Remote Sens. 2006, 27, 3025–3033. [Google Scholar] [CrossRef]
Boiarskii, B.; Hasegawa, H. Comparison of NDVI and NDRE indices to detect differences in vegetation and chlorophyll content. J. Mech. Contin. Math. Sci. 2019, 4, 20–29. [Google Scholar] [CrossRef]
Jiang, L.; Kogan, F.N.; Guo, W.; Tarpley, J.D.; Mitchell, K.E.; Ek, M.B.; Tian, Y.; Zheng, W.; Zou, C.; Ramsay, B.H. Real-time weekly global green vegetation fraction derived from advanced very high resolution radiometer-based NOAA operational global vegetation index (GVI) system. J. Geophys. Res. Atmos. 2010, 115, D11. [Google Scholar] [CrossRef]
Xue, J.; Su, B. Significant remote sensing vegetation indices: A review of developments and applications. J. Sens. 2017, 2017, 1353691. [Google Scholar] [CrossRef]
Huete, A.; Didan, K.; Miura, T.; Rodriguez, E.P.; Gao, X.; Ferreira, L.G. Overview of the radiometric and biophysical performance of the MODIS vegetation indices. Remote Sens. Environ. 2002, 83, 195–213. [Google Scholar] [CrossRef]
Gitelson, A.A.; Viña, A.; Arkebauer, T.J.; Rundquist, D.C.; Keydan, G.; Leavitt, B. Remote estimation of leaf area index and green leaf biomass in maize canopies. Geophys. Res. Lett. 2003, 30, 1248. [Google Scholar] [CrossRef]
Broge, N.H.; Leblanc, E. Comparing prediction power and stability of broadband and hyperspectral vegetation indices for estimation of green leaf area index and canopy chlorophyll density. Remote Sens. Environ. 2001, 76, 156–172. [Google Scholar] [CrossRef]
Mishra, S.; Mishra, D.R. Normalized difference chlorophyll index: A novel model for remote estimation of chlorophyll-a concentration in turbid productive waters. Remote Sens. Environ. 2012, 117, 394–406. [Google Scholar] [CrossRef]
Ma, X.; Dai, X.; Bai, Y.; Wang, Y.; Fu, Y. Rewrite the Stars. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 5694–5703. [Google Scholar]
Gupta, S.; Tripathi, A.K. Flora-NET: Integrating dual coordinate attention with adaptive kernel based convolution network for medicinal flower identification. Comput. Electron. Agric. 2025, 230, 109834. [Google Scholar] [CrossRef]
Hou, Q.; Zhou, D.; Feng, J. Coordinate Attention for Efficient Mobile Network Design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 13713–13722. [Google Scholar]
Yan, Q.; Feng, Y.; Zhang, C.; Pang, G.; Shi, K.; Wu, P.; Dong, W.; Sun, J.; Zhang, Y. Hvi: A New Color Space for Low-Light Image Enhancement. In Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville, TN, USA, 11–15 June 2025; pp. 5678–5687. [Google Scholar]
Koonce, B. MobileNetV3. In Convolutional Neural Networks with Swift for Tensorflow: Image Recognition and Dataset Categorization; Apress: Berkeley, CA, USA, 2021; pp. 125–144. [Google Scholar]
Qin, D.; Leichner, C.; Delakis, M.; Fornoni, M.; Luo, S.; Yang, F.; Wang, W.; Banbury, C.; Ye, C.; Akin, B.; et al. MobileNetV4: Universal models for the mobile ecosystem. In European Conference on Computer Vision; Springer Nature: Cham, Switzerland, 2024; pp. 78–96. [Google Scholar]
Ma, N.; Zhang, X.; Zheng, H.T.; Sun, J. Shufflenet V2: Practical Guidelines for Efficient CNN Architecture Design. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 116–131. [Google Scholar]
Han, K.; Wang, Y.; Tian, Q.; Guo, J.; Xu, C.; Xu, C. Ghostnet: More Features from Cheap Operations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 1580–1589. [Google Scholar]
Mehta, S.; Rastegari, M. Separable self-attention for mobile vision transformers. arXiv 2022, arXiv:2206.02680. [Google Scholar] [CrossRef]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient Channel Attention for Deep Convolutional Neural Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11534–11542. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European conference on computer vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Yang, L.; Zhang, R.Y.; Li, L.; Xie, X. Simam: A simple, parameter-free attention module for convolutional neural networks. In International Conference on Machine Learning; PMLR: New York, NY, USA, 2021; pp. 11863–11874. [Google Scholar]
Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-CAM: Visual explanations from deep networks via gradient-based localization. Int. J. Comput. Vis. 2020, 128, 336–359. [Google Scholar] [CrossRef]
Yuan, Y.; Li, Z.; Zhao, B. A Survey of Multimodal Learning: Methods, Applications, and Future. ACM Comput. Surv. 2025, 57, 1–34. [Google Scholar] [CrossRef]
Baltrušaitis, T.; Ahuja, C.; Morency, L.P. Multimodal machine learning: A survey and taxonomy. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 41, 423–443. [Google Scholar] [CrossRef] [PubMed]
Felix, M.J.B.; Main, R.; Watt, M.S.; Arpanaei, M.M.; Patuawa, T. Early Detection of Water Stress in Kauri Seedlings Using Multitemporal Hyperspectral Indices and Inverted Plant Traits. Remote Sens. 2025, 17, 463. [Google Scholar] [CrossRef]
Candiago, S.; Remondino, F.; De Giglio, M.; Dubbini, M.; Gattelli, M. Evaluating multispectral images and vegetation indices for precision farming applications from UAV images. Remote Sens. 2015, 7, 4026–4047. [Google Scholar] [CrossRef]
Ning, D.; Zhang, Y.; Li, X.; Qin, A.; Huang, C.; Fu, Y.; Gao, Y.; Duan, A. The effects of foliar supplementation of silicon on physiological and biochemical responses of winter wheat to drought stress during different growth stages. Plants 2023, 12, 2386. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Location of the experimental field.

Figure 2. Experimental groups with different irrigation treatments.

Figure 3. Data acquisition equipment. (a) DJI Mavic 3 Multispectral Edition components, (b) drone, and (c) drone remote control.

Figure 4. Data processing flow. (a) RGB and multispectral images taken by drones. (b) DJI Terra. (c) Image processing. (d) Spliced RGB images and multispectral images. (e) Use QGIS to divide the experimental plot. (f) Use a Python script to crop to 224 × 224 pixels. (g) RGB corresponds one-to-one with multispectral. (h) RGB corresponds to vegetation index.

Figure 5. MF-FusionNet network architecture.

Figure 6. DCAFE architecture.

Figure 7. Cross-Stage Feature Fusion Strategy architecture.

Figure 8. Lightweight Cross-Attention architecture.

Figure 9. Performance radar chart for different networks.

Figure 10. Heatmaps for different attention mechanisms. (a) Input image, (b) DCAFE mechanism, (c) SE mechanism, (d) ECA mechanism, (e) CBAM mechanism, and (f) SimAM mechanism.

Figure 11. Accuracy curve of the validation set before and after model improvement.

Figure 12. Visualization map of winter wheat drought severity.

Figure 13. Vegetation index feature contribution rate analysis chart.

Table 1. UAV platform and sensor parameter information.

Equipment	Parameters	Parameter Content
UAV platform	Model	DJI Mavic 3 Multispectral
	Bare weight	951 g
	Operating temperature	−10 °C to 40 °C
	Maximum wind resistance speed	12 m/s
Sensor	Field of view (RGB)	84°
	RGB camera	4/3 CMOS, 20 megapixels
	Equivalent focal length (RGB)	24 mm
	Multispectral sensor types	1/2.8″ CMOS, 5 megapixels
	Field of view (multispectral)	73.91°
	Equivalent focal length	25 mm
	Spectral band	G: 560 ± 16 nm; R: 650 ± 16 nm; RE: 730 ± 16 nm; NIR: 860 ± 26 nm

Table 2. Multispectral vegetation indices and their calculation formulas.

Vegetation Index	Definition	References
Normalized Difference Vegetation Index (NDVI)	NDVI = (NIR − R)/(NIR + R)	[21]
Normalized Difference Water Index (NDWI)	NDWI = (G − NIR)/(G + NIR)	[22]
Normalized Difference Red Edge (NDRE)	NDRE = (NIR − RE)/(NIR + RE)	[23]
Green Vegetation Index (GVI)	GVI = (2 × NIR − R)/(2 × NIR + R)	[24]
Soil-Adjusted Vegetation Index (SAVI)	SAVI = (NIR − R)/(NIR + R + 0.5) × (1.5)	[25]
Enhanced Vegetation Index (EVI)	EVI = (NIR − R)/(1 + NIR − 2.4 × R) × (2.5)	[26]
Green-Normalized Difference Vegetation Index GNDVI)	GNDVI = (NIR − G)/(NIR + G)	[27]
Optimized Soil-Adjusted Vegetation Index (OSAVI)	OSAVI = (NIR − R)/(NIR − R + 0.16)	[25]
Triangular Vegetation Index (TVI)	$TVI = \sqrt{N D V I + 0.5}$	[28]
Normalized Difference Chlorophyll Index (NDCI)	NDCI = (RE − NIR)/(RE + NIR)	[29]

Note: The vegetation indices in the table are calculated based on multispectral bands, with the band symbols being G for green, R for red, RE for red edge, and NIR for near-infrared. The values represent pixel reflectance, and the mean value within the patch is taken as the input feature after calculation.

Table 3. Dataset distribution.

Type of Drought	Training Set	Validation Set
Suitable moisture (WW1)	230	75
Mild drought (WW2)	200	95
Moderate drought (WW3)	190	80
Severe drought (WW4)	215	105
Extreme drought (WW5)	215	95
Total	1050	450

Table 4. Experimental environment.

Name	Related Configurations
Operating system	Windows 11
Processor	Intel Core i7-14700HX
Graphics	NVIDIA GeForce GTX4060 8 GB
Deep learning framework	PyTorch 2.3
Programming language	Python 3.12

Table 5. Comparative experiment of lightweight classification networks.

Classification Model	Acc/%	R/%	F1/%	Params/M	Inference/ms	GFLOPs
MobileNetv3_small	88.76	88.55	88.90	1.51	4.72	0.06
MobileNetv4_conv_small	87.31	87.21	88.63	2.47	4.98	0.18
ShuffleNet_v2	91.46	91.43	91.66	1.26	4.2	0.15
GhostNet_050	91.86	91.56	91.68	1.31	5.22	0.05
MobileVitv2_050	90.82	90.89	90.15	1.1	8.32	0.36
StarNet_s1	92.61	92.48	92.61	2.68	4.79	0.43
StarNet_tiny	92.08	92.24	92.42	0.98	4.62	0.14
Fusion-StarNet	95.36	95.28	95.35	1.21	5.26	0.16

Table 6. Comparative experiments of different attention mechanisms.

Attention Mechanism Name	Acc/%	R/%	F1/%	Params/M	Inference/ms	GFLOPs
ECA	93.57	93.52	93.59	0.98	4.91	0.14
SE	93.38	93.2	93.57	0.99	4.55	0.14
CBAM	92.98	93.00	92.98	0.99	5.07	0.14
SimAM	93.37	93.52	93.63	0.98	4.61	0.14
DCAFE	93.81	93.07	93.26	1.01	5.11	0.15

Table 7. Comparative experiments of different DCAFE insertion strategies.

Stage 4	Star Block	Acc/%	R/%	F1/%	Params/M	Inference/ms	GFLOPs
		92.08	92.24	92.42	0.98	4.62	0.14
√		93.22	93.03	93.27	0.99	4.63	0.14
	√	92.68	92.64	92.78	1.01	4.63	0.14
√	√	93.81	93.07	93.26	1.01	5.11	0.15

√: indicates that the DCAFE module is introduced after this module.

Table 8. Comparative trials using CFFS at different stages.

Experiment Name	Acc/%	R/%	F1/%	Params/M	Inference/ms	GFLOPs
A	92.08	92.24	92.42	0.98	4.62	0.14
B	94.01	94.08	94.16	0.99	4.68	0.15
C	94.15	94.13	94.11	1.03	4.7	0.16
D	94.38	94.26	94.49	1.18	4.72	0.16
E	93.94	94.01	94.07	1.19	4.84	0.16

A: No cross-stage feature fusion. B: Fusion applied only at Stage 2. C: Fusion applied at Stages 2 and 3. D: Fusion applied at Stages 3 and 4. E: Fusion applied at Stages 2, 3, and 4.

Table 9. Ablation study of the improvement process.

Network Modification	DCAFE	CFFS	Acc/%	R/%	F1/%	Params/M	Inference/ms	GFLOPs
			92.61	92.48	92.61	2.68	4.79	0.43
√			92.08	92.24	92.42	0.98	4.62	0.14
√	√		93.81	93.07	93.26	1.01	5.11	0.15
√		√	94.38	94.26	94.49	1.18	4.72	0.16
√	√	√	95.36	95.28	95.35	1.21	5.26	0.16

Table 10. Comparison of different fusion strategies.

Multimodal Fusion Strategy	Acc/%	R/%	F1/%	Params/M	Inference/ms	GFLOPs
Only Image	95.36	95.28	95.35	1.21	5.26	0.16
Early Fusion	94.01	93.78	94.02	1.30	5.15	0.17
Intermediate Fusion (Ours)	96.71	96.71	96.64	1.45	5.29	0.16
Late Fusion	96.32	96.43	96.30	1.52	5.68	0.17

Table 11. Comparative experiments of different mid-term fusion methods.

Fusion Method	Acc/%	R/%	F1/%	Params/M	Inference/ms	GFLOPs
I	93.43	93.47	93.72	1.22	4.85	0.15
II	94.76	94.38	94.24	1.24	4.93	0.15
III	95.12	95.03	95.24	1.30	5.08	0.16
IV	96.71	96.71	96.64	1.45	5.29	0.16

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Guo, Q.; Han, B.; Chu, P.; Wan, Y.; Zhang, J. MF-FusionNet: A Lightweight Multimodal Network for Monitoring Drought Stress in Winter Wheat Based on Remote Sensing Imagery. Agriculture 2025, 15, 1639. https://doi.org/10.3390/agriculture15151639

AMA Style

Guo Q, Han B, Chu P, Wan Y, Zhang J. MF-FusionNet: A Lightweight Multimodal Network for Monitoring Drought Stress in Winter Wheat Based on Remote Sensing Imagery. Agriculture. 2025; 15(15):1639. https://doi.org/10.3390/agriculture15151639

Chicago/Turabian Style

Guo, Qiang, Bo Han, Pengyu Chu, Yiping Wan, and Jingjing Zhang. 2025. "MF-FusionNet: A Lightweight Multimodal Network for Monitoring Drought Stress in Winter Wheat Based on Remote Sensing Imagery" Agriculture 15, no. 15: 1639. https://doi.org/10.3390/agriculture15151639

APA Style

Guo, Q., Han, B., Chu, P., Wan, Y., & Zhang, J. (2025). MF-FusionNet: A Lightweight Multimodal Network for Monitoring Drought Stress in Winter Wheat Based on Remote Sensing Imagery. Agriculture, 15(15), 1639. https://doi.org/10.3390/agriculture15151639

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

MF-FusionNet: A Lightweight Multimodal Network for Monitoring Drought Stress in Winter Wheat Based on Remote Sensing Imagery

Abstract

1. Introduction

2. Materials and Methods

2.1. Study Area

2.2. Experimental Design

2.3. Data Acquisition

2.4. Data Processing and Construction

2.4.1. UAV Image Processing and Construction

2.4.2. Vegetation Index Feature Processing and Construction

2.4.3. Multimodal Data Processing and Construction

2.5. MF-FusionNet Network Architecture

2.5.1. Fusion-StarNet Image Feature Extraction Network

2.5.2. Vegetation Index Feature Extraction

2.5.3. Lightweight Multimodal Fusion Block (LMFB)

2.5.4. Experimental Environment and Evaluation Metrics

3. Experiments and Results

3.1. Comparative Experiments on Visual Feature Extraction Networks

3.2. Comparative Experiments Between DCAFE and Other Attention Mechanisms

3.3. Comparative Experiments of CFFS at Different Network Stages

3.4. Ablation Study

3.5. Comparative Experiments on Different Fusion Strategies

3.6. Comparative Experiments on Intermediate Fusion Methods

3.7. Visualization of Winter Wheat Drought Stress Using Remote Sensing

4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI