A Survey of Methods for Addressing Imbalance Data Problems in Agriculture Applications

Miftahushudur, Tajul; Sahin, Halil Mertkan; Grieve, Bruce; Yin, Hujun

doi:10.3390/rs17030454

Open AccessArticle

A Survey of Methods for Addressing Imbalance Data Problems in Agriculture Applications

¹

Research Centre for Telecommunication, National Research and Innovation Agency (BRIN), Bandung 40135, Indonesia

²

Department of Electrical and Electronic Engineering, The University of Manchester, Manchester M13 9PL, UK

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(3), 454; https://doi.org/10.3390/rs17030454

Submission received: 4 November 2024 / Revised: 23 January 2025 / Accepted: 25 January 2025 / Published: 29 January 2025

(This article belongs to the Section Remote Sensing in Agriculture and Vegetation)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

This survey explores recent advances in addressing class imbalance issues for developing machine learning models in precision agriculture, with a focus on techniques used for plant disease detection, soil management, and crop classification. We examine the impact of class imbalance on agricultural data and evaluate various resampling methods, such as oversampling and undersampling, as well as algorithm-level approaches, to mitigate this challenge. The paper also highlights the importance of evaluation metrics, including F1-score, G-mean, and MCC, in assessing the performance of machine learning models under imbalanced conditions. Additionally, the review provides an in-depth analysis of emerging trends in the use of generative models, like GANs and VAEs, for data augmentation in agricultural applications. Despite the significant progress, challenges such as noisy data, incomplete datasets, and lack of publicly available datasets remain. This survey concludes with recommendations for future research directions, including the need for robust methods that can handle high-dimensional agricultural data effectively.

Keywords:

data imbalance; agriculture; machine learning; sampling techniques; precision farming

1. Introduction

The challenges of food security and sustainable agriculture are becoming increasingly prevalent due to steady population growth, drastic climate change, and resource depletion. As the global population continues to rise, demand for food increases, placing pressure on agricultural systems to produce more with fewer resources. Urbanization, which reduces agricultural land, compounded with climate changes, exacerbates this challenge, affecting crop yields. Additionally, the depletion of natural resources such as water and arable land further complicates the undertaking of ensuring a stable and sustainable food supply. To overcome these growing challenges, a smart and effective approach to crop and resource management is essential. Precision Agriculture (PA) is a promising approach that leverages advanced technologies such as sensing, data analytics, and automation to optimize all aspects of agriculture [1]. Real-time crop monitoring, fertilizer scheduling, and irrigation management are some of the PA applications that help farmers make quick and precise decisions, and also reduce waste and minimize environmental impact [2].

The use of sensory data is crucial for PA applications, as their success heavily relies on accurate and timely data collection and analysis. For this reason, sensors are key components for gathering real-time data on various aspects of the farm ecosystem [3,4]. Various sensors are used in PA applications to suit different needs. For example, below-surface soil moisture sensors, pH sensors, and electrical conductivity sensors are used to provide insights into water availability, nutrient levels and overall soil health [5,6,7,8,9]. Other sensors, such as spectral sensors and height sensors, are employed to monitor the health, growth stages, and potential issues, including nutrient deficiencies and pests [1,10,11,12]. Environmental sensors are used to track weather patterns, including temperature, humidity, wind speed, and rainfall [13,14]. These sensors provide a wealth of data. Such data are essential for informed irrigation decisions and growth predictions [15,16,17,18,19,20]. Additionally, high-resolution imagery sensors from satellites [21,22] or unmanned aerial vehicles (UAVs) [23,24] and close-range imagers [25] are increasingly used to gain in-depth insights into the conditions of crops, monitor their development, detect diseases and pests, and identify areas where further treatments are required [26,27,28]. However, as the volume of data obtained continues to expand, traditional analysis methods become increasingly inadequate [29]. In contrast, more and more data analysis tasks rely on machine learning (ML) to extract meaningful information from the data and to provide suitable and more sophisticated solutions [24].

The utilization of ML algorithms in PA applications offers significant advantages over conventional analysis methods such as handling high-dimensional data without sacrificing accuracy. One example is monitoring plant health conditions using hyperspectral or multispectral imagery, which is processed to derive spectral Vegetation Indices (VIs) as indicators of plant health. One of the most common VIs is the Normalized Difference Vegetation Index (NDVI), which is related to vegetation greenness and can be used to detect changes in plant health [30]. However, the effectiveness of NDVI is limited by environmental factors, such as soil background effects, atmospheric conditions, and vegetation density [31,32]. Soil background effects can influence reflectance [33], while atmospheric conditions [34], such as clouds and dust, may disturb the accuracy of measurements. NDVI values are not always uniform within a scene due to various factors, especially cloud shadows or other objects [35]. Variations in these conditions can lead to differences in reflectance values for each pixel of an object. For example, in bad weather or when an area is covered by thick clouds that reduce sunlight intensity [36], reflectance values from plants become very low, thus lowering the NDVI value. This can cause analysis to misclassify healthy plants as unhealthy, decreasing accuracy [37,38]. On the other hand, ML-based analysis can provide better reliability in analyzing more complex spectral data by considering feature patterns, even in the presence of environmental changes [39,40]. Also, high vegetation density can lead to NDVI value saturation [41], and sensor characteristics, such as spatial and spectral resolution, limit the ability to detect variations in vegetation [42]. The combination of various features from VIs, including NDVI, commonly used to assess plant health in ML, can significantly enhance the performance of health analysis and plant classification [39]. However, the issue of inconsistency in NDVI values may lead to intra-class variance [43,44]. This phenomenon occurs when an object/class has a variation in NDVI values, potentially creating sub-classes that are unevenly spread within a main class. As a result, the analysis using VIs does not always provide the expected level of accuracy in certain situations.

On the other hand, ML algorithms, such as Support Vector Machine (SVM) [45] and Convolutional Neural Networks (CNN) [46], have demonstrated their effectiveness for agricultural applications such as disease/stress detection, crop/weed segmentation and more [45,46,47,48,49,50,51,52]. Another advantage of deploying ML models is their ability to efficiently process and analyze large amounts of data, allowing real-time and continuous monitoring of plant health [53,54].

One of the most common challenges in developing ML and deep learning (DL) models is data imbalance, which occurs when the number of samples in one class is significantly smaller or larger than other classes in a dataset [55]. This issue is not only confined to a specific field but is also frequently encountered in many domains such as healthcare, finance, and security [56,57,58]. In the context of PA, data imbalance becomes more complex due to the unique characteristics of agricultural data, such as the irregularity of events (e.g., pest outbreaks or rare diseases) [59], limited access to data from remote or under-researched regions [60], and seasonal variations that cause certain agricultural phenomena to be infrequent or difficult to capture [61]. As a result, ML models trained on imbalanced datasets tend to bias towards the majority class, making it difficult to generalize to under-represented classes. For example, a model primarily trained on healthy plants may fail to detect rare diseases that occur sporadically, thereby reducing the reliability of ML solutions. Therefore, understanding and addressing data imbalance is crucial for improving the accuracy, robustness, and reliability of ML models in agricultural applications.

In PA, focus tends to be more on identifying rare diseases or stress conditions that occur infrequently in crops. In such situations, detecting rare cases is much more critical, as costs arising from misclassification can be highly significant [62]. Data imbalance is one of the factors that would exacerbate this issue, as ML models tend to prioritize the majority class, so as to minimize overall errors (global loss). As a result, classes with fewer representations such as crops with rare diseases are often overlooked or misclassified. Conversely, if a model predicts healthy plants as infected, the costs may include unnecessary pesticide purchases and applications. However, if a model fails to detect a rare disease that should have been identified and treated, consequences would be more severe. The infection could spread to other areas, increasing the risk to entire agricultural field [63]. In addition, infected plants may die, resulting in significant losses in yields. In this scenario, the costs encompass not only the loss of crop yields but also broader costs associated with handling disease spread and rehabilitating affected areas [64].

Data imbalance not only affects the training of ML models on intended tasks but also impacts measurements or evaluation of model performance. This is because standard evaluation metrics alone, such as accuracy, are often biased towards majority classes and do not reflect well on minority classes [65]. For example, an ML model is trained to classify patients as healthy and unhealthy, with a training dataset consisting of 10,097 (94%) healthy patients and only 677 (6%) patients diagnosed with congenital heart disease (CHD) [66]. The classifier that always predicts healthy would achieve 94% accuracy and 100% recall. This may sounds perfect, but the model fails to identify any diseased patients. Therefore, relying solely on accuracy in imbalanced datasets can be misleading and insufficient for a judging model’s performance. While the imbalance in this example is extreme, ratios such as 10:100 or 15:100 are also common in agricultural applications. For instance, in [67], healthy potato samples (1430) vastly outnumber diseased ones, e.g., those affected by early blight (203 samples) or late blight (251 samples). While this highlights the limitations of the overall accuracy measure for imbalanced datasets, it is important to recognize accuracy as a common and useful measure. Combined with other metrics or using class-wise accuracy and recall, it can provide useful context for model performance.

To address the issue of data imbalance, there exist a number of approaches, which can be broadly categorized according to the level at which they operate. These are algorithm-level, data-level and hybrid-level methods [68]. The algorithm-level methods aim to compensate for data imbalance by modifying the classification algorithm, such as adjusting the threshold value or assigning different weights to each class. The data-level methods involve modifications to the datasets. The easiest data-level approach is resampling, which aims to reduce the degree of imbalance. On the other hand, the hybrid-level approaches combine algorithm-level and data-level methods synergistically. The hybrid-level approach aims to utilize the strengths of each method to produce more robust and accurate ML models [69].

This paper aims to survey various approaches proposed to address the problem of data imbalance, focusing on agricultural applications. Several common techniques are discussed. We provide a detailed analysis of their strategies and highlight their respective limitations. More specifically, we review various resampling methods in the recent literature that are commonly applied to address challenges in agricultural applications, such as plant disease detection, soil management and monitoring, and plant/crop classification. Furthermore, we analyze and review the challenges identified in this survey, and provide suggestions and recommendations for future research.

2. Challenge of Imbalance Classification in Agricultural Applications

In the context of agricultural applications, the challenge of imbalanced data classification becomes increasingly complex due to the necessity of addressing various types of classification problems, including multiclass classification and intra-class classification. Multiclass classification requires specialized techniques to effectively address imbalances between different classes of crops or diseases. Conversely, intra-class classification encounters difficulties in distinguishing subtle variations within a single type of crop that shares similar features.

2.1. Multiclass Classification

Although research in addressing class imbalance problems has been growing recently, most studies focus on binary imbalance problems, where only two classes of data need to be handled. However, in many real-world scenarios, problems are often multiclass, i.e., there are more than two classes in a dataset. Handling multiclass imbalance cases is more complex than binary cases, as it involves two main scenarios: ‘multiminority’ and ‘multimajority’. In the ‘multiminority’ scenario, there are several minority classes with fewer samples than the majority classes, while in the ‘multimajority’ scenario, there are several majority classes that dominate the dataset. Both scenarios can reduce classification performance, particularly for the minority classes [70].

2.2. Intra-Class Classification

Another challenge in class imbalance problems that can affect classification performance is intra-class classification. As previously mentioned, intra-class imbalance occurs when a class within a dataset has highly varied attributes or characteristics [71,72]. Similar to the previous challenge, most solutions for imbalanced classification focus only on inter-class imbalance, often neglecting the existence of intra-class imbalance issues. In some cases, solutions for inter-class imbalance can be adapted to address intra-class problems if there are enough samples with specific labels. However, in practice, collecting samples with specific labels is often difficult and costly [73]. An example in agriculture is classification of apple varieties based on different growth conditions or planting methods. For instance, the class “apple tree” might include several varieties such as Gala, Fuji, Granny Smith, and Honeycrisp [74,75]. Although they are all apple trees, each variety has differences in leaf color, leaf shape, and fruit skin texture. Suppose Gala and Fuji varieties are more common and have many samples in the dataset, while Granny Smith and Honeycrisp are less frequently found. In that case, this uneven distribution of attributes within the “apple tree” class may cause the intra-class imbalance problem [76]. This problem is hard to avoid, especially in image classification tasks. In some cases, an inter-class imbalance solution can be applied to solve intra-class imbalance if the dataset has enough samples of specific labels [77]. However, in practice, collecting specific data is difficult and costly. For example, it is more difficult to obtain leaf samples infected by mosaic virus than by yellow leaf curl virus in tomato [78]. Therefore, developing an approach that can effectively augment specific samples within a class would be valuable. Techniques such as Generative Adversarial Networks (GANs) could be employed to synthesize data with more specific label variations.

2.3. Impact of Data Imbalance on ML Pipeline

Data imbalance presents significant obstacles across various stages of ML pipelines, from data collection to evaluation. To address this, challenges are classified into three key stages: data collection, model training, and evaluation metric selection.

Data Collection. Datasets can suffer from unbalanced class distributions when minority classes are scarce or difficult to collect due to various factors. One cause is natural rarity, such as rare diseases in crops or specific pest infestations that only occur in certain environmental conditions. Limited access to certain populations or regions is also a factor, such as remote farmlands that are difficult to reach for surveys or sampling. Finally, limited technology or resources to capture sufficient data in complex or hard-to-reach environments, such as farms in mountainous areas or under extreme weather, can also exacerbate data imbalance issues.
Another issue faced in data collection is the need to obtain varied data. In agriculture, for example, it is important to collect samples of plant species from different positions, as well as from different stages of plant growth or varying degrees of virus severity. This is crucial to ensure that ML models can generalize well and make accurate predictions across a wide range of real-world conditions. Unfortunately, in practice, obtaining data with sufficient variation is often constrained by limited time, costs, and availability of representative samples. This challenge can be mitigated by generating or synthesizing realistic data using data augmentation as a pre-processing strategy. Data augmentation methods, such as random cropping, flipping, or color jittering, can be used to artificially increase the diversity of the dataset without the need for additional data collection [79]. Furthermore, generative models, such as GANs, can be employed to create new, realistic samples that represent underrepresented classes [80]. These data generation techniques have proven effective in addressing the challenges posed by limited or imbalanced data in real-world agriculture or scenarios [81,82].
Model Training. During training, bias towards majority classes can become a significant issue, for instance, in plant disease detection for cassava, where datasets are often imbalanced. In this case, the number of healthy cassava plant images far exceeds those of plants infected with disease, causing the model to predominantly predict the majority class (healthy plants), reducing the model’s accuracy in detecting diseases in cassava plants [83].
Furthermore, overfitting presents another challenge, particularly when models are trained on imbalanced data. Then, the model frequently “sees” the majority class during training, leading to a lack of robustness and the tendency to overlook the minority class. As a result, the model becomes overfitted, leading to poor performance on new, unseen data. In practice, such models struggle to generalize well and tend to predict the majority class, neglecting the minority class [84]. In addition to data augmentations, this challenge often requires solutions like cost-sensitive learning, where greater penalties are applied to errors involving the minority class, or moving the threshold decision to a point that better balances the predictions between the majority and minority classes, such as fine-tuning the decision threshold to maximize recall or F1-score for the minority class [85,86].
Evaluation Metric Selection. When assessing model performance, a class imbalance can make common metrics such as accuracy misleading in imbalanced datasets, as a model may achieve high accuracy by predominantly predicting the majority class while failing to correctly classify the minority class. Similarly, precision and recall, when used independently, may not provide a complete picture of model performance. For example, high precision for a minority class might not account for the recall trade-off, where the model misses most of the minority class instances [87]. Alternative metrics such as F1-score, MCC, or G-Mean are often more informative in the context of class imbalance [88].

3. Methods to Address Imbalanced Data

Most real-world applications of ML models would suffer from unbalanced class distributions, when the amount of data in minority classes is significantly less than that of majority classes. ML models trained with imbalanced data can cause bias in the classification model, where the model is more likely to classify data into the majority classes and ignore the minority classes [65,89]. This happens because ML algorithms are generally optimized to minimize the overall errors (rather than the individual errors per class). In an imbalanced dataset, the minority classes have less impact on the total errors compared to the errors due to the majority classes. As a result, the model may focus more on correctly predicting the majority classes, even if this means sacrificing prediction accuracy for the minority classes. This can lead to a disparity in model performance, where the model performs well on the majority classes but not on the minority classes. In addition, minority classes may not be well represented during model training. This means that the model has fewer examples to learn from the minority classes, making it more difficult to understand their patterns and characteristics [65,89].

Various methods have been developed to overcome the imbalance data problem. In general, they can be categorized into three approaches:

Algorithm-level Approach:
Focuses on modifying classification algorithms to be more sensitive to minority classes.
Data-level Approach:
Focuses on manipulating data to balance class distributions, by such as oversampling and undersampling.
Hybrid-level Approach:
Combines algorithm-level and data-level approaches to obtain more optimized methods.

Table 1 summarizes some common strategies and limitations of each of these approaches.

3.1. Algorithm-Level Approach

The goal of algorithm-level strategies is to adapt ML algorithms to more effectively handle imbalanced training data conditions. Specifically, they increase the influence of minority class samples during training, which helps to minimize the bias of the trained model towards majority classes [65]. Two predominant techniques discussed in the literature are “cost-sensitive learning” and “threshold moving”.

3.1.1. Cost-Sensitive Learning

Cost-sensitive learning can be implemented during both the training and evaluation phases of a model, where classification focuses more on misclassified classes. This is achieved by modifying the model’s cost function, so that classification errors on certain classes contribute more to the overall cost function value.

In this approach, true positives and true negatives are generally not assigned additional costs because they represent correct classifications [96]. Conversely, higher costs are typically assigned to false positives, especially in minority classes, to address the class imbalance problem. In this way, the model is encouraged to reduce the number of false positives in the minority classes, resulting in a more balanced model performance across all classes. Once the cost matrix is defined, it can be applied to a classifier such as SVMs [97] or CNNs [98].

3.1.2. Threshold Moving

Threshold moving is a technique designed to address unbalanced classification by adjusting the classifier’s decision threshold. Typically, in balanced datasets, the default threshold probability is 0.5 [99]. However, this threshold might not be suitable for imbalanced datasets, as it often leads to the classifier being biased towards predicting majority classes. Therefore, adjusting the decision threshold may help to achieve better classification results for minority classes.

Some studies suggest the Receiver Operating Characteristic (ROC) curves and Precision–Recall (PR) curves for analyzing and determining the optimal threshold for classification [87,100,101]. The ROC curve provides a comprehensive view by covering the optimal thresholds for both majority and minority class classifications. It plots the True Positive Rate (TPR) against the False Positive Rate (FPR) at various threshold settings, allowing for a balanced view of performances across all classes. In contrast, the PR curve focuses on the precision (positive predictive value) and recall (sensitivity) of the minority classes, making it particularly more useful for imbalanced datasets where the performance of minority classes is critical [87,101].

These approaches are particularly effective for large datasets with extreme imbalance, where data-level methods like oversampling may incur high computational costs or risk overfitting. Cost-sensitive learning adjusts the loss function to penalize misclassification of minority class samples, making it a robust choice for maintaining class distribution without requiring preprocessing steps, while threshold moving adjusts the decision threshold to favor minority classes. It is effective when the model outputs probabilistic scores that are well calibrated. In scenarios where achieving high recall for the minority class is critical (e.g., detecting rare diseases or identifying plant diseases in early stages), moving the threshold to a lower value for minority classes can significantly improve their detection rate.

3.2. Data-Level Approach

In the data-level approach, resampling is a critical preprocessing step designed to adjust class distributions within the dataset. This adjustment helps to mitigate the bias that classification algorithms may show towards majority classes. There are two main methods for resampling: undersampling and oversampling.

The difference between undersampling and oversampling techniques is illustrated in Figure 1.

3.2.1. Undersampling

The purpose of undersampling is to balance sample distributions in a dataset by reducing the sample size of majority classes. Although this technique can effectively decrease data processing time due to the reduced amount of data, it risks losing important information from the majority classes, potentially reducing classification performance.

Several studies have been conducted to address the issue of imbalanced data using undersampling. Some popular undersampling techniques are summarized in Table 2.

3.2.2. Oversampling

The fundamental idea of oversampling is to replicate or generate new samples for minority classes in a dataset. Similar to undersampling, oversampling also helps mitigate bias towards majority classes. This approach can be carried out in several ways, from simply duplicating existing instances to more complex methods to create synthetic samples. The main challenge in oversampling is to augment minority classes effectively without introducing significant bias or overfitting [103]. When this challenge is dealt with, the oversampling method can lead to more robust and accurate classification models by giving them a more representative view of minority classes [104].

Some popular oversampling techniques include Random Oversampling (ROS), Synthetic Minority Oversampling Technique (SMOTE) [105], Adaptive Synthetic Sampling (ADASYN) [106], Borderline SMOTE [107], Safe-Level SMOTE (SL-SMOTE) [108], K-Means SMOTE [109], SVM SMOTE [110]. In general, strategies and limitations of these algorithms are presented in Table 3. It is important to note that although these techniques are regarded as traditional approaches, they are still being used and developed today. Some recent research generally takes inspiration from some of these approaches to produce techniques that are more robust and adaptive to the characteristics of the dataset. For example, a study in [47] modifies the SVM SMOTE to be more robust in handling ambiguous data in both majority and minority classes. Table 3 presents ROS and several SMOTE-based oversampling techniques.

Table 2. List of undersampling techniques.

Method	Strategy	Limitation
Random Undersampling (RUS)	Randomly deletes existing samples of the majority class until the number of majority samples equals that of the minority class.	RUS may lead to a loss of valuable information, and can result in underfitting as it reduces the dataset size.
Edited Nearest Neighbour (EEN) [111]	Deletes majority class samples based on the nearest neighbour distribution of the majority class. If a sample from the majority class has neighbours dominated by samples from a different class, it will be deleted.	It may not entirely reduce the degree of imbalance, as there may not be many majority samples meeting the criteria.
Neighbourhood Cleaning Rule (NCL) [112]	An extension of EEN. This method considers the neighbourhood distributions of the majority class samples and the neighbourhood distribution of minority class samples to remove more of the majority class samples.	Although it might remove more samples than EEN, there is still potential for not fully reducing the imbalance degree, especially in extremely imbalanced data conditions.
Tomek-Links [113]	Identifying “Tomek-Links” pairs and removing majority class samples. Tomek-Links pairs are identified if the Euclidean distance between samples from different classes is smaller than the distance between two samples of the same class.	Only considers samples in the boundary area, so noise or overlapped samples may still exist.
Cluster-based oversampling [114]	Using clustering algorithms to select cluster centres or their nearest samples based on the K-Nearest Neighbour (K-NN) rule to represent majority class samples.	While this technique helps retain important information from the majority class, it may disrupt the original sample distribution.

The use of resampling depends heavily on the characteristics of the dataset [115]. For example, data-level methods such as SMOTE work effectively when majority and minority classes are well-separated with a clear decision boundary. However, in datasets with more complex class distributions or where minority class data points lie near or on the decision boundary, ADASYN is preferable as it adapts the sampling distribution to focus on harder-to-classify instances. Similarly, when many minority samples are ambiguous or similar to majority samples, SVM-SMOTE becomes more effective because it generates synthetic data points around the decision boundary. This helps refine the decision boundary and improve the classification of minority class instances. For noisy datasets, undersampling methods like Edited Nearest Neighbours (ENN) or Tomek Links outperform oversampling techniques by removing noisy instances and preventing noise amplification. Moreover, when majority classes exhibit significant intra-class variations, clustering-based undersampling (e.g., k-means) is beneficial as it minimizes the risk of losing critical data.

3.3. Hybrid-Level Approach

The fundamental principle of this approach is to apply one technique to compensate for the weaknesses of another. In other words, a hybrid method is a combination of two complementary techniques. Consequently, it often outperforms a single, stand-alone technique. The most commonly employed hybrid approach in the context of imbalanced classification problems is a combination of oversampling and undersampling [116]. This method is advantageous as it minimizes the risk of losing crucial information and is resilient to overfitting.

An alternative approach to combining multiple ML models trained using different methodologies is the ensemble method, which is one of the most popular techniques for improving classification performance. A considerable body of research has demonstrated that implementation of ensemble learning can yield superior performance to a single classifier model on imbalanced datasets [117]. In ensemble learning, prediction results from multiple weak classifier models are combined, resulting in more accurate predictions than those made by single models.

The two principal approaches in ensemble learning are bagging (bootstrap aggregation) and boosting. In the bagging approach, multiple models are trained independently, and the prediction result from each model is used as a reference for the final decision. In contrast, boosting works by training models sequentially, with each successive model attempting to correct the errors of the previous one. In order to overcome the imbalance problem associated with ensemble learning approaches, several strategies are typically combined. For example, feature selection and resampling methods can be combined to form an ensemble model [118]. Similarly, combining modified SMOTE with bootstrap and bagging has been shown to be effective [119]. Wirot et al. [120] proposed a Cost-Sensitive Neural Network Ensemble (CS-NNE), which combines multiple neural network models weighted by different class weights. A combination of SMOTE and feature selection adapted to ensemble learning was proposed in [121].

Hybrid methods combine the strengths of data-level and algorithm-level techniques to address their respective limitations. For example, SMOTE-ENN integrates synthetic oversampling with noise-reducing undersampling [122]. This approach compensates for the potential increase in noise associated with oversampling by analysing and removing noisy samples, resulting in a more robust performance compared to using SMOTE or ENN individually. However, hybrid methods like SMOTE-ENN are computationally intensive, making them better suited for applications where processing time is not critical.

4. Leveraging Generative Models for Synthetic Data Generation in Addressing Imbalanced Datasets

DL techniques continue to evolve, leading to a growing emphasis on their adoption in generative models and data synthesis. These models are increasingly used for generating synthetic data that closely resemble real data. Two prominent deep generative models are GANs [123] and Variational Autoencoders (VAEs) [124]. These models are designed to understand and mimic the structure and distribution of real data. Such a capability becomes very useful, especially for generating data when real samples are limited. Additionally, transfer learning plays a key role in improving the performance of DL models by leveraging knowledge from pre-trained models, thus enabling more efficient use of limited data by transferring learned features or models from larger, related datasets to the task at hand.

4.1. Generative Adversarial Networks (GANs)

The potential of generative models, in particular GANs, for generating realistic synthetic data makes them a promising alternative to address the problem of data imbalance [125]. This approach is becoming a new trend in ML, particularly for the generation of high-dimensional data such as images, where conventional oversampling techniques such as SMOTE are less effective [126]. The original GAN model was developed by Goodfellow et al. in 2014 [123]. The architecture of GANs typically comprises two neural networks engaged in a competitive process: a generator and a discriminator. The generator is responsible for generating new (or fake) data, while the discriminator aims to differentiate between fake and real data. Both networks are trained simultaneously in an adversarial manner, with the objective for the generator to produce data that are so realistic that the discriminator is unable to identify it as fake [123]. An illustration of the GAN architecture can be found in Figure 2.

Although GAN has the potential to generate data that appear plausible, it is not possible to control the model to generate samples for a specific class. This is because the model will randomly generate samples from any trained class. The conditional GAN (cGAN) was subsequently developed to address this limitation of the standard GAN [127]. The basic concept of cGAN is to train the generator and discriminator in a conditional manner based on a specific class or label. Consequently, it is capable of being controlled to generate images based on the input label on the generator [127]. Figure 3 illustrates the concept of cGAN.

cGANs enhance the ability to generate more specific and easily controlled images or samples through the use of conditional labels. For instance, when synthesizing MNIST or Fashion-MNIST images, a cGAN model can be directed to generate specific digits or objects based on the desired input labels. This capability allows a cGAN model to generate diverse samples across different classes, making it more effective than the standard GAN for creating new samples across various classes. This is particularly useful for addressing multiclass imbalance data problems. Consequently, cGAN has been used to generate synthetic data that support ML models on imbalanced datasets [129].

The cGAN model still has drawbacks. As a DL model, it must be trained with sufficient and balanced data to work effectively. If the model is trained with unbalanced data, its performance may not be optimal, especially when generating data for the minority class. Additionally, without proper fine-tuning, there is no guarantee that the synthetic data will be diverse enough. Including such data in the training set may fail to provide new and useful information to enhance the classifier’s robustness and generalizability. Another challenge in oversampling based on cGAN is controlling the distribution of generated data. Although generated data are realistic, they still have the potential to overlap with the majority class, which can diminish its effectiveness in addressing data imbalance issues [47].

There has been a significant increase in the use of GANs to solve imbalance classification problems. Some significant early studies included the introduction of Balanced GAN (BAGAN), which was designed to generate high-quality synthetic data with unbalanced training sets [130]. Another study utilized GANs for synthetic data augmentation to improve the classification of liver lesions in an imbalanced medical dataset [131]. Since 2020, various recent studies have further strengthened the role of GANs in handling data imbalance. Pan et al. introduced Improved GAN (I-GAN), which was specifically designed to generate samples from minority classes in imbalanced data situations [132]. Qin et al. introduced enhancements to the Wasserstein GAN (WGAN) to generate synthetic data that aid fault diagnosis in unbalanced data [133]. Sharma et al. [134] introduced SMOTifiedGAN, a combination of SMOTE and GAN, to overcome the imbalance classification problem in a generic dataset.

4.2. Variational Autoencoder (VAE)

Another approach under the generative model is the variational autoencoder (VAE) [124]. VAE is a development of the traditional autoencoder (AE) network primarily used for dimensionality reduction. The basic architecture of an autoencoder comprises two main components: an encoder and a decoder. The encoder’s function is to convert input data into a condensed, lower-dimensional representation. In contrast, the decoder is responsible for reconstructing the data to its original size and shape. Because standard autoencoders compress data into a more compact latent vector representation and then reconstruct it, they often produce reconstructed data with limited variation. Therefore, this mechanism is not ideal for tasks requiring high variability in data generation. The main distinction between a traditional autoencoder and a VAE is that while an autoencoder learns to produce a compressed output in the encoder, represented in a bottleneck or latent vector, a VAE learns the input data distribution. This results in the bottleneck in an autoencoder being divided into two separate layers—mean distribution and standard deviation—before learning the distribution as a latent vector, which is then reconstructed with the decoder to obtain the output (see Figure 4 for illustration). This stochastic process enables VAEs not just to learn a fixed representation but also to capture the distribution of the data in the latent space, allowing them to generate new, similar data points and making them highly effective for generative tasks.

Recent studies show various approaches to addressing class imbalance in different types of data using VAEs. Several papers apply VAEs to generate synthetic data that enhance the performance of predictive models in regression contexts, such as Imbalanced Regression (IR) [136], and in disease detection, such as COVID-19 detection using chest X-ray images [137]. VAEs are also utilized to create synthetic data to improve prediction accuracy by integrating them into graph attention networks for construction management [138]. A technique called contrastive variational autoencoders is also used for oversampling in [139].

Although these generative approaches are effective in generating synthetic data, each still has its own weaknesses. For instance, while GANs can effectively generate high-quality data, they are still susceptible to the problem of mode collapse, where the variation in the generated data is quite limited [140]. On the other hand, VAEs are only effective in generating low-resolution data, as they are more likely to produce blurry data [141].

4.3. Transfer Learning

Transfer learning (TL) is an ML technique that leverages knowledge from previously learned tasks or domains to enhance performance on new, related tasks or domains [142]. This approach is particularly beneficial when the target task has limited or unbalanced training data, as it enables models to utilize information acquired from other sources [143]. Common methods of TL include feature extraction and fine-tuning [144]. In feature extraction, a pre-trained model is employed to extract features from new data, which are then used by other models for classification. Fine-tuning involves adjusting the weights of a pre-trained model on a target dataset, allowing the model to adapt to specific characteristics of the new data.

One significant advantage of TL in the context of imbalanced data is its ability to reduce the necessity for large training datasets [145]. By utilizing models trained on extensive, balanced datasets, TL enables models to recognize important patterns and features, even when data for minority classes are scarce. This capability can enhance the accuracy and generalization performance of models concerning minority classes.

However, applying TL to imbalanced data presents challenges. If there are substantial differences between the source and target domains, TL may result in negative transfer, where the transferred knowledge is irrelevant or detrimental to model performance [146]. Therefore, ensuring compatibility between source and target domains is crucial. Additionally, incorporating techniques such as class weight adjustment or error-cost-sensitive algorithms may be necessary to effectively address data imbalance [147,148].

In agricultural applications, TL has been applied to address data imbalance in leaf disease detection. For instance, the Lightweight Federated Transfer Learning (LFTL) framework has been developed to detect and classify leaf diseases [149]. Additionally, the Progressive Loss-Aware Fine-Tuning Stepwise Learning (PLAFTSL) model has been proposed for rice disease detection. This model utilizes a fine-tuned ResNet50 architecture with stepwise learning to improve training efficiency [150].

5. Applications in Agriculture

The problem of class imbalance in agricultural datasets has become increasingly challenging, especially in real-world agricultural scenarios where minority conditions frequently occur. This section provides an in-depth examination of various techniques applied to tackle class imbalance. We will review approaches that have demonstrated effectiveness in mitigating data imbalance across diverse applications, including plant disease detection, soil management, and crop type classification.

5.1. Disease Detection

In disease detection, data imbalance often arises, as the number of diseased plant samples is significantly lower than that of abundant healthy plant samples. This imbalance can lead to models that are more likely to classify plants as healthy, thus reducing the accuracy of disease detection. Consequently, misclassifications of disease-affected samples, even a few, can increase costs in economic and time terms, as early detection errors may lead to broader disease spread. To address this challenge, various approaches have been developed and implemented, ranging from data resampling methods to learning techniques specifically designed to handle this imbalance. Table 4 presents a summary of recent approaches that have been successfully employed in addressing data imbalance issues in disease detection within the agricultural industry.

5.2. Soil Management

Soil management is crucial for optimizing and sustaining agricultural land management and improving crop yields. However, the problem of class imbalance in soil datasets often poses a significant challenge, as rare or specific soil types, such as those with unique mineral compositions or found only in certain locations, are often difficult to collect. As a result, ML models may become biased towards more common soil types. Such classification errors can negatively impact land management decisions; for instance, in precision agriculture, inaccuracies in soil classification can affect recommendations for fertilization and irrigation [160]. Additionally, these inaccuracies can influence strategic planning decisions in agriculture, such as selecting site locations for different crops or large-scale agricultural projects, thereby potentially reducing the effectiveness and success of agricultural practices [161]. Table 5 presents a summary of various approaches for soil classification that address data imbalance issues.

5.3. Crop Type Classification

Monitoring plant diversity in agricultural fields is a crucial step in optimizing crop yields and maintaining ecosystem balance. One commonly used method in this process is leveraging on-camera sensors. These sensors are effective since they can efficiently capture images in various environmental conditions over large areas. For this reason, conventional camera sensors are the primary choice for researchers in monitoring and classifying plants. However, with technological advancements, there has been a shift towards using multispectral and hyperspectral sensors [167]. These sensors offer several advantages over conventional camera sensors. A key advantage of hyperspectral imaging over conventional imaging is its ability to capture and measure the reflectance of a target in greater detail. By capturing a broader spectrum of light, hyperspectral sensors can more accurately identify the “spectral signature” of a target. This capability allows hyperspectral sensors to better characterize plants, distinguish between different species, assess plant health, and detect the presence of diseases or pests with much higher precision than conventional cameras [168].

The use of ML and spectral signatures for plant species classification has been widely employed in various studies. However, again, a major challenge in spectral data analysis is the issue of data imbalance, especially when dealing with high-dimensional data such as hyperspectral data that contain hundreds of features. The data imbalance problem in spectral features can theoretically be addressed with simple oversampling approaches like SMOTE. As mentioned before, this method has limitations when applied to high-dimensional data, which can lead to the creation of overlapped data. These overlapped data can damage the decision boundaries between different classes, thus reducing the model’s accuracy in distinguishing plant species. To overcome this problem, some studies have implemented dimensional reduction techniques, such as feature selection, which serves to reduce the number of irrelevant features before applying SMOTE oversampling [169,170,171].

Hyperspectral data contain not only spectral features but also spatial features and a combination of spectral–spatial features. Spectral–spatial features offer significant advantages in the classification process, primarily because they combine spectral information with spatial context, allowing for more accurate identification and the detection of more complex patterns. This approach is especially effective for plant classifications where some species share similar spectral signatures in some bands, such as in the case of alfalfa and corn oat samples in the Indian Pines dataset [172]. SMOTE works by mapping the original data into a high-dimensional feature space as data points and then generating new data points among the original dataset. Therefore, this approach is particularly suitable for vector data, such as spectral features of hyper-/multispectral images (H/MSI), but is not effective for creating spatial or spectral–spatial features in a 3D format. Consequently, a more suitable approach is needed to address this issue. One promising approach that has emerged in recent years is the use of GANs, which have the capability to generate realistic imagery data [125].

Over the past five years, there has been a significant increase in the use of GANs to solve classification problems on HSI with limited training data. Zhen et al. were among the first to apply GANs to HSI and proposed HSGAN, a semi-supervised classifier based on Auxiliary Classifier GAN (AC-GAN) [173]. HSGAN was further developed by incorporating a majority vote concept to enhance the classifier’s performance [174].

Additionally, Feng et al. introduced MSGAN, which consists of two generators and a discriminator [175]. In this architecture, the generators separately produce spectral and spatial data, while the discriminator combines both types of data to achieve more accurate classification. Another innovation was introduced by Zhong et al., who integrated a Conditional Random Field (CRF) into the discriminator to apply graphical constraints on the softmax layer, thereby strengthening the classification process [176].

Furthermore, Capsnet-Triple GAN combines Capsnet with TripleGAN—an architecture consisting of three main networks: a generator, a discriminator, and a classifier [177]. Meanwhile, Yin et al. incorporated a dropblock structure into the discriminator network to improve the model’s generalization ability [178]. Finally, Roy et al. developed HyperGAMO [179], which applies the Generative Adversarial Minority Oversampling (GAMO) architecture [180] to hyperspectral data to better handle spectral–spatial data.

5.4. Weed Detection

In weed detection, data imbalance often arises when the number of weed samples is significantly lower than that of healthy crops or background elements. This imbalance can cause models to favor classifying images as healthy crops, leading to reduced accuracy in identifying weeds. Moreover, the small size and scattered distribution of weeds, coupled with their visual similarity to young crop plants, pose additional challenges for accurate detection. Errors in weed detection can result in inefficient pesticide application, increased costs, and potential environmental damage. To address these challenges, several approaches have been explored, including data resampling techniques, adoption of specialized DL architectures like SegNet and U-Net, use of tailored loss functions such as weighted loss or focal loss, and use of GAN to increase the number of training samples. Accurate weed detection is crucial for optimizing pesticide use, minimizing environmental impact, and ensuring better crop yields. Table 6 summarizes recent advancements and methodologies employed to tackle data imbalance issues in weed detection within PA.

6. Evaluation Metrics

It is important to note that in the context of classification with imbalanced data, traditional performance measures such as accuracy can be very misleading. To illustrate, consider a test set with 1000 samples, where 950 are healthy samples and 50 are infected samples. If a model predicts all samples in the test set as healthy and fails to correctly predict any of the infected samples, the traditional accuracy metric will show a performance of 95%. At first glance, this number seems very good. However, this high accuracy is misleading because the model lacks the ability to effectively detect minority class samples. Therefore, the model cannot be practically used.

Here, we review several performance metrics that are appropriate for evaluating model performance on imbalanced classification problems, such as F-score, Matthews Correlation Coefficient (MCC), Sensitivity/Recall, Specificity, and Precision. These metrics provide a more comprehensive understanding of the model’s performance, especially its ability to detect and classify minority class samples.

6.1. Confusion Matrix

A confusion matrix is a tool of visualization and analysis that is often used to measure the performance of classification models. This matrix has four main components, as illustrated in Figure 5.

TP = True Positive (# of positive samples correctly classified as the positive class).

FP = False Positive (# of positive samples incorrectly classified as the negative class).

TN = True Negative (# of negative samples correctly classified as the negative class).

FN = False Negative (# of negative samples incorrectly classified as the positive class).

6.2. Matthews Correlation Coefficient (MCC)

The Matthews Correlation Coefficient (MCC) was originally developed by Matthews in 1975 to compare chemical structures [186]. In general, the MCC score is used to summarize a confusion matrix in a single value. In essence, the MCC calculates the correlation between true and predicted values, with the consequence that the higher the correlation, the higher the MCC score. The MCC score is defined as:

M C C = \frac{T P \cdot T N - F P \cdot F N}{\sqrt{(T P + F P) \cdot (T P + F N) \cdot (T N + F P) \cdot (T N + F N)}}

(1)

The main advantage of the MCC over other performance metrics is its ability to provide a balanced view of model performance in conditions of class imbalance. The MCC can be considered one of the most informative and reliable metrics for binary classification, especially when the class distribution is imbalanced. MCC is effective when a comprehensive understanding of model performance is required, particularly in scenarios with extreme class imbalance.

6.3. Precision and Recall

The concepts of precision and recall were first defined by Allen Kent and his colleagues in 1955 [187]. Precision is the proportion of actual positives among all instances labelled by the model as positive, while recall, also known as sensitivity, indicates the proportion of actual positives that have been correctly identified. Although recall and sensitivity are equivalent terms, they are used in different contexts. Sensitivity is typically used in clinical settings, while recall is more common in information retrieval scenarios. In the case of imbalanced classification problems, it is often more crucial to improve the recall rate rather than precision. Precision and recall are derived from the confusion matrix. Precision is calculated using Equation (2), while recall is the same as sensitivity, explained in the next subsection.

P r e c i s i o n = \frac{T r u e P o s i t i v e}{T r u e P o s i t i v e + F a l s e P o s i t i v e}

(2)

6.4. Sensitivity/Recall and Specificity

In classification, the ability of ML models to identify a crucial class is considered particularly more important than other classes. For example, in medical applications or diagnoses where the ability of a model to detect diseases accurately is deemed more important than detecting healthy patients, metrics such as sensitivity and specificity are used. The concepts of sensitivity and specificity were first introduced by Jacob Yerushamy in 1947 as measurement metrics for diagnostic tests [188]. Sensitivity, often also referred to as the True Positive Rate (TPR), measures the proportion of positives that are correctly identified by the model. In other words, sensitivity describes the model’s ability to accurately identify positive cases out of all actual positive cases. For instance, in the context of disease testing, a high sensitivity indicates that most patients who indeed have the disease are successfully detected by the test. On the other hand, specificity, or True Negative Rate (TNR), measures the proportion of negatives that are correctly identified. The formulas for sensitivity and specificity are described by Equations (3) and (4), respectively

S e n s i t i v i t y / R e c a l l = \frac{T r u e P o s i t i v e}{T r u e P o s i t i v e + F a l s e N e g a t i v e}

(3)

S p e c i f i c i t y = \frac{T r u e N e g a t i v e}{T r u e N e g a t i v e + F a l s e P o s i t i v e}

(4)

6.5. F-Score

F β

-score was initially proposed by Van Rijsbergen for information retrieval [189]. By definition, the

F β

-score is a dynamic measure of precision and recall. The mathematical formula can be seen in Equation (5):

F β -score = (1 + β) \frac{P r e c i s i o n \cdot R e c a l l}{(β^{2} P r e c i s i o n) + R e c a l l}

(5)

where

β

can be 1, 2, or 0.5. We use

β

= 1 if recall and precision are equally important;

β

= 0.5 will make precision have two times more weighting than recall. This constant value is used if precision is more important than recall. Vice versa, if the recall is more important than precision, the

β

value is 2. In disease diagnosis, generally having a higher recall is more important than precision. In those cases, a system that predicts someone has cancer when they do not (FN) is more acceptable than the opposite, i.e., the system predicts someone is healthy when they actually have cancer (TN). In this case, it may be harmful if the system has high precision but low recall. The maximum

F β -score

value is 1. High

F -scores

indicate that the model has good precision and recall. The

F 1 -score

is recommended for use when precision and recall need to be balanced, particularly in situations where the cost of both types of errors is considered equal.

6.6. Geometric Mean (G-Mean)

G-mean, defined in Equation (6), is a metric that gauges the balance of classification performance between majority and minority classes and penalizes inequalities. A low G-mean indicates poor performance in classifying the majority cases, even if the minority cases are accurately classified [190]. G-mean is particularly useful when the goal is to maintain a balance between sensitivity and specificity.

G -Mean = \sqrt{S e n s i t i v i t y \times S p e c i f i c i t y}

(6)

6.7. Balanced Accuracy

Balanced accuracy is the average of sensitivity and specificity, reflecting the overall accuracy achieved for both minority and majority classes. When a classifier performs equally well on both classes, this measure is equivalent to traditional accuracy. However, if the classifier achieves a high traditional accuracy as the result of focusing on majority classes, the balanced accuracy will produce a lower score than the traditional accuracy [190]. The balanced accuracy is defined as:

B a l a n c e d A c c u r a c y = \frac{1}{2} \times (S e n s i t i v i t y + S p e c i f i c i t y)

(7)

As can be seen in Equation (7), the balanced accuracy accounts for performance on both classes, helping to mitigate the bias introduced by imbalanced datasets.

6.8. Cost-Sensitive Metrics

Cost-sensitive metrics can be designed to account for the costs associated with classification errors, especially in the context of imbalanced data where errors in the minority class may have a greater impact. Some commonly used cost-sensitive metrics include the weighted F1-score [191], cost matrix (confusion matrix with cost), and cost-sensitive accuracy [192].

7. Challenge and Future Directions

After conducting a comprehensive literature review on various approaches to addressing data imbalance issues, this section explores the existing challenges and future perspectives in tackling these problems, particularly in the agricultural sector. Future research directions are expected to provide more effective and efficient solutions to address these issues and enhance the performance of ML models in agricultural applications.

One of the main challenges is obtaining high-quality data, especially for early disease detection. The differences between healthy and infected plant samples at the early stages of diseases are often very small or ambiguous, complicating accurate data separation and potentially leading to bias in the model. Incomplete data, which can arise from factors such as inaccurate sensors, fluctuating weather conditions, or human errors during data collection, further exacerbates this issue by reducing model accuracy and hindering generalization.

Techniques like SMOTE are used to generate new data around the decision boundaries between classes, but this approach risks creating overlapping data or misclassifying data into other classes, which can reduce overall model performance. Furthermore, accurate and representative data annotation requires significant cost, time, and domain expertise. The variability in plants, growth conditions, and environmental factors adds complexity to this process. Efforts to address data imbalance using resampling techniques from inaccurate reference data can introduce new biases, rendering the model’s results invalid or inconsistent with real-world conditions. Therefore, it is crucial to ensure that the data used in resampling are truly representative and relevant to the problem at hand.

Generating new data through data augmentation techniques, such as GANs, is an advanced approach for addressing data imbalance. However, this approach requires substantial computational resources and time, especially when dealing with hyperspectral data. Unlike conventional images, hyperspectral images have higher dimensions and complexity, requiring significantly greater computational power to process and generate accurate synthetic data. Moreover, synthetic data produced by GANs do not always improve model accuracy. The generated data can be too random or distant from the decision boundary. When this occurs, the synthetic data may not effectively enhance the model’s learning or increase accurate predictions, as it does not provide or add meaningful information to improve the model’s decision-making process.

Lastly, in agricultural applications, ML and DL models often need to be explainable to support field decision making. Farmers and land managers require a clear understanding of how models generate recommendations or decisions, such as those related to plant disease detection or optimal fertilization. However, DL models tend to have complex structures that are often difficult to interpret, leading to low trust in their results. This presents a significant challenge, especially when model predictions need to be adopted by farmers who may havelimited understanding of the technology.

The challenges outlined above highlight different aspects that need to be addressed in future research when addressing data imbalance issues. In light of these challenges, several future research directions can be explored to improve the performance and reliability of ML models.

First, future research should focus on optimizing data augmentation techniques, especially those involving GANs. The development of more efficient GAN architectures that can handle complex data with fewer computational resources while still producing accurate and relevant synthetic data is beneficial. In addition, improving the quality of synthetic data should also be a focus of future research. Methods need to be developed to assess and ensure that synthetic data generated are close to the decision boundaries of the model.

Secondly, future research is placing greater emphasis on the interpretability of DL models because, while DL models often achieve high performance, their “black-box” nature makes it difficult to understand how they make decisions. This lack of transparency presents several challenges, particularly in high-stakes fields like agriculture. In practical applications, stakeholders (farmers, agronomists, etc.) need to trust the technology. If a model trained on imbalanced data consistently misclassifies rare plant diseases or crop conditions, good interpretability would allow experts to understand the rationale behind these errors. This transparency helps to adjust the model or training process to improve performance, making the model more reliable and applicable in real-world scenarios.

8. Conclusions

In this survey, we have focused on resampling approaches that manipulate datasets to address the issue of data imbalance, rather than concentrating on algorithm-level approaches. These approaches provide a new perspective compared to other surveys that primarily discuss general topics in classification within the agricultural domain. We specifically highlight the class imbalance problem that often occurs in agricultural data classification and delve deeply into the effectiveness of resampling methods—both conventional and deep generative models like GANs that can generate more complex and realistic synthetic data.

However, there are some limitations in this survey that need to be considered.

Scope of the literature review: The review focuses on papers published within the last few years and indexed in Scopus, which means it may not cover many older publications that may still provide valuable insights.
Focus on specific techniques: While the review discusses a range of techniques to address class imbalance in agricultural applications, it does not exhaustively compare all of the methods discussed nor does it explore other potentially effective methods, such as active learning, which has gained traction in other applications.
Data limitations: A significant limitation noted in the reviewed studies is the lack of publicly available datasets. Many of the datasets used in the papers referenced are proprietary or have restricted access, which may limit the reproducibility. The lack of standardized public datasets can hinder efforts to compare and validate research findings consistently. Therefore, for future research, we recommend studies that use commonly utilized public datasets to serve as benchmarks. Additionally, journals may encourage data sharing by offering free open-access publication to researchers who make their datasets publicly available. Collaborative efforts for dataset standardization would add significant value by promoting consistency and improving the usability of shared data.

Author Contributions

Conceptualization, T.M., H.M.S., B.G. and H.Y.; methodology, T.M.; validation, T.M., H.M.S. and H.Y.; investigation, T.M. and H.M.S.; writing—original draft preparation, T.M., H.M.S. and H.Y.; writing—review and editing, T.M., H.M.S., B.G. and H.Y.; visualization, T.M.; supervision, B.G. and H.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Data sharing is not applicable.

Acknowledgments

Tajul Miftahushudur would like to acknowledge the scholarship provided by the Indonesian Endowment Fund for Education (LPDP). Halil Mertkan Sahin would also like to acknowledge the scholarship provided by the Ministry of National Education of the Republic of Türkiye.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Gebbers, R.; Adamchuk, V.I. Precision agriculture and food security. Science 2010, 327, 828–831. [Google Scholar] [CrossRef]
Reynolds, M.; Pask, A.; Mullan, D. Physiological Breeding I: Interdisciplinary Approaches to Improve Crop Production; CIMMYT: Texcoco, Mexico, 2012. [Google Scholar]
Wolfert, S.; Ge, L.; Verdouw, C.; Bogaardt, M.J. Big data in smart farming—A review. Agric. Syst. 2017, 153, 69–80. [Google Scholar] [CrossRef]
Zhang, C.; Kovacs, J.M. The application of small unmanned aerial systems for precision agriculture: A review. Precis. Agric. 2012, 13, 693–712. [Google Scholar] [CrossRef]
Kashyap, B.; Kumar, R. Sensing Methodologies in Agriculture for Soil Moisture and Nutrient Monitoring. IEEE Access 2021, 9, 14095–14121. [Google Scholar] [CrossRef]
Jones, H.G. Irrigation scheduling: Advantages and pitfalls of plant-based methods. J. Exp. Bot. 2004, 55, 2427–2436. [Google Scholar] [CrossRef] [PubMed]
Nair, N.; Akshaya, A.A.; Joseph, J. An in-situ soil pH sensor with solid electrodes. IEEE Sens. Lett. 2022, 6, 2000104. [Google Scholar] [CrossRef]
Postolache, S.; Sebastião, P.; Viegas, V.; Postolache, O.; Cercas, F. IoT-based systems for soil nutrients assessment in horticulture. Sensors 2023, 23, 403. [Google Scholar] [CrossRef]
Campbell, G.S.; Anderson, R.Y. An Introduction to Environmental Biophysics; Springer Science & Business Media: Berlin/Heidelberg, Germany, 1998. [Google Scholar]
Thenkabail, P.S.; Lyon, J.G.; Huete, A. Hyperspectral Remote Sensing of Vegetation; CRC Press: Boca Raton, FL, USA, 2012. [Google Scholar]
Zhang, M.; Qin, Z.; Liu, X.; Ustin, S.L. Detection of stress in tomatoes induced by late blight disease in California, USA, using hyperspectral remote sensing. Int. J. Appl. Earth Obs. Geoinf. 2003, 4, 295–310. [Google Scholar] [CrossRef]
Ehlert, D.; Horn, H.J.; Adamek, R. Measuring crop biomass density by laser triangulation. Comput. Electron. Agric. 2010, 74, 111–118. [Google Scholar] [CrossRef]
Moratiel, R.; Martínez-Cob, A.; Latorre, B. Variation in the estimation of ETo and crop water use due to meteorological data quality. Agric. Water Manag. 2011, 98, 1442–1451. [Google Scholar]
Kim, Y.; Evans, R.G.; Iversen, W.M. Remote sensing and control of an irrigation system using a distributed wireless sensor network. IEEE Trans. Instrum. Meas. 2008, 57, 1379–1387. [Google Scholar]
Doraiswamy, P.C.; Hatfield, J.L.; Jackson, T.J.; Akhmedov, B.; Prueger, J.; Stern, A. Crop condition and yield simulations using Landsat and MODIS. Remote Sens. Environ. 2004, 92, 548–559. [Google Scholar] [CrossRef]
Hammer, G.L.; Sinclair, T.R.; Boote, K.J.; Wright, G.C.; Meinke, H. A peanut simulation model: I. Model development and testing. Agron. J. 1995, 87, 1085–1093. [Google Scholar] [CrossRef]
Evett, S.R.; Tolk, J.A.; Howell, T.A. A depth control stand for improved accuracy with the neutron probe. Vadose Zone J. 2003, 2, 642–649. [Google Scholar] [CrossRef]
Sadler, E.J.; Evans, R.G.; Stone, K.C.; Camp, C.R. Opportunities for conservation with precision irrigation. J. Soil Water Conserv. 2005, 60, 371–378. [Google Scholar]
Lobell, D.B.; Schlenker, W.; Costa-Roberts, J. Climate trends and global crop production since 1980. Science 2011, 333, 616–620. [Google Scholar] [CrossRef] [PubMed]
White, J.W.; Hoogenboom, G.; Kimball, B.A.; Wall, G.W. Methodologies for simulating impacts of climate change on crop production. Field Crop. Res. 2011, 124, 357–368. [Google Scholar] [CrossRef]
Segarra, J.; Buchaillot, M.L.; Araus, J.L.; Kefauver, S.C. Remote sensing for precision agriculture: Sentinel-2 improved features and applications. Agronomy 2020, 10, 641. [Google Scholar] [CrossRef]
Yang, C. High resolution satellite imaging sensors for precision agriculture. Front. Agric. Sci. Eng. 2018, 5, 393–405. [Google Scholar] [CrossRef]
Messina, G.; Modica, G. Applications of UAV thermal imagery in precision agriculture: State of the art and future research outlook. Remote Sens. 2020, 12, 1491. [Google Scholar] [CrossRef]
Shahi, T.B.; Xu, C.Y.; Neupane, A.; Guo, W. Machine learning methods for precision agriculture with UAV imagery: A review. Electron. Res. Arch. 2022, 30, 4277–4317. [Google Scholar] [CrossRef]
Mishra, P.; Asaari, M.S.M.; Herrero-Langreo, A.; Lohumi, S.; Diezma, B.; Scheunders, P. Close range hyperspectral imaging of plants: A review. Biosyst. Eng. 2017, 153, 41–60. [Google Scholar] [CrossRef]
Mulla, D.J. Twenty-five years of remote sensing in precision agriculture: Key advances and remaining knowledge gaps. Biosyst. Eng. 2013, 114, 358–371. [Google Scholar] [CrossRef]
Mahlein, A.K. Plant disease detection by imaging sensors—Parallels and specific demands for precision agriculture and plant phenotyping. Plant Dis. 2016, 100, 241–251. [Google Scholar] [CrossRef] [PubMed]
Sankaran, S.; Mishra, A.; Ehsani, R.; Davis, C. A review of advanced techniques for detecting plant diseases. Comput. Electron. Agric. 2010, 72, 1–13. [Google Scholar] [CrossRef]
McBratney, A.; Whelan, B.; Ancev, T.; Bouma, J. Future directions of precision agriculture. Precis. Agric. 2005, 6, 7–23. [Google Scholar] [CrossRef]
Thenkabail, P.S.; Lyon, J.G.; Huete, A. Hyperspectral Indices and Image Classifications for Agriculture and Vegetation, 2nd ed.; CRC Press: Boca Raton, FL, USA, 2019. [Google Scholar]
Li, C.; Xue, J.; Su, B. Significant remote sensing vegetation indices: A review of developments and applications. J. Sens. 2017, 2017, 1353691. [Google Scholar] [CrossRef]
Nascimento, S.M.; Amano, K.; Foster, D.H. Spatial distributions of local illumination color in natural scenes. Vis. Res. 2016, 120, 39–44. [Google Scholar] [CrossRef]
Prudnikova, E.; Savin, I.; Vindeker, G.; Grubina, P.; Shishkonakova, E.; Sharychev, D. Influence of soil background on spectral reflectance of winter wheat crop canopy. Remote Sens. 2019, 11, 1932. [Google Scholar] [CrossRef]
Moravec, D.; Komárek, J.; López-Cuervo Medina, S.; Molina, I. Effect of atmospheric corrections on NDVI: Intercomparability of Landsat 8, Sentinel-2, and UAV sensors. Remote Sens. 2021, 13, 3550. [Google Scholar] [CrossRef]
Stamford, J.D.; Vialet-Chabrand, S.; Cameron, I.; Lawson, T. Development of an accurate low cost NDVI imaging system for assessing plant health. Plant Methods 2023, 19, 9. [Google Scholar] [CrossRef]
Yang, X.; Zuo, X.; Xie, W.; Li, Y.; Guo, S.; Zhang, H. A correction method of NDVI topographic shadow effect for rugged terrain. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 8456–8472. [Google Scholar] [CrossRef]
Guo, Y.; Wang, C.; Lei, S.; Yang, J.; Zhao, Y. A framework of spatio-temporal fusion algorithm selection for Landsat NDVI time series construction. ISPRS Int. J. Geo-Inf. 2020, 9, 665. [Google Scholar] [CrossRef]
Dougherty, T.R.; Jain, R.K. Invisible walls: Exploration of microclimate effects on building energy consumption in New York City. Sustain. Cities Soc. 2023, 90, 104364. [Google Scholar] [CrossRef]
AlSuwaidi, A.; Veys, C.; Hussey, M.; Grieve, B.; Yin, H. Hyperspectral selection based algorithm for plant classification. In Proceedings of the 2016 IEEE International Conference on Imaging Systems and Techniques (IST), Chania, Greece, 4–6 October 2016; pp. 395–400. [Google Scholar] [CrossRef]
Ramanath, A.; Muthusrinivasan, S.; Xie, Y.; Shekhar, S.; Ramachandra, B. NDVI versus CNN features in deep learning for land cover clasification of aerial images. In Proceedings of the IGARSS 2019—2019 IEEE International Geoscience and Remote Sensing Symposium, Yokohama, Japan, 28 July–2 August 2019; pp. 6483–6486. [Google Scholar] [CrossRef]
Zaitunah, A.; Samsuri; Marbun, Y.M.H.; Susilowati, A.; Elfiati, D.; Syahputra, O.K.H.; Arinah, H.; Rangkuti, A.B.; Rambey, R.; Harahap, M.M.; et al. Vegetation density analysis using normalized difference vegetation index in East Jakarta, Indonesia. IOP Conf. Ser. Earth Environ. Sci. 2021, 912, 012053. [Google Scholar] [CrossRef]
Franke, J.; Heinzel, V.; Menz, G. Assessment of NDVI- differences caused by sensor specific relative spectral response functions. In Proceedings of the 2006 IEEE International Symposium on Geoscience and Remote Sensing, Denver, CO, USA, 31 July–4 August 2006; pp. 1138–1141. [Google Scholar] [CrossRef]
Gong, C.; Yin, R.; Long, T.; Jiao, W.; He, G.; Wang, G. Spatial–temporal approach and dataset for enhancing cloud detection in Sentinel-2 imagery: A case study in China. Remote Sens. 2024, 16, 973. [Google Scholar] [CrossRef]
Revel, C.; Deville, Y.; Achard, V.; Briottet, X.; Weber, C. Inertia-constrained pixel-by-pixel nonnegative matrix factorisation: A hyperspectral unmixing method dealing with intra-class variability. Remote Sens. 2018, 10, 1706. [Google Scholar] [CrossRef]
Alsuwaidi, A.; Veys, C.; Hussey, M.; Grieve, B.; Yin, H. Hyperspectral feature selection ensemble for plant classification. In Proceedings of the Hyperspectral Imaging and Applications (HSI 2016), Coventry, UK, 12–13 October 2016. [Google Scholar]
Miftahushudur, T.; Grieve, B.; Yin, H. Ensemble synthetic oversampling with manhattan distance for unbalanced hyperspectral data. In Proceedings of the Intelligent Data Engineering and Automated Learning—IDEAL 2021, Manchester, UK, 25–27 November 2021; Lecture Notes in Computer Science (LNCS). Volume 13113. [Google Scholar] [CrossRef]
Miftahushudur, T.; Sahin, H.M.; Grieve, B.; Yin, H. Enhanced SVM-SMOTE with cluster consistency for imbalanced data classification. In Proceedings of the Intelligent Data Engineering and Automated Learning—IDEAL 2023, Évora, Portugal, 22–24 November 2023; Quaresma, P., Camacho, D., Yin, H., Gonçalves, T., Julian, V., Tallón-Ballesteros, A.J., Eds.; Springer: Cham, Switzerland, 2023; pp. 431–441. [Google Scholar]
Miftahushudur, T.; Grieve, B.; Yin, H. Permuted KPCA and SMOTE to guide GAN-based oversampling for imbalanced HSI classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 489–505. [Google Scholar] [CrossRef]
Peng, Y.; Dallas, M.M.; Ascencio-IbÃ¡Ã±ez, J.T.; Hoyer, J.S.; Legg, J.; Hanley-Bowdoin, L.; Grieve, B.; Yin, H. Early detection of plant virus infection using multispectral imaging and spatialâ€“spectral machine learning. Sci. Rep. 2022, 12, 3113. [Google Scholar] [CrossRef]
Sahin, H.M.; Grieve, B.; Yin, H. Automatic multispectral image classification of plant virus from leaf samples. In Proceedings of the Intelligent Data Engineering and Automated Learning—IDEAL 2020, Guimaraes, Portugal, 4–6 November 2020; Analide, C., Novais, P., Camacho, D., Yin, H., Eds.; Springer: Cham, Switzerland, 2020; pp. 374–384. [Google Scholar]
Sahin, H.M.; Grieve, B.; Yin, H. Combining of Markov random field and convolutional neural networks for hyper/multispectral image classification. In Proceedings of the Intelligent Data Engineering and Automated Learning—IDEAL 2023, Évora, Portugal, 22–24 November 2023; Quaresma, P., Camacho, D., Yin, H., Gonçalves, T., Julian, V., Tallón-Ballesteros, A.J., Eds.; Springer: Cham, Switzerland, 2023; pp. 28–38. [Google Scholar]
Sahin, H.M.; Miftahushudur, T.; Grieve, B.; Yin, H. Segmentation of weeds and crops using multispectral imaging and CRF-enhanced U-Net. Comput. Electron. Agric. 2023, 211, 107956. [Google Scholar] [CrossRef]
Isinkaye, F.O.; Olusanya, M.O.; Singh, P.K. Deep learning and content-based filtering techniques for improving plant disease identification and treatment recommendations: A comprehensive review. Heliyon 2024, 10, e29583. [Google Scholar] [CrossRef] [PubMed]
Kong, Y.L.; Huang, Q.; Wang, C.; Chen, J.; Chen, J.; He, D. Long short-term memory neural networks for online disturbance detection in satellite image time series. Remote Sens. 2018, 10, 452. [Google Scholar] [CrossRef]
Leevy, J.L.; Khoshgoftaar, T.M.; Bauder, R.A.; Seliya, N. A survey on addressing high-class imbalance in big data. J. Big Data 2018, 5, 42. [Google Scholar] [CrossRef]
Ojo, M.O.; Zahid, A. Improving deep learning classifiers performance via preprocessing and class imbalance approaches in a plant disease detection pipeline. Agronomy 2023, 13, 887. [Google Scholar] [CrossRef]
Walsh, R.; Tardy, M. A Comparison of techniques for class imbalance in deep learning classification of breast cancer. Diagnostics 2023, 13, 67. [Google Scholar] [CrossRef] [PubMed]
Cheah, P.C.Y.; Yang, Y.; Lee, B.G. Enhancing financial fraud detection through addressing class imbalance using hybrid SMOTE-GAN techniques. Int. J. Financ. Stud. 2023, 11, 110. [Google Scholar] [CrossRef]
Xiang, Y.; Yao, J.; Yang, Y.; Yao, K.; Wu, C.; Yue, X.; Li, Z.; Ma, M.; Zhang, J.; Gong, G. Real-time detection algorithm for Kiwifruit canker based on a lightweight and efficient generative adversarial network. Plants 2023, 12, 3053. [Google Scholar] [CrossRef]
Pesaresi, S.; Mancini, A.; Casavecchia, S. Recognition and characterization of forest plant communities through remote-sensing NDVI time series. Diversity 2020, 12, 313. [Google Scholar] [CrossRef]
Mahakosee, S.; Jogloy, S.; Vorasoot, N.; Theerakulpisut, P.; Holbrook, C.C.; Kvien, C.K.; Banterng, P. Seasonal variation in canopy size, light penetration and photosynthesis of three cassava genotypes with different canopy Architectures. Agronomy 2020, 10, 1554. [Google Scholar] [CrossRef]
He, J.; Cheng, M.X. Weighting methods for rare event identification from imbalanced datasets. Front. Big Data 2021, 4, 715320. [Google Scholar] [CrossRef]
Singh, V.; Sharma, N.; Singh, S. A review of imaging techniques for plant disease detection. Artif. Intell. Agric. 2020, 4, 229–242. [Google Scholar] [CrossRef]
Brancalion, P.H.; Meli, P.; Tymus, J.R.; Lenti, F.E.; Benini, R.M.; Silva, A.P.M.; Isernhagen, I.; Holl, K.D. What makes ecosystem restoration expensive? A systematic cost assessment of projects in Brazil. Biol. Conserv. 2019, 240, 108274. [Google Scholar] [CrossRef]
Johnson, J.M.; Khoshgoftaar, T.M. Survey on deep learning with class imbalance. J. Big Data 2019, 6, 27. [Google Scholar] [CrossRef]
Kosolwattana, T.; Liu, C.; Hu, R.; Han, S.; Chen, H.; Lin, Y. A self-inspected adaptive SMOTE algorithm (SASMOTE) for highly imbalanced data classification in healthcare. BioData Min. 2023, 16, 15. [Google Scholar] [CrossRef] [PubMed]
Hou, C.; Zhuang, J.; Tang, Y.; He, Y.; Miao, A.; Huang, H.; Luo, S. Recognition of early blight and late blight diseases on potato leaves based on graph cut segmentation. J. Agric. Food Res. 2021, 5, 100154. [Google Scholar] [CrossRef]
Yap, B.W.; Rani, K.A.; Abd Rahman, H.A.; Fong, S.; Khairudin, Z.; Abdullah, N.N. An application of oversampling, undersampling, bagging and boosting in handling imbalanced datasets. In Proceedings of the First International Conference on Advanced Data and Information Engineering (DaEng-2013), Kuala Lumpur, Malaysia, 16–18 December 2013. Lecture Notes in Electrical Engineering. [Google Scholar] [CrossRef]
Azevedo, B.F.; Rocha, A.M.A.; Pereira, A.I. Hybrid approaches to optimization and machine learning methods: A systematic literature review. Mach. Learn. 2024, 113, 4055–4097. [Google Scholar] [CrossRef]
Wang, S.; Yao, X. Multiclass imbalance problems: Analysis and potential solutions. IEEE Trans. Syst. Man Cybern. Part B Cybern. 2012, 42, 1119–1130. [Google Scholar] [CrossRef] [PubMed]
Chen, H.; Wei, J.; Huang, H.; Yuan, Y.; Wang, J. Review of imbalanced fault diagnosis technology based on generative adversarial networks. J. Comput. Des. Eng. 2024, 11, 99–124. [Google Scholar] [CrossRef]
Japkowicz, N.; Stephen, S. The class imbalance problem: A systematic study. Intell. Data Anal. 2002, 6, 429–449. [Google Scholar] [CrossRef]
Roh, Y.; Heo, G.; Whang, S.E. A survey on data collection for machine learning: A big data-AI integration perspective. IEEE Trans. Knowl. Data Eng. 2021, 33, 1328–1347. [Google Scholar] [CrossRef]
Taner, A.; Mengstu, M.T.; Selvi, K.Ç.; Duran, H.; Kabaş, Ö.; Gür, İ.; Karaköse, T.; Gheorghiță, N.-E. Multiclass apple varieties classification using machine learning with histogram of oriented gradient and color moments. Appl. Sci. 2023, 13, 7682. [Google Scholar] [CrossRef]
Taner, A.; Mengstu, M.T.; Selvi, K.Ç.; Duran, H.; Gür, İ.; Ungureanu, N. Apple varieties classification using deep features and machine learning. Agriculture 2024, 14, 252. [Google Scholar] [CrossRef]
Yu, F.; Lu, T.; Xue, C. Deep learning-based intelligent apple variety classification system and model interpretability analysis. Foods 2023, 12, 885. [Google Scholar] [CrossRef]
Hase, N.; Ito, S.; Kaneko, N.; Sumi, K. Data augmentation for intra-class imbalance with generative adversarial network. In Proceedings of the Fourteenth International Conference on Quality Control by Artificial Vision, Mulhouse, France, 15–17 May 2019; Cudel, C., Bazeille, S., Verrier, N., Eds.; International Society for Optics and Photonics (SPIE): San Diego, CA, USA, 2019; Volume 11172, p. 1117206. [Google Scholar] [CrossRef]
Ahmed, S.; Hasan, M.B.; Ahmed, T.; Sony, M.R.K.; Kabir, M.H. Less is more: Lighter and faster deep neural architecture for tomato leaf disease classification. IEEE Access 2022, 10, 68868–68884. [Google Scholar] [CrossRef]
Khare, O.; Mane, S.; Kulkarni, H.; Barve, N. LeafNST: An improved data augmentation method for classification of plant disease using object-based neural style transfer. Discov. Artif. Intell. 2024, 4, 50. [Google Scholar] [CrossRef]
Sauber-Cole, R.; Khoshgoftaar, T.M. The use of generative adversarial networks to alleviate class imbalance in tabular data: A survey. J. Big Data 2022, 9, 98. [Google Scholar] [CrossRef]
Temraz, M.; Kenny, E.M.; Ruelle, E.; Shalloo, L.; Smyth, B.; Keane, M.T. Handling climate change using counterfactuals: Using counterfactuals in data augmentation to predict crop growth in an uncertain climate future. In Case-Based Reasoning Research and Development, Proceedings of the 29th International Conference, ICCBR 2021, Salamanca, Spain, 13–16 September 2021; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2021; Volume 12877. [Google Scholar] [CrossRef]
Mirzaei, A.; Bagheri, H.; Khosravi, I. Enhancing crop classification accuracy through synthetic SAR-optical data generation using deep learning. ISPRS Int. J. Geo-Inf. 2023, 12, 450. [Google Scholar] [CrossRef]
Sambasivam, G.; Opiyo, G.D. A predictive machine learning application in agriculture: Cassava disease detection and classification with imbalanced dataset using convolutional neural networks. Egypt. Inform. J. 2021, 22, 27–34. [Google Scholar] [CrossRef]
Vaidya, H.; Prasad, K.; Rajashekhar, C.; Tripathi, D.; S, R.; Shetty, J.; Swamy, K.; Y, S. A class imbalance aware hybrid model for accurate rice variety classification. Int. J. Cogn. Comput. Eng. 2025, 6, 170–182. [Google Scholar] [CrossRef]
Prexawanprasut, T.; Banditwattanawong, T. Improving minority class recall through a novel cluster-based oversampling technique. Informatics 2024, 11, 35. [Google Scholar] [CrossRef]
Provost, F.; Fawcett, T. Robust classification for imprecise environments. Mach. Learn. 2001, 42, 203–231. [Google Scholar] [CrossRef]
Saito, T.; Rehmsmeier, M. The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLoS ONE 2015, 10, e0118432. [Google Scholar] [CrossRef]
Williams, C.K.I. The effect of class imbalance on precision-recall curves. Neural Comput. 2021, 33, 853–857. [Google Scholar] [CrossRef] [PubMed]
Zheng, W.; Jin, M. The effects of class imbalance and training data size on classifier learning: An empirical study. SN Comput. Sci. 2020, 1, 71. [Google Scholar] [CrossRef]
Elkan, C. The foundations of cost-sensitive learning. In Proceedings of the IJCAI’01—17th International Joint Conference on Artificial Intelligence, San Francisco, CA, USA, 4–10 August 2001; Volume 2, pp. 973–978. [Google Scholar]
Provost, F. Machine learning from imbalanced data sets 101. In Proceedings of the AAAI’2000 Workshop on Imbalanced Data Sets, Austin, TX, USA, 31 July 2000. [Google Scholar]
Dong, X.; Yu, Z.; Cao, W.; Shi, Y.; Ma, Q. A survey on ensemble learning. Front. Comput. Sci. 2020, 14, 241–258. [Google Scholar] [CrossRef]
Drummond, C.; Holte, R.C. C4. 5, class imbalance, and cost sensitivity: Why under-sampling beats over-sampling. In Proceedings of the ICML’2003, Workshop on Learning from Imbalanced Data Sets II, Washington, DC, USA, 21 August 2003; Volume 11. [Google Scholar]
Estabrooks, A.; Jo, T.; Japkowicz, N. A multiple resampling method for learning from imbalanced data sets. Comput. Intell. 2004, 20, 18–36. [Google Scholar] [CrossRef]
Krawczyk, B. Learning from imbalanced data: Open challenges and future directions. Prog. Artif. Intell. 2016, 5, 221–232. [Google Scholar] [CrossRef]
Mienye, I.D.; Sun, Y. Performance analysis of cost-sensitive learning methods with application to imbalanced medical data. Inform. Med. Unlocked 2021, 25, 100690. [Google Scholar] [CrossRef]
Thai-Nghe, N.; Gantner, Z.; Schmidt-Thieme, L. Cost-sensitive learning methods for imbalanced data. In Proceedings of the 2010 International Joint Conference on Neural Networks (IJCNN), Barcelona, Spain, 18–23 July 2010; pp. 1–8. [Google Scholar] [CrossRef]
Khan, S.H.; Hayat, M.; Bennamoun, M.; Sohel, F.A.; Togneri, R. Cost-sensitive learning of deep feature representations from imbalanced data. IEEE Trans. Neural Netw. Learn. Syst. 2018, 29, 3573–3587. [Google Scholar] [CrossRef] [PubMed]
Zou, Q.; Xie, S.; Lin, Z.; Wu, M.; Ju, Y. Finding the best classification threshold in imbalanced classification. Big Data Res. 2016, 5, 2–8. [Google Scholar] [CrossRef]
He, H.; Ma, Y. Imbalanced Learning: Foundations, Algorithms, and Applications; Wiley: Hoboken, NJ, USA, 2013. [Google Scholar] [CrossRef]
Giglioni, V.; García-Macías, E.; Venanzi, I.; Ierimonti, L.; Ubertini, F. The use of receiver operating characteristic curves and precision-versus-recall curves as performance metrics in unsupervised structural damage classification under changing environment. Eng. Struct. 2021, 246, 113029. [Google Scholar] [CrossRef]
Mohammed, R.; Rawashdeh, J.; Abdullah, M. Machine learning with oversampling and undersampling techniques: Overview study and experimental results. In Proceedings of the 2020 11th International Conference on Information and Communication Systems, ICICS 2020, Irbid, Jordan, 7–9 April 2020. [Google Scholar] [CrossRef]
Alkhawaldeh, I.M.; Albalkhi, I.; Naswhan, A.J. Challenges and limitations of synthetic minority oversampling techniques in machine learning. World J. Methodol. 2023, 13, 373–378. [Google Scholar] [CrossRef] [PubMed]
Chen, W.; Yang, K.; Yu, Z.; Shi, Y.; Chen, C.L.P. A survey on imbalanced learning: Latest research, applications and future directions. Artif. Intell. Rev. 2024, 57, 137. [Google Scholar] [CrossRef]
Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic minority over-sampling technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
He, H.; Bai, Y.; Garcia, E.A.; Li, S. ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In Proceedings of the 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), Hong Kong, China, 1–8 June 2008; pp. 1322–1328. [Google Scholar] [CrossRef]
Han, H.; Wang, W.Y.; Mao, B.H. Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning. In Advances in Intelligent Computing, Proceedings of the International Conference on Intelligent Computing (ICIC 2005), Hefei, China, 23–26 August 2005; Huang, D.S., Zhang, X.P., Huang, G.B., Eds.; Springer: Berlin/Heidelberg, Germany, 2005; pp. 878–887. [Google Scholar]
Bunkhumpornpat, C.; Sinapiromsaran, K.; Lursinsap, C. Safe-level-SMOTE: Safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem. In Advances in Knowledge Discovery and Data Mining, Proceedings of the 13th Pacific-Asia Conference (PAKDD 2009), Bangkok, Thailand, 27–30 April 2009; Theeramunkong, T., Kijsirikul, B., Cercone, N., Ho, T.B., Eds.; Springer: Berlin/Heidelberg, Germany, 2009; pp. 475–482. [Google Scholar]
Douzas, G.; Bacao, F.; Last, F. Improving imbalanced learning through a heuristic oversampling method based on k-means and SMOTE. Inf. Sci. 2018, 465, 1–20. [Google Scholar] [CrossRef]
Nguyen, H.M.; Cooper, E.W.; Kamei, K. Borderline over-sampling for imbalanced data classification. Int. J. Knowl. Eng. Soft Data Paradigm. 2011, 3, 4–21. [Google Scholar] [CrossRef]
Wilson, D.L. Asymptotic Properties of Nearest Neighbour Rules Using Edited Data. IEEE Trans. Syst. Man Cybern. 1972, SMC-2, 408–421. [Google Scholar] [CrossRef]
Laurikkala, J. Improving identification of difficult small classes by balancing class distribution. In Artificial Intelligence in Medicine, Proceedings of the 8th Conference on Artificial Intelligence in Medicine in Europe (AIME 2001), Cascais, Portugal, 1–4 July 2001; Quaglini, S., Barahona, P., Andreassen, S., Eds.; Springer: Berlin/Heidelberg, Germany, 2001; pp. 63–66. [Google Scholar]
Two Modifications of CNN. IEEE Trans. Syst. Man Cybern. 1976, SMC-6, 769–772. [CrossRef]
Lin, W.C.; Tsai, C.F.; Hu, Y.H.; Jhang, J.S. Clustering-based undersampling in class-imbalanced data. Inf. Sci. 2017, 409–410, 17–26. [Google Scholar] [CrossRef]
Kraiem, M.S.; Sánchez-Hernández, F.; Moreno-García, M.N. Selecting the suitable resampling strategy for imbalanced data classification regarding dataset properties. An approach based on association models. Appl. Sci. 2021, 11, 8546. [Google Scholar] [CrossRef]
Choirunnisa, S.; Lianto, J. Hybrid method of undersampling and oversampling for handling imbalanced data. In Proceedings of the 2018 International Seminar on Research of Information Technology and Intelligent Systems (ISRITI), Yogyakarta, Indonesia, 21–22 November 2018; pp. 276–280. [Google Scholar] [CrossRef]
Hosni, M.; Abnane, I.; Idri, A.; Carrillo de Gea, J.M.; Fernández Alemán, J.L. Reviewing ensemble classification methods in breast cancer. Comput. Methods Programs Biomed. 2019, 177, 89–112. [Google Scholar] [CrossRef] [PubMed]
Sainin, M.S.; Alfred, R.; Ahmad, F. Ensemble meta classifier with sampling and feature selection for data with imbalance multiclass problem. J. Inf. Commun. Technol. 2021, 20, 103–133. [Google Scholar] [CrossRef]
Kim, K. Noise Avoidance SMOTE in Ensemble Learning for Imbalanced Data. IEEE Access 2021, 9, 143250–143265. [Google Scholar] [CrossRef]
Yotsawat, W.; Wattuya, P.; Srivihok, A. A novel method for credit scoring based on cost-sensitive neural network ensemble. IEEE Access 2021, 9, 78521–78537. [Google Scholar] [CrossRef]
Wei, W.; Jiang, F.; Yu, X.; Du, J. An ensemble learning algorithm based on resampling and hybrid feature selection, with an application to software defect prediction. In Proceedings of the 2022 7th International Conference on Information and Network Technologies (ICINT), Okinawa, Japan, 21–23 May 2022; pp. 52–56. [Google Scholar] [CrossRef]
Batista, G.E.A.P.A.; Prati, R.C.; Monard, M.C. A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explor. Newsl. 2004, 6, 20–29. [Google Scholar] [CrossRef]
Goodfellow, I.J.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial nets. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 8–13 December 2014; Volume 3, pp. 2672–2680. [Google Scholar] [CrossRef]
Kingma, D.P.; Welling, M. Auto-encoding Variational Bayes. arXiv 2013, arXiv:1312.6114. [Google Scholar]
Sampath, V.; Maurtua, I.; Aguilar Martin, J.J.; Gutierrez, A. A survey on generative adversarial networks for imbalance problems in computer vision tasks. J. Big Data 2021, 8, 27. [Google Scholar] [CrossRef] [PubMed]
Blagus, R.; Lusa, L. Evaluation of SMOTE for high-dimensional class-imbalanced microarray data. In Proceedings of the 2012 11th International Conference on Machine Learning and Applications, ICMLA 2012, Boca Raton, FL, USA, 12–15 December 2012. [Google Scholar] [CrossRef]
Mirza, M.; Osindero, S. Conditional generative adversarial nets. arXiv 2014, arXiv:1411.1784v1. [Google Scholar] [CrossRef]
Ma, Y.; Liu, K.; Guan, Z.; Xu, X.; Qian, X.; Bao, H. Background augmentation generative adversarial networks (BAGANs): Effective data generation based on GAN-augmented 3D synthesizing. Symmetry 2018, 10, 734. [Google Scholar] [CrossRef]
Douzas, G.; Bacao, F. Effective data generation for imbalanced learning using conditional generative adversarial networks. Expert Syst. Appl. 2018, 91, 464–471. [Google Scholar] [CrossRef]
Mariani, G.; Scheidegger, F.; Istrate, R.; Bekas, C.; Malossi, C. BAGAN: Data augmentation with balancing GAN. arXiv 2018, arXiv:1803.09655. [Google Scholar]
Frid-Adar, M.; Klang, E.; Amitai, M.; Goldberger, J.; Greenspan, H. Synthetic data augmentation using GAN for improved liver lesion classification. In Proceedings of the 2018 IEEE 15th International Symposium on Biomedical Imaging (ISBI 2018), Washington, DC, USA, 4–7 April 2018. [Google Scholar] [CrossRef]
Pan, T.; Pedrycz, W.; Yang, J.; Wang, J. An improved generative adversarial network to oversample imbalanced datasets. Eng. Appl. Artif. Intell. 2024, 132, 107934. [Google Scholar] [CrossRef]
Qin, Z.; Huang, F.; Pan, J.; Niu, J.; Qin, H. Improved generative adversarial network for bearing fault diagnosis with a small number of data and unbalanced data. Symmetry 2024, 16, 358. [Google Scholar] [CrossRef]
Sharma, A.; Singh, P.K.; Chandra, R. SMOTified-GAN for class imbalanced pattern classification problems. IEEE Access 2022, 10, 30655–30665. [Google Scholar] [CrossRef]
Qian, W.; Gechter, F. Variational information bottleneck model for accurate indoor position recognition. In Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR), Milan, Italy, 10–15 January 2020. [Google Scholar] [CrossRef]
Stocksieker, S.; Pommeret, D.; Charpentier, A. Data Augmentation with Variational Autoencoder for Imbalanced Dataset. arXiv 2024, arXiv:2412.07039. [Google Scholar] [CrossRef]
Chatterjee, S.; Maity, S.; Bhattacharjee, M.; Banerjee, S.; Das, A.K.; Ding, W. Variational autoencoder based imbalanced COVID-19 detection using chest X-ray images. New Gener. Comput. 2023, 41, 25–60. [Google Scholar] [CrossRef] [PubMed]
Mostofi, F.; Behzat Tokdemir, O.; Toğan, V. Generating synthetic data with variational autoencoder to address class imbalance of graph attention network prediction model for construction management. Adv. Eng. Inform. 2024, 62, 102606. [Google Scholar] [CrossRef]
Dai, W.; Ng, K.; Severson, K.; Huang, W.; Anderson, F.; Stultz, C. Generative oversampling with a contrastive variational autoencoder. In Proceedings of the 2019 IEEE International Conference on Data Mining (ICDM), Beijing, China, 8–11 November 2019; pp. 101–109. [Google Scholar] [CrossRef]
Kossale, Y.; Airaj, M.; Darouichi, A. Mode collapse in generative adversarial networks: An overview. In Proceedings of the 2022 8th International Conference on Optimization and Applications (ICOA), Beijing, China, 8–11 November 2022; pp. 1–6. [Google Scholar] [CrossRef]
Naderi, H.; Soleimani, B.H.; Matwin, S. Generating high-fidelity images with disentangled adversarial VAEs and structure-aware loss. In Proceedings of the 2020 International Joint Conference on Neural Networks (IJCNN), Glasgow, UK, 19–24 July 2020; pp. 1–8. [Google Scholar] [CrossRef]
Marques, J.A.L.; Gois, F.N.B.; do Vale Madeiro, J.P.; Li, T.; Fong, S.J. Chapter 4—Artificial neural network-based approaches for computer-aided disease diagnosis and treatment. In Cognitive and Soft Computing Techniques for the Analysis of Healthcare Data; Bhoi, A.K., de Albuquerque, V.H.C., Srinivasu, P.N., Marques, G., Eds.; Intelligent Data-Centric Systems; Academic Press: Cambridge, MA, USA, 2022; pp. 79–99. [Google Scholar] [CrossRef]
Ali, A.H.; Yaseen, M.G.; Aljanabi, M.; Abed, S.A.; GPT, C. Transfer learning: A new promising techniques. Mesopotamian J. Big Data 2023, 2023, 29–30. [Google Scholar] [CrossRef]
Hadhrami, E.A.; Mufti, M.A.; Taha, B.; Werghi, N. Transfer learning with convolutional neural networks for moving target classification with micro-Doppler radar spectrograms. In Proceedings of the 2018 International Conference on Artificial Intelligence and Big Data, ICAIBD 2018, Chengdu, China, 26–28 May 2018. [Google Scholar] [CrossRef]
Liu, T.; Alibhai, S.; Wang, J.; Liu, Q.; He, X.; Wu, C. Exploring transfer learning to reduce training overhead of HPC data in machine learning. In Proceedings of the 2019 IEEE International Conference on Networking, Architecture and Storage (NAS), Enshi, China, 15–17 August 2019; pp. 1–7. [Google Scholar] [CrossRef]
Zhang, W.; Deng, L.; Zhang, L.; Wu, D. A survey on negative transfer. IEEE/CAA J. Autom. Sin. 2023, 10, 305–329. [Google Scholar] [CrossRef]
Lakkapragada, A.; Sleiman, E.; Surabhi, S.; Wall, D.P. Mitigating negative transfer in multi-task learning with exponential moving average loss weighting strategies. In Proceedings of the 37th AAAI Conference on Artificial Intelligence, AAAI 2023, Washington, DC, USA, 7–14 February 2023; Volume 37. [Google Scholar]
Zhang, H.; Liu, W.; Yang, H.; Zhou, Y.; Zhu, C.; Zhang, W. CSAL: Cost sensitive active learning for multi-source drifting stream. Knowl.-Based Syst. 2023, 277, 110771. [Google Scholar] [CrossRef]
Choubey, S.; Divya, D. Lightweight federated transfer learning for plant leaf disease detection and classification across multiclient cross-silo datasets. BIO Web Conf. 2024, 82, 05018. [Google Scholar] [CrossRef]
Upreti, K.; Singh, P.; Jain, D.; Pandey, A.K.; Gupta, A.; Singh, H.R.; Srivastava, S.K.; Prasad, J.S. Progressive loss-aware fine-tuning stepwise learning with GAN augmentation for rice plant disease detection. Multimed. Tools Appl. 2024, 83, 84565–84588. [Google Scholar] [CrossRef]
Ramadan, S.T.Y.; Sakib, T.; Farid, F.A.; Islam, M.S.; Abdullah, J.B.; Bhuiyan, M.R.; Mansor, S.; Karim, H.B.A. Improving wheat leaf disease classification: Evaluating augmentation strategies and CNN-based models With limited dataset. IEEE Access 2024, 12, 69853–69874. [Google Scholar] [CrossRef]
Zhu, J.Y.; Park, T.; Isola, P.; Efros, A.A. Unpaired Image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017. [Google Scholar] [CrossRef]
Chen, Z.; Wang, G.; Lv, T.; Zhang, X. Using a hybrid convolutional neural network with a transformer model for tomato leaf disease detection. Agronomy 2024, 14, 673. [Google Scholar] [CrossRef]
Ahmad, M.; Abdullah, M.; Moon, H.; Han, D. Plant disease detection in imbalanced datasets using efficient convolutional neural networks with stepwise transfer learning. IEEE Access 2021, 9, 140565–140580. [Google Scholar] [CrossRef]
Christakakis, P.; Giakoumoglou, N.; Kapetas, D.; Tzovaras, D.; Pechlivani, E.M. Vision transformers in optimization of AI-based early detection of Botrytis cinerea. AI 2024, 5, 1301–1323. [Google Scholar] [CrossRef]
Hashim, I.C.; Shariff, A.R.M.; Bejo, S.K.; Muharam, F.M.; Ahmad, K. Classification of non-infected and infected with basal stem rot disease using thermal images and imbalanced data approach. Agronomy 2021, 11, 2373. [Google Scholar] [CrossRef]
Hashim, I.C.; Shariff, A.R.M.; Bejo, S.K.; Muharam, F.M.; Ahmad, K. Machine-learning approach using SAR data for the classification of oil palm trees that are non-infected and infected with the basal stem rot disease. Agronomy 2021, 11, 532. [Google Scholar] [CrossRef]
Nafi, N.M.; Hsu, W.H. Addressing class imbalance in image-based plant disease detection: Deep generative vs. sampling-based approaches. In Proceedings of the 2020 International Conference on Systems, Signals and Image Processing (IWSSIP), Niteroi, Brazil, 1–3 July 2020; pp. 243–248. [Google Scholar] [CrossRef]
Xiao, T.; Liu, H.; Cheng, Y. Corn disease identification based on improved GBDT method. In Proceedings of the 2019 6th International Conference on Information Science and Control Engineering (ICISCE), Shanghai, China, 20–22 December 2019; pp. 215–219. [Google Scholar] [CrossRef]
Lu, Y.; Liu, M.; Li, C.; Liu, X.; Cao, C.; Li, X.; Kan, Z. Precision fertilization and irrigation: Progress and applications. AgriEngineering 2022, 4, 41. [Google Scholar] [CrossRef]
Selim, S.; Koc-San, D.; Selim, C.; San, B.T. Site selection for avocado cultivation using GIS and multi-criteria decision analyses: Case study of Antalya, Turkey. Comput. Electron. Agric. 2018, 154, 450–459. [Google Scholar] [CrossRef]
Sharififar, A.; Sarmadian, F. Coping with imbalanced data problem in digital mapping of soil classes. Eur. J. Soil Sci. 2023, 74, e13368. [Google Scholar] [CrossRef]
Sharififar, A.; Sarmadian, F.; Malone, B.P.; Minasny, B. Addressing the issue of digital mapping of soil classes with imbalanced class observations. Geoderma 2019, 350, 84–92. [Google Scholar] [CrossRef]
Sharififar, A.; Sarmadian, F.; Minasny, B. Mapping imbalanced soil classes using Markov chain random fields models treated with data resampling technique. Comput. Electron. Agric. 2019, 159, 110–118. [Google Scholar] [CrossRef]
Wang, L.; Wang, X.; Kooch, Y.; Song, K.; Wu, D. Improvement of data imbalance for digital soil class mapping in Eastern China. Comput. Electron. Agric. 2023, 214, 108322. [Google Scholar] [CrossRef]
Hu, T.; Li, K.; Ma, C.; Zhou, N.; Chen, Q.; Qi, C. Improved classification of soil As contamination at continental scale: Resolving class imbalances using machine learning approach. Chemosphere 2024, 363, 142697. [Google Scholar] [CrossRef] [PubMed]
Nalepa, J. Recent advances in multi-and hyperspectral image analysis. Sensors 2021, 21, 6002. [Google Scholar] [CrossRef]
Taghizadeh, M.; Gowen, A.A.; O’Donnell, C.P. Comparison of hyperspectral imaging with conventional RGB imaging for quality evaluation of Agaricus bisporus mushrooms. Biosyst. Eng. 2011, 108, 191–194. [Google Scholar] [CrossRef]
Blagus, R.; Lusa, L. SMOTE for high-dimensional class-imbalanced data. BMC Bioinform. 2013, 14, 106. [Google Scholar] [CrossRef]
Deepa, T.; Punithavalli, M. An E-SMOTE technique for feature selection in high-dimensional imbalanced dataset. In Proceedings of the ICECT 2011—2011 3rd International Conference on Electronics Computer Technology, Kanyakumari, India, 8–10 April 2011; Volume 2. [Google Scholar] [CrossRef]
Qazi, N.; Raza, K. Effect of feature selection, Synthetic Minority Over-sampling (SMOTE) and under-sampling on class imbalance classification. In Proceedings of the 2012 14th International Conference on Modelling and Simulation, UKSim 2012, Cambridge, UK, 28–30 March 2012. [Google Scholar] [CrossRef]
A, A.S.; S, A.A. Land-cover classification with hyperspectral remote sensing image using CNN and spectral band selection. Remote Sens. Appl. Soc. Environ. 2023, 31, 100986. [Google Scholar] [CrossRef]
Zhan, Y.; Hu, D.; Wang, Y.; Yu, X. Semisupervised hyperspectral image classification based on generative adversarial networks. IEEE Geosci. Remote Sens. Lett. 2018, 15, 212–216. [Google Scholar] [CrossRef]
Zhan, Y.; Wu, K.; Liu, W.; Qin, J.; Yang, Z.; Medjadba, Y.; Wang, G.; Yu, X. Semi-supervised classification of hyperspectral data based on generative adversarial networks and neighbourhood majority voting. In Proceedings of the IGARSS 2018—2018 IEEE International Geoscience and Remote Sensing Symposium, Valencia, Spain, 22–27 July 2018. [Google Scholar] [CrossRef]
Feng, J.; Yu, H.; Wang, L.; Cao, X.; Zhang, X.; Jiao, L. Classification of hyperspectral images based on multiclass spatial-spectral generative adversarial networks. IEEE Trans. Geosci. Remote Sens. 2019, 57, 5329–5343. [Google Scholar] [CrossRef]
Zhong, Z.; Li, J.; Clausi, D.A.; Wong, A. Generative adversarial networks and conditional random fields for hyperspectral image classification. IEEE Trans. Cybern. 2020, 50, 3318–3329. [Google Scholar] [CrossRef]
Wang, X.; Tan, K.; Du, Q.; Chen, Y.; Du, P. Caps-TripleGAN: GAN-Assisted Capsnet for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2019, 57, 7232–7245. [Google Scholar] [CrossRef]
Yin, J.; Li, W.; Han, B. Hyperspectral image classification based on generative adversarial network with dropblock. In Proceedings of the 2019 IEEE International Conference on Image Processing (ICIP), Taipei, Taiwan, 22–25 September 2019. [Google Scholar] [CrossRef]
Roy, S.K.; Haut, J.M.; Paoletti, M.E.; Dubey, S.R.; Plaza, A. Generative adversarial minority oversampling for spectral-spatial hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5500615. [Google Scholar] [CrossRef]
Mullick, S.S.; Datta, S.; Das, S. Generative adversarial minority oversampling. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar] [CrossRef]
Fawakherji, M.; Suriani, V.; Nardi, D.; Bloisi, D.D. Shape and style GAN-based multispectral data augmentation for crop/weed segmentation in precision farming. Crop Prot. 2024, 184, 106848. [Google Scholar] [CrossRef]
Modak, S.; Stein, A. Synthesizing training data for intelligent weed control systems using generative AI. In Architecture of Computing Systems, Proceedings of the 37th International Conference, ARCS 2024, Potsdam, Germany, 14–16 May 2024; Fey, D., Stabernack, B., Lankes, S., Pacher, M., Pionteck, T., Eds.; Springer: Cham, Switzerland, 2024; pp. 112–126. [Google Scholar]
Ma, X.; Deng, X.; Qi, L.; Jiang, Y.; Li, H.; Wang, Y.; Xing, X. Fully convolutional network for rice seedling and weed image segmentation at the seedling stage in paddy fields. PLoS ONE 2019, 14, e0215676. [Google Scholar] [CrossRef] [PubMed]
Bi, Z.; Li, Y.; Guan, J.; Li, J.; Zhang, P.; Zhang, X.; Han, Y.; Wang, L.; Guo, W. Weed identification in broomcorn millet field using segformer semantic segmentation based on multiple loss functions. Eng. Agric. Environ. Food 2024, 17, 27–36. [Google Scholar] [CrossRef] [PubMed]
Jun, S.; Wenjun, T.; Xiaohong, W.; Jifeng, S.; Bing, L.; Chunxia, D. Real-time recognition of sugar beet and weeds in complex backgrounds using multi-channel depth-wise separable convolution model. Trans. Chin. Soc. Agric. Eng. Trans. CSAE 2019, 35, 184–190. [Google Scholar] [CrossRef]
Matthews, B.W. Comparison of the predicted and observed secondary structure of T4 phage lysozyme. BBA—Protein Struct. 1975, 405, 442–451. [Google Scholar] [CrossRef] [PubMed]
Kent, A.; Berry, M.M.; Luehrs, F.U.; Perry, J.W. Machine literature searching VIII. Operational criteria for designing information retrieval systems. Am. Doc. 1955, 6, 93–101. [Google Scholar] [CrossRef]
Binney, N.; Hyde, C.; Bossuyt, P.M. On the origin of sensitivity and specificity. Ann. Intern. Med. 2021, 74, 401–407. [Google Scholar] [CrossRef]
Van Rijsbergen, C.J. Information Retrieval; Butterworth-Heinemann: Oxford, UK, 1979. [Google Scholar]
Akosa, J.S. Predictive accuracy: A misleading performance measure for highly imbalanced data. In Proceedings of the SAS Global Forum 2017, Orlando, FL, USA, 2–5 April 2017. [Google Scholar]
Hinojosa Lee, M.C.; Braet, J.; Springael, J. Performance Metrics for Multilabel Emotion Classification: Comparing Micro, Macro, and Weighted F1-Scores. Appl. Sci. 2024, 14, 9863. [Google Scholar] [CrossRef]
Man, X.; Lin, J.; Yang, Y. Stock-UniBERT: A News-based cost-sensitive ensemble BERT model for stock trading. In Proceedings of the 2020 IEEE 18th International Conference on Industrial Informatics (INDIN), Warwick, UK, 20–23 July 2020; Volume 1, pp. 440–445. [Google Scholar] [CrossRef]

Figure 1. Difference between undersampling and oversampling [102]. (Blue and orange colors represent majority and minority classes, respectively).

Figure 2. GAN model, adapted from [48].

Figure 3. cGAN model, adapted from [128].

Figure 4. Variational autoencoder model, a type of generative model that learns to encode input data into a probabilistic latent space (adapted from [135]).

Figure 5. Confusion matrix.

Table 1. Strategies to handle imbalanced data problem.

Techniques	Strategy	Limitations
Algorithm-level
Cost-sensitive learning [90]	Assigning weights to sample data to compensate for the imbalance condition.	Requires knowledge to assign the appropriate initial weight values at the beginning of the process.
Threshold moving [91]	Fine-tuning the threshold probability or decision in a classifier, typically set at 0.5, to a specific value.	Needs readjustment for new cases or changing data conditions.
Ensemble learning [92]	Combining the outputs of multiple learning classifiers.	Demands a significant amount of resources and high complexity.
Data-level
Undersampling [93]	Reducing the number of samples from the majority class until both classes are balanced.	Potential for the loss of important information from the deleted majority samples.
Oversampling [94]	Increasing the number of minority samples until it matches the majority class.	Redundant data can lead to overfitting.
Hybrid-level
Hybrid [95]	Combining data-level and algorithm-level techniques.	Potentially inherits weaknesses from one of the techniques.

Table 3. List of oversampling techniques.

Method	Strategy	Limitation
ROS	Randomly duplicating existing samples of the minority class until the number of minority samples equals that of the majority class.	ROS prone to overfitting, as the model may become too tailored to the duplicated samples of the minority class, thus reducing its ability to generalize to new data.
SMOTE [105]	Randomly selecting samples close to each other in the feature space and then generating new data between these chosen samples. Specifically, SMOTE randomly selects n-nearest neighbours from the sample data and then creates synthetic instances randomly along the line between the reference sample and its neighbours.	SMOTE has a potential overlap issue. Since it does not consider the existence of the majority class, the synthetic samples may be located within majority class samples, reducing majority class performance.
ADASYN [106]	An extension of SMOTE that focuses on generating more synthetic data in minority class areas that are difficult for classification algorithms to learn, thus improving classification performance for unevenly distributed minority classes.	ADASYN may generate synthetic samples around noisy instances, amplifying the effect of noise and potentially degrading model performance.
Borderline SMOTE [107]	Concentrates on minority class samples located in the border area, which are difficult to classify and frequently misclassified. Uses KNN to identify border samples and generates new data based on them.	Sensitivity to noise; if reference samples are noisy, Borderline SMOTE may worsen it by generating more noise from synthetic samples. Success heavily relies on hyperparameter settings, especially the number of nearest neighbours.
SL-SMOTE [108]	Aims to assess the safety level of each minority class sample to avoid generating unwanted noise. Categorizes reference samples into noisy, borderline, and safe areas; generates new samples specifically in the safe area.	New synthetic data in safe areas, away from decision boundaries, may not significantly impact heavily misclassified border areas. Less effective for complex datasets.
K-Means SMOTE [109]	Combines K-Means clustering with SMOTE; clusters minority class samples into groups using K-Means and applies SMOTE to each cluster.	Sensitive to outliers, as they can affect clustering. The effectiveness of oversampling relies on the number of clusters (K) parameter.
SVM SMOTE [110]	Uses SVM to identify support vector samples near the decision boundary and generates new samples around these support vectors.	Its effectiveness depends on the SVM model and requires optimal hyperparameter tuning.

Table 4. Summary of various approaches for plant disease detection.

Title	Year	Dataset	Techniques
Improving Wheat Leaf Disease Classification: Evaluating Augmentation Strategies and CNN-Based Models With Limited Dataset [151].	2024	Wheat leaves	In order to expand the limited dataset, data augmentation was employed using CycleGAN [152] and ADASYN. Furthermore, classification was conducted using CNN models.
Using a Hybrid Convolutional Neural Network with a Transformer Model for Tomato Leaf Disease Detection [153]	2024	Plant Village dataset	Using a hybrid model of CNN with Transformer, utilizing CycleGAN for data augmentation to overcome the class imbalance problem.
Enhanced SVM-SMOTE with Cluster Consistency for Imbalanced Data Classification [48].	2023	Hyperspectral data of Sugar leaves	Using a modified SVM-SMOTE technique to increase the number of minority samples of early-stage diseased plants, which are often ambiguous and difficult to distinguish from both normal and infected conditions.
Ensemble synthetic oversampling with Manhattan distance for unbalanced hyperspectral data [46]	2021	Hyperspectral data of Arabidopsis leaves	Modifying Safe-Level SMOTE with Manhattan distance and then extending it with ensemble learning techniques.
Plant Disease Detection in Imbalanced Datasets Using Efficient Convolutional Neural Networks With Stepwise Transfer Learning [154]	2021	Plant Village dataset and Pepper dataset	Using SMOTE and GAN to overcome imbalance data issue.
Vision Transformers in Optimisation of AI-Based Early Detection of Botrytis cinerea [155]	2024	Botrytis cinerea (Gray Mold Disease) on Cucurbitaceae crops	Using DL segmentation model with Vision Transformer (ViT) encoder, combined with Cut-and-Paste method to address dataset imbalance. Multispectral imaging is used to detect disease progression.
Classification of Non-Infected and Infected with Basal Stem Rot Disease Using Thermal Images and Imbalanced Data Approach [156]	2021	Oil palm trees infected with Ganoderma boninense (BSR)	Using thermal images to identify BSR-infected and non-infected oil palm trees, with data imbalance approaches such as RUS, ROS, and SMOTE.
Machine-Learning Approach Using SAR Data for the Classification of Oil Palm Trees That Are Non-Infected and Infected with the Basal Stem Rot Disease [157]	2021	Oil palm trees infected with Ganoderma boninense (BSR)	Used ALOS PALSAR-2 imagery with dual polarization. SMOTE was employed to address the imbalance in data.
Addressing Class Imbalance in Image-Based Plant Disease Detection: Deep Generative vs. Sampling-Based Approaches [158]	2020	Plant Village dataset	Compared a GAN-based approach with traditional sampling methods (undersampling, oversampling, SMOTE) to address data imbalance.
Corn Disease Identification Based on improved GBDT Method [159]	2019	Corn leaves	Used SMOTE to address data imbalance, applied regional interpolation for image resizing, and employed Gradient Boosting Decision Tree (GBDT) for disease identification.

Table 5. Summary of various approaches for soil classification with imbalanced data.

Title	Year	Dataset	Techniques
Coping with imbalanced data problem in digital mapping of soil classes [162]	2023	453 soil profiles from northwest Iran	The authors proposed three ML approaches to address the imbalanced data problem in soil mapping: Ensemble Gradient Boosting (XGB), Cost-Sensitive Decision Tree (CSDT), and One-Class Classification Combined with Multi-Class Classification (OCCM).
Addressing the issue of digital mapping of soil classes with imbalanced class observations [163]	2019	452 soil profile observations collected on a regular grid covering approximately 12,000 hectares in northwest Iran	Over- and under-sampling techniques were employed to address the imbalanced class distribution in the dataset.
Mapping imbalanced soil classes using Markov chain random fields models treated with data resampling technique [164]	2019	452 soil profile observations collected on a regular grid covering approximately 12,000 hectares in northwest Iran	Markov Chain Random Field Modeling was used to predict spatial patterns of soil classes, while ROS was used prior modelling.
Improvement of data imbalance for digital soil class mapping in Eastern China [165]	2023	316 topsoil samples were collected in Eastern China	ROS and RUS techniques were applied to address class imbalance, while Random Forest (RF) was used for soil classification.
Improved classification of soil As contamination at continental scale: Resolving class imbalances using machine learning approach [166]	2024	Land Use/Land Cover Area Frame Survey (LUCAS) 2009 dataset	Methods like SMOTE, SMOTE-Tomek, RUS, and the Tomek-Links technique (TLTE) were used to balance the number of samples in the contaminated and non-contaminated classes.

Table 6. Summary of various approaches for weed detection with imbalanced data.

Title	Year	Dataset	Techniques
Shape and style GAN-based multispectral data augmentation for crop/weed segmentation in precision farming [181]	2024	Multispectral image of sugar beet dataset	This research utilizes two types of GAN: cGAN and DCGAN for scene augmentation.
Synthesising Training Data for Intelligent Weed Control Systems Using Generative AI [182]	2024	Multispectral image of sugar beet dataset	This study employs a generative approach to create synthetic images for training object detection systems for weed control. It combines the Segment Anything Model (SAM) for zero-shot transfer to new domains with an AI-based Stable Diffusion Model to generate synthetic images.
Fully convolutional network for rice seedling and weed image segmentation at the seedling stage in paddy fields [183]	2019	RGB images of rice seedlings and weeds in paddy fields	The study applies semantic segmentation using SegNet to detect the positions of rice seedlings and weeds in paddy fields. In addition, class weight coefficients are calculated to handle the class imbalance.
Weed identification in broomcorn millet field using Segformer semantic segmentation based on multiple loss functions [184]	2024	RGB Images of broomcorn millet farms with 67% weed coverage	The study uses Segformer. Furthermore, a combination of dice loss and focal loss is applied to address the imbalance between positive and negative samples and to resolve the issue of small area segmentation due to densely growing weeds.
Real-time recognition of sugar beet and weeds in complex backgrounds using multi-channel depth-wise separable convolution model [185]	2019	Multispectral image of sugar beet dataset	This study introduces a lightweight convolutional neural network with a codec structure for real-time sugar beet and weed recognition. Then, a weighted loss function addresses pixel imbalances between soil, crops, and weeds.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Miftahushudur, T.; Sahin, H.M.; Grieve, B.; Yin, H. A Survey of Methods for Addressing Imbalance Data Problems in Agriculture Applications. Remote Sens. 2025, 17, 454. https://doi.org/10.3390/rs17030454

AMA Style

Miftahushudur T, Sahin HM, Grieve B, Yin H. A Survey of Methods for Addressing Imbalance Data Problems in Agriculture Applications. Remote Sensing. 2025; 17(3):454. https://doi.org/10.3390/rs17030454

Chicago/Turabian Style

Miftahushudur, Tajul, Halil Mertkan Sahin, Bruce Grieve, and Hujun Yin. 2025. "A Survey of Methods for Addressing Imbalance Data Problems in Agriculture Applications" Remote Sensing 17, no. 3: 454. https://doi.org/10.3390/rs17030454

APA Style

Miftahushudur, T., Sahin, H. M., Grieve, B., & Yin, H. (2025). A Survey of Methods for Addressing Imbalance Data Problems in Agriculture Applications. Remote Sensing, 17(3), 454. https://doi.org/10.3390/rs17030454

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Survey of Methods for Addressing Imbalance Data Problems in Agriculture Applications

Abstract

1. Introduction

2. Challenge of Imbalance Classification in Agricultural Applications

2.1. Multiclass Classification

2.2. Intra-Class Classification

2.3. Impact of Data Imbalance on ML Pipeline

3. Methods to Address Imbalanced Data

3.1. Algorithm-Level Approach

3.1.1. Cost-Sensitive Learning

3.1.2. Threshold Moving

3.2. Data-Level Approach

3.2.1. Undersampling

3.2.2. Oversampling

3.3. Hybrid-Level Approach

4. Leveraging Generative Models for Synthetic Data Generation in Addressing Imbalanced Datasets

4.1. Generative Adversarial Networks (GANs)

4.2. Variational Autoencoder (VAE)

4.3. Transfer Learning

5. Applications in Agriculture

5.1. Disease Detection

5.2. Soil Management

5.3. Crop Type Classification

5.4. Weed Detection

6. Evaluation Metrics

6.1. Confusion Matrix

6.2. Matthews Correlation Coefficient (MCC)

6.3. Precision and Recall

6.4. Sensitivity/Recall and Specificity

6.5. F-Score

6.6. Geometric Mean (G-Mean)

6.7. Balanced Accuracy

6.8. Cost-Sensitive Metrics

7. Challenge and Future Directions

8. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI