A Survey of Methods for Addressing Imbalance Data Problems in Agriculture Applications
Abstract
:1. Introduction
2. Challenge of Imbalance Classification in Agricultural Applications
2.1. Multiclass Classification
2.2. Intra-Class Classification
2.3. Impact of Data Imbalance on ML Pipeline
- Data Collection. Datasets can suffer from unbalanced class distributions when minority classes are scarce or difficult to collect due to various factors. One cause is natural rarity, such as rare diseases in crops or specific pest infestations that only occur in certain environmental conditions. Limited access to certain populations or regions is also a factor, such as remote farmlands that are difficult to reach for surveys or sampling. Finally, limited technology or resources to capture sufficient data in complex or hard-to-reach environments, such as farms in mountainous areas or under extreme weather, can also exacerbate data imbalance issues.Another issue faced in data collection is the need to obtain varied data. In agriculture, for example, it is important to collect samples of plant species from different positions, as well as from different stages of plant growth or varying degrees of virus severity. This is crucial to ensure that ML models can generalize well and make accurate predictions across a wide range of real-world conditions. Unfortunately, in practice, obtaining data with sufficient variation is often constrained by limited time, costs, and availability of representative samples. This challenge can be mitigated by generating or synthesizing realistic data using data augmentation as a pre-processing strategy. Data augmentation methods, such as random cropping, flipping, or color jittering, can be used to artificially increase the diversity of the dataset without the need for additional data collection [79]. Furthermore, generative models, such as GANs, can be employed to create new, realistic samples that represent underrepresented classes [80]. These data generation techniques have proven effective in addressing the challenges posed by limited or imbalanced data in real-world agriculture or scenarios [81,82].
- Model Training. During training, bias towards majority classes can become a significant issue, for instance, in plant disease detection for cassava, where datasets are often imbalanced. In this case, the number of healthy cassava plant images far exceeds those of plants infected with disease, causing the model to predominantly predict the majority class (healthy plants), reducing the model’s accuracy in detecting diseases in cassava plants [83].Furthermore, overfitting presents another challenge, particularly when models are trained on imbalanced data. Then, the model frequently “sees” the majority class during training, leading to a lack of robustness and the tendency to overlook the minority class. As a result, the model becomes overfitted, leading to poor performance on new, unseen data. In practice, such models struggle to generalize well and tend to predict the majority class, neglecting the minority class [84]. In addition to data augmentations, this challenge often requires solutions like cost-sensitive learning, where greater penalties are applied to errors involving the minority class, or moving the threshold decision to a point that better balances the predictions between the majority and minority classes, such as fine-tuning the decision threshold to maximize recall or F1-score for the minority class [85,86].
- Evaluation Metric Selection. When assessing model performance, a class imbalance can make common metrics such as accuracy misleading in imbalanced datasets, as a model may achieve high accuracy by predominantly predicting the majority class while failing to correctly classify the minority class. Similarly, precision and recall, when used independently, may not provide a complete picture of model performance. For example, high precision for a minority class might not account for the recall trade-off, where the model misses most of the minority class instances [87]. Alternative metrics such as F1-score, MCC, or G-Mean are often more informative in the context of class imbalance [88].
3. Methods to Address Imbalanced Data
- Algorithm-level Approach:Focuses on modifying classification algorithms to be more sensitive to minority classes.
- Data-level Approach:Focuses on manipulating data to balance class distributions, by such as oversampling and undersampling.
- Hybrid-level Approach:Combines algorithm-level and data-level approaches to obtain more optimized methods.
3.1. Algorithm-Level Approach
3.1.1. Cost-Sensitive Learning
3.1.2. Threshold Moving
3.2. Data-Level Approach
3.2.1. Undersampling
3.2.2. Oversampling
Method | Strategy | Limitation |
---|---|---|
Random Undersampling (RUS) | Randomly deletes existing samples of the majority class until the number of majority samples equals that of the minority class. | RUS may lead to a loss of valuable information, and can result in underfitting as it reduces the dataset size. |
Edited Nearest Neighbour (EEN) [111] | Deletes majority class samples based on the nearest neighbour distribution of the majority class. If a sample from the majority class has neighbours dominated by samples from a different class, it will be deleted. | It may not entirely reduce the degree of imbalance, as there may not be many majority samples meeting the criteria. |
Neighbourhood Cleaning Rule (NCL) [112] | An extension of EEN. This method considers the neighbourhood distributions of the majority class samples and the neighbourhood distribution of minority class samples to remove more of the majority class samples. | Although it might remove more samples than EEN, there is still potential for not fully reducing the imbalance degree, especially in extremely imbalanced data conditions. |
Tomek-Links [113] | Identifying “Tomek-Links” pairs and removing majority class samples. Tomek-Links pairs are identified if the Euclidean distance between samples from different classes is smaller than the distance between two samples of the same class. | Only considers samples in the boundary area, so noise or overlapped samples may still exist. |
Cluster-based oversampling [114] | Using clustering algorithms to select cluster centres or their nearest samples based on the K-Nearest Neighbour (K-NN) rule to represent majority class samples. | While this technique helps retain important information from the majority class, it may disrupt the original sample distribution. |
3.3. Hybrid-Level Approach
4. Leveraging Generative Models for Synthetic Data Generation in Addressing Imbalanced Datasets
4.1. Generative Adversarial Networks (GANs)
4.2. Variational Autoencoder (VAE)
4.3. Transfer Learning
5. Applications in Agriculture
5.1. Disease Detection
5.2. Soil Management
5.3. Crop Type Classification
5.4. Weed Detection
6. Evaluation Metrics
6.1. Confusion Matrix
6.2. Matthews Correlation Coefficient (MCC)
6.3. Precision and Recall
6.4. Sensitivity/Recall and Specificity
6.5. F-Score
6.6. Geometric Mean (G-Mean)
6.7. Balanced Accuracy
6.8. Cost-Sensitive Metrics
7. Challenge and Future Directions
8. Conclusions
- Scope of the literature review: The review focuses on papers published within the last few years and indexed in Scopus, which means it may not cover many older publications that may still provide valuable insights.
- Focus on specific techniques: While the review discusses a range of techniques to address class imbalance in agricultural applications, it does not exhaustively compare all of the methods discussed nor does it explore other potentially effective methods, such as active learning, which has gained traction in other applications.
- Data limitations: A significant limitation noted in the reviewed studies is the lack of publicly available datasets. Many of the datasets used in the papers referenced are proprietary or have restricted access, which may limit the reproducibility. The lack of standardized public datasets can hinder efforts to compare and validate research findings consistently. Therefore, for future research, we recommend studies that use commonly utilized public datasets to serve as benchmarks. Additionally, journals may encourage data sharing by offering free open-access publication to researchers who make their datasets publicly available. Collaborative efforts for dataset standardization would add significant value by promoting consistency and improving the usability of shared data.
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Gebbers, R.; Adamchuk, V.I. Precision agriculture and food security. Science 2010, 327, 828–831. [Google Scholar] [CrossRef]
- Reynolds, M.; Pask, A.; Mullan, D. Physiological Breeding I: Interdisciplinary Approaches to Improve Crop Production; CIMMYT: Texcoco, Mexico, 2012. [Google Scholar]
- Wolfert, S.; Ge, L.; Verdouw, C.; Bogaardt, M.J. Big data in smart farming—A review. Agric. Syst. 2017, 153, 69–80. [Google Scholar] [CrossRef]
- Zhang, C.; Kovacs, J.M. The application of small unmanned aerial systems for precision agriculture: A review. Precis. Agric. 2012, 13, 693–712. [Google Scholar] [CrossRef]
- Kashyap, B.; Kumar, R. Sensing Methodologies in Agriculture for Soil Moisture and Nutrient Monitoring. IEEE Access 2021, 9, 14095–14121. [Google Scholar] [CrossRef]
- Jones, H.G. Irrigation scheduling: Advantages and pitfalls of plant-based methods. J. Exp. Bot. 2004, 55, 2427–2436. [Google Scholar] [CrossRef] [PubMed]
- Nair, N.; Akshaya, A.A.; Joseph, J. An in-situ soil pH sensor with solid electrodes. IEEE Sens. Lett. 2022, 6, 2000104. [Google Scholar] [CrossRef]
- Postolache, S.; Sebastião, P.; Viegas, V.; Postolache, O.; Cercas, F. IoT-based systems for soil nutrients assessment in horticulture. Sensors 2023, 23, 403. [Google Scholar] [CrossRef]
- Campbell, G.S.; Anderson, R.Y. An Introduction to Environmental Biophysics; Springer Science & Business Media: Berlin/Heidelberg, Germany, 1998. [Google Scholar]
- Thenkabail, P.S.; Lyon, J.G.; Huete, A. Hyperspectral Remote Sensing of Vegetation; CRC Press: Boca Raton, FL, USA, 2012. [Google Scholar]
- Zhang, M.; Qin, Z.; Liu, X.; Ustin, S.L. Detection of stress in tomatoes induced by late blight disease in California, USA, using hyperspectral remote sensing. Int. J. Appl. Earth Obs. Geoinf. 2003, 4, 295–310. [Google Scholar] [CrossRef]
- Ehlert, D.; Horn, H.J.; Adamek, R. Measuring crop biomass density by laser triangulation. Comput. Electron. Agric. 2010, 74, 111–118. [Google Scholar] [CrossRef]
- Moratiel, R.; Martínez-Cob, A.; Latorre, B. Variation in the estimation of ETo and crop water use due to meteorological data quality. Agric. Water Manag. 2011, 98, 1442–1451. [Google Scholar]
- Kim, Y.; Evans, R.G.; Iversen, W.M. Remote sensing and control of an irrigation system using a distributed wireless sensor network. IEEE Trans. Instrum. Meas. 2008, 57, 1379–1387. [Google Scholar]
- Doraiswamy, P.C.; Hatfield, J.L.; Jackson, T.J.; Akhmedov, B.; Prueger, J.; Stern, A. Crop condition and yield simulations using Landsat and MODIS. Remote Sens. Environ. 2004, 92, 548–559. [Google Scholar] [CrossRef]
- Hammer, G.L.; Sinclair, T.R.; Boote, K.J.; Wright, G.C.; Meinke, H. A peanut simulation model: I. Model development and testing. Agron. J. 1995, 87, 1085–1093. [Google Scholar] [CrossRef]
- Evett, S.R.; Tolk, J.A.; Howell, T.A. A depth control stand for improved accuracy with the neutron probe. Vadose Zone J. 2003, 2, 642–649. [Google Scholar] [CrossRef]
- Sadler, E.J.; Evans, R.G.; Stone, K.C.; Camp, C.R. Opportunities for conservation with precision irrigation. J. Soil Water Conserv. 2005, 60, 371–378. [Google Scholar]
- Lobell, D.B.; Schlenker, W.; Costa-Roberts, J. Climate trends and global crop production since 1980. Science 2011, 333, 616–620. [Google Scholar] [CrossRef] [PubMed]
- White, J.W.; Hoogenboom, G.; Kimball, B.A.; Wall, G.W. Methodologies for simulating impacts of climate change on crop production. Field Crop. Res. 2011, 124, 357–368. [Google Scholar] [CrossRef]
- Segarra, J.; Buchaillot, M.L.; Araus, J.L.; Kefauver, S.C. Remote sensing for precision agriculture: Sentinel-2 improved features and applications. Agronomy 2020, 10, 641. [Google Scholar] [CrossRef]
- Yang, C. High resolution satellite imaging sensors for precision agriculture. Front. Agric. Sci. Eng. 2018, 5, 393–405. [Google Scholar] [CrossRef]
- Messina, G.; Modica, G. Applications of UAV thermal imagery in precision agriculture: State of the art and future research outlook. Remote Sens. 2020, 12, 1491. [Google Scholar] [CrossRef]
- Shahi, T.B.; Xu, C.Y.; Neupane, A.; Guo, W. Machine learning methods for precision agriculture with UAV imagery: A review. Electron. Res. Arch. 2022, 30, 4277–4317. [Google Scholar] [CrossRef]
- Mishra, P.; Asaari, M.S.M.; Herrero-Langreo, A.; Lohumi, S.; Diezma, B.; Scheunders, P. Close range hyperspectral imaging of plants: A review. Biosyst. Eng. 2017, 153, 41–60. [Google Scholar] [CrossRef]
- Mulla, D.J. Twenty-five years of remote sensing in precision agriculture: Key advances and remaining knowledge gaps. Biosyst. Eng. 2013, 114, 358–371. [Google Scholar] [CrossRef]
- Mahlein, A.K. Plant disease detection by imaging sensors—Parallels and specific demands for precision agriculture and plant phenotyping. Plant Dis. 2016, 100, 241–251. [Google Scholar] [CrossRef] [PubMed]
- Sankaran, S.; Mishra, A.; Ehsani, R.; Davis, C. A review of advanced techniques for detecting plant diseases. Comput. Electron. Agric. 2010, 72, 1–13. [Google Scholar] [CrossRef]
- McBratney, A.; Whelan, B.; Ancev, T.; Bouma, J. Future directions of precision agriculture. Precis. Agric. 2005, 6, 7–23. [Google Scholar] [CrossRef]
- Thenkabail, P.S.; Lyon, J.G.; Huete, A. Hyperspectral Indices and Image Classifications for Agriculture and Vegetation, 2nd ed.; CRC Press: Boca Raton, FL, USA, 2019. [Google Scholar]
- Li, C.; Xue, J.; Su, B. Significant remote sensing vegetation indices: A review of developments and applications. J. Sens. 2017, 2017, 1353691. [Google Scholar] [CrossRef]
- Nascimento, S.M.; Amano, K.; Foster, D.H. Spatial distributions of local illumination color in natural scenes. Vis. Res. 2016, 120, 39–44. [Google Scholar] [CrossRef]
- Prudnikova, E.; Savin, I.; Vindeker, G.; Grubina, P.; Shishkonakova, E.; Sharychev, D. Influence of soil background on spectral reflectance of winter wheat crop canopy. Remote Sens. 2019, 11, 1932. [Google Scholar] [CrossRef]
- Moravec, D.; Komárek, J.; López-Cuervo Medina, S.; Molina, I. Effect of atmospheric corrections on NDVI: Intercomparability of Landsat 8, Sentinel-2, and UAV sensors. Remote Sens. 2021, 13, 3550. [Google Scholar] [CrossRef]
- Stamford, J.D.; Vialet-Chabrand, S.; Cameron, I.; Lawson, T. Development of an accurate low cost NDVI imaging system for assessing plant health. Plant Methods 2023, 19, 9. [Google Scholar] [CrossRef]
- Yang, X.; Zuo, X.; Xie, W.; Li, Y.; Guo, S.; Zhang, H. A correction method of NDVI topographic shadow effect for rugged terrain. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 8456–8472. [Google Scholar] [CrossRef]
- Guo, Y.; Wang, C.; Lei, S.; Yang, J.; Zhao, Y. A framework of spatio-temporal fusion algorithm selection for Landsat NDVI time series construction. ISPRS Int. J. Geo-Inf. 2020, 9, 665. [Google Scholar] [CrossRef]
- Dougherty, T.R.; Jain, R.K. Invisible walls: Exploration of microclimate effects on building energy consumption in New York City. Sustain. Cities Soc. 2023, 90, 104364. [Google Scholar] [CrossRef]
- AlSuwaidi, A.; Veys, C.; Hussey, M.; Grieve, B.; Yin, H. Hyperspectral selection based algorithm for plant classification. In Proceedings of the 2016 IEEE International Conference on Imaging Systems and Techniques (IST), Chania, Greece, 4–6 October 2016; pp. 395–400. [Google Scholar] [CrossRef]
- Ramanath, A.; Muthusrinivasan, S.; Xie, Y.; Shekhar, S.; Ramachandra, B. NDVI versus CNN features in deep learning for land cover clasification of aerial images. In Proceedings of the IGARSS 2019—2019 IEEE International Geoscience and Remote Sensing Symposium, Yokohama, Japan, 28 July–2 August 2019; pp. 6483–6486. [Google Scholar] [CrossRef]
- Zaitunah, A.; Samsuri; Marbun, Y.M.H.; Susilowati, A.; Elfiati, D.; Syahputra, O.K.H.; Arinah, H.; Rangkuti, A.B.; Rambey, R.; Harahap, M.M.; et al. Vegetation density analysis using normalized difference vegetation index in East Jakarta, Indonesia. IOP Conf. Ser. Earth Environ. Sci. 2021, 912, 012053. [Google Scholar] [CrossRef]
- Franke, J.; Heinzel, V.; Menz, G. Assessment of NDVI- differences caused by sensor specific relative spectral response functions. In Proceedings of the 2006 IEEE International Symposium on Geoscience and Remote Sensing, Denver, CO, USA, 31 July–4 August 2006; pp. 1138–1141. [Google Scholar] [CrossRef]
- Gong, C.; Yin, R.; Long, T.; Jiao, W.; He, G.; Wang, G. Spatial–temporal approach and dataset for enhancing cloud detection in Sentinel-2 imagery: A case study in China. Remote Sens. 2024, 16, 973. [Google Scholar] [CrossRef]
- Revel, C.; Deville, Y.; Achard, V.; Briottet, X.; Weber, C. Inertia-constrained pixel-by-pixel nonnegative matrix factorisation: A hyperspectral unmixing method dealing with intra-class variability. Remote Sens. 2018, 10, 1706. [Google Scholar] [CrossRef]
- Alsuwaidi, A.; Veys, C.; Hussey, M.; Grieve, B.; Yin, H. Hyperspectral feature selection ensemble for plant classification. In Proceedings of the Hyperspectral Imaging and Applications (HSI 2016), Coventry, UK, 12–13 October 2016. [Google Scholar]
- Miftahushudur, T.; Grieve, B.; Yin, H. Ensemble synthetic oversampling with manhattan distance for unbalanced hyperspectral data. In Proceedings of the Intelligent Data Engineering and Automated Learning—IDEAL 2021, Manchester, UK, 25–27 November 2021; Lecture Notes in Computer Science (LNCS). Volume 13113. [Google Scholar] [CrossRef]
- Miftahushudur, T.; Sahin, H.M.; Grieve, B.; Yin, H. Enhanced SVM-SMOTE with cluster consistency for imbalanced data classification. In Proceedings of the Intelligent Data Engineering and Automated Learning—IDEAL 2023, Évora, Portugal, 22–24 November 2023; Quaresma, P., Camacho, D., Yin, H., Gonçalves, T., Julian, V., Tallón-Ballesteros, A.J., Eds.; Springer: Cham, Switzerland, 2023; pp. 431–441. [Google Scholar]
- Miftahushudur, T.; Grieve, B.; Yin, H. Permuted KPCA and SMOTE to guide GAN-based oversampling for imbalanced HSI classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 489–505. [Google Scholar] [CrossRef]
- Peng, Y.; Dallas, M.M.; Ascencio-Ibáñez, J.T.; Hoyer, J.S.; Legg, J.; Hanley-Bowdoin, L.; Grieve, B.; Yin, H. Early detection of plant virus infection using multispectral imaging and spatial–spectral machine learning. Sci. Rep. 2022, 12, 3113. [Google Scholar] [CrossRef]
- Sahin, H.M.; Grieve, B.; Yin, H. Automatic multispectral image classification of plant virus from leaf samples. In Proceedings of the Intelligent Data Engineering and Automated Learning—IDEAL 2020, Guimaraes, Portugal, 4–6 November 2020; Analide, C., Novais, P., Camacho, D., Yin, H., Eds.; Springer: Cham, Switzerland, 2020; pp. 374–384. [Google Scholar]
- Sahin, H.M.; Grieve, B.; Yin, H. Combining of Markov random field and convolutional neural networks for hyper/multispectral image classification. In Proceedings of the Intelligent Data Engineering and Automated Learning—IDEAL 2023, Évora, Portugal, 22–24 November 2023; Quaresma, P., Camacho, D., Yin, H., Gonçalves, T., Julian, V., Tallón-Ballesteros, A.J., Eds.; Springer: Cham, Switzerland, 2023; pp. 28–38. [Google Scholar]
- Sahin, H.M.; Miftahushudur, T.; Grieve, B.; Yin, H. Segmentation of weeds and crops using multispectral imaging and CRF-enhanced U-Net. Comput. Electron. Agric. 2023, 211, 107956. [Google Scholar] [CrossRef]
- Isinkaye, F.O.; Olusanya, M.O.; Singh, P.K. Deep learning and content-based filtering techniques for improving plant disease identification and treatment recommendations: A comprehensive review. Heliyon 2024, 10, e29583. [Google Scholar] [CrossRef] [PubMed]
- Kong, Y.L.; Huang, Q.; Wang, C.; Chen, J.; Chen, J.; He, D. Long short-term memory neural networks for online disturbance detection in satellite image time series. Remote Sens. 2018, 10, 452. [Google Scholar] [CrossRef]
- Leevy, J.L.; Khoshgoftaar, T.M.; Bauder, R.A.; Seliya, N. A survey on addressing high-class imbalance in big data. J. Big Data 2018, 5, 42. [Google Scholar] [CrossRef]
- Ojo, M.O.; Zahid, A. Improving deep learning classifiers performance via preprocessing and class imbalance approaches in a plant disease detection pipeline. Agronomy 2023, 13, 887. [Google Scholar] [CrossRef]
- Walsh, R.; Tardy, M. A Comparison of techniques for class imbalance in deep learning classification of breast cancer. Diagnostics 2023, 13, 67. [Google Scholar] [CrossRef] [PubMed]
- Cheah, P.C.Y.; Yang, Y.; Lee, B.G. Enhancing financial fraud detection through addressing class imbalance using hybrid SMOTE-GAN techniques. Int. J. Financ. Stud. 2023, 11, 110. [Google Scholar] [CrossRef]
- Xiang, Y.; Yao, J.; Yang, Y.; Yao, K.; Wu, C.; Yue, X.; Li, Z.; Ma, M.; Zhang, J.; Gong, G. Real-time detection algorithm for Kiwifruit canker based on a lightweight and efficient generative adversarial network. Plants 2023, 12, 3053. [Google Scholar] [CrossRef]
- Pesaresi, S.; Mancini, A.; Casavecchia, S. Recognition and characterization of forest plant communities through remote-sensing NDVI time series. Diversity 2020, 12, 313. [Google Scholar] [CrossRef]
- Mahakosee, S.; Jogloy, S.; Vorasoot, N.; Theerakulpisut, P.; Holbrook, C.C.; Kvien, C.K.; Banterng, P. Seasonal variation in canopy size, light penetration and photosynthesis of three cassava genotypes with different canopy Architectures. Agronomy 2020, 10, 1554. [Google Scholar] [CrossRef]
- He, J.; Cheng, M.X. Weighting methods for rare event identification from imbalanced datasets. Front. Big Data 2021, 4, 715320. [Google Scholar] [CrossRef]
- Singh, V.; Sharma, N.; Singh, S. A review of imaging techniques for plant disease detection. Artif. Intell. Agric. 2020, 4, 229–242. [Google Scholar] [CrossRef]
- Brancalion, P.H.; Meli, P.; Tymus, J.R.; Lenti, F.E.; Benini, R.M.; Silva, A.P.M.; Isernhagen, I.; Holl, K.D. What makes ecosystem restoration expensive? A systematic cost assessment of projects in Brazil. Biol. Conserv. 2019, 240, 108274. [Google Scholar] [CrossRef]
- Johnson, J.M.; Khoshgoftaar, T.M. Survey on deep learning with class imbalance. J. Big Data 2019, 6, 27. [Google Scholar] [CrossRef]
- Kosolwattana, T.; Liu, C.; Hu, R.; Han, S.; Chen, H.; Lin, Y. A self-inspected adaptive SMOTE algorithm (SASMOTE) for highly imbalanced data classification in healthcare. BioData Min. 2023, 16, 15. [Google Scholar] [CrossRef] [PubMed]
- Hou, C.; Zhuang, J.; Tang, Y.; He, Y.; Miao, A.; Huang, H.; Luo, S. Recognition of early blight and late blight diseases on potato leaves based on graph cut segmentation. J. Agric. Food Res. 2021, 5, 100154. [Google Scholar] [CrossRef]
- Yap, B.W.; Rani, K.A.; Abd Rahman, H.A.; Fong, S.; Khairudin, Z.; Abdullah, N.N. An application of oversampling, undersampling, bagging and boosting in handling imbalanced datasets. In Proceedings of the First International Conference on Advanced Data and Information Engineering (DaEng-2013), Kuala Lumpur, Malaysia, 16–18 December 2013. Lecture Notes in Electrical Engineering. [Google Scholar] [CrossRef]
- Azevedo, B.F.; Rocha, A.M.A.; Pereira, A.I. Hybrid approaches to optimization and machine learning methods: A systematic literature review. Mach. Learn. 2024, 113, 4055–4097. [Google Scholar] [CrossRef]
- Wang, S.; Yao, X. Multiclass imbalance problems: Analysis and potential solutions. IEEE Trans. Syst. Man Cybern. Part B Cybern. 2012, 42, 1119–1130. [Google Scholar] [CrossRef] [PubMed]
- Chen, H.; Wei, J.; Huang, H.; Yuan, Y.; Wang, J. Review of imbalanced fault diagnosis technology based on generative adversarial networks. J. Comput. Des. Eng. 2024, 11, 99–124. [Google Scholar] [CrossRef]
- Japkowicz, N.; Stephen, S. The class imbalance problem: A systematic study. Intell. Data Anal. 2002, 6, 429–449. [Google Scholar] [CrossRef]
- Roh, Y.; Heo, G.; Whang, S.E. A survey on data collection for machine learning: A big data-AI integration perspective. IEEE Trans. Knowl. Data Eng. 2021, 33, 1328–1347. [Google Scholar] [CrossRef]
- Taner, A.; Mengstu, M.T.; Selvi, K.Ç.; Duran, H.; Kabaş, Ö.; Gür, İ.; Karaköse, T.; Gheorghiță, N.-E. Multiclass apple varieties classification using machine learning with histogram of oriented gradient and color moments. Appl. Sci. 2023, 13, 7682. [Google Scholar] [CrossRef]
- Taner, A.; Mengstu, M.T.; Selvi, K.Ç.; Duran, H.; Gür, İ.; Ungureanu, N. Apple varieties classification using deep features and machine learning. Agriculture 2024, 14, 252. [Google Scholar] [CrossRef]
- Yu, F.; Lu, T.; Xue, C. Deep learning-based intelligent apple variety classification system and model interpretability analysis. Foods 2023, 12, 885. [Google Scholar] [CrossRef]
- Hase, N.; Ito, S.; Kaneko, N.; Sumi, K. Data augmentation for intra-class imbalance with generative adversarial network. In Proceedings of the Fourteenth International Conference on Quality Control by Artificial Vision, Mulhouse, France, 15–17 May 2019; Cudel, C., Bazeille, S., Verrier, N., Eds.; International Society for Optics and Photonics (SPIE): San Diego, CA, USA, 2019; Volume 11172, p. 1117206. [Google Scholar] [CrossRef]
- Ahmed, S.; Hasan, M.B.; Ahmed, T.; Sony, M.R.K.; Kabir, M.H. Less is more: Lighter and faster deep neural architecture for tomato leaf disease classification. IEEE Access 2022, 10, 68868–68884. [Google Scholar] [CrossRef]
- Khare, O.; Mane, S.; Kulkarni, H.; Barve, N. LeafNST: An improved data augmentation method for classification of plant disease using object-based neural style transfer. Discov. Artif. Intell. 2024, 4, 50. [Google Scholar] [CrossRef]
- Sauber-Cole, R.; Khoshgoftaar, T.M. The use of generative adversarial networks to alleviate class imbalance in tabular data: A survey. J. Big Data 2022, 9, 98. [Google Scholar] [CrossRef]
- Temraz, M.; Kenny, E.M.; Ruelle, E.; Shalloo, L.; Smyth, B.; Keane, M.T. Handling climate change using counterfactuals: Using counterfactuals in data augmentation to predict crop growth in an uncertain climate future. In Case-Based Reasoning Research and Development, Proceedings of the 29th International Conference, ICCBR 2021, Salamanca, Spain, 13–16 September 2021; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2021; Volume 12877. [Google Scholar] [CrossRef]
- Mirzaei, A.; Bagheri, H.; Khosravi, I. Enhancing crop classification accuracy through synthetic SAR-optical data generation using deep learning. ISPRS Int. J. Geo-Inf. 2023, 12, 450. [Google Scholar] [CrossRef]
- Sambasivam, G.; Opiyo, G.D. A predictive machine learning application in agriculture: Cassava disease detection and classification with imbalanced dataset using convolutional neural networks. Egypt. Inform. J. 2021, 22, 27–34. [Google Scholar] [CrossRef]
- Vaidya, H.; Prasad, K.; Rajashekhar, C.; Tripathi, D.; S, R.; Shetty, J.; Swamy, K.; Y, S. A class imbalance aware hybrid model for accurate rice variety classification. Int. J. Cogn. Comput. Eng. 2025, 6, 170–182. [Google Scholar] [CrossRef]
- Prexawanprasut, T.; Banditwattanawong, T. Improving minority class recall through a novel cluster-based oversampling technique. Informatics 2024, 11, 35. [Google Scholar] [CrossRef]
- Provost, F.; Fawcett, T. Robust classification for imprecise environments. Mach. Learn. 2001, 42, 203–231. [Google Scholar] [CrossRef]
- Saito, T.; Rehmsmeier, M. The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLoS ONE 2015, 10, e0118432. [Google Scholar] [CrossRef]
- Williams, C.K.I. The effect of class imbalance on precision-recall curves. Neural Comput. 2021, 33, 853–857. [Google Scholar] [CrossRef] [PubMed]
- Zheng, W.; Jin, M. The effects of class imbalance and training data size on classifier learning: An empirical study. SN Comput. Sci. 2020, 1, 71. [Google Scholar] [CrossRef]
- Elkan, C. The foundations of cost-sensitive learning. In Proceedings of the IJCAI’01—17th International Joint Conference on Artificial Intelligence, San Francisco, CA, USA, 4–10 August 2001; Volume 2, pp. 973–978. [Google Scholar]
- Provost, F. Machine learning from imbalanced data sets 101. In Proceedings of the AAAI’2000 Workshop on Imbalanced Data Sets, Austin, TX, USA, 31 July 2000. [Google Scholar]
- Dong, X.; Yu, Z.; Cao, W.; Shi, Y.; Ma, Q. A survey on ensemble learning. Front. Comput. Sci. 2020, 14, 241–258. [Google Scholar] [CrossRef]
- Drummond, C.; Holte, R.C. C4. 5, class imbalance, and cost sensitivity: Why under-sampling beats over-sampling. In Proceedings of the ICML’2003, Workshop on Learning from Imbalanced Data Sets II, Washington, DC, USA, 21 August 2003; Volume 11. [Google Scholar]
- Estabrooks, A.; Jo, T.; Japkowicz, N. A multiple resampling method for learning from imbalanced data sets. Comput. Intell. 2004, 20, 18–36. [Google Scholar] [CrossRef]
- Krawczyk, B. Learning from imbalanced data: Open challenges and future directions. Prog. Artif. Intell. 2016, 5, 221–232. [Google Scholar] [CrossRef]
- Mienye, I.D.; Sun, Y. Performance analysis of cost-sensitive learning methods with application to imbalanced medical data. Inform. Med. Unlocked 2021, 25, 100690. [Google Scholar] [CrossRef]
- Thai-Nghe, N.; Gantner, Z.; Schmidt-Thieme, L. Cost-sensitive learning methods for imbalanced data. In Proceedings of the 2010 International Joint Conference on Neural Networks (IJCNN), Barcelona, Spain, 18–23 July 2010; pp. 1–8. [Google Scholar] [CrossRef]
- Khan, S.H.; Hayat, M.; Bennamoun, M.; Sohel, F.A.; Togneri, R. Cost-sensitive learning of deep feature representations from imbalanced data. IEEE Trans. Neural Netw. Learn. Syst. 2018, 29, 3573–3587. [Google Scholar] [CrossRef] [PubMed]
- Zou, Q.; Xie, S.; Lin, Z.; Wu, M.; Ju, Y. Finding the best classification threshold in imbalanced classification. Big Data Res. 2016, 5, 2–8. [Google Scholar] [CrossRef]
- He, H.; Ma, Y. Imbalanced Learning: Foundations, Algorithms, and Applications; Wiley: Hoboken, NJ, USA, 2013. [Google Scholar] [CrossRef]
- Giglioni, V.; García-Macías, E.; Venanzi, I.; Ierimonti, L.; Ubertini, F. The use of receiver operating characteristic curves and precision-versus-recall curves as performance metrics in unsupervised structural damage classification under changing environment. Eng. Struct. 2021, 246, 113029. [Google Scholar] [CrossRef]
- Mohammed, R.; Rawashdeh, J.; Abdullah, M. Machine learning with oversampling and undersampling techniques: Overview study and experimental results. In Proceedings of the 2020 11th International Conference on Information and Communication Systems, ICICS 2020, Irbid, Jordan, 7–9 April 2020. [Google Scholar] [CrossRef]
- Alkhawaldeh, I.M.; Albalkhi, I.; Naswhan, A.J. Challenges and limitations of synthetic minority oversampling techniques in machine learning. World J. Methodol. 2023, 13, 373–378. [Google Scholar] [CrossRef] [PubMed]
- Chen, W.; Yang, K.; Yu, Z.; Shi, Y.; Chen, C.L.P. A survey on imbalanced learning: Latest research, applications and future directions. Artif. Intell. Rev. 2024, 57, 137. [Google Scholar] [CrossRef]
- Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic minority over-sampling technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
- He, H.; Bai, Y.; Garcia, E.A.; Li, S. ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In Proceedings of the 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), Hong Kong, China, 1–8 June 2008; pp. 1322–1328. [Google Scholar] [CrossRef]
- Han, H.; Wang, W.Y.; Mao, B.H. Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning. In Advances in Intelligent Computing, Proceedings of the International Conference on Intelligent Computing (ICIC 2005), Hefei, China, 23–26 August 2005; Huang, D.S., Zhang, X.P., Huang, G.B., Eds.; Springer: Berlin/Heidelberg, Germany, 2005; pp. 878–887. [Google Scholar]
- Bunkhumpornpat, C.; Sinapiromsaran, K.; Lursinsap, C. Safe-level-SMOTE: Safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem. In Advances in Knowledge Discovery and Data Mining, Proceedings of the 13th Pacific-Asia Conference (PAKDD 2009), Bangkok, Thailand, 27–30 April 2009; Theeramunkong, T., Kijsirikul, B., Cercone, N., Ho, T.B., Eds.; Springer: Berlin/Heidelberg, Germany, 2009; pp. 475–482. [Google Scholar]
- Douzas, G.; Bacao, F.; Last, F. Improving imbalanced learning through a heuristic oversampling method based on k-means and SMOTE. Inf. Sci. 2018, 465, 1–20. [Google Scholar] [CrossRef]
- Nguyen, H.M.; Cooper, E.W.; Kamei, K. Borderline over-sampling for imbalanced data classification. Int. J. Knowl. Eng. Soft Data Paradigm. 2011, 3, 4–21. [Google Scholar] [CrossRef]
- Wilson, D.L. Asymptotic Properties of Nearest Neighbour Rules Using Edited Data. IEEE Trans. Syst. Man Cybern. 1972, SMC-2, 408–421. [Google Scholar] [CrossRef]
- Laurikkala, J. Improving identification of difficult small classes by balancing class distribution. In Artificial Intelligence in Medicine, Proceedings of the 8th Conference on Artificial Intelligence in Medicine in Europe (AIME 2001), Cascais, Portugal, 1–4 July 2001; Quaglini, S., Barahona, P., Andreassen, S., Eds.; Springer: Berlin/Heidelberg, Germany, 2001; pp. 63–66. [Google Scholar]
- Two Modifications of CNN. IEEE Trans. Syst. Man Cybern. 1976, SMC-6, 769–772. [CrossRef]
- Lin, W.C.; Tsai, C.F.; Hu, Y.H.; Jhang, J.S. Clustering-based undersampling in class-imbalanced data. Inf. Sci. 2017, 409–410, 17–26. [Google Scholar] [CrossRef]
- Kraiem, M.S.; Sánchez-Hernández, F.; Moreno-García, M.N. Selecting the suitable resampling strategy for imbalanced data classification regarding dataset properties. An approach based on association models. Appl. Sci. 2021, 11, 8546. [Google Scholar] [CrossRef]
- Choirunnisa, S.; Lianto, J. Hybrid method of undersampling and oversampling for handling imbalanced data. In Proceedings of the 2018 International Seminar on Research of Information Technology and Intelligent Systems (ISRITI), Yogyakarta, Indonesia, 21–22 November 2018; pp. 276–280. [Google Scholar] [CrossRef]
- Hosni, M.; Abnane, I.; Idri, A.; Carrillo de Gea, J.M.; Fernández Alemán, J.L. Reviewing ensemble classification methods in breast cancer. Comput. Methods Programs Biomed. 2019, 177, 89–112. [Google Scholar] [CrossRef] [PubMed]
- Sainin, M.S.; Alfred, R.; Ahmad, F. Ensemble meta classifier with sampling and feature selection for data with imbalance multiclass problem. J. Inf. Commun. Technol. 2021, 20, 103–133. [Google Scholar] [CrossRef]
- Kim, K. Noise Avoidance SMOTE in Ensemble Learning for Imbalanced Data. IEEE Access 2021, 9, 143250–143265. [Google Scholar] [CrossRef]
- Yotsawat, W.; Wattuya, P.; Srivihok, A. A novel method for credit scoring based on cost-sensitive neural network ensemble. IEEE Access 2021, 9, 78521–78537. [Google Scholar] [CrossRef]
- Wei, W.; Jiang, F.; Yu, X.; Du, J. An ensemble learning algorithm based on resampling and hybrid feature selection, with an application to software defect prediction. In Proceedings of the 2022 7th International Conference on Information and Network Technologies (ICINT), Okinawa, Japan, 21–23 May 2022; pp. 52–56. [Google Scholar] [CrossRef]
- Batista, G.E.A.P.A.; Prati, R.C.; Monard, M.C. A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explor. Newsl. 2004, 6, 20–29. [Google Scholar] [CrossRef]
- Goodfellow, I.J.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial nets. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 8–13 December 2014; Volume 3, pp. 2672–2680. [Google Scholar] [CrossRef]
- Kingma, D.P.; Welling, M. Auto-encoding Variational Bayes. arXiv 2013, arXiv:1312.6114. [Google Scholar]
- Sampath, V.; Maurtua, I.; Aguilar Martin, J.J.; Gutierrez, A. A survey on generative adversarial networks for imbalance problems in computer vision tasks. J. Big Data 2021, 8, 27. [Google Scholar] [CrossRef] [PubMed]
- Blagus, R.; Lusa, L. Evaluation of SMOTE for high-dimensional class-imbalanced microarray data. In Proceedings of the 2012 11th International Conference on Machine Learning and Applications, ICMLA 2012, Boca Raton, FL, USA, 12–15 December 2012. [Google Scholar] [CrossRef]
- Mirza, M.; Osindero, S. Conditional generative adversarial nets. arXiv 2014, arXiv:1411.1784v1. [Google Scholar] [CrossRef]
- Ma, Y.; Liu, K.; Guan, Z.; Xu, X.; Qian, X.; Bao, H. Background augmentation generative adversarial networks (BAGANs): Effective data generation based on GAN-augmented 3D synthesizing. Symmetry 2018, 10, 734. [Google Scholar] [CrossRef]
- Douzas, G.; Bacao, F. Effective data generation for imbalanced learning using conditional generative adversarial networks. Expert Syst. Appl. 2018, 91, 464–471. [Google Scholar] [CrossRef]
- Mariani, G.; Scheidegger, F.; Istrate, R.; Bekas, C.; Malossi, C. BAGAN: Data augmentation with balancing GAN. arXiv 2018, arXiv:1803.09655. [Google Scholar]
- Frid-Adar, M.; Klang, E.; Amitai, M.; Goldberger, J.; Greenspan, H. Synthetic data augmentation using GAN for improved liver lesion classification. In Proceedings of the 2018 IEEE 15th International Symposium on Biomedical Imaging (ISBI 2018), Washington, DC, USA, 4–7 April 2018. [Google Scholar] [CrossRef]
- Pan, T.; Pedrycz, W.; Yang, J.; Wang, J. An improved generative adversarial network to oversample imbalanced datasets. Eng. Appl. Artif. Intell. 2024, 132, 107934. [Google Scholar] [CrossRef]
- Qin, Z.; Huang, F.; Pan, J.; Niu, J.; Qin, H. Improved generative adversarial network for bearing fault diagnosis with a small number of data and unbalanced data. Symmetry 2024, 16, 358. [Google Scholar] [CrossRef]
- Sharma, A.; Singh, P.K.; Chandra, R. SMOTified-GAN for class imbalanced pattern classification problems. IEEE Access 2022, 10, 30655–30665. [Google Scholar] [CrossRef]
- Qian, W.; Gechter, F. Variational information bottleneck model for accurate indoor position recognition. In Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR), Milan, Italy, 10–15 January 2020. [Google Scholar] [CrossRef]
- Stocksieker, S.; Pommeret, D.; Charpentier, A. Data Augmentation with Variational Autoencoder for Imbalanced Dataset. arXiv 2024, arXiv:2412.07039. [Google Scholar] [CrossRef]
- Chatterjee, S.; Maity, S.; Bhattacharjee, M.; Banerjee, S.; Das, A.K.; Ding, W. Variational autoencoder based imbalanced COVID-19 detection using chest X-ray images. New Gener. Comput. 2023, 41, 25–60. [Google Scholar] [CrossRef] [PubMed]
- Mostofi, F.; Behzat Tokdemir, O.; Toğan, V. Generating synthetic data with variational autoencoder to address class imbalance of graph attention network prediction model for construction management. Adv. Eng. Inform. 2024, 62, 102606. [Google Scholar] [CrossRef]
- Dai, W.; Ng, K.; Severson, K.; Huang, W.; Anderson, F.; Stultz, C. Generative oversampling with a contrastive variational autoencoder. In Proceedings of the 2019 IEEE International Conference on Data Mining (ICDM), Beijing, China, 8–11 November 2019; pp. 101–109. [Google Scholar] [CrossRef]
- Kossale, Y.; Airaj, M.; Darouichi, A. Mode collapse in generative adversarial networks: An overview. In Proceedings of the 2022 8th International Conference on Optimization and Applications (ICOA), Beijing, China, 8–11 November 2022; pp. 1–6. [Google Scholar] [CrossRef]
- Naderi, H.; Soleimani, B.H.; Matwin, S. Generating high-fidelity images with disentangled adversarial VAEs and structure-aware loss. In Proceedings of the 2020 International Joint Conference on Neural Networks (IJCNN), Glasgow, UK, 19–24 July 2020; pp. 1–8. [Google Scholar] [CrossRef]
- Marques, J.A.L.; Gois, F.N.B.; do Vale Madeiro, J.P.; Li, T.; Fong, S.J. Chapter 4—Artificial neural network-based approaches for computer-aided disease diagnosis and treatment. In Cognitive and Soft Computing Techniques for the Analysis of Healthcare Data; Bhoi, A.K., de Albuquerque, V.H.C., Srinivasu, P.N., Marques, G., Eds.; Intelligent Data-Centric Systems; Academic Press: Cambridge, MA, USA, 2022; pp. 79–99. [Google Scholar] [CrossRef]
- Ali, A.H.; Yaseen, M.G.; Aljanabi, M.; Abed, S.A.; GPT, C. Transfer learning: A new promising techniques. Mesopotamian J. Big Data 2023, 2023, 29–30. [Google Scholar] [CrossRef]
- Hadhrami, E.A.; Mufti, M.A.; Taha, B.; Werghi, N. Transfer learning with convolutional neural networks for moving target classification with micro-Doppler radar spectrograms. In Proceedings of the 2018 International Conference on Artificial Intelligence and Big Data, ICAIBD 2018, Chengdu, China, 26–28 May 2018. [Google Scholar] [CrossRef]
- Liu, T.; Alibhai, S.; Wang, J.; Liu, Q.; He, X.; Wu, C. Exploring transfer learning to reduce training overhead of HPC data in machine learning. In Proceedings of the 2019 IEEE International Conference on Networking, Architecture and Storage (NAS), Enshi, China, 15–17 August 2019; pp. 1–7. [Google Scholar] [CrossRef]
- Zhang, W.; Deng, L.; Zhang, L.; Wu, D. A survey on negative transfer. IEEE/CAA J. Autom. Sin. 2023, 10, 305–329. [Google Scholar] [CrossRef]
- Lakkapragada, A.; Sleiman, E.; Surabhi, S.; Wall, D.P. Mitigating negative transfer in multi-task learning with exponential moving average loss weighting strategies. In Proceedings of the 37th AAAI Conference on Artificial Intelligence, AAAI 2023, Washington, DC, USA, 7–14 February 2023; Volume 37. [Google Scholar]
- Zhang, H.; Liu, W.; Yang, H.; Zhou, Y.; Zhu, C.; Zhang, W. CSAL: Cost sensitive active learning for multi-source drifting stream. Knowl.-Based Syst. 2023, 277, 110771. [Google Scholar] [CrossRef]
- Choubey, S.; Divya, D. Lightweight federated transfer learning for plant leaf disease detection and classification across multiclient cross-silo datasets. BIO Web Conf. 2024, 82, 05018. [Google Scholar] [CrossRef]
- Upreti, K.; Singh, P.; Jain, D.; Pandey, A.K.; Gupta, A.; Singh, H.R.; Srivastava, S.K.; Prasad, J.S. Progressive loss-aware fine-tuning stepwise learning with GAN augmentation for rice plant disease detection. Multimed. Tools Appl. 2024, 83, 84565–84588. [Google Scholar] [CrossRef]
- Ramadan, S.T.Y.; Sakib, T.; Farid, F.A.; Islam, M.S.; Abdullah, J.B.; Bhuiyan, M.R.; Mansor, S.; Karim, H.B.A. Improving wheat leaf disease classification: Evaluating augmentation strategies and CNN-based models With limited dataset. IEEE Access 2024, 12, 69853–69874. [Google Scholar] [CrossRef]
- Zhu, J.Y.; Park, T.; Isola, P.; Efros, A.A. Unpaired Image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017. [Google Scholar] [CrossRef]
- Chen, Z.; Wang, G.; Lv, T.; Zhang, X. Using a hybrid convolutional neural network with a transformer model for tomato leaf disease detection. Agronomy 2024, 14, 673. [Google Scholar] [CrossRef]
- Ahmad, M.; Abdullah, M.; Moon, H.; Han, D. Plant disease detection in imbalanced datasets using efficient convolutional neural networks with stepwise transfer learning. IEEE Access 2021, 9, 140565–140580. [Google Scholar] [CrossRef]
- Christakakis, P.; Giakoumoglou, N.; Kapetas, D.; Tzovaras, D.; Pechlivani, E.M. Vision transformers in optimization of AI-based early detection of Botrytis cinerea. AI 2024, 5, 1301–1323. [Google Scholar] [CrossRef]
- Hashim, I.C.; Shariff, A.R.M.; Bejo, S.K.; Muharam, F.M.; Ahmad, K. Classification of non-infected and infected with basal stem rot disease using thermal images and imbalanced data approach. Agronomy 2021, 11, 2373. [Google Scholar] [CrossRef]
- Hashim, I.C.; Shariff, A.R.M.; Bejo, S.K.; Muharam, F.M.; Ahmad, K. Machine-learning approach using SAR data for the classification of oil palm trees that are non-infected and infected with the basal stem rot disease. Agronomy 2021, 11, 532. [Google Scholar] [CrossRef]
- Nafi, N.M.; Hsu, W.H. Addressing class imbalance in image-based plant disease detection: Deep generative vs. sampling-based approaches. In Proceedings of the 2020 International Conference on Systems, Signals and Image Processing (IWSSIP), Niteroi, Brazil, 1–3 July 2020; pp. 243–248. [Google Scholar] [CrossRef]
- Xiao, T.; Liu, H.; Cheng, Y. Corn disease identification based on improved GBDT method. In Proceedings of the 2019 6th International Conference on Information Science and Control Engineering (ICISCE), Shanghai, China, 20–22 December 2019; pp. 215–219. [Google Scholar] [CrossRef]
- Lu, Y.; Liu, M.; Li, C.; Liu, X.; Cao, C.; Li, X.; Kan, Z. Precision fertilization and irrigation: Progress and applications. AgriEngineering 2022, 4, 41. [Google Scholar] [CrossRef]
- Selim, S.; Koc-San, D.; Selim, C.; San, B.T. Site selection for avocado cultivation using GIS and multi-criteria decision analyses: Case study of Antalya, Turkey. Comput. Electron. Agric. 2018, 154, 450–459. [Google Scholar] [CrossRef]
- Sharififar, A.; Sarmadian, F. Coping with imbalanced data problem in digital mapping of soil classes. Eur. J. Soil Sci. 2023, 74, e13368. [Google Scholar] [CrossRef]
- Sharififar, A.; Sarmadian, F.; Malone, B.P.; Minasny, B. Addressing the issue of digital mapping of soil classes with imbalanced class observations. Geoderma 2019, 350, 84–92. [Google Scholar] [CrossRef]
- Sharififar, A.; Sarmadian, F.; Minasny, B. Mapping imbalanced soil classes using Markov chain random fields models treated with data resampling technique. Comput. Electron. Agric. 2019, 159, 110–118. [Google Scholar] [CrossRef]
- Wang, L.; Wang, X.; Kooch, Y.; Song, K.; Wu, D. Improvement of data imbalance for digital soil class mapping in Eastern China. Comput. Electron. Agric. 2023, 214, 108322. [Google Scholar] [CrossRef]
- Hu, T.; Li, K.; Ma, C.; Zhou, N.; Chen, Q.; Qi, C. Improved classification of soil As contamination at continental scale: Resolving class imbalances using machine learning approach. Chemosphere 2024, 363, 142697. [Google Scholar] [CrossRef] [PubMed]
- Nalepa, J. Recent advances in multi-and hyperspectral image analysis. Sensors 2021, 21, 6002. [Google Scholar] [CrossRef]
- Taghizadeh, M.; Gowen, A.A.; O’Donnell, C.P. Comparison of hyperspectral imaging with conventional RGB imaging for quality evaluation of Agaricus bisporus mushrooms. Biosyst. Eng. 2011, 108, 191–194. [Google Scholar] [CrossRef]
- Blagus, R.; Lusa, L. SMOTE for high-dimensional class-imbalanced data. BMC Bioinform. 2013, 14, 106. [Google Scholar] [CrossRef]
- Deepa, T.; Punithavalli, M. An E-SMOTE technique for feature selection in high-dimensional imbalanced dataset. In Proceedings of the ICECT 2011—2011 3rd International Conference on Electronics Computer Technology, Kanyakumari, India, 8–10 April 2011; Volume 2. [Google Scholar] [CrossRef]
- Qazi, N.; Raza, K. Effect of feature selection, Synthetic Minority Over-sampling (SMOTE) and under-sampling on class imbalance classification. In Proceedings of the 2012 14th International Conference on Modelling and Simulation, UKSim 2012, Cambridge, UK, 28–30 March 2012. [Google Scholar] [CrossRef]
- A, A.S.; S, A.A. Land-cover classification with hyperspectral remote sensing image using CNN and spectral band selection. Remote Sens. Appl. Soc. Environ. 2023, 31, 100986. [Google Scholar] [CrossRef]
- Zhan, Y.; Hu, D.; Wang, Y.; Yu, X. Semisupervised hyperspectral image classification based on generative adversarial networks. IEEE Geosci. Remote Sens. Lett. 2018, 15, 212–216. [Google Scholar] [CrossRef]
- Zhan, Y.; Wu, K.; Liu, W.; Qin, J.; Yang, Z.; Medjadba, Y.; Wang, G.; Yu, X. Semi-supervised classification of hyperspectral data based on generative adversarial networks and neighbourhood majority voting. In Proceedings of the IGARSS 2018—2018 IEEE International Geoscience and Remote Sensing Symposium, Valencia, Spain, 22–27 July 2018. [Google Scholar] [CrossRef]
- Feng, J.; Yu, H.; Wang, L.; Cao, X.; Zhang, X.; Jiao, L. Classification of hyperspectral images based on multiclass spatial-spectral generative adversarial networks. IEEE Trans. Geosci. Remote Sens. 2019, 57, 5329–5343. [Google Scholar] [CrossRef]
- Zhong, Z.; Li, J.; Clausi, D.A.; Wong, A. Generative adversarial networks and conditional random fields for hyperspectral image classification. IEEE Trans. Cybern. 2020, 50, 3318–3329. [Google Scholar] [CrossRef]
- Wang, X.; Tan, K.; Du, Q.; Chen, Y.; Du, P. Caps-TripleGAN: GAN-Assisted Capsnet for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2019, 57, 7232–7245. [Google Scholar] [CrossRef]
- Yin, J.; Li, W.; Han, B. Hyperspectral image classification based on generative adversarial network with dropblock. In Proceedings of the 2019 IEEE International Conference on Image Processing (ICIP), Taipei, Taiwan, 22–25 September 2019. [Google Scholar] [CrossRef]
- Roy, S.K.; Haut, J.M.; Paoletti, M.E.; Dubey, S.R.; Plaza, A. Generative adversarial minority oversampling for spectral-spatial hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5500615. [Google Scholar] [CrossRef]
- Mullick, S.S.; Datta, S.; Das, S. Generative adversarial minority oversampling. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar] [CrossRef]
- Fawakherji, M.; Suriani, V.; Nardi, D.; Bloisi, D.D. Shape and style GAN-based multispectral data augmentation for crop/weed segmentation in precision farming. Crop Prot. 2024, 184, 106848. [Google Scholar] [CrossRef]
- Modak, S.; Stein, A. Synthesizing training data for intelligent weed control systems using generative AI. In Architecture of Computing Systems, Proceedings of the 37th International Conference, ARCS 2024, Potsdam, Germany, 14–16 May 2024; Fey, D., Stabernack, B., Lankes, S., Pacher, M., Pionteck, T., Eds.; Springer: Cham, Switzerland, 2024; pp. 112–126. [Google Scholar]
- Ma, X.; Deng, X.; Qi, L.; Jiang, Y.; Li, H.; Wang, Y.; Xing, X. Fully convolutional network for rice seedling and weed image segmentation at the seedling stage in paddy fields. PLoS ONE 2019, 14, e0215676. [Google Scholar] [CrossRef] [PubMed]
- Bi, Z.; Li, Y.; Guan, J.; Li, J.; Zhang, P.; Zhang, X.; Han, Y.; Wang, L.; Guo, W. Weed identification in broomcorn millet field using segformer semantic segmentation based on multiple loss functions. Eng. Agric. Environ. Food 2024, 17, 27–36. [Google Scholar] [CrossRef] [PubMed]
- Jun, S.; Wenjun, T.; Xiaohong, W.; Jifeng, S.; Bing, L.; Chunxia, D. Real-time recognition of sugar beet and weeds in complex backgrounds using multi-channel depth-wise separable convolution model. Trans. Chin. Soc. Agric. Eng. Trans. CSAE 2019, 35, 184–190. [Google Scholar] [CrossRef]
- Matthews, B.W. Comparison of the predicted and observed secondary structure of T4 phage lysozyme. BBA—Protein Struct. 1975, 405, 442–451. [Google Scholar] [CrossRef] [PubMed]
- Kent, A.; Berry, M.M.; Luehrs, F.U.; Perry, J.W. Machine literature searching VIII. Operational criteria for designing information retrieval systems. Am. Doc. 1955, 6, 93–101. [Google Scholar] [CrossRef]
- Binney, N.; Hyde, C.; Bossuyt, P.M. On the origin of sensitivity and specificity. Ann. Intern. Med. 2021, 74, 401–407. [Google Scholar] [CrossRef]
- Van Rijsbergen, C.J. Information Retrieval; Butterworth-Heinemann: Oxford, UK, 1979. [Google Scholar]
- Akosa, J.S. Predictive accuracy: A misleading performance measure for highly imbalanced data. In Proceedings of the SAS Global Forum 2017, Orlando, FL, USA, 2–5 April 2017. [Google Scholar]
- Hinojosa Lee, M.C.; Braet, J.; Springael, J. Performance Metrics for Multilabel Emotion Classification: Comparing Micro, Macro, and Weighted F1-Scores. Appl. Sci. 2024, 14, 9863. [Google Scholar] [CrossRef]
- Man, X.; Lin, J.; Yang, Y. Stock-UniBERT: A News-based cost-sensitive ensemble BERT model for stock trading. In Proceedings of the 2020 IEEE 18th International Conference on Industrial Informatics (INDIN), Warwick, UK, 20–23 July 2020; Volume 1, pp. 440–445. [Google Scholar] [CrossRef]
Techniques | Strategy | Limitations |
---|---|---|
Algorithm-level | ||
Cost-sensitive learning [90] | Assigning weights to sample data to compensate for the imbalance condition. | Requires knowledge to assign the appropriate initial weight values at the beginning of the process. |
Threshold moving [91] | Fine-tuning the threshold probability or decision in a classifier, typically set at 0.5, to a specific value. | Needs readjustment for new cases or changing data conditions. |
Ensemble learning [92] | Combining the outputs of multiple learning classifiers. | Demands a significant amount of resources and high complexity. |
Data-level | ||
Undersampling [93] | Reducing the number of samples from the majority class until both classes are balanced. | Potential for the loss of important information from the deleted majority samples. |
Oversampling [94] | Increasing the number of minority samples until it matches the majority class. | Redundant data can lead to overfitting. |
Hybrid-level | ||
Hybrid [95] | Combining data-level and algorithm-level techniques. | Potentially inherits weaknesses from one of the techniques. |
Method | Strategy | Limitation |
---|---|---|
ROS | Randomly duplicating existing samples of the minority class until the number of minority samples equals that of the majority class. | ROS prone to overfitting, as the model may become too tailored to the duplicated samples of the minority class, thus reducing its ability to generalize to new data. |
SMOTE [105] | Randomly selecting samples close to each other in the feature space and then generating new data between these chosen samples. Specifically, SMOTE randomly selects n-nearest neighbours from the sample data and then creates synthetic instances randomly along the line between the reference sample and its neighbours. | SMOTE has a potential overlap issue. Since it does not consider the existence of the majority class, the synthetic samples may be located within majority class samples, reducing majority class performance. |
ADASYN [106] | An extension of SMOTE that focuses on generating more synthetic data in minority class areas that are difficult for classification algorithms to learn, thus improving classification performance for unevenly distributed minority classes. | ADASYN may generate synthetic samples around noisy instances, amplifying the effect of noise and potentially degrading model performance. |
Borderline SMOTE [107] | Concentrates on minority class samples located in the border area, which are difficult to classify and frequently misclassified. Uses KNN to identify border samples and generates new data based on them. | Sensitivity to noise; if reference samples are noisy, Borderline SMOTE may worsen it by generating more noise from synthetic samples. Success heavily relies on hyperparameter settings, especially the number of nearest neighbours. |
SL-SMOTE [108] | Aims to assess the safety level of each minority class sample to avoid generating unwanted noise. Categorizes reference samples into noisy, borderline, and safe areas; generates new samples specifically in the safe area. | New synthetic data in safe areas, away from decision boundaries, may not significantly impact heavily misclassified border areas. Less effective for complex datasets. |
K-Means SMOTE [109] | Combines K-Means clustering with SMOTE; clusters minority class samples into groups using K-Means and applies SMOTE to each cluster. | Sensitive to outliers, as they can affect clustering. The effectiveness of oversampling relies on the number of clusters (K) parameter. |
SVM SMOTE [110] | Uses SVM to identify support vector samples near the decision boundary and generates new samples around these support vectors. | Its effectiveness depends on the SVM model and requires optimal hyperparameter tuning. |
Title | Year | Dataset | Techniques |
---|---|---|---|
Improving Wheat Leaf Disease Classification: Evaluating Augmentation Strategies and CNN-Based Models With Limited Dataset [151]. | 2024 | Wheat leaves | In order to expand the limited dataset, data augmentation was employed using CycleGAN [152] and ADASYN. Furthermore, classification was conducted using CNN models. |
Using a Hybrid Convolutional Neural Network with a Transformer Model for Tomato Leaf Disease Detection [153] | 2024 | Plant Village dataset | Using a hybrid model of CNN with Transformer, utilizing CycleGAN for data augmentation to overcome the class imbalance problem. |
Enhanced SVM-SMOTE with Cluster Consistency for Imbalanced Data Classification [48]. | 2023 | Hyperspectral data of Sugar leaves | Using a modified SVM-SMOTE technique to increase the number of minority samples of early-stage diseased plants, which are often ambiguous and difficult to distinguish from both normal and infected conditions. |
Ensemble synthetic oversampling with Manhattan distance for unbalanced hyperspectral data [46] | 2021 | Hyperspectral data of Arabidopsis leaves | Modifying Safe-Level SMOTE with Manhattan distance and then extending it with ensemble learning techniques. |
Plant Disease Detection in Imbalanced Datasets Using Efficient Convolutional Neural Networks With Stepwise Transfer Learning [154] | 2021 | Plant Village dataset and Pepper dataset | Using SMOTE and GAN to overcome imbalance data issue. |
Vision Transformers in Optimisation of AI-Based Early Detection of Botrytis cinerea [155] | 2024 | Botrytis cinerea (Gray Mold Disease) on Cucurbitaceae crops | Using DL segmentation model with Vision Transformer (ViT) encoder, combined with Cut-and-Paste method to address dataset imbalance. Multispectral imaging is used to detect disease progression. |
Classification of Non-Infected and Infected with Basal Stem Rot Disease Using Thermal Images and Imbalanced Data Approach [156] | 2021 | Oil palm trees infected with Ganoderma boninense (BSR) | Using thermal images to identify BSR-infected and non-infected oil palm trees, with data imbalance approaches such as RUS, ROS, and SMOTE. |
Machine-Learning Approach Using SAR Data for the Classification of Oil Palm Trees That Are Non-Infected and Infected with the Basal Stem Rot Disease [157] | 2021 | Oil palm trees infected with Ganoderma boninense (BSR) | Used ALOS PALSAR-2 imagery with dual polarization. SMOTE was employed to address the imbalance in data. |
Addressing Class Imbalance in Image-Based Plant Disease Detection: Deep Generative vs. Sampling-Based Approaches [158] | 2020 | Plant Village dataset | Compared a GAN-based approach with traditional sampling methods (undersampling, oversampling, SMOTE) to address data imbalance. |
Corn Disease Identification Based on improved GBDT Method [159] | 2019 | Corn leaves | Used SMOTE to address data imbalance, applied regional interpolation for image resizing, and employed Gradient Boosting Decision Tree (GBDT) for disease identification. |
Title | Year | Dataset | Techniques |
---|---|---|---|
Coping with imbalanced data problem in digital mapping of soil classes [162] | 2023 | 453 soil profiles from northwest Iran | The authors proposed three ML approaches to address the imbalanced data problem in soil mapping: Ensemble Gradient Boosting (XGB), Cost-Sensitive Decision Tree (CSDT), and One-Class Classification Combined with Multi-Class Classification (OCCM). |
Addressing the issue of digital mapping of soil classes with imbalanced class observations [163] | 2019 | 452 soil profile observations collected on a regular grid covering approximately 12,000 hectares in northwest Iran | Over- and under-sampling techniques were employed to address the imbalanced class distribution in the dataset. |
Mapping imbalanced soil classes using Markov chain random fields models treated with data resampling technique [164] | 2019 | 452 soil profile observations collected on a regular grid covering approximately 12,000 hectares in northwest Iran | Markov Chain Random Field Modeling was used to predict spatial patterns of soil classes, while ROS was used prior modelling. |
Improvement of data imbalance for digital soil class mapping in Eastern China [165] | 2023 | 316 topsoil samples were collected in Eastern China | ROS and RUS techniques were applied to address class imbalance, while Random Forest (RF) was used for soil classification. |
Improved classification of soil As contamination at continental scale: Resolving class imbalances using machine learning approach [166] | 2024 | Land Use/Land Cover Area Frame Survey (LUCAS) 2009 dataset | Methods like SMOTE, SMOTE-Tomek, RUS, and the Tomek-Links technique (TLTE) were used to balance the number of samples in the contaminated and non-contaminated classes. |
Title | Year | Dataset | Techniques |
---|---|---|---|
Shape and style GAN-based multispectral data augmentation for crop/weed segmentation in precision farming [181] | 2024 | Multispectral image of sugar beet dataset | This research utilizes two types of GAN: cGAN and DCGAN for scene augmentation. |
Synthesising Training Data for Intelligent Weed Control Systems Using Generative AI [182] | 2024 | Multispectral image of sugar beet dataset | This study employs a generative approach to create synthetic images for training object detection systems for weed control. It combines the Segment Anything Model (SAM) for zero-shot transfer to new domains with an AI-based Stable Diffusion Model to generate synthetic images. |
Fully convolutional network for rice seedling and weed image segmentation at the seedling stage in paddy fields [183] | 2019 | RGB images of rice seedlings and weeds in paddy fields | The study applies semantic segmentation using SegNet to detect the positions of rice seedlings and weeds in paddy fields. In addition, class weight coefficients are calculated to handle the class imbalance. |
Weed identification in broomcorn millet field using Segformer semantic segmentation based on multiple loss functions [184] | 2024 | RGB Images of broomcorn millet farms with 67% weed coverage | The study uses Segformer. Furthermore, a combination of dice loss and focal loss is applied to address the imbalance between positive and negative samples and to resolve the issue of small area segmentation due to densely growing weeds. |
Real-time recognition of sugar beet and weeds in complex backgrounds using multi-channel depth-wise separable convolution model [185] | 2019 | Multispectral image of sugar beet dataset | This study introduces a lightweight convolutional neural network with a codec structure for real-time sugar beet and weed recognition. Then, a weighted loss function addresses pixel imbalances between soil, crops, and weeds. |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Miftahushudur, T.; Sahin, H.M.; Grieve, B.; Yin, H. A Survey of Methods for Addressing Imbalance Data Problems in Agriculture Applications. Remote Sens. 2025, 17, 454. https://doi.org/10.3390/rs17030454
Miftahushudur T, Sahin HM, Grieve B, Yin H. A Survey of Methods for Addressing Imbalance Data Problems in Agriculture Applications. Remote Sensing. 2025; 17(3):454. https://doi.org/10.3390/rs17030454
Chicago/Turabian StyleMiftahushudur, Tajul, Halil Mertkan Sahin, Bruce Grieve, and Hujun Yin. 2025. "A Survey of Methods for Addressing Imbalance Data Problems in Agriculture Applications" Remote Sensing 17, no. 3: 454. https://doi.org/10.3390/rs17030454
APA StyleMiftahushudur, T., Sahin, H. M., Grieve, B., & Yin, H. (2025). A Survey of Methods for Addressing Imbalance Data Problems in Agriculture Applications. Remote Sensing, 17(3), 454. https://doi.org/10.3390/rs17030454