Utilization of Machine Learning and Hyperspectral Imaging Technologies for Classifying Coated Maize Seed Vigor: A Case Study on the Assessment of Seed DNA Repair Capability

Wonggasem, Kris; Wongchaisuwat, Papis; Chakranon, Pongsan; Onwimol, Damrongvudhi

doi:10.3390/agronomy14091991

Open AccessArticle

Utilization of Machine Learning and Hyperspectral Imaging Technologies for Classifying Coated Maize Seed Vigor: A Case Study on the Assessment of Seed DNA Repair Capability

by

Kris Wonggasem

¹,

Papis Wongchaisuwat

¹

,

Pongsan Chakranon

¹ and

Damrongvudhi Onwimol

^2,*

¹

Department of Industrial Engineering, Faculty of Engineering, Kasetsart University, Bangkok 10900, Thailand

²

Department of Agronomy, Faculty of Agriculture, Kasetsart University, Bangkok 10900, Thailand

^*

Author to whom correspondence should be addressed.

Agronomy 2024, 14(9), 1991; https://doi.org/10.3390/agronomy14091991

Submission received: 8 August 2024 / Revised: 25 August 2024 / Accepted: 30 August 2024 / Published: 2 September 2024

(This article belongs to the Special Issue Computer Vision and Deep Learning Technology in Agriculture: 2nd Edition)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

The conventional evaluation of maize seed vigor is a time-consuming and labor-intensive process. By contrast, this study introduces an automated, nondestructive framework for classifying maize seed vigor with different seed DNA repair capabilities using hyperspectral images. The selection of coated maize seeds for our case study also aligned well with practical applications. To ensure the accuracy and reliability of the results, rigorous data preprocessing steps were implemented to extract high-quality information from raw spectral data obtained from the hyperspectral images. In particular, commonly used pretreatment methods were explored. Instead of analyzing all the wavelengths of spectral data, a competitive adaptive reweighted sampling method was used to select more informative wavelengths, optimizing analysis efficiency. Furthermore, this study leveraged machine learning models, enriched through oversampling techniques to address data imbalance at the seed level. The results obtained using a support vector machine with enhanced techniques demonstrated promising results with 100% sensitivity, 96.91% specificity, and a 0.9807 Matthews correlation coefficient (MCC). Thus, this study highlighted the effectiveness of hyperspectral imaging and machine learning in modern seed assessment practices. By introducing a seed vigor classification system that can even accommodate coated seeds, this study offers a potential pathway for empowering seed producers in practical, real-world applications.

Keywords:

seed vigor; coated seed; hyperspectral imaging; machine learning; oversampling technique; Zea mays L.

1. Introduction

Maize is one of the most widely cultivated and traded cereal crops, with its seeds used for various purposes in agri-food systems, ranging from food and feed to industrial applications [1]. Maize seed production has increased in recent decades, holding immense significance in the global market. It plays a crucial role in ensuring food security, promoting global economic development, and meeting the diverse needs of a growing population. However, seed producers face numerous challenges related to seed vigor, which directly impacts crop productivity and quality in terms of storability and subsequent seedling establishment. In particular, seed vigor refers to the overall health of a seed and its strength to germinate quickly, produce robust seedlings, or maintain their viability during storage [2]. One significant challenge for producers is maintaining consistent seed vigor levels, as seeds can lose viability over time because of factors such as improper storage conditions, genetic degradation, or environmental stress. Ensuring uniform seed vigor across large seed lots also poses a challenge. Additionally, accurately assessing seed vigor using conventional methods can be labor-intensive and time-consuming [2].

To assess seed vigor, various characteristics are considered, such as germination speed, uniformity of germination, and the ability of seedlings to withstand environmental stresses. The direct evaluation of maize seed vigor, such as destructive field emergence, typically involves a time-consuming and labor-intensive process. Although indirect procedures for assessing corn seed vigor such as the cold test and varied stress tests are faster, these procedures still require ~1 week [3]. The radicle emergence (RE) test is a method used to classify seed vigor and assesses the seed’s ability to repair DNA damage [4]. The test has shown significant potential for future application to a wide variety of species, possibly using image analysis [5,6]. The RE test, a swift and dependable method for assessing maize seed vigor, is still considered a destructive and time-consuming process [7]. In particular, the time required for this method may vary from 5–10 days, based on the plant species, as a consequence of the waiting time required to evaluate the results of the visible radicle emergences. More importantly, these methods may also lead to inaccurate conclusions owing to the widespread use of coated seed testing in the industry, as highlighted by Yan, et al. [8]. Seed coating is a mechanism that improves seed quality by providing plant growth–promoting substances. However, the majority of automated seed analyses are performed on uncoated seeds [9]. Developing a rapid and nondestructive technique for evaluating seed vigor via RE testing can revolutionize seed processing and quality assurance systems.

As promising alternative approaches, nondestructive methods such as imaging techniques are increasingly employed for seed vigor assessment, enabling efficient evaluation. In particular, hyperspectral imaging (HSI) technologies have been widely adopted for seed vigor testing along with artificial intelligence (AI) in recent years [10,11,12,13,14]. Both machine learning (ML) and deep learning (DL) are subsets of AI that enable algorithms to learn based on historical data. Traditional ML learns patterns from data through extensive preprocessing and engineered features, whereas DL, which consists of deep neural networks, automatically learns feature representations directly from raw data. DL generally benefits from large amounts of training data with relatively higher requirements for computational resources. Feng, et al. [11] provided an overview of seed viability, hyperspectral imaging technology, spectral signatures, and various preprocessing and classification techniques. The commonly used preprocessing methods included noise removal, image filtering, Savitzky–Golay (SG), multiplicative scatter correction (MSC), and standard normal variate (SNV), with a combination of ML models, such as support vector machine (SVM), principal component analysis (PCA), K-nearest neighbor, partial least square regression, and linear discriminant analysis (LDA). For DL models, convolutional neural networks (CNNs) with image input were commonly used as classification models [15].

Specifically focusing on maize seeds, hyperspectral imaging techniques (HSI) were used to differentiate viable and nonviable corn seeds with machine classification models such as partial least squares discriminant analysis (PLS-DA), LDA, and SVM [16,17,18,19,20]. In addition, several pretreatment analyses and their combinations were thoroughly explored [21,22]. Recently, more advanced models such as CNNs have been employed to further improve model performance [23,24]. For instance, Pang, et al. [25] introduced a one-dimensional CNN and compared it to traditional ML, including SVM, and an extreme learning machine to identify and predict corn seed vigor based on a spectral dataset.

Most previous studies have relied on samples deliberately subjected to artificial aging to induce variations in seed vigor [16,26,27,28]. Buijs, et al. [29] highlighted distinctions in deterioration between seeds that underwent artificial aging treatments and those stored under authentic storage conditions. By contrast, Cui, et al. [30] predicted maize seed vigor using regression models based on the extracted features of hyperspectral data retrieved from samples of air-dried seeds stored in real storage facilities with limited applicability. To the best of our knowledge, only a few studies have conducted experiments involving coated maize seeds. In a study by Jia, et al. [31], near-infrared spectroscopy could effectively distinguish between varieties of coated maize seeds with an accuracy rate of 97.5%. However, its capability was limited to handling small quantities, making it impractical for industrial use. Driven by this observation and recognizing the significant potential of HSI, we introduced ML models for the automated classification of maize seed vigor and investigated their real-world applications.

For our experiment, we arranged 120 coated maize seeds in each of the 120 trays and captured a hyperspectral image of each tray. Following image acquisition, we employed a region of interest (ROI) identification process to extract spectral data for each seed. Two common pretreatment analysis techniques, namely SNV and MSC, were further explored to enhance data quality and eliminate systematic variations or noise compared with the original data. Owing to the large number of extracted features, a wavelength selection technique and competitive adaptive reweighted sampling (CARS) were employed to retain vital information of the entire data with a relatively small number of features. After the preprocessing steps, we examined the performance of traditional ML models using the original spectra information and oversampled spectra data. The oversampling techniques were conducted to address class imbalance issues, where a minority class exhibited significantly fewer instances than another class. The ground truth of seed vigor classification was achieved using RE testing alongside curve-fitting and clustering techniques [6].

Our main contribution includes in-depth experimentation on various combinations of pretreatment analysis, feature extraction, the data oversampling technique, and a classification model to identify maize seed vigor concerning ground truth labels. An ROI identification process was used to extract spectral data at a seed level. All the experiments were conducted at the seed level to gain benefits from the ability of the ML model with a relatively larger number of input data compared with the tray level. Oversampling techniques with optimized parameters were used to improve the model’s performance. Finally, we tested our proposed framework on coated seeds, which is a unique dataset that has been minimally used in previous studies.

2. Materials and Methods

2.1. Sample Preparation and Hyperspectral Image Acquisition

In preparation for the experiment, 120 maize (Zea mays L.) seed lots were sourced from various productions in Thailand spanning 3 years (2021–2023). The samples included a wide range of harvest times (rainy and dry season), seed producers (16 producers), and production areas (18 locations), offering a comprehensive representation of real-world practice. The film-coating technique was applied using a mixture of metalaxyl, together with chitosan and polyethylene glycol as polymers. The technique achieved an application recovery rate of ~90% while also maintaining the seed’s size, shape, and weight without any significant modifications (Figure 1B). After harvest, the seeds were stored in a hermetic bag (SGB Premium-25RZ GrainPro^®, Washington, DC, USA) at 15 °C and 50% relative humidity (RH), with monitoring facilitated using a USB data logger until their usage in the experiment.

Germination tests were evaluated using the between-paper (BP) technique (ISTA, 2023) with 4 replicates of 100 seeds. The seeds were germinated between two layers of paper wrapped in plastic bags and then placed in a cabinet germinator (Seedburo Equipment, Des Plaines, IL, USA). The RH in the germinator was kept at very near saturation and the temperature was set to 25 °C. The seeds were considered to have germinated when normal seedlings were observable, as per the ISTA guidelines [32]. Normal seedlings were counted daily until 7 days after setting to germination. The germination percentage was reported based on the tolerances between the highest and lowest germination percentages of the replicates in a single germination test (a two-way test at the 2.5% significance level). Inferential statistical analysis was based on analysis of variance (ANOVA) with a single-factor (fixed effect model) and was performed on the results at a significance level of p ≤ 0.05. Homogeneity of variance (Levene test) and normality of data were tested under the assumptions for ANOVA. The percentage data from the software was angularly transformed before ANOVA was conducted (transformed by arcsine·

\sqrt{x / 100}

).

For the acquisition of spectral data, the HSI system, a SPECIM FX17 mounted on the LabScanner 40 × 20 from Spectral Imaging Ltd., Oulu, Finland, was utilized (Figure 1A). The system comprised a temperature-stabilized InGaAs camera equipped with an imaging spectrograph, a fore objective lens (OLET17.5 with a 38° field of view and 150 mm focusing), an illumination unit (tungsten halogen lamps, 20 W), a translation stage, and a computer equipped with data acquisition and control software (Lumo Scanner software version 2.2). NIR reflectance spectrometry was conducted in the range of 935.61–1720.23 nm with a resolution of 3.5 nm. The hyperspectral instrument was situated in an open room spanning 15 m². Additionally, the room’s curtains were closed to eliminate all external light.

The camera’s exposure time was 10 ms, and the displacement platform moved at a speed of 10 mm/s. As depicted in Figure 1, maize seed samples were arranged in a grid on a black cardboard sample tray (Figure 2), without specifying the seed orientation. Following movement toward the electronically controlled platform, the HSI instrument captured the sample’s HSI data, transmitted it to the computer for storage, and scanned each sample. The original image underwent correction with a black-and-white reference to obtain the adjusted image.

2.2. Ground Truth Annotation: Seed Vigor Classifications

The seed samples were randomly drawn and subsequently classified in a systematic manner, ensuring no bias. RE tests were used to assess seed DNA repair abilities. RE tests were conducted on the entire set of maize seed samples modified based on the ISTA rules [7]. Using the BP technique involves four replicates where each comprises 100 seeds; the seeds were germinated between layers of paper enclosed in plastic bags and then placed within a cabinet germinator (Seedburo Equipment, Des Plaines, IL, USA). The germinator-maintained RH was very close to saturation, with a set temperature of 25 °C. The daily counts of REs and physiological germination (2 mm RE) continued for 7 days after germination. Radicle emergence indices, including maximum radicle emergence (REMax), mean radicle emergence times (MGT), radicle emergence speed (T50), uniformity of radicle emergence (U7525), and the area under the curve of the cumulative radicle emergence fitted curve (AUC), were subjected to analysis using the GERMINATOR software [33]. Subsequently, the K-means method was applied for cluster analysis based on these RE indices [6].

2.3. Data Preprocessing

In the raw spectral data, interferences from several sources could be observed. To mitigate these issues, it was crucial to preprocess the initial spectrum by minimizing the impact of noise interference, environmental factors, and variations in light intensity. Subsequently, during the ROI identification step, depicted in Figure 2, some seeds that were poorly captured or had a low signal-to-noise ratio were screened. Consequently, from the initial 14,400 seeds, 10,908 seeds remained for the next stages of the experiment. This step was conducted to locate specific areas within the entire image of the seed tray, facilitating the extraction of spectral information at an individual seed level. A morphological process using raw 2D images was initiated to manipulate the shape and structure of seeds within the image. This process included dilation and erosion, complemented by opening and closing operations. We further extracted spectral information following min–max normalization for each pixel within the seed areas. Each seed was associated with a certain number of pixels, and each pixel corresponded to one spectral value for each wavelength. For example, a seed with 516 pixels will have 516 spectral values for each wavelength. These spectral values were then averaged across the seed area, resulting in one averaged spectral value per wavelength. This process was iterated across all wavelengths for the remaining sample of 10,908 seeds, resulting in a collection of 224 values for each seed.

The remaining spectral data at the seed level underwent preprocessing using two commonly used techniques: SNV and MSC. These pretreatment analyses were primarily used to improve data quality compared with the original spectral data. SNV was chosen to correct baseline shifts, a common issue in spectroscopic data due to various factors. MSC, on the other hand, was selected to address multiplicative effects, which can often distort spectral patterns. We opted for these two pretreatment techniques due to their widespread use and established effectiveness in improving the reliability of models built on spectroscopic measurements. By comparing the performance of the model with and without preprocessing, we were able to demonstrate the significant benefits of using SNV and MSC prior to fitting the model.

Additionally, we implemented the CARS technique to select informative wavelengths, effectively reducing the dimensionality of the feature space. CARS selected important wavelengths using partial least squares regression and discriminant analysis. It improved model accuracy by iteratively choosing the most significant variables for prediction. This feature extraction step was typically crucial for generating meaningful insights for the model, directly influencing the performance of the ML algorithms.

2.4. ML Model Development

For the data partition, as shown in Table 1, we held out 10% of the whole data for testing, whereas the rest were counted as training and validation parts, accounting for 75% and 15%, respectively. The validation set was deliberately designed for fine-tuning hyperparameters and selecting the most appropriate model before evaluating performance on the test set. To prevent overfitting in our high-dimensional spectral data, we deliberately utilized a slightly larger validation set. As a result of an imbalanced number of samples across distinct classes with ~15% of the minority class, we used oversampling techniques, including the adaptive synthetic algorithm (ADASYN) [34] and the borderline synthetic minority oversampling algorithm (BorderlineSMOTE) [35]. These oversampling techniques were exclusively used in the training set to generate synthetic data for the minority class, thereby aiding in model training. Our choice of oversampling was majorly motivated by the need to address the class imbalance in the dataset. This imbalance can hinder a model’s ability to learn the minority class effectively. Oversampling helps mitigate this issue by increasing the representation of the minority class, ultimately improving the model’s detection accuracy for that class.

An in-depth comparison was conducted based on training data to compare traditional ML models, SVM, and ensemble learning with linear discriminant analysis (ELDA). SVM finds the optimal hyperplane to separate data points into different classes with the maximum possible margin. This makes SVM particularly well-suited for our high-dimensional spectral data, as it is robust to outliers and computationally efficient, especially when combined with kernel tricks. LDA is another popular technique for analyzing spectral data, known for its ability to project data onto a lower-dimensional space that maximizes class separation. It is a statistical method that seeks to find a linear combination of features that best discriminates between different classes in the dataset. To further enhance the model’s performance, we opted for an ensemble version of LDA, combining multiple LDA models to improve generalization and reduce overfitting.

Typical evaluation metrics for a binary classification problem were used to assess and compare models with diverse implemented techniques in our experiments. A confusion matrix comprising true positive, true negative, false positive, and false negative was initially constructed. Consequently, sensitivity, specificity, overall accuracy, F1-score, and area under the receiver operating characteristic (ROC) curve, as well as MCC metrics, were computed to examine the model’s performance. Python 3.9.18 along with Scikit-learn 1.4.1 served as the main library for classification model development. The libPLS MATLAB library was used to conduct variable selection through CARS analysis [36]. The imbalanced-learn python package was employed for implementing the oversampling techniques [37]. All data analyses were conducted on a computer configured using an Intel Core i5-13400F LGA 1700 processor, 16 GB of RAM, and an NVIDIA GeForce RTX 3060 graphics card with 12 GB of GDDR6 memory.

3. Results

3.1. Seed Vigor Classification

We developed an automated and nondestructive framework to classify the vigor of coated maize seeds with different DNA repair abilities by inspecting the speed and uniformity of RE. The germination percentage of seeds that surpassed 50% exhibited a higher value of >80% (Figure 3).

Seeds begin to imbibe water once germination commences. Some of the higher-vigor seeds emerge from their radicle within 24 h, whereas the lower-vigor seeds largely remain unchanged (Figure 4). In other words, seeds that have lower vigor experience a more extended lag period. The curve-fitting technique, which involves the analysis of RE behavior, offers more precise and comprehensive data than a single count of RE that considers only one point. The cumulative RE curve provided insight into the RE behavior of all 120 seed lots using a four-parameter hill function for fitting functions (Figure 5A). The seeds with a high germination percentage showed high REMax and AUC values, whereas their MGT, T50, and U7525 values were lower.

To generate ground truth labels, air-dried seed samples underwent classification using a combination of RE tests, curve-fitting techniques, and clustering methods. The number of clusters (K) of two based on silhouette analysis was selected to classify seed vigor into two classes, which were named as high- and low-vigor (Figure 5B). In total, 19 of 120 lots were classified as low-vigor seeds. These annotations were used as ground truth for training ML models in the following step.

3.2. Oversampling Algorithm Performances

An integration of pretreatment analysis, feature selection, and ML models was implemented. With imbalanced data, we specifically focused on a combination of sensitivity, specificity, and MCC as the best performance criteria, ensuring that the model effectively predicted majority and minority classes. The higher these primary metrics are, the more superior the performance of the model. We also experimented with two oversampling techniques in comparison with the original data. To investigate the effect of oversampling parameters on model performance on the test set, we varied the number of oversampled instances of the minority class in the training data. Particularly, the low-vigor seed, which is the minority class, was oversampled from 1323 instances to 50%, 100%, 200%, 300%, and 400% of the original values in the training data. These percentages indicate an augmentation relative to the original count, such that 50% corresponds to a 50% increase, yielding a total of 150% of the original data. The computed MCC results from these experiments are shown in Figure 6 for the ADASYN and BorderlineSMOTE oversampling techniques.

3.3. ML Model Performances

According to Figure 6, oversampled data with 200% of the original values of the minority class tended to outperform other oversampling parameters regardless of the pretreatment analysis, oversampling techniques, and ML models. Specifically, oversampling the minority class from 1323 to 3969 in the original data provided the best performance. With the selected 200% upsampling parameter, we summarized the sensitivity, specificity, and MCC of all the experimental parameters in Table 2. In addition, we compared these results with our baseline model using the original data without any pretreatment analysis feature selection method or oversampling techniques. Furthermore, we constructed the corresponding confusion matrix and ROC curve of the best model, as illustrated in Figure 7. This selected model yielded 99.45% overall accuracy, 99.34% precision, and a 99.67% F1-score.

4. Discussion

Seed vigor is a physiological characteristic that cannot be assessed merely via exterior physical appearance. Currently, assessing seed vigor through RE behavior is a reliable method [2,38]. This approach provides benefits to seed producers because of its efficacy in assessing DNA repair ability, which is the primary factor that numerous seed vigor testing methods prioritize. According to its ability to detect seed deterioration at an early stage, seed producers can estimate a decrease in germination performance in advance. However, this technique does have disadvantages, especially its destructive and time-consuming issues. By contrast, our nondestructive, rapid methodology provided promising results within a relatively short time frame. Conventional approaches necessitate approximately 5 days, while ours required no more than a minute. More importantly, we tested our framework with coated seeds with water-soluble fungicides and colorants. In a real-world application, the farmer consequently gains significant advantages over utilizing traditional uncoated seeds. Seeds can be assessed on the production line without interrupting the entire process or even during their storage period. Unsold seeds can also be examined without additional destructive processing, aiding efficient inventory management. Although working with coated seeds is necessary for industrial seed testing, analysts often face substantial challenges owing to seed coatings. Interestingly, our model had the remarkable ability to detect the vigor beneath the coating without causing any harm to the seed or requiring the use of hazardous radiation for the operator [39].

The vigor of maize seeds has been classified into two separate classes: high- and low-vigor (Figure 5). Of 120 lots, 19 were classified as low-vigor seeds. The criteria used to classify vigor were indicative of the seeds’ DNA repair ability. Reportedly, this damage is repaired during the initial phases of germination and is critical for the emergence of the radicle [2]. An extended lag period is required for repairing aged (low-vigor) seeds that have accumulated damage, resulting in low REMax and AUC and high MGT, T50, and U7525 before radicle emergence. By contrast, unaged (high-vigor) seeds experience radicle emergence more rapidly, with high REMax and AUC, but low MGT, T50, and U7525. The study revealed that seeds with low vigor (Figure 3, Figure 4 and Figure 5) showed a reduced germination percentage or maximum RE compared with seeds with high vigor. This finding supports the hypothesis that high-vigor seeds reflect greater values in terms of RE speed, RE uniformity, and AUC [40].

Our proposed framework used ML models with engineered features based on spectral data obtained via HSI. Owing to imbalances among vigor classes in our dataset, we used oversampling techniques to improve model performance. Importantly, the minority class constituted ~15% of the dataset, potentially resulting in a low recall rate during model training. Higher oversampling rates led to better balance within the adjusted dataset. However, excessively high oversampling rates produced satisfactory results in the training set but potentially failed to show good generalization to the test set. Notably, we only oversampled the training set, maintaining the original class proportion in the test set to reflect its actual distribution. Our observations, as depicted in Figure 6, supported this trend, exhibiting improved model performance with increasing oversampling rates of up to 200%. However, beyond this point, performance either plateaued or experienced a slight decline in certain experimental cases. The oversampling techniques significantly improved model performance, although only marginal differences between ADASYN and BorderlineSMOTE were observed in Table 2.

When considering pretreatment analysis, minor discrepancies were noted among SNV, MSC, and the original data. The findings suggest that the ELDA model performs optimally when these preprocessing techniques are not applied, potentially owing to its sensitivity to these preprocessing techniques or the potential noise introduced by these methods. Alternatively, the SVM model was either resilient to this noise or possibly benefited from the scatter correction effects of these preprocessing methods. However, the possibility of overfitting because of noise cannot be completely ruled out for the SVM model, and the substantial amount of test data used in this study helped mitigate this risk. Nevertheless, the choice of pretreatment analysis did not appear to be the predominant factor influencing desirable performance in our case study. Furthermore, the results suggest that SVM generally outperformed ELDA across most experimental settings, although the difference was not notably significant. Conclusively, SVM with a combination of SNV and features filtered via CARS based on oversampled data provided superior performance. The most important factor contributing to desirable results was incorporating the oversampling technique with a suitable sampling rate. To confirm its generalizability, we further observed the performance of the selected model on the training and validation sets. Consistent results across all three partitions of the data, without any signs of overfitting, offered strong evidence supporting our proposed framework for external datasets. This robust performance is crucial for real-world applications, especially those involving relatively large-scale data.

By combining HSI with an advanced ML model enhanced with oversampling techniques, we successfully assessed the vigor of coated maize seeds. This method could be used across diverse factors, including different harvest times, seed producers, and production areas in Thailand. Our study focused on spectral data, which possibly limited its capability. Incorporating textual features, as reviewed by Rogers, Blanc-Talon, Urschler, and Delmas [12], represented a potential direction for future research. This study used commercially prepared, coated seeds obtained from the production process owing to limitations in data availability. This prevented a direct comparison between coated and uncoated seeds. Future studies could involve collecting spectral data from before and after the coating process. This would allow for a more comprehensive evaluation of the coating’s influence on seed vigor detection results and further highlight the specific role of the coating composition. While our seed vigor classification approach was tailored specifically for maize datasets, its adaptability suggests potential applicability to other agronomic crops with minimal modifications. A fast and nondestructive approach for monitoring the seed vigor of coated maize seeds has been developed using HSI. This method enables efficient evaluation without negatively impacting the quality of the seeds. This leads to a minimum of three benefits for the seed production company. (1) Seeds can be inspected on the processing line before they are packaged and distributed. (2) Seeds can also be inspected during their storage period. (3) Additionally, in order to optimize seed inventory management, deteriorated or returned seeds can be evaluated. This study used 120 lots, whereas external factors such as seed variety and growth environment might influence the results when scaling up. Future studies can explore how well this framework generalizes to larger and more diverse datasets and spectral data [24,28]. In addition, our future study will involve experimentation on extended cereal crop seeds, including examining other data perspectives to uncover hidden potential using DL. The automated adjustment of the model settings on a diverse dataset would also improve its practicality in real-world applications. Consequently, the results propose an approach for rapidly discriminating and sorting coated corn seeds with different DNA repair abilities without touching the samples on the processing line. This expansion would improve our methodology’s versatility and scalability in agricultural contexts beyond maize cultivation, offering promising directions for broader agricultural research and practice.

5. Conclusions

This study proposed a potential framework to adopt the ML capability of processing large amounts of data at the seed level. Our model yielded promising results, achieving a sensitivity of 100% and a specificity of 96.91%, particularly with a SVM trained on SNV pretreatment analysis along with oversampled data. Using oversampling techniques notably boosted the model performance in terms of its ability to predict the minority class. Using a dataset of coated seeds, our study extended the applicability of the nondestructive and rapid method to diverse seed types. Hyperspectral imaging and ML can precisely determine the vigor of maize seeds by analyzing their DNA repair capacity, as compared to relying on conventional field emergence testing methods. This would benefit seed producers in the crop industry by enhancing their quality control management practices. Conclusively, the implementation of ML methods with improved techniques based on HSI could classify the vigor of maize-coated seeds automatically and nondestructively in an efficient manner.

Author Contributions

K.W.: Supervision, Visualization, Funding acquisition, Writing—review & editing. P.W.: Methodology, Supervision, Visualization, Funding acquisition, Writing—original draft, Writing—review & editing. P.C.: Data curation, Formal Analysis, Software. D.O.: Conceptualization, Formal Analysis, Funding acquisition, Project administration, Writing—review & editing. All authors have read and agreed to the published version of the manuscript.

Funding

This project was funded by the National Research Council of Thailand (NRCT) and Kasetsart University (KU): N42A650281. This research and innovation activity was funded by the KU Research and Development Institute (KURDI): FF(KU)35.67 and FF(KU)51.67.

Data Availability Statement

The dataset analyzed during the current study is available from the corresponding author upon request.

Acknowledgments

The seed company Charoen Pokphand Produce Co., Ltd. (CPP) has been acknowledged for kindly providing maize seed samples. In addition, we would like to thank Amornrit Puttipipatkajorn at the Intelligent Embedded System Laboratory, Computer Engineering, Faculty of Engineering, KU, Kamphaeng Saen Campus; Achitpon Yimpin of Rhino Research; and Siraprapha Nasaree from CPP for constructive suggestions.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Abbreviations

ADASYN	Adaptive synthetic algorithm
AI	Artificial intelligence
ANOVA	Analysis of variance
AUC	Area under the curve of the cumulative radicle emergence fitted curve
BorderlineSMOTE	Borderline synthetic minority oversampling algorithm
BP	Between-paper technique
CARS	Competitive adaptive reweighted sampling
CNNs	Convolutional neural networks
DL	Deep learning
DNA	Deoxyribonucleic acid
HIS	Hyperspectral imaging
LDA	Linear discriminant analysis
MCC	Matthews correlation coefficient
MGT	Mean radicle emergence times
ML	Machine learning
MSC	Multiplicative scatter correction
PCA	Principal component analysis
PLS-DA	Partial least squares discriminant analysis
RE	Radicle emergence
REMax	Maximum radicle emergence
ROC	Receiver operating characteristic
ROI	Region of interest
SG	Savitzky–Golay
SNV	Standard normal variate
SVM	Support vector machine
T50	Radicle emergence speed
U7525	Uniformity of radicle emergence

References

Erenstein, O.; Jaleta, M.; Sonder, K.; Mottaleb, K.; Prasanna, B.M. Global maize production, consumption and trade: Trends and R&D implications. Food Secur. 2022, 14, 1295–1319. [Google Scholar] [CrossRef]
Powell, A.A. Seed vigour in the 21st century. Seed Sci. Technol. 2022, 50, 45–73. [Google Scholar] [CrossRef]
Song, P.; Yue, X.; Gu, Y.; Yang, T. Assessment of maize seed vigor under saline-alkali and drought stress based on low field nuclear magnetic resonance. Biosyst. Eng. 2022, 220, 135–145. [Google Scholar] [CrossRef]
Powell, A.; Matthews, S. Seed aging/repair hypothesis leads to new testing methods. Seed Technol. 2012, 34, 15–25. [Google Scholar]
Colmer, J.; O’Neill, C.M.; Wells, R.; Bostrom, A.; Reynolds, D.; Websdale, D.; Shiralagi, G.; Lu, W.; Lou, Q.; Le Cornu, T.; et al. SeedGerm: A cost-effective phenotyping platform for automated seed imaging and machine-learning based phenotypic analysis of crop seed germination. New Phytol. 2020, 228, 778–793. [Google Scholar] [CrossRef] [PubMed]
Yimpin, A.; Sermwuthisarn, P.; Phimcharoen, W.; Chaisan, T.; Onwimol, D. SVRICE: An automated rice seed vigor classification system via radicle emergence testing using image-processing, curve-fitting, and clustering methods. Appl. Eng. Agric. 2022, 38, 831–843. [Google Scholar] [CrossRef]
ISTA. International Rules for Seed Testing; International Rules for Seed Testing: Zürich, Switzerland, 2023. [Google Scholar]
Yan, S.; Zhu, Q.; Huang, M.; Zhao, X.; Liu, Z. UDATNN: A modeling scheme integrating unsupervised domain adversarial learning and tri-training strategy for variety recognition of maize seeds with domain shift. Comput. Electron. Agric. 2023, 213, 108237. [Google Scholar] [CrossRef]
Javed, T.; Afzal, I.; Shabbir, R.; Ikram, K.; Saqlain Zaheer, M.; Faheem, M.; Haider Ali, H.; Iqbal, J. Seed coating technology: An innovative and sustainable approach for improving seed quality and crop performance. J. Saudi Soc. Agric. Sci. 2022, 21, 536–545. [Google Scholar] [CrossRef]
Rahman, A.; Cho, B.K. Assessment of seed quality using non-destructive measurement techniques: A review. Seed Sci. Res. 2016, 26, 285–305. [Google Scholar] [CrossRef]
Feng, L.; Zhu, S.; Liu, F.; He, Y.; Bao, Y.; Zhang, C. Hyperspectral imaging for seed quality and safety inspection: A review. Plant Methods 2019, 15, 91. [Google Scholar] [CrossRef]
Rogers, M.; Blanc-Talon, J.; Urschler, M.; Delmas, P. Wavelength and texture feature selection for hyperspectral imaging: A systematic literature review. J. Food Meas. Charact. 2023, 17, 6039–6064. [Google Scholar] [CrossRef]
Saha, D.; Manickavasagan, A. Machine learning techniques for analysis of hyperspectral images to determine quality of food products: A review. Curr. Res. Food Sci. 2021, 4, 28–44. [Google Scholar] [CrossRef]
Xia, Y.; Xu, Y.; Li, J.; Zhang, C.; Fan, S. Recent advances in emerging techniques for non-destructive detection of seed viability: A review. Artif. Intell. Agric. 2019, 1, 35–47. [Google Scholar] [CrossRef]
Ma, T.; Tsuchikawa, S.; Inagaki, T. Rapid and non-destructive seed viability prediction using near-infrared hyperspectral imaging coupled with a deep learning approach. Comput. Electron. Agric. 2020, 177, 105683. [Google Scholar] [CrossRef]
Wang, Y.; Peng, Y.; Qiao, X.; Zhuang, Q. Discriminant analysis and comparison of corn seed vigor based on multiband spectrum. Comput. Electron. Agric. 2021, 190, 106444. [Google Scholar] [CrossRef]
Wakholi, C.; Kandpal, L.M.; Lee, H.; Bae, H.; Park, E.; Kim, M.S.; Mo, C.; Lee, W.-H.; Cho, B.-K. Rapid assessment of corn seed viability using short wave infrared line-scan hyperspectral imaging and chemometrics. Sens. Actuators B Chem. 2018, 255, 498–507. [Google Scholar] [CrossRef]
Ambrose, A.; Kandpal, L.M.; Kim, M.S.; Lee, W.-H.; Cho, B.-K. High speed measurement of corn seed viability using hyperspectral imaging. Infrared Phys. Technol. 2016, 75, 173–179. [Google Scholar] [CrossRef]
Qiu, Z.J.; Chen, J.; Zhao, Y.Y.; Zhu, S.S.; He, Y.; Zhang, C. Variety Identification of Single Rice Seed Using Hyperspectral Imaging Combined with Convolutional Neural Network. Appl. Sci. 2018, 8, 212. [Google Scholar] [CrossRef]
Liu, S.; Chen, Z.; Jiao, F. Detection of maize seed germination rate based on improved locally linear embedding. Comput. Electron. Agric. 2023, 204, 107514. [Google Scholar] [CrossRef]
Pang, T.; Chen, C.; Fu, R.; Wang, X.; Yu, H. An end-to-end seed vigor prediction model for imbalanced samples using hyperspectral image. Front. Plant Sci. 2023, 14, 1322391. [Google Scholar] [CrossRef]
Zhang, T.; Lu, L.; Yang, N.; Fisk, I.D.; Wei, W.; Wang, L.; Li, J.; Sun, Q.; Zeng, R. Integration of hyperspectral imaging, non-targeted metabolomics and machine learning for vigour prediction of naturally and accelerated aged sweetcorn seeds. Food Control 2023, 153, 109930. [Google Scholar] [CrossRef]
Zhao, X.; Pang, L.; Wang, L.; Men, S.; Yan, L. Deep Convolutional Neural Network for Detection and Prediction of Waxy Corn Seed Viability Using Hyperspectral Reflectance Imaging. Math. Comput. Appl. 2022, 27, 109. [Google Scholar] [CrossRef]
Fan, Y.; An, T.; Wang, Q.; Yang, G.; Huang, W.; Wang, Z.; Zhao, C.; Tian, X. Non-destructive detection of single-seed viability in maize using hyperspectral imaging technology and multi-scale 3D convolutional neural network. Front. Plant Sci. 2023, 14, 1248598. [Google Scholar] [CrossRef] [PubMed]
Pang, L.; Men, S.; Yan, L.; Xiao, J. Rapid Vitality Estimation and Prediction of Corn Seeds Based on Spectra and Images Using Deep Learning and Hyperspectral Imaging Techniques. IEEE Access 2020, 8, 123026–123036. [Google Scholar] [CrossRef]
Feng, L.; Zhu, S.; Zhang, C.; Bao, Y.; Feng, X.; He, Y. Identification of Maize Kernel Vigor under Different Accelerated Aging Times Using Hyperspectral Imaging. Molecules 2018, 23, 3078. [Google Scholar] [CrossRef]
Wang, Y.; Peng, Y.; Zhuang, Q.; Zhao, X. Feasibility analysis of NIR for detecting sweet corn seeds vigor. J. Cereal Sci. 2020, 93, 102977. [Google Scholar] [CrossRef]
Xu, P.; Zhang, Y.; Tan, Q.; Xu, K.; Sun, W.; Xing, J.; Yang, R. Vigor identification of maize seeds by using hyperspectral imaging combined with multivariate data analysis. Infrared Phys. Technol. 2022, 126, 104361. [Google Scholar] [CrossRef]
Buijs, G.; Willems, L.A.J.; Kodde, J.; Groot, S.P.C.; Bentsink, L. Evaluating the EPPO method for seed longevity analyses in Arabidopsis. Plant Sci. 2020, 301, 110644. [Google Scholar] [CrossRef]
Cui, H.; Bing, Y.; Zhang, X.; Wang, Z.; Li, L.; Miao, A. Prediction of Maize Seed Vigor Based on First-Order Difference Characteristics of Hyperspectral Data. Agronomy 2022, 12, 1899. [Google Scholar] [CrossRef]
Jia, S.; An, D.; Liu, Z.; Gu, J.; Li, S.; Zhang, X.; Zhu, D.; Guo, T.; Yan, Y. Variety identification method of coated maize seeds based on near-infrared spectroscopy and chemometrics. J. Cereal Sci. 2015, 63, 21–26. [Google Scholar] [CrossRef]
ISTA. ISTA Handbook on Seedling Evaluation, 4th ed.; Don, R., Ducournau, S., Eds.; International Rules for Seed Testing: Zürich, Switzerland, 2018. [Google Scholar]
Joosen, R.V.; Kodde, J.; Willems, L.A.; Ligterink, W.; Van Der Plas, L.H.; Hilhorst, H.W. GERMINATOR: A software package for high-throughput scoring and curve fitting of Arabidopsis seed germination. Plant J. 2010, 62, 148–159. [Google Scholar] [CrossRef] [PubMed]
Haibo, H.; Yang, B.; Garcia, E.A.; Shutao, L. ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In Proceedings of the 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), Hong Kong, China, 1–8 June 2008; pp. 1322–1328. [Google Scholar]
Han, H.; Wang, W.-Y.; Mao, B.-H. Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning. In Advances in Intelligent Computing; Springer: Berlin/Heidelberg, Germany, 2005; pp. 878–887. [Google Scholar]
Li, H.-D.; Xu, Q.-S.; Liang, Y.-Z. libPLS: An integrated library for partial least squares regression and linear discriminant analysis. Chemom. Intell. Lab. Syst. 2018, 176, 34–43. [Google Scholar] [CrossRef]
LemaÃŽtre, G.; Nogueira, F.; Aridas, C.K. Imbalanced-learn: A python toolbox to tackle the curse of imbalanced datasets in machine learning. J. Mach. Learn. Res. 2017, 18, 1–5. [Google Scholar]
Matthews, S.; Beltrami, E.; El-Khadem, R.; Khajeh-Hosseini, M.; Nasehzadeh, M.; Urso, G. Evidence that time for repair during early germination leads to vigour differences in maize. Seed Sci. Technol. 2011, 39, 501–509. [Google Scholar] [CrossRef]
Hamdy, S.; Charrier, A.; Corre, L.L.; Rasti, P.; Rousseau, D. Toward robust and high-throughput detection of seed defects in X-ray images via deep learning. Plant Methods 2024, 20, 63. [Google Scholar] [CrossRef]
Wagner, M.-H.; Powell, A.A.; Dupont, A.; Shinohara, T.; Ducournau, S. Radicle emergence test for cabbage can be assessed using multispectral imaging. Seed Sci. Technol. 2023, 51, 291–296. [Google Scholar] [CrossRef]

Figure 1. Physical map of hyperspectral imaging system. The hyperspectral instrument for image acquisition (A) and maize seed characteristics employed for the acquisition of spectral data (B).

Figure 2. ROI process with spectral extraction.

Figure 3. Seed germination of 120 coated maize seed samples was obtained from different lots produced in Thailand over 3 years. The germinator was set to 25 °C. Error bars denote confidence intervals (n = 4; p < 0.05); missing error bars indicate ranges smaller than the symbols.

Figure 4. Illustrations of the radicle emergence characteristics of seeds with high vigor compared with those with low vigor at each time point.

Figure 5. Four-parameter hill function-fitted cumulative radicle emergence curve. Blue and red represent high- and low-vigor seed lots, respectively (A). Classification of seed quality for 120 maize seed lots according to K-means clustering using the following metrics: area under the fitted curve for radicle emergence, uniformity of radicle emergence, mean radicle emergence times, and radicle emergence speed employed as the ground truth labels (B).

Figure 6. MCC obtained from ELDA (top) and SVM (bottom) using varied oversampling parameters based on ADASYN (left) and BorderlineSMOTE (right) with original spectral, SNV, and MSC pretreatment analysis.

Figure 7. Confusion matrix (A) and ROC curve (B) of the superior model.

Table 1. Data partition of seed-level data.

Class Label	Number of Data in the Training Set	Number of Data in the Validation Set	Number of Data in the Test Set
Class Label	(75%)	(15%)	(10%)
High-vigor seed lot	6858	1372	914
Low-vigor seed lot	1323	265	176
Total	8181	1637	1090

Table 2. Comparison of the performance of the machine learning model on the test set.

Pretreatment		OG				SNV				MSC
Wavelength Selection		None	CARS			None	CARS			None	CARS
Oversampling Technique		None	None	ADASYN	Borderline SMOTE	None	None	ADASYN	Borderline SMOTE	None	None	ADASYN	Borderline SMOTE
ELDA	Accuracy	95.05	97.07	98.08	98.49		95.14	97.48	97.02		94.91	97.80	97.75
	Sensitivity	95.50	96.95	99.72	99.89		94.58	98.73	98.56		94.66	98.63	98.84
	Specificity	95.39	97.77	90.66	92.11		99.24	91.42	89.66		96.68	93.65	92.45
	Precision	99.28	99.61	97.96	98.29		99.89	98.24	97.85		99.50	98.73	98.46
	F1-score	97.10	98.26	98.83	99.09		97.16	98.48	98.21		97.02	98.68	98.65
	MCC	0.8133	0.8918	0.9343	0.9483		0.8180	0.9103	0.8945		0.8080	0.9207	0.9199
SVM	Accuracy	97.25	94.73	99.13	99.08		97.66	99.45	99.45		97.02	99.31	99.27
	Sensitivity	96.91	94.05	99.40	99.34		97.27	100	99.89		96.55	99.89	99.89
	Specificity	99.35	100	97.78	97.78		100	96.91	97.31		100	96.53	96.28
	Precision	99.89	100	99.56	99.56		100	99.34	99.45		100	99.28	99.23
	F1-score	98.37	96.93	99.48	99.45		98.62	99.67	99.67		98.24	99.59	99.56
	MCC	0.8990	0.8021	0.9686	0.9669		0.9146	0.9807	0.9805		0.8905	0.9757	0.9742

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wonggasem, K.; Wongchaisuwat, P.; Chakranon, P.; Onwimol, D. Utilization of Machine Learning and Hyperspectral Imaging Technologies for Classifying Coated Maize Seed Vigor: A Case Study on the Assessment of Seed DNA Repair Capability. Agronomy 2024, 14, 1991. https://doi.org/10.3390/agronomy14091991

AMA Style

Wonggasem K, Wongchaisuwat P, Chakranon P, Onwimol D. Utilization of Machine Learning and Hyperspectral Imaging Technologies for Classifying Coated Maize Seed Vigor: A Case Study on the Assessment of Seed DNA Repair Capability. Agronomy. 2024; 14(9):1991. https://doi.org/10.3390/agronomy14091991

Chicago/Turabian Style

Wonggasem, Kris, Papis Wongchaisuwat, Pongsan Chakranon, and Damrongvudhi Onwimol. 2024. "Utilization of Machine Learning and Hyperspectral Imaging Technologies for Classifying Coated Maize Seed Vigor: A Case Study on the Assessment of Seed DNA Repair Capability" Agronomy 14, no. 9: 1991. https://doi.org/10.3390/agronomy14091991

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Utilization of Machine Learning and Hyperspectral Imaging Technologies for Classifying Coated Maize Seed Vigor: A Case Study on the Assessment of Seed DNA Repair Capability

Abstract

1. Introduction

2. Materials and Methods

2.1. Sample Preparation and Hyperspectral Image Acquisition

2.2. Ground Truth Annotation: Seed Vigor Classifications

2.3. Data Preprocessing

2.4. ML Model Development

3. Results

3.1. Seed Vigor Classification

3.2. Oversampling Algorithm Performances

3.3. ML Model Performances

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI