1. Introduction
Rice is a crucial food source in many countries due to its rich nutritional value and adaptability. It is a major crop grown in various regions and climates [
1]. There are significant differences in natural environmental conditions, such as altitude, soil type, and precipitation, among different origins. These differences directly affect the growth cycle, nutrient content, and quality characteristics of local rice. As an important food production base in the northeast of China, Jilin Province, its geographical location and environmental conditions have largely shaped the quality characteristics of rice from various origins. Jilin Province is part of a typical mid-to-high latitude, continental climate region [
2]. The geographic location and environmental characteristics of different origins provide unique conditions for the quality of local rice. These factors play a key role in determining the nutritional value, taste, and adaptability of rice [
3]. Against this background, the study of spectroscopic detection and identification of rice from different origins in Jilin Province is particularly important and necessary. Through the in-depth study of the spectral characteristics of rice from different origins, the best quality rice can be screened out, which can not only promote the modernisation and intelligent development of rice production in Jilin Province but also help to improve the quality and yield of rice and promote the sustainable development of the local agricultural economy [
4].
Spectroscopic techniques offer great potential and advantages in tracing the origin of rice. They can provide extensive chemical information about rice samples, including data on composition and structure. These data can be used to build a spectral database of rice origin characteristics and combined with technologies such as geographic information systems (GIS) to achieve traceability management of rice from different origins. The rapid and non-destructive characteristics of spectroscopic technology can be tested without damaging the samples, greatly improving testing efficiency and sample utilisation. At the same time, spectroscopic technology also has a high detection sensitivity for trace components and can find trace compounds in rice, which helps distinguish between rice of different origins and provides a rich database for establishing a complex rice origin traceability model. Combining spectroscopic techniques with chemometrics, statistics and other methods can improve the accuracy and reliability of traceability.
Spectral data fusion technology is a method that combines spectral information from different sources, processes the data using chemometric methods, and constructs detection models with varying levels of fusion to reflect the sample information more comprehensively. According to different fusion levels, spectral data fusion techniques are categorized into three fusion strategies: data layer fusion strategy, feature layer fusion strategy, and decision layer fusion strategy [
5]. Fluorescence spectral data are usually obtained by using different wavelengths of excitation light sources, and each wavelength of excitation light source can cause the sample to emit a specific wavelength of fluorescence spectra, so the selection of direct fusion avoids the loss of information that may occur during the data processing process, ensures the completeness and accuracy of the original data, and improves the understanding and analysis of the sample [
6]. The fluorescence spectroscopy detection method has a wide range of applications in biomedicine [
7,
8], environmental monitoring [
9,
10,
11], agriculture [
12], etc. It can help researchers to better understand the nature and characteristics of samples, and thus support the related scientific research and application practice.
At present, the application of spectroscopic techniques in rice origin traceability research has made remarkable progress. Studies at home and abroad have shown that the use of spectroscopic techniques combined with chemometric methods can effectively distinguish the characteristics of different rice origins. In recent years, researchers have successfully established a spectral fingerprint library of rice origin [
13,
14] by analysing the spectral data of rice leaves [
15,
16], soil [
17], and water [
18], and have used different spectral techniques, such as near-infrared spectroscopy (NIRS) [
19,
20], Raman spectroscopy [
21], fluorescence spectroscopy [
22], and hyper-spectroscopy [
23], etc., to carry out traceability research on rice origin.
These studies provide technical support for the geographic information traceability of rice origin and new ideas for rice quality, safety control, and brand protection. Meanwhile, domestic and foreign methods such as machine learning and artificial intelligence have been combined, based on spectroscopic technology, to further improve the accuracy and efficiency of rice origin traceability, which provides important technical support for guaranteeing food security and agricultural product quality.
As shown in
Figure 1, in this paper, fluorescence spectroscopy technology was used to spectroscopically detect three different states of rice from twelve different origins in Jilin Province, and the three states of rice were individually analysed using three spectral modelling analysis methods; then the spectra of the three states of rice were fused to carry out spectral modelling analysis; and finally, the standard values of rice constituent material content of different origins were used to predict the constituent contents of rice from different origins; this provides a good basis for the quality and origin identification of rice and represents a new fast and accurate analysis method with broad application prospects.
2. Materials and Methods
2.1. Sample Sources
The samples for this experiment were selected from Japonica rice 830 of the Rice Research Institute of the Jilin Academy of Agricultural Sciences in 2023 and originated from twelve cities in Jilin Province: Songyuan (SY), Changyi District of Jilin City (CY), Hunchun (HC), Gongzhuling (GZL), Tao’er River (TEH), Qianguo (QG), Yushu (YS), Da’an (DA), Huinan (HN), Meihekou (MHK), Yanji (YJ), and Zhenlai (ZL). These origins include the major rice-producing areas in the northern, central, and southern parts of Jilin Province. The fluorescence spectra of rice seeds, brown rice, and rice flour were detected and analysed separately. The location distribution of the 12 different origins of rice is shown in
Figure 2, and their geographic location information is shown in
Table 1.
Before collecting fluorescence spectroscopy data, the samples undergo manual handling and screening. Basic cleaning is performed on the samples to ensure that the sample surfaces are free of dust and impurities, preventing them from affecting the accuracy of the data collection process. The manual screening was conducted to select full, well-proportioned grains of moderate size, free of pests, moulds, and mechanical damage in the rice seed samples for preservation. The selected rice seed samples of different origins were sealed in food-grade plastic bags, labelled with their corresponding origin, and placed in a refrigerator at 4 °C for cold storage to ensure that all samples were not affected by spoilage. The samples were removed and placed at room temperature for 3 h before the fluorescence spectra of the rice samples were collected.
The brown rice was obtained by hulling the rice seeds using a huller, and after hulling, any deteriorated and defective brown rice was sieved out, and the brown rice with intact and undamaged grains was selected for the subsequent experiments. For the preparation of rice flour, a grinder was used for grinding, and the grinding time was set to 40 s. This choice was made to ensure that the grinding time would place the state of the brown rice flour at the upper limit of what could be ground out, avoiding errors caused by inconsistencies in the size of the ground particles. Also, 40 s was chosen as the grinding time to avoid excessive heat generation due to a long grinding time. After grinding, the brown rice flour was passed through a standard split sample sieve and a 100-mesh sieve (the standard 100-mesh sieve is measured by the GB/T 6003.1-2012 [
24] standard) and then placed into a mill bottle for preservation. Before experimenting, each type of brown rice was sampled and milled more than three times in the process, while the brown rice flour of the same origin was kept in the same sample bag. The three-state rice samples are shown in
Figure 3.
The main components of rice are starch, protein, etc., which show different vibrational spectral information due to their chemical composition, content, and structure. In this study, we purchased some chemical reagents for spectroscopic analysis to study the difference in the content of rice components from different origins, which were purchased from Shanghai Aladdin Biochemical Science and Technology Co., Ltd.in China, including riboflavin (CAS No. 83-88-5), alkaline lignin (CAS No. 8068-05-1), zein (CAS No. 9010-05-1), and amylopectin from maize (CAS No. 9037-22-3).
2.2. Spectral Acquisition Test System
To accurately measure the spectral information of rice and chemical reagents, it was necessary to prepare the experimental equipment using an MTO-Laser with an excitation wavelength of 405 nm for the laser light source (product power of 50 mW, operating current of 60 mA), an Optosky fibre optic spectrometer ATP2400 purchased from Aopu Tiancheng Photoelectric CO., LTD. in China, Xiamen(detection range of 350–800 nm, a slit of 50 nm, and a resolution of 1.5 nm), and an optical fibre (Shanghai Wenyi Optoelectronics Co., Ltd. in China, model UV600-1.0, core diameter of 600 µm, and light transmission range of up to 200–1100 nm); the instrument connections and the spectral signal acquisition system is shown in
Figure 3. As can be seen in this figure, the collected information on the sample was placed in the sample tank with the light source hitting the sample at a 45° angle of reflection to the fibre optic probe, and the probe is connected to the spectrometer while the spectrometer is connected to the other end of the computer experimental device.
Figure 4 is the fluorescence spectroscopy experimental system schematic diagram.
In the experiments, the sample tank was filled with rice samples to ensure that the light source directly illuminated the sample surface when detecting rice seeds and brown rice and thus ensure that the thickness of the samples was consistent when measuring rice flour and chemical reagents. The surface area of the cuvette was about 6 cm2, but the number of rice grains detected cannot be guaranteed to be the same each time due to different grain sizes. The surface of the cuvette was able to detect about 50 grains of rice seeds; the spot area was approximately elliptic, at about 5 mm2; and the spot irradiation range was able to detect about 5–7 rice samples. After switching on the instrument system, the light source warmed up and the spectrometer was in a stable working condition. The average number of spectrometers was set to 1, the integration time was set to 20 s, and the sampling interval was 2 ms. During the acquisition, the incident angle and reflection angle were at 45° when the sample surface was flat. The data were collected by the software that accompanied the Optosky fibre optic spectrometer. Using the high-speed scanning function of the software, the sample position was changed by uniformly fine-tuning the displacement platform to change the measurement point during detection. After rejecting the invalid spectra, 1000 pieces of data were saved for each group.
2.3. Machine Learning Algorithms
2.3.1. Decision Tree
The decision tree is a powerful machine-learning algorithm commonly used in regression and classification problems. It achieves the prediction and classification of data by building multiple decision trees and combining them into a “forest” [
25]. Each decision tree is trained by randomly selecting samples and features, and this randomness helps to reduce the risk of overfitting and improves the generalisation ability of the model. When making a prediction, the decision tree combines the results of all the decision trees to obtain the final prediction. One of the advantages of decision trees is that they can handle many input variables without much data preprocessing. Another advantage is that since the training of each decision tree is independent of each other, decision trees can be trained and predicted in parallel, thus speeding up the computation [
26].
2.3.2. Support Vector Machines
The support vector machine (SVM) classification algorithm is a classification method based on statistical learning and is a in class of supervised learning that utilizes the binary classification of the data of generalized linear classifiers; its decision boundary is the maximum margin hyperplane for solving the learning samples, and it achieves the classification of the SVM mainly through the search for the classification of various types of samples between the hyperplanes [
27]. The implementation of SVM is mainly divided into two steps: the first is to select the kernel function, and the second is to test the kernel function and select the optimal parameters. The kernel function maps the data from the original space to the feature space, and depending on the chosen kernel function, it can take a variety of forms, thus providing SVMs with the ability to handle linear and nonlinear classification. The kernel can be viewed as a mapping of nonlinear data to a high-dimensional feature space while providing computational shortcuts by allowing linear algorithms to be used with high-dimensional feature spaces.
2.3.3. K-Nearest Neighbour
The K-nearest neighbour (KNN) is a classical machine learning algorithm for classification and regression tasks. Its core idea is instance-based learning, i.e., prediction by comparison with nearest neighbours. In a classification problem, when given a new unlabelled data point, the KNN algorithm determines the K-nearest neighbours in the training set based on distance. It then votes on the labels of these neighbours and uses the most frequently occurring category as a prediction [
28].
The key steps in the KNN algorithm include determining the number of neighbours K, calculating the distance between the new data point and each data point in the training set (usually using a distance metric such as Euclidean distance or Manhattan distance), finding the K nearest neighbours, and making predictions. Choosing a smaller value of K may result in overfitting the model and choosing a larger value of K may result in underfitting the model. In addition, the KNN algorithm has a higher computational complexity when dealing with large datasets and high-dimensional feature spaces because it needs to compute the distance between the new data point and all the training data points.
2.3.4. Neural Networks
A neural network (NN) is a computational model consisting of multiple neurons that mimics the structure and function of the human nervous system. It automatically discovers patterns and regularities in data by learning large amounts of data to carry out tasks such as classification, regression, and clustering. A neural network consists of an input layer, a hidden layer, and an output layer, where each neuron is connected to each neuron in the next layer, and the strength of the connections is regulated by weights. As data are propagated through the network, each neuron weights and sums the inputs and performs a nonlinear transformation through an activation function, which is then passed to the next layer of neurons. The connection weights are adjusted by a back-propagation algorithm to make the network’s predictions as close as possible to the true labels, and a gradient descent optimiser is used to minimise the loss function to enable model training. Neural networks have powerful expressive capabilities and are suitable for processing various types of data, including images, text, and speech [
29].
2.4. Principal Component Analysis
Since the fluorescence spectral data of rice from different origins have large dimensions, PCA principal component analysis was used to reduce the dimensionality of the spectra. Principal component analysis (PCA) is a commonly used multivariate statistical analysis method, which aims to transform high-dimensional data into low-dimensional data through linear transformation while retaining the main information in the data. The core idea of PCA is to search for the most important features or principal components of the data to realise the downscaling and simplification of the data. In PCA, by calculating the covariance matrix of the data and its eigenvalues and eigenvectors, it is possible to determine the principal components of the data, i.e., the directions of the data with the highest variance. These principal components are linear combinations in the original data, which can maximise the preservation of the information in the original data, thus realizing a dimensionality reduction in the data [
30]. Through PCA, patterns, structures, and correlations in the data can be discovered, helping us to understand the intrinsic characteristics of the data better; at the same time, the dimensionality of the data can be reduced, simplifying the process of data analysis and facilitating the subsequent modelling and visualization analysis. Several independent variables were obtained to replace the original variables to make them reflect as much information as possible, and the whole amount of data was decomposed into a loading matrix and a scoring matrix. The principal components of the score matrix (PC1, PC2, PC3) are projected into the 3D coordinate system. Classify and differentiate the pattern points according to their distribution in the three-dimensional coordinate system.
Since PCA can reduce the redundant information of data and highlight the main features of the data, it has a wide range of applications in the fields of data mining, pattern recognition, image processing, and so on.
2.5. Gaussian Process Regression
Regression analysis is a statistical method used to study the relationship between independent variables (explanatory variables) and dependent variables (response variables). In regression analysis, we try to find a mathematical model that describes how the independent variable affects the dependent variable to make predictions or inferences. Gaussian process regression (GPR) is a probability-based nonparametric regression method mainly used for prediction and uncertainty estimation [
31]. Unlike traditional regression models, GPR does not assume a specific functional form of the data but rather describes the distribution of functions in the input space through Gaussian processes. To train the GPR model, the covariance matrix is constructed using the training data and the observed data are assumed to be affected by Gaussian noise that is independently and identically distributed. By maximising the log-likelihood function, the model parameters can be optimised. For prediction, the Gaussian distribution of the predicted values, including the predicted mean and uncertainty range, can be obtained by calculating the covariance between the new input points and the training data.
Starch content is one of the important indicators of rice quality, which directly affects the processing quality and market value of rice. With the Gaussian process regression model, we can build a prediction model using the relevant data characteristics of each origin. This model can help to understand the relationship between starch content and various potential influencing factors, thus providing a scientific basis and predictive power for agricultural production decisions.
2.6. Data Processing
Common pre-processing methods for spectral information include data smoothing, standard normal transformation, and multiple scattering correction. Through these methods, the light scattering effect due to the physical properties of the sample (i.e., hardness, particle, size, etc.) can be eliminated. The derivative derivation method is also commonly used for spectral data preprocessing, by which the spectral baseline drift can be suppressed and more details in the spectral data can be revealed. In addition, irrelevant information in the image can also be eliminated through image preprocessing, improving the quality of the data and simplifying follow-up work.
When using spectral data for the quantitative analysis of sample physicochemical values, it is often necessary to evaluate the performance of the model with the help of some statistical parameters [
32]. In this paper, three statistical parameters were chosen: the decision factor, the calibrate the root mean square error and the root mean square error of prediction
The closer the R2 of the model is to 1, the higher the degree of explanation of the dependent variable by the independent variables used in building the model and the better the fit of the model. The smaller the RMSEC and RMSEP are, the better, as a smaller RMSEC indicates that the regressivity of the built model is better, that the sample relevance of the calibration set is better, and that the overall degree of deviation from the model is smaller; the smaller the RMSEP is, the smaller the error between the actual value of the samples and the prediction value. The smaller the RMSEP is, the smaller the error between the actual value of the sample and the predicted value; the stronger the predictive ability of the constructed model is, the closer the two are and the more stable the model is.
4. Conclusions and Prospects
4.1. Conclusions
In this paper, Japonica rice 830 planted in the same year in different areas of the Jilin Province was selected as a sample, and the spectral detection of three different states of rice from 12 different origins in the Jilin Province was carried out through the use of fluorescence spectroscopy, and the spectral data obtained after a series of pre-processing were used as inputs for the prediction model of the decision tree, SVM, KNN, and neural network. It was observed that the accuracy of the model increased as the decision-making degree of the algorithms improved in the samples. The conclusion was reached that the accuracy of the models increased with the increase in the algorithm’s decision-making degree in the samples. The accuracy data of the three different states of rice, with large differences in differentiation, were fused to reduce these differences, and although the accuracy was improved, the model computation time was greatly increased, so it was downscaled using principal component analysis; the first six principal components, which accounted for 98.9% of the overall model, were interpreted, from which 50 feature points were extracted for modelling and analysis, which greatly reduced the model computation time while obtaining a higher accuracy. The model calculation time was greatly reduced while higher accuracy was obtained. Finally, using the standard value of rice starch content from different origins, the Gaussian process regression model was used to regress and predict the starch content of rice in different states and from different origins, and the R2 of the model was greater than 0.8, while the RMSEC and RMSEP were smaller, which proved that this model was very stable, represented a new fast and accurate analysis method for the identification of rice quality and origin, and has a wide range of application prospects. Taking these factors together, the application of spectroscopic technology in the field of rice origin traceability is promising, and it will provide more scientific and accurate technical support for quality control and origin traceability in the rice industry chain.
4.2. Prospects
Models used in this study, such as decision tree, support vector machine (SVM), K-nearest neighbour (KNN), and neural networks, performed well in rice spectral detection and quality prediction but still have some limitations. The diversity and quality of data are crucial for model stability and accuracy. Insufficiently comprehensive training data or a lack of representativeness in the data from certain sources can limit the predictive effectiveness of the model. Therefore, variables such as different growing conditions, soil types, and climatic factors should be considered during the data collection phase to improve the robustness of the model.
In the future, the application of spectral technology in rice quality identification and source tracing is a promising prospect. With the development of machine learning and deep learning technologies, combined with more advanced algorithms such as convolutional neural networks (CNNs) and integrated learning methods, it is expected that the accuracy and adaptability of the model will be further improved. Meanwhile, advances in sensor technology make real-time, online spectral detection possible, providing a more accurate and efficient solution for quality control in the rice industry chain.