1. Introduction
Apple (
Malus pumila Mill.) is a perennial crop belonging to the Rosaceae family, and careful selection of suitable cultivation sites based on geographical and environmental conditions is needed because of its long-term cultivation in a single location. Furthermore, in the cultivation of fruit trees, it is crucial to supply the right amount of nutrients during key stages. During these stages, nitrogen is the most critical factor influencing both vegetative growth and the quality and quantity of fruit. Insufficient nitrogen weakens plant growth, resulting in poor fruit development and a significant decrease in yield and quality [
1]. In contrast, an excess of nitrogen causes assimilated nutrients to be consumed primarily for the growth of stems and leaves, causing the plant to grow excessively and leading to fruit disorders such as bitter pits or corky tissue [
2]. As the fruit size increases, coloration becomes inadequate and maturation is delayed, resulting in rapid quality deterioration during storage. Furthermore, prolonged vegetative growth leads to a decrease in nutrient accumulation during storage and delays plant maturation, increasing susceptibility to frost damage. Therefore, timely fertilization is crucial for effective cultivation management [
3,
4,
5,
6].
Remote sensing technology, which observes characteristics and phenomena using sensors that are mounted on platforms such as satellites and aircraft without physical contact with a target, is gaining attention. This technology utilizes reflected or radiated electromagnetic energy to observe desired subjects. Recently, advancements in drone technology, satellites, and high-resolution sensor technology, coupled with the integration of big data and AI, have been utilized in various fields, such as geology, marine science, defense, and the environment, where on-site surveys are challenging, not only for urban and territorial planning [
7]. In agriculture, various methods, including real-time monitoring of crop nutrients [
8], monitoring of moisture levels [
9,
10], disease and pest diagnosis [
11], crop yield assessment [
12,
13,
14], cultivation area estimation [
15], early prediction of harvest timing, and forecasting of harvest quantity and quality [
16,
17], are actively utilized. Such applications signify a significant stride toward enhancing agricultural productivity, efficient resource management, and sustainability. By enabling predictive modeling and precise agricultural management, agriculture can progress toward more sustainable and efficient farming activities, contributing to income growth and environmental preservation. A typical RGB sensor covering the visible spectrum represents information for only three to ten wavelengths, whereas multispectral sensors, including near-infrared sensors, provide information for the same range of wavelengths. In contrast, hyperspectral sensors can capture information for as few as several dozen to several hundred wavelengths. However, the increasing size of spectral data leads to higher costs, complex data processing, and challenges such as signal-to-noise ratio (SNR) degradation [
18]. Therefore, postacquisition preprocessing and minimization of data loss are necessary for effective data handling. Regression analysis is a technique that utilizes one or more independent variables (x) to explain the dependent variable of interest using a mathematical function. The types of regression analysis include linear regressions, such as simple and multiple linear regressions, and nonlinear regression analyses, such as tree-based and polynomial regressions. Linear regression has the advantages of simple computations, easy model interpretation, and rapid analysis [
19]; however, linear regression is sensitive to outliers and may result in decreased model accuracy when the relationships between variables are not linear. Addressing this limitation increases model accuracy by accounting for nonlinear relationships through higher-order terms, compensating for the disadvantages of linear regression [
20]. However, as the number of interaction terms increases, the calculation becomes more complex and requires more time, and a higher bias can lead to overfitting (bias–variance tradeoff). Therefore, it is challenging to definitively state which regression analysis is better based on the independent variables concerning the subject of analysis. Finally, it is important to compare the performance of the models, calculated using both linear and nonlinear regression analyses and to select a model with high reproducibility. This approach ensures the selection of a reliable model through a performance comparison.
Based on the described developments, focused research is underway to apply similar methodologies to orchard cultivation. Studies utilizing hyperspectral imaging have been conducted to predict carbohydrate content, which is associated with fruit quality, achieving a high prediction performance of over 75% [
21]. Additionally, research focused on predicting potassium levels using a combination of various vegetation indices derived from hyperspectral imaging has been carried out. Among the diverse vegetation indices, the combination of red edge and blue wavelengths in the derived DVI (Difference Vegetation Index) exhibited the highest performance, with an R
2 value of 0.899 [
22].
In this study, we developed a model to predict the leaf nitrogen content of apple trees via hyperspectral imaging by (1) conducting regression analysis (partial least-squares regression, support vector regression, and eXtreme gradient boosting regression) using both the full spectrum and selected wavelengths, followed by a comparison of the evaluation performances; and (2) reducing the spectral resolution through spectral binning and performing regression analysis using both the full spectrum and selected wavelengths, followed by a subsequent comparison of the evaluation performance.
2. Materials and Methods
This study was conducted over two years, from 2021 to 2022, at the experimental field of the National Institute of Horticultural & Herbal Science located in Wanju-gun, Jeollabuk-do, Republic of Korea (35°49′42.8″ N, 127°01′52.9″ E). Two-year-old nursery stocks of ‘Hongro/M.9’ were used for the experiment, and they were subsequently grafted onto potted rootstocks. The potting mixture was prepared by mixing horticultural soil, loess soil, and perlite at a ratio of 5:4:1. The plants were planted at intervals of 3 m × 2 m, with each treatment plot accommodating 38 trees. Nitrogen fertilization was carried out by dividing ammonium nitrate (NH4NO3) into fertilizer amounts of 171 g/year, 43 g/year, and 0 g/year for each plot, after which the fertilizer was diluted in 2 L of water.
2.1. Hyperspectral Data
The hyperspectral imaging system was composed of a hyperspectral sensor (Fx10, Specim Spectral Imaging Ltd., Finland) that operates in the wavelength range of 400–1000 nm, with 224 channels, a field of view of 38°, and a spectral resolution of 5.5 nm based on the 2-binning line scan method. In addition, the system included a rotator mounted at the bottom (RS10, Specim Spectral Imaging Ltd., Oulu, Finland) and a reference board with 99% reflectance (Spectralon, Labsphere, Inc., North Sutton, NH, USA) to correct for variations in sunlight. In addition, a rotator (RS10; Specim Spectral Imaging Ltd., Oulu, Finland) attached to the bottom and a reference board (Spectralon, Labsphere, Inc., North Sutton, NH, USA) with 99% reflectance to compensate for solar variability were used in the system setup. To prevent image distortion during hyperspectral imaging, the rotation radius of the rotator was set to 30°. Prior to capturing the main image, a dark current image was acquired to eliminate noise caused by the heat generated during sensor operation. The reference board was then placed beside the subject, and images were acquired using dedicated imaging software (Lumo Scanner, Specim Spectral Imaging Ltd., Oulu, Finland). The acquired images were processed using hyperspectral image processing software (ENVI 5.3, Exelis Visual Information Solutions, Boulder, CO, USA). Before image processing, the images were subjected to a preprocessing phase that included dark current correction and radiometric correction. Normalized images were subsequently applied to a vegetation index, specifically the NDVI-GNDVI, as described in Equation (1), to separate the canopy area from the background.
The images were converted into vegetation indices utilizing density slices to separate the canopy from the background based on a designated threshold. Subsequently, the canopy section was designated as the region of interest, and the reflectance values were extracted (
Figure 1).
As depicted in
Figure 2, in shadowed regions, where light absorption and reflection are minimal, noise occurs. When comparing spectral curves between areas with shadows and those without, reflectance values exhibit a difference of approximately 0.1 to 0.5 or more depending on the wavelength. Since such differences can lead to data errors in predicting nitrogen content, histograms were generated for each wavelength. Subsequently, threshold values were set to minimize the impact of shadows and delineate the regions. To reduce the spectral resolution of the extracted hyperspectral data, the original 2-binning images were partitioned into 4-binning, 8-binning, and 16-binning.
2.2. Apple Leaf Nitrogen Content Measurement
At each time point, a total of 21 leaf samples were collected, with seven samples from each treatment group. Nitrogen content data were acquired for the leaves, with a focus on mature leaves, and a total of 10 leaves were collected. The leaf nitrogen content was measured in accordance with the soil and plant analysis methods stipulated by the National Institute of Agricultural Sciences (2000). The collected leaves were dried in a dryer at 60 °C for 5 days (60 h). A 1 g sample of the dried material was digested with a mixture of nitrogen and perchloric acid at a ratio of 85:15, amounting to 10 mL. Upon completion of digestion, the solution was allowed to cool to room temperature. The residual liquid in the container was then rinsed with distilled water and filtered through a volumetric flask. The leaf nitrogen content was subsequently measured using a carbon/nitrogen elemental analyzer (NL/Primacs SNC-100, Skalar Analytical B.V., Breda, The Netherlands).
2.3. Hyperspectral Data Transformations
Various issues arise when capturing hyperspectral images in open fields owing to differences in environmental factors. These include changes in atmospheric conditions, uneven lighting sources [
23,
24], and noise caused by the heat of the sensor itself [
25]. Therefore, accurate analysis of hyperspectral images obtained in open fields requires preprocessing, which involves setting and optimizing the hyperspectral sensor and imaging equipment according to the conditions. The first derivative method is a preprocessing technique used to reduce the noise caused by light. This involves calculating the rate of change between data points by differentiating the raw data. This method extracts features through the gradient of the data rather than the raw reflectance values, thereby reducing the noise caused by environmental fluctuations and improving the accuracy of the data. Additionally, the Savitzky–Golay filter is a method for smoothing data at regular intervals [
26]. The output value at each data point is determined by finding, through least-squares fitting, the polynomial of order k that best fits the surrounding points. This method is commonly used, because it reduces noise due to various light conditions and atmospheric states while maintaining spectral characteristics, making it an effective preprocessing technique [
27].
2.4. Variable Selection Method
Hyperspectral data, which contain abundant continuous spectral information, complicate computational analysis [
28] and can lead to overfitting owing to unnecessary variables [
29], consequently diminishing the performance of regression models [
30]. To address these issues, methods have been developed to eliminate variables with little relevance to the dependent variable among numerous independent variables or to find combinations of predictive variables. These methods involve combining meaningful information to extract new features, thereby removing unnecessary information or noise and extracting important information for analysis. Among the variable selection methods, competitive adaptive reweighted sampling (CARS) uses PLS-based regression coefficients as criteria to evaluate the importance of variables. Subsets are randomly generated via Monte Carlo sampling, and N variables are selected through competition, followed by wavelength selection based on an exponentially decreasing function and adaptive reweighted sampling, with the lowest root-mean-square error (RMSE) chosen through cross-validation [
31]. The successive projections algorithm (SPA) employs a forward selection approach, constructs subsets of variables with minimal collinearity, calculates the distance between the variables and their orthogonal projections, and selects those with the maximum orthogonal distance. Selection was based on the lowest RMSECV in multiple linear regression (MLR) [
32]. The random frog (R-Frog) model, which is based on partial least-squares regression (PLSR), randomly selects variable sets and calculates selection probabilities through repeated iterations using the reversible jump Markov chain Monte Carlo algorithm. Wavelengths with higher selection probabilities were chosen as feature variables.
2.5. Regression Analysis Based on Machine Learning Models
PLSR develops models using least-squares regression between dependent variables by creating latent variables that maximize the covariance between a linear combination of independent and dependent variables. This approach addresses the issue of low regression coefficient estimates owing to the high correlations among the independent variables. Gradient boosting regression, an analysis method utilizing the boosting technique within ensemble models, progressively adds three models that predict and calculate residuals (the differences between predicted and actual values), thereby reducing errors. However, this process can be time-consuming and can be mitigated by extreme gradient boosting (XGBoost) regression analysis. Unlike gradient boosting regression, XGBoost supports parallel and distributed processing, allowing it to handle large datasets rapidly. It learns efficiently and concisely through pruning and can use various objective functions to reduce time. A support vector machine (SVM) is an algorithm that maps data to a high-dimensional space and determines a decision boundary by maximizing the margin, which represents the distance between the decision boundary and data points (MATLAB R2023a, MathWorks, Natick, MA, USA). The performances of these regression models were validated using 10-fold cross-validation and evaluated based on the coefficient of determination (R
2) and RMSE. R
2 is a statistical metric in regression analysis that indicates how effectively a model explains the variability of the dependent variable. A value close to one suggests that the model effectively explains the variability of the dependent variable, while a value close to zero indicates that the model fails to adequately explain the variability of the dependent variable. The RMSE is a metric used to measure the difference between predicted and actual values. The method involves squaring the differences between each predicted value and its corresponding actual value, calculating the mean of these squared differences, and then taking the square root of that mean. A lower RMSE indicates that the model’s predictions are closer to the actual values, signifying higher model performance.
Figure 3 presents a flowchart summarizing the processes, including image preprocessing and analysis methods, conducted in this study.
4. Discussion
This paper presents the results of a prediction model for apple tree leaf nitrogen content using full-spectrum wavelengths. For the raw data, the R
2 values for PLSR, SVR, and XGB ranged from 0.633 to 0.643, 0.743 to 0.811, and 0.850 to 0.892, respectively. For the first derivative, the R
2 values for the PLSR and SVR ranged from 0.623 to 0.688 and 0.667 to 0.704, respectively, and overfitting was observed with XGB. When compared with results from previous research, the raw data showed that PLSR had an R
2 of 0.773 and the first derivative data had an R
2 of 0.774 [
34]. The improvement in performance can, exactly, be attributed to the higher spectral resolution. Despite maintaining the same wavelength range, the increased spectral resolution introduces a greater number of wavelengths. This, in turn, contributes to a higher count of independent variables in the prediction model, ultimately leading to improved performance.
Hyperspectral data, represented as continuous curves, constitute a complex dataset because of differences in reflectance values, even within adjacent wavelength bands in the same spectral range. These results suggest that nonlinear regression analysis methods, such as SVR and XGB, are more advantageous in terms of prediction performance and interpretability than linear regression analyses, such as PLSR [
35]. Additionally, Savitzky–Golay filtering, a preprocessing method which is used to reduce the noise caused by light, smooths the data by adjusting the polynomial order and window size. However, the first derivative, which represents the rate of change in adjacent wavelengths rather than the inherent value of the reflectance, is sensitive to spectral changes and peak enhancement [
36]. This sensitivity is beneficial but can be problematic when noise is present, as it leads to significant changes in the gradient. Such drawbacks are evident in the results of this experiment, where a lower prediction performance was observed or overfitting occurred in the tree-based boosting method, XGB, owing to the sensitivity of the first derivative data.
A comparison of the variable selection algorithms revealed that the primary selections were made at the blue (470–490 nm), green (550 nm), red edge (680–740 nm), and NIR wavelengths. In the visible light spectrum, wavelengths that were closely associated with chlorophyll were chosen. For chlorophyll a, the highest absorption occurred at the boundary of the red and red edge wavelengths, approximately at 670 nm, whereas chlorophyll b exhibited maximum absorption at 470 nm and reflection at the green wavelength. Furthermore, in the nonvisible spectrum, specifically at the red edge and near-infrared (NIR) wavelengths, differences in reflectance values reflect the nutritional status of leaves, which typically increase in value when the nutritional state is favorable [
37]. The structural characteristics of leaves vary with nitrogen levels: a higher nitrogen content results in an increase in the leaf surface area. Additionally, the leaf epidermis thickens, and cells in the mesophyll tissue increase in size and become more densely arranged, leading to an increase in the chlorophyll content [
38,
39]. Thickening of epidermal tissue facilitates active gas exchange, resulting in enhanced photosynthesis. Based on the spectral characteristics corresponding to the structural changes in the leaves, the analysis results considering the full spectrum revealed that for the PLSR models, R
2 = 0.619, which was lower than that of CARS, Rfrog, and SPA. In contrast, for the SVR models, CARS had an R
2 of 0.754, Rfrog had an R
2 of 0.742, and SPA had an R
2 of 0.765, indicating a greater performance than those of the models using the full spectrum. In the case of XGB, the performance across various variable selection algorithms ranged from 0.7 to 0.756, showing effectiveness that is similar to the results obtained using full-spectrum analysis. These results indicate differences based on the variable selection algorithm. PLSR, which creates new variables through linear combinations of independent variables, seems to lack an adequate explanation of the selected variables. In contrast, the use of the radial basis function kernel that is based on Gaussian functions in SVR, along with various loss functions (such as the mean square error and mean absolute error) and gradient boosting in tree-based XGB, allows for the interpretation of nonlinear relationships between independent and predicted variables, unlike in PLSR. The improvement in predictive performance through the optimization of prediction models, including hyperparameter tuning for each analysis method, suggested that fewer variables can yield similar or better results in the prediction models. Another method explored in previous research reduces variables that are involved in predicting nitrogen content using various vegetation indices and the red edge wavelength. The results showed that the R
2 based on the BPNN model was 0.77 [
40]. However, since vegetation indices require a combination of multiple wavelengths, lowering the spectral resolution might lead to changes in the values of these indices. Therefore, reducing spectral resolution is considered inadequate as an alternative for variable reduction in this context.
When comparing the wavelengths that were selected based on the 2-binning criterion with those selected through spectral binning at 4, 8, and 16 bp, it was observed that for the number of wavelengths selected by CARS in the raw data, similar or adjacent wavelengths were chosen regardless of the spectral resolution. When comparing the XGB prediction models that exhibited the highest performance for each spectral resolution, the lowest value was observed for the 16-binning model, with an R2 of 0.743, and the highest was observed for the 4-binning model, with an R2 of 0.760, indicating a similar performance with a difference of only 1.7%. In the case of Rfrog, unlike CARS, the selected wavelengths varied slightly according to the spectral resolution. However, a 2% difference in the coefficient of determination was observed based on the spectral resolution in the XGB prediction model. For SPA, in the case of 2- and 4-binnings, only the red edge and NIR wavelength regions were selected, which differed from the wavelengths chosen by CARS and Rfrog, which showed a difference in evaluation performance of approximately 3% to 5% compared with previous variable selection methods. Additionally, the lowest R2 value (0.65) was observed after eight binning cycles, which seems to be due to the decrease in performance attributed to whether the 680 nm wavelength, located between the red and red edge wavelengths, was selected under the same binning criteria for CARS and Rfrog. When the first derivative spectral data were used for variable selection through spectral binning, there was no similarity in the wavelengths that were selected based on spectral resolution in contrast to the raw data. Consequently, the regression analysis, particularly for SVR, exhibited substantial deviations, with R2 values ranging from 0.577 to 0.712. This is because as the spectral resolution decreases, leading to a reduction in the number of wavelengths, continuous spectral data loss occurs. While the raw data retain the inherent spectroscopic characteristics of the canopy, the first derivative data, owing to data loss during spectral binning, respond sensitively to even minor changes. As a result, the prediction performance was unstable and varied with spectral resolution. Furthermore, a high spectral resolution does not necessarily translate into an improved performance in predictive models.
Figure 5 presents the mapping of hyperspectral images using the wavelengths selected based on CARS. The results are divided into red to green colors based on the nitrogen content range, indicating that the leaf nitrogen content ranged from a minimum of 0% to a maximum of 4%. Spectral binning, which combines wavelength bands, can reduce the number of wavelength bands and lead to the loss of continuous spectral data, potentially degrading the performance of the prediction models [
41]. However, this process can also reduce the costs associated with data processing and analysis. Additionally, by combining adjacent wavelength bands, the SNR can be enhanced, and the inclusion of similar spectral data can be minimized. Therefore, appropriate spectral binning may offer advantages such as a reduced data processing speed owing to a reduction in high-dimensional spectral data and enhanced predictive performance [
42,
43,
44].
5. Conclusions
In this study, various predictive models were developed and compared via regression analysis with both the full spectrum and selected significant wavelengths to predict the leaf nitrogen content in apple trees via hyperspectral imaging. In addition, spectral binning was used to reduce the spectral resolution to 5, 10, or 20 nm, and regression analysis was conducted using only the wavelengths identified through variable selection. The predictive performance at these reduced spectral resolutions was compared to that at the original spectral resolution to determine the optimal spectral resolution. The study showed that reducing the spectral resolution reduces the number of wavelengths, leading to data loss. However, the intrinsic shape of the spectral curve is maintained, suggesting that performance can be preserved, even with a lower spectral resolution. However, hyperspectral imaging has a narrow spectral resolution, allowing for detailed interpretation of physiological responses in crops across numerous wavelengths. However, due to the high cost of equipment and various constraints during image acquisition, to address these issues, the spectral resolution was decreased to achieve satisfactory results. These results imply that the development of a miniaturized multispectral sensor can be practical and cost-effective, potentially serving as an alternative to hyperspectral sensors. Furthermore, utilizing geographic information systems, including sensors and drones, could enhance the precision of monitoring apples that are cultivated in extensive orchards. Through stable cultivation management, this approach could secure both quantity and quality, providing a reliable means for ensuring stable crop yields and quality control.