*3.3. Selection of Relevant Features at Di*ff*erent Scales*

One of the main motivations for the application of hyperspectral imaging technology is the potential to find the most relevant wavelength for a specific task, and to subsequently design a specific sensor. Reference [41] showed that specific wavelengths might be useful to identify certain leaf diseases in sugar beet. In wheat, VIs have been described that are capable of detecting brown rust [18]. This shows that a selection of specific wavelengths can be specific for one disease. We applied the introduced technique to the data sets on the ground-canopy and UAV scale and derived important wavelength for the detection of disease symptoms as well as the prediction of disease severity.

#### 3.3.1. Ground Scale

Feature selection on the field scale was performed for the detection of YR. The models were trained on a homogenized sample of training data and validated by a five-fold cross-validation. The final accuracy was determined by the hold-out test set. To reduce the computational complexity, the feature was regularly subsampled by a factor of 5. The resulting 33 bands were ranked and an optimal band number was selected (Figure 8).

For YR, an optimal number of 16 features reached 91% accuracy. However, to allow a comparison with the UAV scale selection, we selected the best 10 features, providing an accuracy of 88%. The waveband of 780 nm in the NIR was the most important for YR detection. The next two bands were also in in the NIR, followed by a band in the blue/green spectral region. Less important was the NIR wavebands > 800 nm and the red part of the spectrum. Various works have shown that VIs using wavelengths out of these spectral regions can be successfully used to detect rust diseases of wheat [17,18,25], or even for necrotrophic diseases of other crop plants such as groundnuts [42]. In the literature, it has been described that pigments and water influence the absorbance and reflectance of light with plant interactions [43–45]. The measured reflectance signal is always a mixed signal and the result of complex biochemical interactions [43,46,47]. The visible region is mainly influenced by the light absorption of leaf pigments [48]. Healthy wheat canopies appear dark green because of high amounts of chlorophyll in the leaves [10]. With YR infection in the leaf tissue, a degradation of chlorophyll happens, while the urediniospores of rust fungi are pigmented through the formation of carotenoids [49]. This could explain the importance of certain absorption or reflection bands of pigments for YR detection in the visible range. The effect of chlorophyll degradation and the formation of chlorosis, and a resulting detectability for the disease has also been described for *Septoria tritici* blotch [28]. The NIR region is strongly influenced by the leaf and cell structures, the architecture of the canopy, and water absorption bands [43,50]. High YR incidence leads to an early senescence of leaves in the upper, but particularly in the lower leaf levels. This changes the appearance of the crop architecture, reduces the vitality of leaves and water content, and could explain the importance of specific wavebands for YR detection.

**Figure 8.** Results of the feature selection for the relevant wavebands for the classification of YR in the field on the ground (**top**) and UAV (**bottom**) scales. The accuracy reached for the different numbers of features (**left**) and the ranking of the inclusion within the feature subset (**right**) is displayed. RMSE = root mean square error.

#### 3.3.2. UAV Scale

For the feature selection on the UAV scale, the detection and quantification of YR infections was investigated. Using the UAV and the filter-system Rikola hyperspectral camera, the mean spectrum of the central part of each plot was measured at multiple days. The first four dates were used, as a suitable disease estimation was not possible later due to the beginning of senescence.

The optimal number of features was 11 features, reaching an RMSE of 17.9 (i.e., to the visual assessment at the ground of around 70%) (Figure 8). Here, the most important bands were 830 nm and 510 nm, followed by NIR bands. Without significance were the red region 630–700 nm and the beginning of the NIR at 700–800 nm. The selection of the spectral border band would be a sign of fitting to noise if the Specim V10E line scanner had been used, but here, the Rikola camera was used without an increased noise at the spectral border regions.

Feature selection results for further traits are shown in Table 3. Important bands were also found in the green and NIR regions, which might have been triggered by the same biochemical reactions as on the ground scale. However, for the fertilizer, fungicide, and the combined treatment the spectral region 600–750 nm had a higher relevance.


**Table 3.** The six most important bands for selected plot traits at the UAV scale and ranking of the wavebands (in nm) for the importance of feature selection, beginning with the highest. Selected traits: fertilizer (Fert), fungicide (Fung), fungicide + fertilizer (Fert+Fung), yellow rust (YR) detection, yellow rust regression.

#### 3.3.3. Cross-Scale Interpretation

The cross-scale interpretation revealed significant inconsistencies but also some parallels. The inconsistencies were related to sensor characteristics, as the same sensor had not always been applied. Furthermore, additional factors at the different scales (leaf geometry, mixed pixels with background) were included at the higher scales that may have relied on further bands to be regarded properly by the prediction model.

The number of required features varied at the different scales. In a separate experiment (data not shown) with fixed leaves in the laboratory, a perfect differentiation was possible using two bands. Geometry was also not relevant, as the leaves were fixed in a horizontal position. The highest number required on the field scale was 18 on average, as the complex geometry and complex scattering effects in the canopy affected the recorded signal. At the UAV scale, the geometry was the same, but due to the physical smoothing by blur and high pixel size, the signal was simplified again. There, an optimum was reached at 11, omitting the spectral region 620–820 nm.

The red region had a low relevance for the classification of YR on the field and UAV scales. This might have been due to the fact that urediniospores *P. striiformis* appear more yellow than red (due to carotenoid composition) and do not show strong reflection in the red region. The NIR region had an increased relevance on the UAV scale. Presumably this was related to simple separability based on pigments on the lower scale, whereas in the field, the leaf geometry distorted this signal and the NIR region was required to compensate for this effect.

The differences and parallels of the different feature sets motivated the cross-scale application of feature sets. It was assumed that information about optimal feature sets could also be an advantage at a different scale. Therefore, the feature sets for the assessment of YR were exchanged between the ground scale with the Specim V10, and the UAV scale with the Rikola hyperspectral camera. To allow a comparison of the different feature sets, the number of included features was fixed to 10, based on the previous feature selection runs (Figure 8). Evaluation at the ground and UAV scales was performed following the same principle as for the feature selection. Table 4 shows the performance of multiple feature sets. The highest accuracy was reached by the full data set, followed by the 16 VIs. The feature sets with 10 features reached a slightly lower, but in direct comparison, very similar accuracy. The results indicate that the complex situation in the wheat canopy required more than 10 features. The good performance of the equidistant feature set can be explained by the resemblance to the 10 selected features that were nearly equidistantly distributed over the spectral range. Both feature sets applied wavebands out of the same spectral regions. Furthermore, the performance of the field-selected feature set points to the heterogeneity of reflectance characteristics even within the same treatment group. The test and training data were extracted from separate image sets. Following, this feature set was optimized on the training data, but had no advantages compared to the equidistant feature set on the test data.


**Table 4.** Performance of the different feature sets for the YR detection based on ground observations.

The comparison of different feature sets showed the potential positive results of feature selection to a higher degree. The highest accuracy was obtained by the feature set optimized at the UAV scale, whereas the feature set from the ground scale obtained an even lower performance than the equidistant feature set (Table 5). For the UAV data set, a separation of test and training data was not possible due to the much smaller data base. Here, a leave-one-out cross validation was applied to obtain R2 and correlation coefficients. The obtained feature set may have been more adapted to the evaluation procedure at the UAV scale compared to the ground scale. The UAV evaluation shows that it was possible to slightly increase the accuracy by feature selection compared to the full data set, and also that uninformed subsampling did not lead to optimal results.

**Table 5.** Performance of the different feature sets for the YR regression based on UAV observations.


However, the data characteristics at the ground canopy and the UAV scale were so disparate that an advantage of feature set transfer is doubtful. The transferred feature set had a lower performance even compared to the uninformed equidistant sampling. There were multiple factors contributing to the deviating data characteristics expressed by different demands to the feature sets. One of the main points was the use of different sensors with different measurement principles, each adapted to its measurement scale. The noise characteristics of the ground camera showed an increased noise level at the spectral border regions and a noise optimum in the red range. The UAV camera showed a homogenous measurement quality for the whole range, despite some artifact bands around 630 nm, where optical refractions seem to occur at a beam splitter. The suitability of a spectral region can be significantly reduced by such sensor characteristics, but if the effect occurs only at one sensor, the optimal feature set changes. Further points regard the implicit spatial smoothing if a larger area is captured by a single pixel. At the ground scale, the feature set will directly point to the reflectance characteristics of the spores, whereas at the UAV scale, the reduced vitality and even morphological changes have to be taken into account. In contrast, the close-range observations at the ground scale were dominated by the leaf geometry, and more specifically by leaf angle and position within the crop stand. Therefore, the analysis model had to integrate these factors to enable predictions as robust as possible against the plant geometry. At the UAV scale, most of the pixels provided a mixed signal of multiple leaves and, in addition, the analysis was performed on the mean spectra of each plot. Most of the geometric effects averaged out as the characteristics of hundreds of leaves were averaged.

In general, there is no single waveband for individual diseases, but broad regions (blue, green, red, NIR I (700–800), NIR II 800–1000) with varying relevance for the different diseases. This is tightly coupled with the sensor characteristics. The Rikola camera was not able to measure the blue and NIR (900–1000 nm), but provided stable noise conditions over the whole measurement region. The Specim V10E camera had a larger measurement region (400–1000 nm), but the spectral border regions had a much higher noise level.

3.3.4. Spatial Resolution as Key Parameter for Disease Detection

The un-sampled data had a GSD of approximately 0.4 mm (for Specim as well as Rikola). The UAV observations (20 m flight height) had a GSD of approximately 8 mm (Figure 9).

This approach did not regard the adaptation of the model to the new classification scale. By retraining the prediction, the accuracy may be improved, as the smoothing here also affected the data characteristics. However, even then, the disease-specific information will vanish at a certain level. We omitted this evaluation as the performance measures of the retrained models were not comparable anymore as the number of training data declined drastically, e.g., to around 100 samples for YR at higher subsampling scales.

**Figure 9.** Visualization of spatial subsampling effects for the four investigated subsampling levels (images) and the effect of scale on accuracy and F1 score for the two different approaches knn (nearest neighbor) and an aggressive for subsampling the annotation.

The investigations allowed the definition of a minimal sampling distance at which the mixed information no longer allowed the prediction of plant diseases. Without retraining the model, the accuracy decreased at subsampling factors of 10 and 20. A low subsampling of 2 seems to have had no negative effects. Presumably, the included smoothing removed border cases and outliers which are hard to classify correctly. At higher subsampling levels, more and more mixed pixels occurred where the aggressive label subsampling tended to extend the image regions assigned to a class. Subsequently, the effect was more severe here. The accuracy of more than 50% at the final state was related to the dominant background, which provided a significant majority of test data at the high subsampling levels. It was not related to the ability to predict the presence of YR. This was demonstrated by the F1 score, a measure to quantify the performance of a multi-class prediction model on class level.

In this performance measure, the quality also decreased at subsampling factor of 10. Surprisingly, the F1 score increased to nearly optimal numbers at a subsampling factor of 100. Discussing this fact, it has to be noted that at the highest subsampling factor, only 119 YR samples were included in the data set, which are all correctly classified. Maybe this was related to the accuracy on the UAV scale, where the majority of geometric effects were averaged out. This point remains to be evaluated in further investigations, but it seems that subsampling by a factor of 100 removed the leaf structure completely, whereas at a lower subsampling factor, the leaf structure was still apparent but more and more effects

of mixture become present. Without any leaf structure, the classification problem is simplified to the presence of YR within this part or even within the image. At low level disease severities, this will cause major problems, but here, the test data has been selected to show clear disease symptoms.
