Next Article in Journal
Antioxidant Responses and Adaptation Mechanisms of Tilia tomentosa Moench, Fraxinus excelsior L. and Pinus nigra J. F. Arnold towards Urban Air Pollution
Previous Article in Journal
Sorption Characteristic of Thermally Modified Wood at Varying Relative Humidity
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Machine Learning-Based Species Classification Methods Using DART-TOF-MS Data for Five Coniferous Wood Species

1
Department of Life and Nanopharmaceutical Sciences, Graduate School, Kyung Hee University, Seoul 02447, Korea
2
Department of Biomedical and Pharmaceutical Sciences, Graduate School, Kyung Hee University, Seoul 02447, Korea
3
Department of Basic Pharmaceutical Science, College of Pharmacy, Kyung Hee University, Seoul 02447, Korea
4
Department of Forest Bioresources, National Institute of Forest Science, Suwon 16631, Korea
5
Department of Integrated Drug Development and Natural Products, Graduate School, Kyung Hee University, Seoul 02447, Korea
*
Author to whom correspondence should be addressed.
Forests 2022, 13(10), 1688; https://doi.org/10.3390/f13101688
Submission received: 16 August 2022 / Revised: 4 October 2022 / Accepted: 11 October 2022 / Published: 14 October 2022
(This article belongs to the Section Wood Science and Forest Products)

Abstract

:
Various problems worldwide are caused by illegal production and distribution of timber, such as deception about timber species and origin and illegal logging. Numerous studies on wood tracking are being conducted around the world to demonstrate the legitimacy of timber. Tree species identification is the most basic element of wood tracking research because the quality of wood varies greatly from species to species and is consistent with the botanical origin of commercially distributed wood. Although many recent studies have combined machine learning-based classification methods with various analytical methods to identify tree species, it is unclear which classification model is most effective. The purpose of this work is to examine and compare the performance of three supervised machine learning classification models, support vector machine (SVM), random forest (RF), and artificial neural network (ANN), in identifying five conifer species and propose an optimal model. Using direct analysis in real-time ionization combined with time-of-flight mass spectrometry (DART-TOF-MS), metabolic fingerprints of 250 individual specimens representing five species were collected three times. When the machine learning models were applied to classify the wood species, ANN outperformed SVM and RF. All three models showed 100% prediction accuracy for genus classification. For species classification, the ANN model had the highest prediction accuracy of 98.22%. The RF model had an accuracy of 94.22%, and the SVM had the lowest accuracy of 92.89%. These findings demonstrate the practicality of authenticating wood species by combining DART-TOF-MS with machine learning, and they indicate that ANN is the best model for wood species identification.

1. Introduction

Various problems around the world are caused by illegally produced and distributed timber, such as deception about the species and origin of timber and illegal logging. Illegal manufacture of timber is a major cause of climate change and poverty by means of deforestation, as well as economic damage [1]. As logwood goes through several stages of processing after logging and is distributed as a variety of wood products, tracking the legality of raw wood materials is difficult. Nonetheless, studies on wood tracking are being carried out around the world to prove the legitimacy of timber for sale [2,3,4]. Timber tracking research uses a hierarchical approach appropriate to the production and distribution of timber, such as identifying the species and determining its origin and harvesting site. Tree species identification is the most basic element of timber tracking because the quality of wood varies dramatically depending on the species and corresponds to botanical identification information about commercially distributed wood [5].
The following methods are generally used for species identification: (1) morphological methods, such as anatomy and dendrochronology [6]; (2) genetic methods such as DNA barcoding, DNA fingerprinting, and phylogeography [7]; (3) chemical methods such as stable isotopes, near-infrared (NIR) spectroscopy, and direct analysis in real time–mass spectrometry (DART-MS) [4]. Morphological studies (analyses of macroscopic and microscopic anatomical features) are traditionally and most frequently used for identification; however, only trained experts can perform such analyses, and even they can struggle to identify wood at the species level. Therefore, extensive and automated image analysis methods using machine learning and neural computation are being studied [8]. DNA analysis has been successful in species identification [9], but it is time-consuming and expensive. Furthermore, analyzing DNA sequences from wood samples for genetic differentiation is challenging because isolating DNA from dried and processed wood is difficult [7]. The use of stable isotopes has proved successful for species and regions [2]; however, that technique also has challenges [4]. NIR spectroscopy has proved successful in distinguishing between species and in determining the origin of a few species [10,11,12].
Studies using DART-MS have been conducted to classify white and red oak [13], to distinguish between eucalyptus and Convention on International Trade in Endangered Species of Wild Flora and Fauna (CITES)-listed Aruacariacea [14], and to distinguish between wild and cultivated agarwood [15]. Zhang et al. [16,17] discriminated between Pterocarpus tinctorius Welw. and endangered Pterocarpus santalinus L.f. by combining DART with Fourier transform ion cyclotron resonance mass spectrometry and using a multivariate analysis in Soft Independent Modeling by Class Analogy (SIMCA) statistical software. DART-MS can ionize molecules in ambient conditions with minimal sample preparation, offering the advantages of high throughput, fast analysis speed, and non-destructive testing [13,18,19]. Those advantages are critical to time-constrained enforcement policies, so DART-MS is becoming increasingly popular [20]. Moreover, it is less expensive than genetic and stable isotope analyses. Recent studies have combined DART-MS analyses with machine learning [14,16,17,21,22,23,24,25].
Machine learning is a computer algorithm for development of artificial intelligence that can automatically recognize and learn relevant information from input data to find relationships and patterns in the data and solve problems [26,27,28]. Machine learning methods are generally characterized into two major groups based on type of learning: supervised and unsupervised. The supervised learning method uses labeled data to train an algorithm to compute output variables [29]. Unsupervised learning methods, on the other hand, use unlabeled data in training an algorithm to identify hidden patterns [30]. The strength of applying machine learning to wood science and engineering is its ability to analyze any type of data, such as images [31], anatomy data [32], infrared spectra [12], MS spectra [33], and U(H)PLC chromatograms [34]. Furthermore, as more data accumulate and can be used to train an algorithm, its accuracy and predictive power improve. The big data-based machine learning industry is growing and finding applications in various parts of the wood industry for genotype discrimination [35,36], species identification [8,12,23,37], and wood moisture content prediction [38,39].
Multivariate analysis and machine learning methods are mainly used to classify species using chemical fingerprint data [40], but there has been no research to assess which method is best for determining the biological origin of wood through DART-MS analysis for wood classification. In this paper, in order to reveal which classification method is best for DART-MS data, origin determination was tested using principal component analysis (PCA), partial least squared discriminant analysis (PLS-DA), orthogonal partial least squared discriminant analysis (OPLS-DA), support vector machine (SVM), random forest (RF), and artificial neural network (ANN) methods. This study aimed to prevent or solve problems in the wood distribution market caused by the use of wood of the wrong origin by presenting an effective identification method. In addition, the practicality of DART-TOF-MS combined with machine learning is demonstrated to authenticate wood species and indicate which model is the best for identification. These results may be proposed for future work to enable other researchers to complete research in this field.

2. Materials and Methods

2.1. Wood Materials

Wood samples of five coniferous species, Chamaecyparis obtusa (Co) Siebold & Zucc. Endl., Larix kaempferi (Lk) (Lamb.) Carr., Pinus densiflora (Pd) Sieb. et Zucc., Pinus koraiensis (Pk) Sieb. et Zucc., and Pinus thunbergii (Pt) Parl. were collected from different areas of South Korea by the National Institute of Forest Science in the Korean Forest Service. Fifty samples of each species were tested (detailed information is contained in Supplementary Materials).
All samples were collected in 2021, and they were either increment cores with a diameter of 5 mm and a maximum length of 15 cm or water-depth direction cores with a diameter of 5 mm and an average length of 7.5 cm. The wood samples were stored at room temperature with silica gel until analysis.

2.2. DART-TOF-MS Conditions

With no further sample preparation, all wood samples were analyzed three times using a DART simplified voltage and pressure ion source (IonSense, Saugus, MA, USA) equipped with a JEOL JMS-T100TD (AccuTOF-TLC) mass spectrometer (JEOL Ltd., Tokyo, Japan) in positive ion mode. Each sample was placed on a lab-made wood sample tray module (Figure 1) in a stream of heated helium gas produced by the DART ion source for four seconds. The mass spectra of the heartwood were acquired using the DART ion source parameters and mass spectrometer with settings of helium gas flow rate, 3 L/min; gas heater temperature, 350 °C; desolvating temperature, 250 °C; orifice 1 temperature, 80 °C; orifice 1 voltage, 10 V; orifice 2 voltage, 5 V; ion guide peak voltage, 600 V; ion guide bias voltage, 25 V; focus voltage, −145 V; condenser lens voltage, 6.0 V; quadrupole lens voltage, 15.0 V; right/left lens voltage, −2.0 V; top/bottom lens voltage, 0.1 V; pusher bias voltage, −0.55 V; reflectron voltage, 950.0 V; ring lens voltage, 5 V; and detector voltage 2000 V. Spectra were obtained over the mass range of m/z 100 to 500 with a sampling interval of 0.5 ns and recording interval of 0.5 s. A polyethylene glycol 600 (PEG 600) standard was measured before and after each sample analysis for mass calibration.

2.3. Data Preprocessing

The data preprocessing flow chart is provided in Figure 1. The raw data were calibrated using the PEG 600 standard and exported as centroid mass spectra by Mass Center Main software (JEOL), which has 41,938 m/z features for each sample.
In most instances, DART-TOF-MS data are applied to algorithms in the form of a matrix or frame, with columns and rows corresponding to variables and samples, respectively. Furthermore, a process called binning is required to segment variables with the same specific m/z value based on mass tolerance. In this study, because the total ion chromatogram peaks had positively skewed distributions, the m/z value expressed to five decimal places was rounded down to one decimal place, and all peak intensities with the same integer value of m/z were added to produce 4000 variants per sample.
Data processing and subsequent classification were performed using Python in the PyCharm integrated development environment. Target data were processed through one-hot encoding, as shown in Figure 1, and seventy percent of the obtained data (n = 525) were randomly set as the training data, with the rest (n = 225) used as test data, to analyze the accuracy of the developed models.
Before machine learning classification, the data were preprocessed as appropriate for each model. The input data for SVM required a Z-score normalization process [41] because SVM is sensitive to features with a large scale and has difficulty with accurate predictions with a large number of features in the dataset [42]. As RF is a tree-based model [43] and ANN was developed to perform batch normalization after each hidden layer, the input data for RF and ANN were used without normalization.

2.4. Modeling for Classification

In this study, SVM, RF, and ANN were used to classify samples of five coniferous woods at the species level (Figure 2). The SVM and RF models were developed using the Scikit-learn package, and the ANN model was developed using the Keras library.
Cortes and Vapnik [44] introduced SVM to simultaneously minimize classification error and maximize the margin between two classes. A multiclass SVM model uses a Gaussian-based radial basis function as the kernel function to map the given data to a high-dimensional feature space according to a one-to-one scheme and thereby refine the decision boundary [44,45]. Two parameters, cost (C) and gamma, have to be set by the user. C determines how many data samples are allowed to be classified into faulty categories, and gamma determines the curvature of the decision boundary, i.e., the distance over which the data samples exert influence [46]. If the values of C and gamma are set too low, the possibility of outliers is increased and the influence distance of each datapoint becomes broader, producing a high probability of underfitting. On the other hand, if the values of C and gamma are set too high, the probability of outliers is low, and the influence distance of each datapoint narrows, resulting in a high probability of overfitting. We set the C value as 1.0 and gamma as auto (1/n_features), which means the gamma value of each sample decreases as the number of samples increases.
RF is a combination of tree predictors in which each tree depends on the value of an independently sampled random vector that has the same distribution for all trees in the forest [47]. This algorithm forms multiple decision trees, passes the data through each tree at the same time, votes on the classification results for each tree, and selects the result with the most votes as the final classification result. Although overfitting can result from RF, generating a large number of trees prevents such overfitting from having a significant effect on the overall prediction [47]. The hyper-parameters for RF are as follows:
  • “n_estimators” means the number of trees in the forest;
  • “max_depth” means the maximum depth of the trees;
  • “min_samples_split” means the minimum number of samples required to split an internal node;
  • “min_samples_leaf” means the minimum number of samples required to be at a leaf node;
  • “max_features” means the number of features to consider when looking for the best split;
  • “max_leaf_nodes” means the number of groups to be classified;
  • “max_samples” means the number of samples to draw from the total data to train each base estimator.
  • We optimized n_estimators as 1000 and max_depth as 10 and allowed the rest of the parameters to be set to default values by Scikit-learn (min_samples_split, 2; min_samples_leaf, 1; max_features, auto; max_leaf_nodes, none; max_samples, none).
In an ANN, artificial neurons form a network through synaptic combinations and solve problems by learning to change the strength of their synaptic bindings [48]. In this study, we developed a multilayer ANN model. Instead of directly using the given data to form a classifier, we added a structure called a hidden layer while training the model with transformation operations optimized for the specified data [49]. The update rules for the parameters of the neuron are derived by calculating weights and biases that minimize the cost function defined by the labels contained in the training dataset. However, using hidden layers creates a problem in that the cost function cannot be defined because there are no labels. To solve that problem, a backpropagation algorithm was developed to train a multilayer ANN [50]. Therefore, we developed a multilayer ANN model with three hidden layers, considered as a deep neural network [51,52,53], using the Leaky ReLU activation function and Adam optimization method with a sparse categorical cross entropy loss function with a learning rate of 0.0001.

3. Results

3.1. Figures, Tables, and Schemes

Representative mass spectra for the heartwood of each coniferous species are shown in Figure 3. Each spectrum was obtained by overlapping 150 spectra per species with equal weighting. Among the major peaks, m/z 205.19 was observed in Co, Lk, and Pd; m/z 207.25 and 303.23 were observed in all species except Co; and m/z 338.34 and 355.37 were detected in all species except Pd. It was not possible to distinguish the species based on their major peaks.

3.2. Multivariate Analysis

PCA is an unsupervised analytical method used to find differences between groups of samples, understand group associations, and validate the relative roles of compounds in group separation [54,55]. As a supervised method, PLS-DA reduces dimensions by considering the variance of Y [56,57]. Therefore, PLS-DA requires extraction of a new variable that maximizes the covariance between the linear combination of the independent variable X and the dependent variable Y. This variable is not reflected in the PCA, and has the advantage of building an efficient model with a small number of extracted variables [56,57]. OPLS-DA can maximize discrimination between groups by addition of orthogonal components from PLS using class information [58]. We performed PCA, PLS-DA, and OPLS-DA using SIMCA-P software ver. 14.1 (Umetrics, Umea, Sweden). In the PCA and PLS-DA results, no significant distinctions were found between species in the PCA scatter plot and only Co was clustered to distinguish it from other species (Figure S1). The OPLS-DA score plot (Figure 4) shows that C. obtusa was distinguished from the other species, and L. kaempferi and P. koraiensis were adequately sorted in the OPLS-DA score 3D plot. However, P. densiflora and P. thunbergii were not clearly distinguished in either the OPLS-DA score plot or the OPLS-DA 3D score plot. The OPLS-DA analysis provided the following statistical parameters: R2X = 0.797, R2Y = 0.841, and Q2 = 0.813. To test the validity of the OPLS-DA model and prevent overfitting, permutation tests were performed using 100 random permutations and the same number of components. Those results show that the model is valid because the extrapolated intercept value (Q2) of −0.157 indicates that the model has no overfitting [59].

3.3. Classification Model Performance

When classifications were performed with the same random seed, the accuracy increased in the order of SVM, RF, and ANN. In all models, the genus-level classification accuracy was 100% in our sample group, meaning the accuracy of each model in the genus Pinus depends on species-level classification. The SVM model produced identification errors of about 7% in the test set, including classifying P. densiflora as P. thunbergii, P. koraiensis as P. densiflora, and P. thunbergii as L. kaempferi or P. densiflora, with a total prediction accuracy rate of 92.89%. The RF model’s prediction accuracy rate of 94.22% was slightly higher than that of the SVM, and its errors involved predicting P. densiflora as P. thunbergii and P. thunbergii as P. densiflora. The ANN model produced the same errors as the RF, but the frequency was significantly lower, with an overall prediction accuracy rate of 98.22%. To validate the results, we added a new test set with the same size as that drawn from samples that were not used for developing the classification models. The accuracy results for the training (70%), test (30%), and new test sets for each model are provided in Table 1. A confusion matrix for the three models is presented in Figure 5.

3.4. Evaluation of Model Performance

The purpose of model evaluation is to prevent overfitting and identify the optimal model. Accuracy or error rate is one of the most commonly used metrics in practice, and many researchers use it to evaluate the generalization ability of a classifier [60]. Accuracy is a measure of a trained classifier in terms of overall correctness, as the total number of instances in which the trained classifier presented a correct prediction [61]. The benefits of accuracy as a study metric are its ease of computation with low complexity; suitability for multiclass and multilabel problems; user friendly scoring; and ease of understanding. However, the simplicity of the accuracy metric could lead to suboptimal solutions, especially when dealing with imbalanced class distribution [60,61,62].
Therefore, to evaluate the performance of the classification models, in addition to accuracy, the following metrics were estimated: precision, recall, F1-score, receiver operating characteristic (ROC) curve, and area under the ROC curve (AUC) from the confusion matrix (Table 2) [61,63,64]. The ROC curves for the models are presented in Figure 6. All metrics have a value between 0 and 1, and the closer is the result to 1, the higher the accuracy of classification [61].

4. Discussion

Identification of tree species is important when the wood is distributed and used as a raw material because wood quality and price depend on species [65]. Therefore, various studies have been conducted to distinguish the exact origins of similar woods [6,7,8,9,10,11,12,13,14,15,16,17,37].
For the inter-genus classification, all three machine learning models showed 100% prediction accuracy in the sample group collected. In the inter-species classification, the ANN model had the highest classification prediction accuracy, 98.22%. The RF model had 94.22% accuracy, and the SVM had the lowest accuracy, 92.89%. Classification at the species level is difficult because the samples are genetically similar and have similar metabolomic properties. In particular, for our samples, Pd and Pt are genetically closer to each other than to Pk in the phylogenetic tree [66].
In a study in South Africa, Omer et al. [67] tested SVM and ANN models for pixel-based classification of six tree species from mixed native coastal forests. The methods produced similar classification results; the overall accuracy of SVM and ANN was 77% and 75%, respectively, using a randomly selected 70% (583) of the dataset for training and the remaining 30% (244) of the dataset for evaluation. In the same study area, Cho et al. [68] used an SVM algorithm to compare pixel-based and object-based approaches for classifying three major three species. The pixel-based and object-based methods achieved overall accuracy scores of 85% and 89%, respectively. In the above papers, it was confirmed that SVM is more suitable for pixel-based classification, indicating the need to evaluate a suitable machine learning algorithm according to data type. Karlson et al. [69] utilized an RF model for WorldView-2 imagery object-based classification of five major tree species in Burkina Faso, West Africa, and achieved the most accurate results of the overall accuracy of 83.4% based on the multiseasonal dataset which is followed by 78.4% based on the dry season dataset and 68.1% based on the wet season dataset.
The advantage of RF is that it incorporates a feature importance property into the model itself, so variable importance can be determined without any special process. Deklerck et al. [22] improved overfitting and bias caused by random discriminant ion selection by performing kernel discriminant analysis on data obtained from RF and using the result to classify afromosia and provide the minimum number of ions needed to achieve the highest accuracy. In our study, the most important variables were m/z values of 271.1, 270.1, and 241.1, which were not confirmed as the major m/z values in the spectra (Figure 3).
In this study, we scaled down the m/z value to one decimal place for the convenience of data handling for the machine learning, which is similar to that of low-resolution MS data, and it means that low-resolution mass data can be used with machine learning classification methods. We selected an ANN with three hidden layers as a deep learning method [51,52,53] because it is suitable enough for the application of MS spectrum data in a vector form. In the case of supervised learning as in this study, multilayer perceptron has been effectively utilized as a feedforward ANN that involves some hidden layers separating the input and output layers that correspond to the network computational engine [52]. Furthermore, ANN has the potential to be developed as a database tool that can store spectra of different wood species and classify unknown samples accordingly. As related research continues and more information about various wood species accumulates, more information will be available to teach the ANN, which will improve its accuracy in wood species prediction. Since DART-MS is one of the ambient mass spectrometer methods and the measurement is performed effectively in high throughput [19], the metabolomics data of DART-MS spectra can be effectively applied to machine learning using a method such as SVM, RF, and ANN.

5. Conclusions

Identification of marker compounds after multivariate analysis is normally used for metabolite classification analyses [70]. Complex metabolites, such as those in plants, however, often cannot be clearly classified by multivariate analysis. In the field of plant metabolite research, there has been a trend in recent decades to study the overall aspects of metabolites using various data analysis techniques rather than the results of specific marker compounds [71,72,73].
In this study, we examined the performance of SVM, RF, and ANN algorithms and the potential of DART-TOF-MS data to classify wood species. Five conifers representing three genera and five species were analyzed by DART-TOF-MS and classified using three machine learning models. We demonstrated that, when sufficient data are available, SVM, RF, and ANN all show excellent classification results with DART-TOF-MS data that contain many variables. Evaluation of the performance of each model showed a value of 0.9 or higher in all metrics, indicating excellent performance. Among them, ANN showed the highest accuracy. These findings substantiate the practicality of authenticating wood species using DART-TOF-MS coupled with machine learning and indicate that ANN is the best model type for wood species identification. These results provide a foundation for additional research in tree species identification.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/f13101688/s1, Figure S1: PCA and PLS-DA scatter plots for Chamaecyparis obtusa (Co), Larix kaempferi (Lk), Pinus densiflora (Pd), Pinus koraiensis (Pk), and Pinus thunbergii (Pt).

Author Contributions

Conceptualization, Y.-P.J. and G.P.; methodology, G.P. and Y.-G.L.; software, Y.-G.L.; validation, G.P. and Y.-S.Y.; formal analysis, Y.-G.L.; investigation, G.P. and Y.-G.L.; resources, J.-Y.A. and J.-W.L.; data curation, Y.-G.L.; writing—original draft preparation, G.P.; writing—review and editing, G.P. and Y.-P.J.; visualization, G.P.; supervision, Y.-P.J.; project administration, Y.-P.J.; funding acquisition, J.-W.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by a grant (Project No. FG0601-2019-02) from the National Institute of Forest Science, Korea in 2021. The finding and conclusions of this publication are solely the responsibility of the authors and do not necessarily represent the official views of the funding agency.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors have no affiliations with or involvement in any organization or entity with any financial or non-financial interest in the subject matter or materials discussed in this publication.

References

  1. Reboredo, F. Socio-economic, environmental, and governance impacts of illegal logging. Environ. Syst. Decis. 2013, 33, 295–304. [Google Scholar] [CrossRef]
  2. Dormontt, E.E.; Boner, M.; Braun, B.; Breulmann, G.; Degen, B.; Espinoza, E.; Gardner, S.; Guillery, P.; Hermanson, J.C.; Koch, G.; et al. Forensic timber identification: It’s time to integrate disciplines to combat illegal logging. Biol. Conserv. 2015, 191, 790–798. [Google Scholar] [CrossRef]
  3. Schmitz, N.; Boner, M.; Cervera, M.T.; Chavesta, M.; Cronn, R.; Degen, B.; Deklerck, V.; Diaz-Sala, C.; Dormontt, E.; Ekué, M.; et al. General sampling guide for timber tracking: How to collect reference samples for timber identification. General sampling guide for timber tracking: How to collect reference samples for timber identification. Glob. Timber Track. Netw. GTTN Secr. Eur. For. Inst. Thuenen Inst. 2019, 43, 1–43. [Google Scholar]
  4. Schmitz, N.; Beeckman, H.; Blanc-Jolivet, C.; Boeschoten, L.; Braga, J.; Cabezas, J.A.; Chaix, G.; Crameri, S.; Deklerck, V.; Degen, B.; et al. Overview of current practices in data analysis for wood identification. A guide for the different timber tracking methods. Glob. Timber Track. Netw. GTTN Secr. Eur. For. Inst. Thuenen Inst. 2020. [Google Scholar]
  5. Jozsa, L.A.; Middleton, G.R. A Discussion of Wood Quality Attributes and Their Practical Implications; Forintek Canada Corporation Vancouver: Vancouver, BC, Canada, 1994. [Google Scholar]
  6. Schweingruber, F.H. Trees and Wood in Dendrochronology: Morphological, Anatomical, and Tree-Ring Analytical Characteristics of Trees Frequently Used in Dendrochronology; Springer Science & Business Media: Berlin/Heidelberg, Germary, 2012. [Google Scholar]
  7. Lowe, A.J.; Cross, H.B. The Applicat ion of DNA methods to Timber Tracking and Origin Verificat ion. IAWA J. 2011, 32, 251–262. [Google Scholar] [CrossRef]
  8. Wu, F.; Gazo, R.; Haviarova, E.; Benes, B. Wood identification based on longitudinal section images by using deep learning. Wood Sci. Technol. 2021, 55, 553–563. [Google Scholar] [CrossRef]
  9. Höltken, A.M.; Schröder, H.; Wischnewski, N.; Degen, B.; Magel, E.; Fladung, M. Development of DNA-based methods to identify CITES-protected timber species: A case study in the Meliaceae family. Holzforschung 2012, 66, 97–104. [Google Scholar] [CrossRef]
  10. Bächle, H.; Zimmer, B.; Wegener, G. Classification of thermally modified wood by FT-NIR spectroscopy and SIMCA. Wood Sci. Technol. 2012, 46, 1181–1192. [Google Scholar] [CrossRef]
  11. Nisgoski, S.; de Oliveira, A.A.; de Muñiz, G.I.B. Artificial neural network and SIMCA classification in some wood discrimination based on near-infrared spectra. Wood Sci. Technol. 2017, 51, 929–942. [Google Scholar] [CrossRef]
  12. Sohn, S.-I.; Oh, Y.-J.; Pandian, S.; Lee, Y.-H.; Zaukuu, J.-L.Z.; Kang, H.-J.; Ryu, T.-H.; Cho, W.-S.; Cho, Y.-S.; Shin, E.-K. Identification of Amaranthus Species Using Visible-Near-Infrared (Vis-NIR) Spectroscopy and Machine Learning Methods. Remote Sens. 2021, 13, 4149. [Google Scholar] [CrossRef]
  13. Cody, R.B.; Dane, A.J.; Dawson-Andoh, B.; Adedipe, E.O.; Nkansah, K. Rapid classification of White Oak (Quercus alba) and Northern Red Oak (Quercus rubra) by using pyrolysis direct analysis in real time (DART™) and time-of-flight mass spectrometry. J. Anal. Appl. Pyrolysis 2012, 95, 134–137. [Google Scholar] [CrossRef]
  14. Evans, P.D.; Mundo, I.A.; Wiemann, M.C.; Chavarria, G.D.; McClure, P.J.; Voin, D.; Espinoza, E.O. Identification of selected CITES-protected Araucariaceae using DART TOFMS. IAWA J. 2017, 38, 266–S3. [Google Scholar] [CrossRef]
  15. Espinoza, E.O.; Lancaster, C.A.; Kreitals, N.M.; Hata, M.; Cody, R.B.; Blanchette, R.A. Distinguishing wild from cultivated agarwood (Aquilaria spp.) using direct analysis in real time and time of-flight mass spectrometry. Rapid Commun. Mass Spectrom. 2014, 28, 281–289. [Google Scholar] [CrossRef] [PubMed]
  16. Zhang, M.; Zhao, G.J.; Liu, B.; He, T.; Guo, J.; Jiang, X.; Yin, Y. Wood discrimination analyses of Pterocarpus tinctorius and endangered Pterocarpus santalinus using DART-FTICR-MS coupled with multivariate statistics. IAWA J. 2019, 40, 58–74. [Google Scholar] [CrossRef] [Green Version]
  17. Zhang, M.; Zhao, G.; Guo, J.; Wiedenhoeft, A.C.; Liu, C.C.; Yin, Y. Timber species identification from chemical fingerprints using direct analysis in real time (DART) coupled to Fourier transform ion cyclotron resonance mass spectrometry (FTICR-MS): Comparison of wood samples subjected to different treatments. Holzforschung 2019, 73, 975–985. [Google Scholar] [CrossRef]
  18. Pierce, C.Y.; Barr, J.R.; Cody, R.B.; Massung, R.F.; Woolfitt, A.R.; Moura, H.; Thompson, H.A.; Fernandez, F.M. Ambient generation of fatty acid methyl ester ions from bacterial whole cells by direct analysis in real time (DART) mass spectrometry. Chem. Commun. 2006, 8, 807–809. [Google Scholar] [CrossRef]
  19. Kim, H.J.; Seo, Y.T.; Park, S.-I.; Jeong, S.H.; Kim, M.K.; Jang, Y.P. DART–TOF–MS based metabolomics study for the discrimination analysis of geographical origin of Angelica gigas roots collected from Korea and China. Metabolomics 2014, 11, 64–70. [Google Scholar] [CrossRef]
  20. Sisco, E.; Forbes, T.P. Forensic applications of DART-MS: A review of recent literature. Forensic Chem. 2020, 22, 100294. [Google Scholar] [CrossRef]
  21. Arora, M.; Zambrzycki, S.C.; Levy, J.M.; Esper, A.; Frediani, J.K.; Quave, C.L.; Fernández, F.M.; Kamaleswaran, R. Machine Learning Approaches to Identify Discriminative Signatures of Volatile Organic Compounds (VOCs) from Bacteria and Fungi Using SPME-DART-MS. Metabolites 2022, 12, 232. [Google Scholar] [CrossRef]
  22. Deklerck, V.; Finch, K.; Gasson, P.; Bulcke, J.V.D.; Van Acker, J.; Beeckman, H.; Espinoza, E. Comparison of species classification models of mass spectrometry data: Kernel Discriminant Analysis vs Random Forest; A case study of Afrormosia (Pericopsis elata(Harms) Meeuwen). Rapid Commun. Mass Spectrom. 2017, 31, 1582–1588. [Google Scholar] [CrossRef] [Green Version]
  23. Deklerck, V.; Mortier, T.; Goeders, N.; Cody, R.B.; Waegeman, W.; Espinoza, E.; Van Acker, J.; Bulcke, J.V.D.; Beeckman, H. A protocol for automated timber species identification using metabolome profiling. Wood Sci. Technol. 2019, 53, 953–965. [Google Scholar] [CrossRef]
  24. Finch, K.; Espinoza, E.; Jones, F.A.; Cronn, R. Source Identification of Western Oregon Douglas-Fir Wood Cores Using Mass Spectrometry and Random Forest Classification. Appl. Plant Sci. 2017, 5, 1600158. [Google Scholar] [CrossRef] [PubMed]
  25. Pavlovich, M.J.; Dunn, E.E.; Hall, A.B. Chemometric brand differentiation of commercial spices using direct analysis in real time mass spectrometry. Rapid Commun. Mass Spectrom. 2016, 30, 1123–1130. [Google Scholar] [CrossRef]
  26. Samuel, A.L. Some Studies in Machine Learning Using the Game of Checkers. II-Recent Progress. In Computer Games I; Levi, D.N.L., Ed.; Springer: New York, NY, USA, 1988; pp. 366–400. [Google Scholar] [CrossRef]
  27. Salem, H.; Kabeel, A.; El-Said, E.M.; Elzeki, O.M. Predictive modelling for solar power-driven hybrid desalination system using artificial neural network regression with Adam optimization. Desalination 2021, 522, 115411. [Google Scholar] [CrossRef]
  28. Kowsher, M.; Hossen, I.; Tahabilder, A.; Prottasha, N.J.; Habib, K.; Azmi, Z.R.M. Support Directional Shifting Vector: A Direction Based Machine Learning Classifier. Emerg. Sci. J. 2021, 5, 700–713. [Google Scholar] [CrossRef]
  29. Cord, M.; Cunningham, P. Machine Learning Techniques for Multimedia: Case Studies on Organization and Retrieval; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2008. [Google Scholar]
  30. Barlow, H.B. Unsupervised learning. Neural Comput. 1989, 1, 295–311. [Google Scholar] [CrossRef]
  31. Wongpoo, T.; Sriwan, W.; Titijaroonroj, T.; Jamsri, P. Chertify: Wood Identification-Based Mobile Cross-platform by Deep Learning Technique. In International Conference on Computing and Information Technology; Springer: Cham, Switzerland, 2022; pp. 77–87. [Google Scholar] [CrossRef]
  32. Liu, S.; He, T.; Wang, J.; Chen, J.; Guo, J.; Jiang, X.; Wiedenhoeft, A.C.; Yin, Y. Can quantitative wood anatomy data coupled with machine learning analysis discriminate CITES species from their look-alikes? Wood Sci. Technol. 2022, 56, 1567–1583. [Google Scholar] [CrossRef]
  33. Nag, A.; Gerritsen, A.; Doeppke, C.; Harman-Ware, A. Machine Learning-Based Classification of Lignocellulosic Biomass from Pyrolysis-Molecular Beam Mass Spectrometry Data. Int. J. Mol. Sci. 2021, 22, 4107. [Google Scholar] [CrossRef]
  34. Silvello, G.C.; Bortoletto, A.M.; de Castro, M.C.; Alcarde, A.R. New approach for barrel-aged distillates classification based on maturation level and machine learning: A study of cachaça. LWT 2021, 140, 110836. [Google Scholar] [CrossRef]
  35. He, T.; Jiao, L.; Wiedenhoeft, A.C.; Yin, Y. Machine learning approaches outperform distance- and tree-based methods for DNA barcoding of Pterocarpus wood. Planta 2019, 249, 1617–1625. [Google Scholar] [CrossRef]
  36. He, T.; Jiao, L.; Yu, M.; Guo, J.; Jiang, X.; Yin, Y. DNA barcoding authentication for the wood of eight endangered Dalbergia timber species using machine learning approaches. Holzforschung 2018, 73, 277–285. [Google Scholar] [CrossRef]
  37. Esteban, L.G.; De Palacios, P.; Conde, M.; Fernández, F.G.; Garcia, M.C.; González-Alonso, M. Application of artificial neural networks as a predictive method to differentiate the wood of Pinus sylvestris L. and Pinus nigra Arn subsp. salzmannii (Dunal) Franco. Wood Sci. Technol. 2017, 51, 1249–1258. [Google Scholar] [CrossRef]
  38. Chen, J.; Li, G. Prediction of moisture content of wood using Modified Random Frog and Vis-NIR hyperspectral imaging. Infrared Phys. Technol. 2020, 105, 103225. [Google Scholar] [CrossRef]
  39. Ozsahin, S.; Murat, M. Prediction of equilibrium moisture content and specific gravity of heat treated wood by artificial neural networks. Eur. J. Wood Wood Prod. 2017, 76, 563–572. [Google Scholar] [CrossRef]
  40. Xi, B.; Gu, H.; Baniasadi, H.; Raftery, D. Statistical Analysis and Modeling of Mass Spectrometry-Based Metabolomics Data. In Mass Spectrometry in Metabolomics; Humana Press: New York, NY, USA, 2014; Volume 1198, pp. 333–353. [Google Scholar] [CrossRef] [Green Version]
  41. El Margae, S.; Sanae, B.; Mounir, A.K.; Youssef, F. Traffic sign recognition based on multi-block LBP features using SVM with normalization. In Proceedings of the 2014 9th International Conference on Intelligent Systems: Theories and Applications (SITA-14), Rabat, Morocco, 7–8 May 2014. [Google Scholar]
  42. Amarappa, S.; Sathyanarayana, S. Data classification using Support Vector Machine (SVM), a simplified approach. Int. J. Electron. Comput. Sci. Eng. 2014, 3, 435–445. [Google Scholar]
  43. Pal, M. Random forest classifier for remote sensing classification. Int. J. Remote Sens. 2005, 26, 217–222. [Google Scholar] [CrossRef]
  44. Cortes, C.; Vapnik, V. Support-vector networks. Mach. Learn. 1995, 20, 273–297. [Google Scholar] [CrossRef]
  45. Hsu, C.-W.; Lin, C.-J. A comparison of methods for multiclass support vector machines. IEEE Trans. Neural Netw. 2002, 13, 415–425. [Google Scholar] [CrossRef] [Green Version]
  46. Karatzoglou, A.; Meyer, D.; Hornik, K. Support vector machines in R. J. Stat. Softw. 2006, 15, 1–28. [Google Scholar] [CrossRef] [Green Version]
  47. Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
  48. Rumelhart, E.D.; McClelland, J.L. PDP Research Group. In Parallel Distributed Processing; IEEE: New York, NY, USA, 1998; Volume 1. [Google Scholar]
  49. Svozil, D.; Kvasnicka, V.; Pospichal, J. Introduction to multi-layer feed-forward neural networks. Chemom. Intell. Lab. Syst. 1997, 39, 43–62. [Google Scholar] [CrossRef]
  50. Rumelhart, D.E.; Hinton, G.E.; Williams, R.J. Learning representations by back-propagating errors. Nature 1986, 323, 533–536. [Google Scholar] [CrossRef]
  51. Lee, J.-G.; Jun, S.; Cho, Y.-W.; Lee, H.; Kim, G.B.; Seo, J.B.; Kim, N. Deep Learning in Medical Imaging: General Overview. Korean J. Radiol. 2017, 18, 570–584. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  52. Corsaro, C.; Vasi, S.; Neri, F.; Mezzasalma, A.M.; Neri, G.; Fazio, E. NMR in Metabolomics: From Conventional Statistics to Machine Learning and Neural Network Approaches. Appl. Sci. 2022, 12, 2824. [Google Scholar] [CrossRef]
  53. Goodfellow, I.; Bengio, Y.; Courville, A. Deep Learning; MIT Press: Cambridge, MA, USA, 2016. [Google Scholar]
  54. Wold, S.; Esbensen, K.; Geladi, P. Principal component analysis. Chemom. Intell. Lab. Syst. 1987, 2, 37–52. [Google Scholar] [CrossRef]
  55. Pearson, K. On lines and planes of closest fit to systems of points in space. Lond. Edinb. Dublin Philos. Mag. J. Sci. 1901, 2, 559–572. [Google Scholar] [CrossRef] [Green Version]
  56. Cha, J. Partial least squares. Adv. Methods Mark. Res. 1994, 407, 52–78. [Google Scholar]
  57. Lee, L.C.; Liong, C.-Y.; Jemain, A.A. Partial least squares-discriminant analysis (PLS-DA) for classification of high-dimensional (HD) data: A review of contemporary practice strategies and knowledge gaps. Analyst 2018, 143, 3526–3539. [Google Scholar] [CrossRef]
  58. Bylesjö, M.; Rantalainen, M.; Cloarec, O.; Nicholson, J.K.; Holmes, E.; Trygg, J. OPLS discriminant analysis: Combining the strengths of PLS-DA and SIMCA classification. J. Chemom. 2006, 20, 341–351. [Google Scholar] [CrossRef]
  59. Mahadevan, S.; Shah, S.L.; Marrie, T.J.; Slupsky, C.M. Analysis of Metabolomic Data Using Support Vector Machines. Anal. Chem. 2008, 80, 7562–7570. [Google Scholar] [CrossRef] [PubMed]
  60. Huang, J.; Ling, C. Using AUC and accuracy in evaluating learning algorithms. IEEE Trans. Knowl. Data Eng. 2005, 17, 299–310. [Google Scholar] [CrossRef]
  61. Grandini, M.; Bagli, E.; Visani, G. Metrics for multi-class classification: An overview. arXiv 2020, arXiv:2008.05756. [Google Scholar]
  62. Ranawana, R.; Palade, V. Optimized Precision—A New Measure for Classifier Performance Evaluation. In Proceedings of the 2006 IEEE International Conference on Evolutionary Computation, Vancouver, BC, Canada, 16–21 July 2006; pp. 2254–2261. [Google Scholar] [CrossRef]
  63. Hanley, J.A. Receiver operating characteristic (ROC) methodology: The state of the art. Crit. Rev. Comput. Tomogr. 1989, 29, 307. [Google Scholar]
  64. Hanley, J.A.; McNeil, B.J. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 1982, 143, 29–36. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  65. Zobel, B.J.; Van Buijtenen, J.P. Wood Variation: Its Causes and Control; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2012. [Google Scholar]
  66. Hong, J.K.; Yang, J.C.; Lee, Y.M.; Kim, J.H. Molecular phylogenetic study of Pinus in Korea based on chloroplast DNA psbA-trnH and atpF-H sequences data. Korean J. Plant Taxon. 2014, 44, 111–118. [Google Scholar] [CrossRef] [Green Version]
  67. Omer, G.; Mutanga, O.; Abdel-Rahman, E.M.; Adam, E. Performance of Support Vector Machines and Artificial Neural Network for Mapping Endangered Tree Species Using WorldView-2 Data in Dukuduku Forest, South Africa. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2015, 8, 4825–4840. [Google Scholar] [CrossRef]
  68. Cho, M.A.; Malahlela, O.; Ramoelo, A. Assessing the utility WorldView-2 imagery for tree species mapping in South African subtropical humid forest and the conservation implications: Dukuduku forest patch as case study. Int. J. Appl. Earth Obs. Geoinformation ITC J. 2015, 38, 349–357. [Google Scholar] [CrossRef]
  69. Karlson, M.; Ostwald, M.; Reese, H.; Bazié, H.R.; Tankoano, B. Assessing the potential of multi-seasonal WorldView-2 imagery for mapping West African agroforestry tree species. Int. J. Appl. Earth Obs. Geoinformation ITC J. 2016, 50, 80–88. [Google Scholar] [CrossRef]
  70. Okada, T.; Afendi, F.M.; Amin, A.U.; Takahashi, H.; Nakamura, K.; Kanaya, S. Metabolomics of medicinal plants: The importance of multivariate analysis of analytical chemistry data. Curr. Comput. Aided-Drug Des. 2010, 6, 179–196. [Google Scholar] [CrossRef] [PubMed]
  71. Antunes, A.C.; Acunha, T.D.; Perin, E.C.; Rombaldi, C.V.; Galli, V.; Chaves, F.C. Untargeted metabolomics of strawberry (Fragaria x ananassa ‘Camarosa’) fruit from plants grown under osmotic stress conditions. J. Sci. Food Agric. 2019, 99, 6973–6980. [Google Scholar] [CrossRef] [PubMed]
  72. Lee, S.; Oh, D.G.; Singh, D.; Lee, H.J.; Kim, G.R.; Lee, S.; Lee, J.S.; Lee, C.H. Untargeted Metabolomics toward systematic characterization of antioxidant compounds in Betulaceae family plant extracts. Metabolites 2019, 9, 186. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  73. Pérez-Cova, M.; Tauler, R.; Jaumot, J. Adverse Effects of Arsenic Uptake in Rice Metabolome and Lipidome Revealed by Untargeted Liquid Chromatography Coupled to Mass Spectrometry (LC-MS) and Regions of Interest Multivariate Curve Resolution. Separations 2022, 9, 79. [Google Scholar] [CrossRef]
Figure 1. (a) A lab-made wood sample tray module for DART-TOF-MS analysis; (b) process flow diagram for mass spectral data preprocessing; (c) one-hot encoding for target data.
Figure 1. (a) A lab-made wood sample tray module for DART-TOF-MS analysis; (b) process flow diagram for mass spectral data preprocessing; (c) one-hot encoding for target data.
Forests 13 01688 g001
Figure 2. Illustrations of the machine learning models: (a) support vector machine; (b) random forest; (c) artificial neural networks.
Figure 2. Illustrations of the machine learning models: (a) support vector machine; (b) random forest; (c) artificial neural networks.
Forests 13 01688 g002
Figure 3. Representative overlapped mass spectra for each species; (a) Chamaecyparis obtusa (Co) Siebold & Zucc., (b) Larix kaempferi (Lk), (c) Pinus densiflora (Pd) Sieb. et Zucc., (d) Pinus koraiensis (Pk) Sieb. et Zucc., and (e) Pinus thunbergii (Pt) Parl.
Figure 3. Representative overlapped mass spectra for each species; (a) Chamaecyparis obtusa (Co) Siebold & Zucc., (b) Larix kaempferi (Lk), (c) Pinus densiflora (Pd) Sieb. et Zucc., (d) Pinus koraiensis (Pk) Sieb. et Zucc., and (e) Pinus thunbergii (Pt) Parl.
Forests 13 01688 g003
Figure 4. OPLS-DA scatter plot and 3D plot for Chamaecyparis obtusa (Co), Larix kaempferi (Lk) (Lamb.) Carr., Pinus densiflora (Pd), Pinus koraiensis (Pk), and Pinus thunbergii (Pt). Statistical parameters of the model: number of samples, 750; R2X, 0.797; R2Y, 0.841; Q2, 0.813.
Figure 4. OPLS-DA scatter plot and 3D plot for Chamaecyparis obtusa (Co), Larix kaempferi (Lk) (Lamb.) Carr., Pinus densiflora (Pd), Pinus koraiensis (Pk), and Pinus thunbergii (Pt). Statistical parameters of the model: number of samples, 750; R2X, 0.797; R2Y, 0.841; Q2, 0.813.
Forests 13 01688 g004
Figure 5. Confusion matrix for the three machine learning models. The x-axis represents the label predicted by the models, and the y-axis represents the true label. Cells are color-coded by accuracy percentage. The genus-level classification accuracy was 100% in all models.
Figure 5. Confusion matrix for the three machine learning models. The x-axis represents the label predicted by the models, and the y-axis represents the true label. Cells are color-coded by accuracy percentage. The genus-level classification accuracy was 100% in all models.
Forests 13 01688 g005
Figure 6. The ROC curves for the three machine learning models. The x-axis represents the false positive rate, and the y-axis represents the true positive rate.
Figure 6. The ROC curves for the three machine learning models. The x-axis represents the false positive rate, and the y-axis represents the true positive rate.
Forests 13 01688 g006
Table 1. Classification accuracies for training, test, and new test set data for each machine learning model.
Table 1. Classification accuracies for training, test, and new test set data for each machine learning model.
ModelTrain Set (70%)Test Set (30%)New Test Set
TotalCoLkPdPkPt
Support Vector Machine (SVM)98.48%92.89%52/5239/3933/3646/4939/4990.22%
100%100%91.67%93.88%79.59%
Random Forest
(RF)
99.81%94.22%52/5239/3932/3649/4940/4993.33%
100%100%88.89%100%81.63%
Artificial Neural Network (ANN)100%98.22%52/5239/3933/3649/4948/4998.67%
100%100%91.67%100%97.96%
Table 2. Evaluation metrics of the average values of five species for each machine learning model.
Table 2. Evaluation metrics of the average values of five species for each machine learning model.
ModelAccuracyPrecisionRecallF1-ScoreAUC
Support Vector Machine (SVM)0.9290.9270.9300.9290.904
Random Forest (RF)0.9420.9380.9410.9390.964
Artificial Neural Network (ANN)0.9820.9820.9790.9810.984
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Park, G.; Lee, Y.-G.; Yoon, Y.-S.; Ahn, J.-Y.; Lee, J.-W.; Jang, Y.-P. Machine Learning-Based Species Classification Methods Using DART-TOF-MS Data for Five Coniferous Wood Species. Forests 2022, 13, 1688. https://doi.org/10.3390/f13101688

AMA Style

Park G, Lee Y-G, Yoon Y-S, Ahn J-Y, Lee J-W, Jang Y-P. Machine Learning-Based Species Classification Methods Using DART-TOF-MS Data for Five Coniferous Wood Species. Forests. 2022; 13(10):1688. https://doi.org/10.3390/f13101688

Chicago/Turabian Style

Park, Geonha, Yun-Gyo Lee, Ye-Seul Yoon, Ji-Young Ahn, Jei-Wan Lee, and Young-Pyo Jang. 2022. "Machine Learning-Based Species Classification Methods Using DART-TOF-MS Data for Five Coniferous Wood Species" Forests 13, no. 10: 1688. https://doi.org/10.3390/f13101688

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop