*3.5. Data Analysis*

HPLC fingerprints from the 280 rhizome samples, 280 stem samples, and 280 leaf samples, a total of 840 fingerprint data was exported in CSV format and imported to MATLAB R2018b (The MathWorks, Inc., Natick, MA, USA), which was used for correlation optimized warping (COW) alignment preprocessing of chromatographic fingerprint. MATLAB code of COW is freely available from www.models.kvl.dk. The preprocessing fingerprint was analyzed in the following work [59].

Exploratory data analysis (EDA) is necessary for building predictive models [60,61]. It can help in determining interesting correlations among all of the samples or variables and summarize data sets main characteristics [60]. Principal component analysis (PCA) is a popular primary tool in EDA [61,62]. It is often used to visualize the relatedness between samples and explains the variance in the data. Hence, PCA, as an unsupervised pattern recognition technique, was widely used to extract key information from chemical fingerprint for geographical origin or Modelling Research [61].

Unlike PCA, orthogonal partial least squares discriminant analysis (OPLS-DA) is a supervised pattern recognition technique. As an extension of PLS, an inbuilt orthogonal signal correction filter was incorporated in the OPLS-DA model [56]. This algorithm e ffectively divides the X variable into two parts: one part that is related to class information (Y-predictive) and the other is orthogonal or unrelated to class information (Y-uncorrelated). Therefore, interpretability and prediction performance of the model was enhanced [56].

Random forest (RF) is another supervised pattern recognition technique utilized in the study. RF is an ensemble learning method [55]. A large number of trees were produced by RF algorithm in order to improve model predictive ability, and trees' decision results were combined as final decision results. In other words, the more trees built in the random forest classifier, the higher accuracy could be achieved. However, many researches showed that an optimum tree number was of grea<sup>t</sup> importance in modeling classification performance [33,46].

In this work, exploratory data analysis of HPLC fingerprints of *G. rigescens* grown in four different latitudes was finished with PCA. Two supervised pattern recognition techniques, OPLS-DA and RF, were applied to build classification models for *G. rigescens* producing areas. SIMCA 14.1 software managed PCA and OPLS-DA (Umetrics AB, Umea, Sweden). RF classification models were established with R 3.5.1 program and package randomForest (Version 4.6-14) [63].
