2.2. Self-Organizing Maps and Molecular Descriptors Applied in the Chemotaxonomy of Lamiaceae Subfamilies
From the botanical occurrences of the diterpenes obtained from the Lamiaceae family, 108 molecular descriptors were generated for each molecular structure using Dragon 7.0 software [
24]. The botanical occurrences were classified into four subfamilies and the values of the descriptors were used as input data for the SOM Toolbox 2.0 software [
37]. The subfamilies selected for analysis were those that presented the highest number of botanical occurrences making possible the pattern recognition of the distribution of diterpenes in Lamiaceae (
Table 1). Then, the self-organizing matrix for each molecule was calculated, dividing the samples into groups according to the similarity and after comparing the SOM with the classification proposed by Li et al. [
3].
In the maps depicted, the chemical occurrences of certain subfamilies occupy regions that are labeled by the following colors:
Clade III (Nepetoideae), red;
Clade IV (Ajugoideae, Lamioideae and Scutellarioideae), lilac;
Ajugoideae, blue;
Lamioideae, green;
Scutellarioideae, dark blue.
The SOM that was obtained using the occurrences of the diterpenes of clade III (Nep) and clade IV (Aju, Lam and Scu) subfamilies showed a total hit rate of 86.3%, with 6025 occurrences and 5200 hits (
Table 2). The SOM generated using fingerprint to analyze the correspondence of botanical occurrences of clade III and clade IV subfamilies resulted in a total hit rate of 89.5%. These data corroborate a good separation of the subfamilies because even though different descriptors were used, the results were similar (
Table 2).
The SOM (
Figure 2) shows a clear separation between the botanical occurrences of clade III (red) and clade IV (lilac), reaffirming the phylogenetic analysis performed by Li et al. (
Figure 1) [
3]. Analyzing the SOM, there is a chemical pattern that shows a region in which the subfamily Nep (red) occupies many neurons distributed by the map, being the one with the highest number of occurrences (3644) and the best rate of success 89.2% (
Table 2). The predictive performance of the SOM for the five training and test sets that were generated from the original set can be visualized in
Table 3. The applicability domain (AD) was reliable for more than 99% of the predictions of the test set. The average match rate for the five test sets (85.4%) is very close to that of the training (86.4%). The clade III (Nep subfamily) shows the highest match rate values for training sets (88.6%) and tests (88.3%), while clade IV (subfamilies Aju, Lam and Scu) showed 82.1% and 81% for training and test sets, respectively.
Chemotaxonomy analysis was also performed using other machine learning algorithms: support vector machine (SVM), which is a supervised machine learning algorithm, and k- nearest neighbors (k-NN), which is an instance-based algorithm. Results are shown in
Table 4 for the analysis performed on the SOM by clade. It can be observed that, as in the SOM, the models generated with SVM and k-NN obtained very similar results and with high hit rates.
The applicability domain (AD) was reliable for over 99% of the test set predictions for all algorithms used: SOM with molecular descriptor, SOM with fingerprint, SVM and k-NN.
The most significant descriptors for the clustering the diterpenes of the Ajugoideae, Lamioideae, Scutellarioideae (clade IV) and Nepetoideae (clade III) subfamilies are also shown in
Figure 2. The U-matrix shows the distances between the neighboring map unit, where high values indicate a border of a cluster and uniform areas of low values indicate the clusters themselves (
Figure 2a). The subfamily of clade III shows a high value for the following descriptors, which are shown in black in
Figure 2a: atoms-centered descriptor O-056 that encodes alcohol and functional group count nArOH that encodes the number of aromatic hydroxyls. The diterpenes of the clade IV subfamilies present high values for the ring descriptor NRS that encodes the number of ring systems (
Figure 2a).
In analyzing the individual descriptors, it was verified in the descriptor of atom-centered fragments, O-056 (alcohol), that its highest value was attributed to diterpene
1 (
Figure 3) due to the presence of four alcohols. This diterpene is popularly known as isorosthin J [
38,
39] and belongs to the subfamily Nepetoideae (clade III). The diterpene
2 (
Figure 3), known as ajubractin A [
40], belongs to the subfamily Ajugoideae (clade IV) and presents the null value for the descriptor O-056. It was observed that diterpene
3 (
Figure 3), known as plectranthol A [
41], has the highest value of the nArOH descriptor, with the presence of four aromatic hydroxyls, whereas the lowest value, null, for this descriptor was attributed to diterpene
4, lupulin A [
42,
43,
44,
45] (
Figure 3).
It was reported in the literature that plectranthol A (
3) shows antioxidant activity [
41] and, according to this chemotaxonomic study, it is observed that it can be found in a species belonging to the subfamily Nepetoideae of clade III (red) (
Figure 2a), whereas lupulin A has potential antibacterial activity [
42] being commonly found in species of clade IV subfamilies, Ajugoideae and Scutellarioideae [
42,
43,
44,
45] (
Figure 2 and
Figure 3).
By examining the NRS descriptor (
Figure 2a), it was found that diterpene
5 (
Figure 3), which is known as scutalpin L [
46,
47], presented the highest value for this descriptor, having in its molecule four ring systems, occurring in the subfamily Scutellarioideae of clade IV. Diterpene
6 (crassifol) [
48] of the subfamily Nepetoideae shows a null value for the NRS descriptor because it has an acyclic structure (
Figure 3).
This confirms that there is a chemical profile of diterpenes, which shows that the subfamilies of clade IV present diterpenes with more ring systems and that the subfamily Nepetoideae (clade III) has molecules rich in hydroxyl groups attached to aromatic and nonaromatic groups.
The SOM generated to analyze the correspondences of the 2381 diterpene botanical occurrences of the clade IV subfamilies (Aju, Lam and Scu) resulted in a total hit rate of 91.4% (
Table 5). It is also observed that the subfamily Lam presents the best hit rate with 94.8% and the largest number of occurrences and compounds of clade IV; its structural diversity in terms of diterpenes is shown in the SOM (
Figure 4). The subfamily Scu shows a hit rate of 81.3%, revealing a clear separation of these subfamilies because all the subfamilies present an accuracy greater than 80%.
Using fingerprint, rates of accuracy were observed close to those obtained using the molecular descriptors; the subfamily Lam had the same hit rate 94.8% in the fingerprint (
Table 5). This information supports a good SOM rating performance even when using two different types of descriptors.
Table 6 shows a significant correspondence in the training and test sets of the Aju, Lam and Scu subfamilies. Once more, the AD was reliable for more than 99% of the predictions of the test set. Lamioideae have higher match values: 95.9 and 94.1% for the training and testing, respectively. Scutellarioideae shows lower matching values in the training models with a mean of 76.2% and similar performance in the test results (68.1%). All the total training and test results show a level of significance higher than 60%.
Chemotaxonomy analysis was also performed using other machine learning algorithms, i.e., support vector machine (SVM), which is a supervised machine learning algorithm, and k- nearest neighbors (k-NN), which is an instance-based algorithm. The results are shown in
Table 7 for the analysis performed on the SOM by subfamilies belonging to clade IV. It can be observed that, as in the SOM, the models generated with SVM and k-NN obtained very similar results, with high hit rates. The applicability domain (AD) was reliable for over 99% of the test set predictions for all algorithms used: SOM with molecular descriptor, SOM with fingerprint, SVM and k-NN.
In analyzing the SOM and descriptors obtained only from clade IV, the diterpenes of the Ajugoideae, Lamioideae and Scutellarioideae subfamilies that make up this clade were used (
Figure 4a). In the map, we can see that there is a proximity between Lam (green) and Aju (light blue), as well as Aju (light blue) with Scu (dark blue), therefore, the pattern of the botanical occurrence of diterpenes does not corroborate with the phylogenetic classification proposed by Li et al. [
3], who report that Lam (green) would be closer to Scu (dark blue) than Aju (light blue).
As shown in
Figure 4, the self-organizing map obtained by fingerprint showed similarity in the separation of diterpenes when compared to the map obtained by the fragment descriptors.
Analyzing the descriptors shown in
Figure 4a, in the black color for higher values, one realizes that the diterpenes of the Scu subfamily display a high value for the nArCOOR (number of aromatic esters) descriptor; secondary metabolites of subfamily Lam show high values in the descriptor nR = Cp (number of primary C terminals—sp
2) and the subfamily Aju has molecular structures with higher values of the descriptor nFuranes (number of furans).
The diterpene
7 (
Figure 5) shows the highest value for the nArCOOR descriptor because in its structure it has three aromatic esters. It is commonly known as scutebatin B [
49], being found in the subfamily Scutellarioideae (dark blue) (
Figure 4a), and the study of its isolation verified its inhibitory effects on the production of nitric oxide aromatic esters induced by lipopolysaccharide in macrophages [
49]. We can observe in the descriptor nArCOOR that the white spaces are formed by regions of smaller values, being related to the diterpenes of Lamioideae (green) and Ajugoideae (light blue) (
Figure 4a). Thus, we have as example diterpene
8 (
Figure 5), known as cyllenin A [
50,
51], which does not have aromatic ester groups and belongs to the subfamily Lamioideae.
We investigated the highest value reported in the descriptor nR = Cp, which was attributed to diterpene
9 (
Figure 5) which is known as sclarene [
7]; with three sp
2 terminal carbons, this diterpene occurs in the subfamily Lamioideae (green) (
Figure 4a). The lowest value of the descriptor nR = Cp corresponds to the diterpene
10 (
Figure 5), which does not present any terminal carbon sp
2 and is located in the subfamily Ajugoideae (light blue). Diterpene
10 is known as ajugamarin A1 [
43] and shows a potential neuroprotective effect [
52].
The diterpene
11 (
Figure 5), teubrevin G [
53,
54], presents the highest value for the nFurane descriptor because there are two furan rings. Observing the descriptor in the black region, which represents higher values, and comparing with the map matches with the same region in which the diterpenes of Ajugoideae occupy confirms that this diterpene occurs in the subfamily Ajugoideae. The diterpene
12 (
Figure 5), known as sidendrodiol [
7,
55,
56,
57], belongs to the species that occur in the subfamily Lamioideae and does not have furan groups.
The Lamiaceae family includes the genus
Scutellaria, which belongs to the subfamily Scutellarioideae, and has a cosmopolitan distribution of around 360 species worldwide and in different climatic regions. A majority of its growing species in Asia have a long tradition in Chinese folk medicine [
46]. Several studies indicate that diterpenes are commonly found in these species.
Isodon, belonging to the Nepetoideae subfamily, is another genus with the same cosmopolitan distribution and concentrating the largest distribution in Asia. Several descriptions of species of this genus are reported, however, they have quite different chemical substances from those found in the Scutellarioideae subfamily as we can verify the execution rate of the records of SOMs analyzed in clade III and clade IV [
58].