3.2.3. Correlation Analysis of Gene Expression vs. Lignin Content

In correlation analysis, the strategy was to calculate all possible phenotype vs. gene expression combinations and then identify the most reliable pair. In this case, seven 2-class lignin content groups and seven 2-class gene expression groups were used for a Pearson correlation coefficient calculation. For each sample, there were seven patterns of phenotype and seven patterns of gene expression value. Be aware that these values only contain the discrete variables 1 and 2. Throughout the Pearson calculations for the seven groups of lignin content vs. the seven groups of gene expression, once the coefficient value appeared higher than the threshold given by the user, this gene was marked. The threshold was set as 0.7, 0.8 and 0.9. For each pattern of phenotype, the count of marked genes was summarized in Table 1. The 2-group pattern ((3), (1, 2, 4)) of phenotype had the most genes (Table 1). Moreover, the separation of group (3) and group (1, 2, 4) was consistent with the lignin content distribution for the 48 loquat fruit samples (Figure 3), thus, we used ((3), (1, 2, 4)) for further analysis.

For the 2-group pattern of phenotype ((3), (1, 2, 4)), 53 of the 71 highly correlated genes (correlation coefficient greater than 0.8) were well annotated (Table S1). The count of annotated genes with a correlation coefficient between 0.7 and 0.8 was 307. The correlated genes were more numerous than reported in our previous study [12]. The expression patterns of the 53 highly correlated genes are shown in Figure 4.


**Figure 4.** A heatmap of normalized gene expression values of the 53 highly correlated genes in gene expression vs. lignin content comparison, as calculated by the C-CorA method.

To assess the validity and accuracy of the C-CorA method, we compared the correlation coefficients calculated by C-CorA to those of the Pearson method and the Spearman method; the results are summarized in Figure 5. Genes highly related to the lignin content which were detected by the Pearson method and the Spearman method are all included in the result from C-CorA (Figure 5A). The C-CorA method identified a larger number of candidate genes, and the correlation coefficient values showed better discrimination. Using the Pearson or Spearman methods, 22 genes with correlation coefficients greater than 0.7 were screened (Figure 5B). For these 22 genes, the correlation coefficients calculated using C-CorA were consistent with at least one of the traditional methods, except for two genes (UN68285 and UN35236). The correlation coefficients for these two genes calculated by the three methods are different. Comparison of the RPKM values for these two genes showed that the expression of UN68285 and UN35236 in all 48 samples (Figure 5C) was similar to trends in the lignin content, suggesting that these two genes are related to lignin content.

**Figure 5.** (**A**) A Venn diagram of the count of genes with high correlation coefficients in gene expression vs. lignin content comparison as calculated by the Pearson method, the Spearman method and the C-CorA method. (**B**) The correlation coefficients for 22 genes screened by the Pearson method or the Spearman method (>0.7) and the coefficients from the C-CorA method. (**C**) The base-2 logarithmic values of the RPKM values of genes UN68285 and UN35236.

In our previous study, we constructed co-expression networks between genes based on the Pearson product-moment correlation coefficient to determine lignin-related genes. Then *EjCAD3* (*UN68301*) and other genes were identified as candidate genes [12]. Using the C-CorA analysis, *EjCAD3* (*UN68301*) also showed a high correlation signal (Table S1). Other genes identified by C-CoA included a CBL-interacting protein kinase 08 (*UN00386*), an RNA binding protein (*UN30230*), a gibberellin 3-beta hydroxylase (*UN68191*) and so on, which was also consistent with the former work [12].

Among the newly identified genes were two lignin-related genes, encoding a CAD (cinnamyl alcohol dehydrogenase) (*UN26767*, *EjCAD4*) and a MYB transcription factor (*UN00328*). These two genes were not identified in the previous work [12]. The expression levels of *EjCAD4* and MYB (UN00328) decreased in LTC and HT treated samples compared with control samples (Figure 6), and they were positively correlated to the lignin content (the correlation coefficients with lignin content were 0.86 and 0.87, respectively). CAD is an enzyme that participates in the last step of monolignol biosynthesis. *AtCAD4* (AT3G19450), the homologous gene *EjCAD4* in Arabidopsis is involved in lignin biosynthesis and reported to act as an essential component in pathogen defense [19,20]. Moreover, the homologous gene *EjCAD4* in Chinese White Poplar (*Populus tomentosa* Carr.), JX986606.1, has also been identified as a candidate gene related to lignification by SSR markers [21].

**Figure 6.** The RPKM of UN26767, UN00328, WRKY (UN10110) and BHLH (UN35332) at each time point under LTC, HT and 0 ◦C treatments from the RNA-Seq analysis of loquat fruit.

The expression of other transcription factors was also identified as being correlated with lignin content, including ERF (*UN41017*), BHLH (*UN35332*), MYB (*UN31768*, *UN02037*), WRKY (*UN10110* and *UN23662*), NAC (*UN48890*) and so on (Table S1).
