3.2.4. Correlation Analysis for Newly Identified Candidate Genes

To better understand the regulatory network of genes newly identified by the C-CorA method above, a correlation analysis of gene vs. gene based on their expression values was performed. The C-CorA method was used to detect the genes correlated with *EjCAD4* (*UN26767*), and a total of 65 correlated genes were identified (Table S2). Interestingly, two transcription factors, WRKY (*UN10110*) and BHLH (*UN35332*), showed the opposite expression patterns when compared to *EjCAD4* (Figure 6), which suggested that they might be negative regulators of *EjCAD4*.

Multiple genes are involved in lignin accumulation. We used the *EjCAD4* as an example to perform a correlation analysis of multi-gene vs. lignin content. When searching for genes that act synergistically with *EjCAD4* to influence the fruit lignin content, we applied the *k*-means clustering method to a 2 × 48 matrix formed by the expression values of two genes, one of which was *EjCAD4*. The clustering results were then used as the input vector for the correlation analysis together with the lignin content. Based on the results produced by C-CorA, we obtained 38 annotated genes with a correlation coefficient cutoff value of 0.8. Six calcium-correlated genes were identified, including *EjCAD3* (Table S3). Cold shock elicits an immediate rise in cytosolic free calcium concentration [22] and can activate the expression of lignin-associated genes [23,24]. As we know, Ca2+ treatment could help to maintain fruit firmness via forming an "egg-box" [25]. Whether calciumcorrelated genes interplay with loquat fruit lignification during postharvest and how they synergistically work with lignin-related gene *EjCAD4* requires further research.

#### *3.3. Other Cases of Correlation Analysis Using C-CorA*

The C-CorA method could be applied to any multi-dimensional data. It could detect more candidate genes in a correlation analysis of RNA-Seq. Two more cases are presented here.

#### 3.3.1. Transcriptome Analysis in Kiwifruit Using C-CorA

In a recent study of fruit softening [26], the mechanism of postharvest cell wall metabolism was explored in kiwifruit. Six cell wall metabolism-related genes (*AdGAL1*, *Ad-MAN1*, *AdPL1*, *AdPL5*, *Adβ-Gal5* and *AdPME1*) were identified as candidate genes for pectin degradation by a correlation-based analysis as reported in Zhang et al. [26]. The correlation

coefficients in this study were generated by the Pearson method. We applied the C-CorA method on the data provided by the author, and ten pectin degradation-related structural genes were highly related to the physiological traits. These ten genes included the six candidate genes mentioned above and four more genes (*Achn351951*, *Achn106231*, *Achn381701* and *Achn064441*). Among these four genes, two genes (*Achn351951* and *Achn106231*) were annotated as pectate lyase while the other two (*Achn381701* and *Achn064441*) were annotated as pectinesterase. Both pectate lyase and pectinesterase play an important role in pectin degradation [27]. Figure 7 illustrates the gene expression trends of these four genes in the firmness and the cell wall material pattern of kiwifruits. The pattern of physiological traits and the FPKM of these four genes showed the same or opposite trends during the treatment of kiwi fruit, which indicated that specific genes detected by C-CorA should also be considered as candidate genes in cell wall metabolism.

**Figure 7.** The cell wall material content (mg/g), firmness (N) and FPKM value of four new candidate genes (*Achn351951*, *Achn106231*, *Achn381701* and *Achn064441*) detected by C-CorA in four groups of kiwi fruit.

#### 3.3.2. Transcriptome Analysis in Persimmon Using C-CorA

Acetaldehyde is a compound that could precipitate soluble condensed tannins into insoluble condensed tannins, which is an important process for persimmon fruit flavor [28]. The production of acetaldehyde depends on two key enzymes, pyruvate decarboxylase (PDC) and alcohol dehydrogenase (ADH) [28]. Three PDC genes (*EVM0028451*, *EVM0027273* and *EVM0022732*) and two ADH genes (*EVM0007501* and *EVM0027066*) were detected in the highest correlated module by a weighted gene co-expression network analysis (WGCNA) as reported in Kou et al [29]. When we applied the C-CorA method on the same dataset as Kou et al. [29], two more genes (PDC, *EVM0018709* and ADH, *EVM0007329*) were detected. The gene expressions and ethanol/acetaldehyde production are shown in Figure 8. The expression patterns of the two genes were consistent with ethanol and acetaldehyde content. The new genes found by C-CorA are worth further research.

#### **4. Discussion**

In this work, we used a cluster-based correlation analysis method (C-CorA) to analyze the loquat RNA-Seq and identified some additional lignification-related transcripts which were not detected in the previous study [12]. In the C-CorA workflow, the cluster step and correlation analysis step are treated as two independent modules. In the cluster step, many alternative methods could be used rather than the *k*-means method, depending on the features of the input data. An advantage of using the clustering result for the correlation coefficient calculations is that it can process various types of data in different dimensions. For the correlation analysis of RNA-Seq data, C-CorA can deal with different cases, including gene vs. gene/phenotype, multi-gene vs. gene/phenotype and multiphenotype vs. gene. The correlation coefficient calculated by the C-CorA method can also be used in other correlation-based analyses, such as the WGCNA [3].

The *k*-means method has some limitations. The number of groups must be determined in advance. In this work, the parameter *k* was set to 4. By choosing a relative value of *k* and then combining these groups into 2-group patterns, the potential correct cluster will be included. Also, the clustering result given by the *k*-mean method is sometimes not unique. To overcome this deficiency, multiple rounds of calculations should be performed for each data set.

For RNA-Seq analysis, alignment-free methods are quickly developed. Kallisto [30] and Sailfish [31] use the unique *k*mer from each transcript to calculate the read count for the calculation of abundance. The C-CorA method can also use these *k*mer counts to do the correlation analysis. The *k*mer distribution can describe the difference between transcripts or genome sequences, meaning that the C-CorA method can be extended to other applicable sequence-based analyses. For data from hybrid population analyses, the cluster method matches the phenomena of segregation of character. The C-CorA method has the potential to be applied to these analyses.

According to previous work [32,33], CAD (cinnamyl alcohol dehydrogenase) is an important enzyme that catalyzes the final step in the biosynthesis of lignin precursors, and MYB transcription factors [9,10] have been identified that are involved in loquat fruit chilling lignification by regulating lignin biosynthetic genes. By performing a C-CorA analysis of gene expression vs. lignin content, we identified *EjCAD3* (*UN68301*) and

*EjCAD4* (*UN26767*) as candidate genes. In addition, a MYB transcription factor (*UN00328*) was highly correlated to the lignin content, which suggests that it could be a positive regulator of lignin biosynthesis. Using the same approach on gene vs. gene, we found two transcription factors, BHLH (*UN35332*) and WRKY (*UN10110*), that might work as negative regulators of *EjCAD4*. For the correlation analysis of multi-gene vs. lignin content, we identified six calcium correlated genes that may cooperate with *EjCAD4* in loquat fruit lignification.

Several differentially expressed genes previously identified [12] as candidate genes, such as *EjCAD3* (*UN68301*), also showed a high correlation with the lignin content using the C-CorA method. This indicates that the C-CorA method is an effective method for RNA-Seq data analysis. Furthermore, the correlation analysis of gene vs. gene and multi-gene vs. phenotype performed by C-CorA can help to build a network of the target genes.
