*3.1. Method Implementation*

The C-CorA method contains two modules: the cluster module and the correlation analysis module (Figure 1). This approach is highly flexible, as it can process various types of data and each of the two modules can be replaced with other suitable algorithms. The output is the correlation coefficients, which can be used in downstream analyses.

**Figure 1.** The C-CorA workflow showing the clustering module and a correlation analysis module. The numbers 1, 2, 3 and 4 represent the different clusters. When the k-mean is set as 4 for the cluster module, input data1 and input data 2 are first clustered into four groups. Then these groups are randomly combined into two groups for correlation analysis.

The algorithm used in the C-CorA method is written in two ways: MATLAB code and C# WPF (Windows Presentation Foundation). The algorithm written using MATLAB code is suitable for large-scale data sets, while the C# WPF is suitable for small-scale data sets and researchers with less experience in bioinformatics. The code and implementation results can be accessed on the website https://github.com/ili-4/C-corA (accessed: 20 January 2021). The MATLAB code can be run with MATLAB R2018b under the Windows or Linux systems. The cluster method used in the algorithm is the *k-*means function of MATLAB and the correlation calculation method is the corr function of MATLAB. The C# WPF is a visual program in windows that provides a convenient way to use the C-CorA method (Figure 2A). The program requires two paths for the input of data files, and the user can adjust the value of *k* used in the *k*-means clustering module of C-CorA. The output includes results of clustering and correlation coefficient calculations. An example of the output file is shown in Figure 2B, which was generated by the C# WPF application of C-CorA using the RNA-seq data of loquat fruit as mentioned above.


**Figure 2.** (**A**) The appearance of the C-CoA application written using C # WPF. (**B**) An example of an output file.

#### *3.2. Transcriptome Analysis Using C-CorA Method*

3.2.1. Clustering of Loquat Samples Based on Lignin Content

To get a more convincing 2-cluster pattern, the *k-*means method was used for the initial cluster. Setting the *k* parameter as 4, 48 loquat samples were randomly combined and clustered into four groups based on the lignin content (Figure 3). Then these clustered groups were randomly combined into seven 2-classes: ((1), (2, 3, 4)), ((2), (1, 3, 4)), ((3), (1, 2, 4)), ((4), (1, 2, 3)), ((1, 2), (3, 4)), ((1, 3), (2, 4)) and ((1, 4), (2, 3)). For further correlation coefficient calculations, the lignin content value was replaced by the discrete variable 1 or 2 in each 2-class. For example, in ((1), (2, 3, 4)), the lignin content value was substituted as 1 in (1) and 2 in (2, 3, 4).

**Figure 3.** The *k-*means cluster result of the 48 loquat fruit samples from MATLAB. The number of groups was set to 4 in the *k-*means function. The data points are clustered based on the value of lignin content. (HT: heat treatment, LTC: Low temperature conditioning and T0: 0 ◦C).

#### 3.2.2. Clustering of Gene Expression Data

The gene expression value was processed using the same cluster method as above. The *k* parameter was set to 4 as well. We divided the gene expression value into four groups and then randomly combined them into seven 2-classes for subsequent calculations. In each 2-class, the discrete variable 1 or 2 was substituted for the gene expression value, i.e., gene expression values were replaced as 1 in one class, and 2 in the other class.
