*2.2. Simulation Study of the Properties of BFM*

Extensive simulation was conducted to study the statistical validity and power of BFM to detect DMRs. For simplicity, for each individual, we simulated one CpG island (genomic region with CpG sites) consisting of m equally spaced CpG sites, with only one DMR of length r(<m) in the middle of the island. Further, we used equal sample size, N, for each of the K groups, and, we did not include any covariates.

Simulation Setup:

The goal here is to simulate methylation rate at each CpG site for each individual. This is achieved in two steps. In step 1, methylation data in the form of NGS short reads sequences were simulated for each CpG site, with correlated methylation status between CpG sites. We also assumed that methylation status at CpG sites among different sequences were independent, as expected in NGS data. In step 2, the individual methylation rates were calculated by summarizing the methylation status at each CpG site from the short read sequences.

The simulation details are described below:

First, we generated 100 NGS short reads using 100 pairs of random numbers {a, c} where a is the start point and c is the length of each short read sequence.

Then we used vector **Y** = (Ykis,a, Ykis,a+1, ... , Ykis,a+c−1) to define the methylation status for short read sequence s of individual i in group k, and generated **Y** from a multivariate Bernoulli distribution to allow for the correlation among the methylation rates.P(**Y** = **y**) = P ykisa, ykis,a<sup>+</sup>1, ... , ykis,a<sup>+</sup>c−<sup>1</sup> of such a discrete random vector **Y** depends on 2<sup>c</sup> probabilities, p(0, 0, ... , 0), p(0, 0, ... , 1), ... , p(1, 1, ... , 1), specific to the different realizations of **Y**. Considering the fact that if a vector Y1, Y2, ... , Yp follows p-variate Bernoulli distribution, the conditional distribution of(Y1, Y2, ... , Yr) (r <sup>&</sup>lt; p) given Yr+1, Yr+2, ... , Yp is also a multivariate Bernoulli distribution [18]. We can utilize this fact to reduce the dimensionality of the unconditional multivariate Bernoulli distribution.

Because of the correlation of methylation rates between CpG sites, we treated methylation status Ykis,j at each CpG site j on short read sequence s as a branching process, taking advantage of the property of multivariate Bernoulli distribution [19]. We assumed that, for CpG site j, branching probabilities were the same for each short read sequence of all individuals in group k. Thus, we defined the branching probability pkj = P Ykis,j = 1|Ykis,j−<sup>1</sup> = 1 as the probability of methylated sequence read at CpG site j, conditional on the methylated sequence read at CpG site j − 1 on the same short read sequence of the same individual. Similarly, we defined the branching probability qkj = P Ykis,j = 1|Ykis,j−<sup>1</sup> = 0 as the same probability, conditional on unmethylated sequence read at CpG site j − 1.

The methylation status (Ykis,a, Ykis,a+1, ... , Ykis,a+c−1) were generated as follows:

For the first CpG site of the sequence, the methylation status ykisa was generated from Bernoulli distribution Bern(ma), with ma <sup>=</sup> pka <sup>+</sup> qka /2.

The methylation status ykis,j for <sup>j</sup> <sup>=</sup> <sup>a</sup> <sup>+</sup> 1, ... , <sup>a</sup> <sup>+</sup> <sup>c</sup> <sup>−</sup> 1 was generated with ykis,j <sup>∼</sup> Bern pkj if ykis,j−<sup>1</sup> <sup>=</sup> 1 or ykis,j <sup>∼</sup> Bern qkj if ykis,j−<sup>1</sup> <sup>=</sup> 0.

After generating all the sequences at every CpG site for each individual, we calculated the total numbers of methylated and unmethylated short read sequences at CpG site j for individual i in group k, s ykis,j = 1 and s ykis,j = 0 ,. Then the methylation count and the sequencing coverage are given by mkij = s ykis,j = 1 and ckij = s ykis,j = 1 + s ykis,j = 0 , respectively.

We generated one CpG region with 24 CpG sites for each individual, 6 of which (from site 10 to 15) constituting the DMR. We simulated four groups of severity levels, with sample size 50 in each group, and repeated it with sample size 100. The branching probabilities, pkj, were pre-determined. Also, we chose qkj = pkj − 0.2. We also simulated two different scenarios of DMR patterns.

Under Scenario 1, we chose the probabilities, pkj, to be symmetric around the middle of the DMR (CpG sites 12 and 13). The predetermined probabilities pkj and their symmetric pattern under Scenario 1 are presented in Table 1.


**Table 1.** Conditional probabilities pkj at each CpG site for simulation of BFM under Scenario 1.

Under Scenario 2, we randomly chose the CpG sites with the peak values of pkj within the simulated DMR (between sites 10 and 15), varying it for different individuals. Specifically, for each individual in each group, we first generated a random number r (between 10 and 15) for the location of the CpG site with the highest methylation, and then chose the branching probabilities pkj to increase from 10 to r and then decrease from r to 15. The pkj for the non-DMCs remained the same as in Scenario 1. The second scenario is a more realistic depiction of the real world. However, the results and conclusions should be the same under both situations.
