*2.5. Data Analysis*

Figure 1 shows the summary statistics of biomarker results in the spring and summer seasons of 2014. For each biomarker, outliers were identified by falling outside three standard deviations of the means for spring and summer separately. The means and standard deviations were calculated by assigning equal weights to each data entry (an individual could have multiple data entries if they consented to provide blood samples at multiple days of the week). These outliers were hence removed and the resultant data were used for all subsequent analysis. As a result, no data were removed for IL-1β and IL-12; <1% of data were removed for IFN-γ, IL-2, IL-4, IL-6, IL-13, TNF-<sup>α</sup>, and ICAM-1; <2% of data were removed for IL-8, and IL-10; 4% data were removed for VCAM-1; 5% of data were removed for SAA, and 6% of data were removed for CRP.

**Figure 1.** Concentrations of biomarkers during spring and summer seasons, 2014. CRP, SAA, ICAM-1, and VCAM-1 are shown in μg/mL; IFN-γ, IL1-β, IL-2, Il-4, IL-6, IL-8, IL-10, IL-12, IL-13, and TNF-α are given in pg/mL. In the figure, the central line across the box represents the mean, and the upper and lower boundaries correspond to three standard deviations above and below the mean, respectively. The outliers falling outside three standards of the mean are marked with circles. For ease of illustration, four biomarkers are rescaled: IL-4 upscaled by a factor of 10; SAA and IFN-γ downscaled by a factor of 1/20; IL-8 downscaled by a factor of 1/100.

To further investigate variations in biomarker measurements, linear mixed regression models were built, with each type of biomarkers as the response variable and the following factors as potential explanatory variables, such as season, PM2.5, BC, gender, smoking, and the use of masks. The linear mixed model also made it happen to include the dependency structure of biomarkers from the same subjects who were followed in both seasons. Figure S1 (Supplementary Materials) illustrates the overall modelling process.

First, a linear mixed model can be formulated as follows:

$$\mathbf{x}\_{ij} = \beta\_0 + \beta\_1 \mathbf{x}\_1 + \beta\_2 \mathbf{x}\_2 + \dots + \beta\_6 \mathbf{x}\_6 + \eta\_i + \varepsilon\_{ij} \tag{1}$$

$$
\alpha \eta\_i \sim N\left(0, \sigma\_\eta^2\right), \quad \epsilon\_{ij} \sim N\left(0, \text{var}\left(\epsilon\_{ij}\right)\right), \tag{2}
$$

where *yij* is the level of one biomarker from subject *i* at *j*th measurement and *x*1, ... , *x*6 are the explanatory variables, including PM2.5, BC, season (spring or summer), smoking (Y or N), gender (F or M) and mask (Y or N), with corresponding coefficients *β*1, ... , *β*6. The variables x1 to x6 are treated as fixed effects and *ηi* denotes the random effect among subjects. Hence, the dependency between the biomarkers measured from the same subjects is included in the model by sharing the same term *ηi*. The term *ij* is the random error in each measurement and its variance is assumed to be also associated with the explanatory variables, following the power-of-X dispersion function [48] specified below:

$$\text{var}(\varepsilon\_{\overline{i}\}) = \sigma\_{\varepsilon}^{2} \exp(\gamma\_{0} + \gamma\_{1}\mathbf{x}\_{1} + \gamma\_{2}\mathbf{x}\_{2} + \dots + \gamma\_{6}\mathbf{x}\_{6}),\tag{3}$$

where *σε* is an unknown parameter and *γk* (*k* = 0, 1, . . . , 6) are unknown parameters called the dispersion effects parameters.

Second, model diagnostics were performed on the full model to investigate the need for transformation on the biomarker response variables and to assess outliers. Box-Cox transformations were applied on the biomarker variables, if necessary, to correct non-normality. To examine assumptions with the transformed model, scaled residuals, defined through the Cholesky decomposition of the variance-covariance matrix, were obtained in place of raw residuals, as they tended to be uncorrelated with the constant mean zero. To further detect outliers and potentially influential data points, restricted likelihood distance [49] was used as an overall influence measure in addition to scaled residuals. Specifically, a data point with either a restricted likelihood distance or a scaled residual more than three was identified as an outlier and removed from the dataset. Once an outlier was deleted, the model on the updated dataset was refitted and an outlier was removed one at a time until no more outliers were detected in the residual analysis.

Third, on the transformed full model, two stepdown variable selection procedures were employed partially based on *p*-values of the effects. Specifically, using the complete fixed effects under consideration, a stepdown selection on the dispersion effects only was performed, until the lowest Bayesian information criterion (BIC) [50] was achieved. Then keeping these selected dispersion effects, another stepdown selection on the main effects only was performed and BIC was used to select the final model. Note that the last model from the second stepdown selection did not necessarily have the lowest BIC. Thus, sometimes, some insignificant effects were kept in the final model to attain a lower BIC, or some significant effects might be excluded for the same reason.

Finally, the adequacy of each final linear mixed model was checked by a residual analysis. As shown in Figure S1 (Supplementary Materials), outliers were identified and removed in the same manner until the model was free of outliers. The final results of the models for each biomarker could be found in Table 3.



1 The 4th root of the variable was used in the model. 2 The log of the variable was used in the model. 3 The square root of the variable was used in the model.
