1. Introduction
The combination of non-pharmaceutical interventions (NPIs) such as border controls, social distancing, and test-trace-isolate-quarantine systems allowed a number of countries and regions in the Western Pacific to suppress COVID-19 transmission for extended periods [
1]. During periods of low but non-zero community transmission, it can be difficult for policy makers to make sense of disease transmission potential. Consequentially, it remains difficult to ascertain the excessiveness of contemporaneous control measures, which are economically and socially costly, and whether outbreaks represent nascent waves or stochastic flare-ups of disease transmission. These uncertainties are further magnified when secondary community infectees result from imported, active infectors [
2].
A commonly used metric of disease transmissibility is the instantaneous or effective reproduction number
, which is defined as the ratio of the number of new local infections generated at time
and the total infectiousness of infected individuals at that time [
3]. As an indicator of disease transmissibility, the threshold of unity signals if the epidemic is growing. Given its utility to policy makers for epidemic assessment,
has been used to understand the impact of public health interventions for outbreaks caused by pathogens such as smallpox, influenza, severe acute respiratory syndrome coronaviruses 1 and 2 [
3,
4]. Over the past decade, several approaches have been proposed to estimate
that extend the seminal Wallinga and Teunis method [
5], including EpiEstim by Cori et al., EpiFilter by Parag and EpiInvert by Alvarez et al. [
3,
6,
7]. Although these methods pool information across time to improve the precision and hence the utility of
estimates, they do not account for exogenous factors that may substantially affect transmissibility—such as mobility data or characteristics of the NPIs that took place over time. Exogenous factors may provide additional information on disease transmissibility, especially when the number of cases is low and uninformative.
Furthermore, despite improvements in EpiEstim and EpiFilter being proposed to distinguish imported cases from local ones [
8,
9], when quarantine measures on international travelers are in place, the risk of secondary transmission from imported cases may differ substantially from that of the community, and when imported cases constitute a substantial fraction of the total number being detected, it is important for estimates of
to account for heterogeneous transmissibility between groups.
Therefore, this study outlines a method to combine case time-series, stratified by importation status, and covariates (vaccination rates, mobility levels, and policy implementation) to estimate in a regression-style framework. We applied the method to COVID-19 in four Western Pacific countries and regions that had both extended periods of low and high community incidence. The method we developed, EpiRegress, is completely data-driven, assigning little prior information to s. It allows changes in historical to be explained through relevant, exogenous covariates, and thereby provides reliable nowcasts of the current despite the possible absence of case counts contemporaneously. The approach therefore can provide policy makers real-time estimates of disease transmission potential to guide decisions on containment measures.
2. Methods
In brief, the EpiRegress framework assumes a negative binomial relationship between daily case counts and , where the number of cases on a certain day had an expected value equal to the sum of local and imported infectiousness. We estimated over time by fitting case counts in a model akin to a generalised linear regression where dependent variables were taken to be mobility, epidemiological, and policy data. The Metropolis–Hastings algorithm was then used to derive the joint posterior distribution of the parameters before predictions on case count were made based on parameter estimates. We exemplified the utility of our approach through application to four Western Pacific countries or regions (henceforth denoted regions)—New South Wales, Australia; New Zealand; Taiwan, China; and the city state of Singapore—over the period from January to September 2021. These four regions and this epoch were selected as they had periods of low incidence, making estimation of harder due to uninformative case data. The datasets utilised and methods developed are discussed below.
2.1. COVID-19 Data
Reported COVID-19 case counts in New South Wales, New Zealand, Singapore, and Taiwan were collected from the government health websites of these four regions respectively [
10,
11,
12,
13]. They were extracted from 12 November 2020 (50 days prior to year 2021) to 30 September 2021, for the four regions. Local cases and imported cases were separated for the regions New South Wales, New Zealand, and Taiwan while for Singapore, we excluded cases reported in foreign worker dormitories for both local and imported cases as they had a distinct disease epidemiology due to localized movement restrictions and denser living conditions compared to the community populace (
Figure S2) [
14,
15]. In the
Supplementary Information, we demonstrate that these exclusions do not significantly affect
estimates (
Figure S9).
The serial interval distribution was derived based on the time between notification events of 157 pairs of primary and secondary household cases in Singapore in 2021 [
16]. We approximated the empirical serial interval distribution with a discrete, truncated log normal distribution using a mean of 4.1 days and standard deviation of 3.5 days (
Figure S3). We used these data to estimate the infection potential assuming that the majority of the local cases were constituted of the B.1.617.2 variant (henceforth denoted Delta variant) [
17], and similar nonpharmaceutical interventions had the intended effects of facilitating early detection, isolation of new cases, and limiting spread [
18].
We also included the time-varying proportion of cases of the Delta variant as a candidate factor to explain changes in
using data from GISAID; for New South Wales, the Australian proportion was used as no state-level proportion was available [
19]. This inclusion was to account for the variant having a significantly higher basic reproductive number than the original virus [
20]. For all the regions but Taiwan, the proportion rose gradually from 0 in mid-March to nearly 100%, while for Taiwan it rapidly grew from 0 to 100% in mid-June. The proportion was 100% for all the four regions from mid-July to the end of September 2021 (
Figure S1). The time window predated the emergence of the Omicron variant of SARS-CoV-2.
2.2. Policy Data
Policies introduced in the four regions were extracted from the Oxford COVID-19 Government Response Tracker (OxCGRT) [
21], which has provided information of government responses to COVID-19 around the globe since 1 January 2020. The dataset consists of 16 different indicators grouped into categories of closures and containment (c), economic response (e), and health systems (h), together with four indices calculated as functions of individual indicators from 1 January 2020. Detailed descriptions of these indicators and indices were listed in the
Supplementary Information (Table S1). We extracted data until 30 September 2021 and imputed each of the missing values (0.3%) with the last available value in that column. For Singapore, we introduced six more variables representing discrete intervention phases in place from November 2020, to September 2021, including Phase 2, Phase 3, Phase 2 (Heightened Alert), Phase 3 (Heightened Alert), Preparatory Stage, and Stabilization Phase (taken as a reference variable) [
22], during which the government adopted distinctive containment strategies on workplaces, social gatherings, dining, and entertainment facilities.
Variations in stringency, government response and containment health index values were almost the same for each specific region while for economic support index, only New South Wales and Taiwan recorded a sudden change by the end of March and May respectively. The greatest change in the first three index values took place in mid-August for New Zealand, and in mid-May for Taiwan. Changes in these indices tended to be milder for New South Wales and Singapore. Variations in the policy indicators, by comparison, were not so large and many remained constant throughout the nine-month period. Generally, indicators belonging to the same category had similar changes and the trends were mostly reflected in the corresponding indices (
Figure S1).
2.3. Google Mobility Data
Mobility data (
Figure S1) for the four regions were obtained from Google’s Community Mobility Reports from 12 November 2020 to 30 September 2021 [
23]. The data reflect relative changes in time from a pre-COVID-19 outbreak baseline that visitors spend in six different types of places: residential, workplace, retail and recreation, grocery and pharmacy, parks, and transit stations. Generally, the time people spent in the last five types of places were strongly and positively correlated with each other and negatively correlated with the time people spent in residential areas, though the trend between parks and other types of places for New South Wales and New Zealand were not so obvious (even converse). Variation in time spent in workplaces was the largest for all the four regions and a prominent weekly circle was observed while substantial changes in the variables were seen in May to August (
Figure S1).
2.4. Vaccination Data
The vaccination doses administered per 100 people in New Zealand, Singapore, and Taiwan were collected from covidvax.live [
24], an online platform that provides real-time statistics on vaccine doses registered worldwide. We obtained daily doses administered for New South Wales from COVID LIVE [
25], an Australian website whose data sources are media releases and state health departments, and took the population of the region to be 8.2 million. The time range for vaccination data was from 1 January to 30 September 2021. For simplicity, we calculated the ‘vaccination rate’ as half of the average vaccination doses administered per person, i.e.,
, where
is the mean vaccination doses administered per person by time
in region
. Vaccination started the earliest in Singapore, rose at almost a constant speed from March to July and slowed down from August as the vaccinated reached around 80% of the population. For both New South Wales and New Zealand, the rate started to rise from late March and accelerated significantly from mid-July. Taiwan, however, is the region with lowest vaccination rate among the four, where few people were vaccinated by mid-June and, by the end of September, only around 40% of the population had been vaccinated (
Figure S1).
2.5. Modelling Daily Number of Cases and Time-Varying Reproduction Number
Using the serial interval probability mass at
days,
, and the number of reported cases before day
,
, the number of local cases on day
,
, is assumed to follow a negative binomial distribution with mean
and variance
where
is assumed to be the constant risk of transmission per imported case into the community,
is the inflation factor for the variance,
is the length of the time window
when a primary case is likely to cause a secondary case and for each day
,
and
is the instantaneous reproduction number aforementioned. Allowing for enough time between neighboring generations of infections [
26], we truncated the serial interval at
for computational purposes by setting
and discretized the serial interval distribution by letting
where
is the probability density function of the log normal distribution with mean 4.1 days and standard deviation 3.5, which is the aforementioned approximation of the empirical serial interval distribution (
Figure S3).
2.6. Augmenting Inference with Exogenous Factors
We assume that
can be explained by a series of exogenous factors at time
, thus:
where
is a matrix with
exogenous factors measured across time points
,
a vector of time-invariant coefficients,
a constant intercept, and
a vector of time-varying reproduction numbers. Covariates with constant values were excluded.
Since there are a large number of covariates
that may potentially affect or be correlated with
, we make use of the Bayesian Lasso [
27] for parameter selection by assigning a Laplace prior distribution with mean 0 and variance
for each entry of
, i.e.,
where
is the penalty in the
-penalized least square error function
which the Lasso estimates minimize.
We set
(see
Table S3 for more details). We then used an auto-regressive Metropolis–Hastings algorithm to estimate the joint posterior distribution of the parameters
and thus those of
s with a Gaussian proposal distribution
and each new draw of parameters
was accepted with probability
where
is the conditional likelihood and
the prior distribution of the parameters. In these,
is the probability mass function for negative binomial distribution with mean
and variance
,
is the density function for Laplace distribution with location parameter 0 and scale parameter
,
is the density function for the non-informative normal prior
which we assigned to the intercept
, while
and
are the positive constraints (indicator function taking 1 if and only if the argument of the function is positive) we set for
and
respectively.
We standardized the matrix before doing regression and excluded covariates which remained constant throughout the inference window to allow for the comparison of different entries of .
To examine the roles of different factors in accounting for changes in
s, three model variants with different factors included in the covariate matrix
were considered: (i) a full model that included all available factors (
Table S2), (ii) a model excluding policies that included only mobility and epidemiological factors, (iii) a hybrid model that included ‘retail and recreation’ and ‘residential’ from google mobility data, vaccination rate and all indicator covariates in the Oxford policy data except ‘testing policy’ and ‘vaccination policy’. The second model variant was chosen to see if mobility and epidemiology variables could fully reflect changes in policy-related variables, making the latter redundant in
estimation. Variables in the last model variant were selected based on the correlation matrixes of
s in the full model, i.e., some of the variables with correlation coefficients close to 1 were excluded in the hybrid model.
To compare the fits of different model variants, we used the Deviance Information Criterion (DIC), which measures the deviance while penalizing model complexity. The formula for calculating the DIC is
where
,
is the collection of the parameters to be estimated and
is some constant.
2.7. Simulation for Validation of the Method
Since the true values are not observable in the case studies, we used simulations to validate the proposed approach. We considered two different covariate matrices over a window of 230 days (henceforth, Scenario 1 and Scenario 2): one taken directly from the mobility, epidemiological, and policy data of New Zealand between 1 January and 28 June 2021, and the other with 20 randomly generated covariates, among which 6 are continuous variables and the rest 14 are ordinal (range for each variable: 0–4). Similar to inference performed in case studies, we standardized the matrices and excluded the covariates with constant values, after which the first covariate matrix was left with 21 variables.
We randomly generated 4 different sets of
coefficients for each scenario, calculated the corresponding
s and further simulated local case counts for 230 days from the negative binomial distribution with mean and variance specified in Equations (1) and (2) in
Section 2.5 previously. Imported case counts were obtained from a discrete uniform distribution with left end as 0 and right end as a number in the set
. The constant risk of transmission per imported case into the community
was set to be a fixed value of 0.01 and the inflation factor for the variance,
, was taken in the range 3–6. Utilising the simulated incidence curves, we estimated
s with EpiRegress over a window of
days (i.e., the likelihood function for estimating
s was
) and compared them with the ‘real’ values by calculating the mean absolute errors (MAE) and mean absolute percentage errors (MAPE) as follows:
and
where
is the posterior median and
is the simulated
for day
in the original dataset. We also calculated the successful coverage rates (SCR) for the proportion of the time in the 180 days where the simulated
values,
, fell within the 95% CrIs of the estimated
s.
Using the same incidence curves, we additionally performed estimations with EpiEstim, EpiFilter, and EpiInvert for comparison purposes, but we only calculated MAEs and MAPEs for point estimates by EpiInvert as the relationship between and which it uses in the renewal equation is the same as the that for simulating the case counts.
2.8. Prediction of Case Counts
If values of the covariates on day , , are available, EpiRegress enables us to estimate the number of local cases on day , . This is done in two steps. First, we obtain the posterior predictive distribution of by doing MCMC simulations with a shifting window of 90 days, i.e., using data over the time interval to calculate the likelihoods for the case counts from day to day , where is the maximum length of serial interval aforementioned. Since follows a negative binomial distribution with mean and variance as in (1) and (2), we obtain the posterior predictive distribution of by performing Monte Carlo sampling. Note that past data are used to generate these samples, rather than past simulations, so the results presented represent nowcasting accuracy rather than long-term predictions, which would in any case require future covariates to be predicted.
Analyses were conducted in R [
28] and C++.
4. Discussion
Over the course of the COVID-19 pandemic, the time varying reproduction number,
, has received considerable public attention as a metric of the waxing and waning of epidemic trajectory. Examples include Wuhan, China [
4] and diverse European countries [
29]. When daily case counts are large, standard methods to estimate
are successful, though they may face data challenges with complications such as day of the week effects [
30]. This problem is mitigated by EpiInvert with its signal processing approach [
7]. However, in places and times where disease transmission is low, having small numbers of cases either makes the methods fail (for example, EpiInvert might even give negative
estimates on some occasions), or makes
hard to estimate with high precision, making it difficult to assess whether intervention measures in place in the community to mitigate disease spread are unduly strict. Many of the countries and territories with long periods of successful mitigation, particularly in Asia [
31,
32], faced this issue during the first year and a half of the pandemic, complicating decision making. A fundamental issue is the inverse correlation between the reproduction number in successive time points—future cases being explainable if
is high and
low, or vice versa, or anywhere in between—and the target of inference being the marginal distribution for these quantities. Other approaches have tackled this using smoothing approaches to share information between nearby time points, such as EpiFilter [
6,
9] which utilises Bayesian recursive filters to good effect by introducing a Gaussian relationship between neighbouring
s.
This study proposes an alternative means of pooling information across time, by linking the estimation of the ensemble {
} to time-varying covariates whose effect may potentially be preserved across prolonged periods of the epidemic, and with a feasible relationship—at least correlative—with transmission rates. This approach performed no worse than three other, prominent methods, EpiEstim [
3,
8], EpiFilter [
6,
9], and EpiInvert [
7], and in some situations compared favourably, particularly when incidence was dominated by imported cases and quarantine measures for returnees had taken effect. Furthermore, the smoothing techniques in the existed methods, though successfully reducing uncertainty in low incidence scenarios, tend to keep the estimates far below unity and make it impossible to respond to sudden changes in the transmissibility, which might not be reflected in the number of reported cases, especially if daily case counts are too small. Therefore, the relatively large uncertainty in
estimates by EpiRegress compared to that by EpiEstim or EpiFilter, might not necessarily be a drawback, as was demonstrated in the simulation results. The approach additionally lends itself well to nowcasting the effective reproduction rate when future cases due to the current cases are yet to emerge but the time-varying covariates can be measured in near real-time. This is not the case for some important variables we considered, such as mobility data which were made public only after a lag [
33], but other data streams without this restriction may be possible for governments with modern surveillance systems, such as the Republic of Korea which deployed big data capture to good effect from an early stage of the COVID-19 pandemic [
32,
34]. Such nowcasting of
would help policy makers respond rapidly to any upsurge in risks.
Naively, we might hope that the inclusion of covariates that are related to
through a regression framework, would permit inferences on the key factors associated with growing transmissibility, as Beest et al., did when they estimated impacts of several influenza-related factors [
35]. Such inference is, unfortunately, prevented by the high collinearity between various mobility, epidemiological, and policy covariates involved in EpiRegress, which frequently move in tandem as multiple policy changes or behavioural changes conterminously vary. As a result, it is unlikely that the impact of specific policies can be obtained through our approach (though an approach akin to a meta-analysis over many countries might permit such associations to be derived [
36]). While this may be seen as a weakness, it also points to the robustness of the methodology to model misspecification, for even if more distal covariates are included instead of those more proximately related to transmission, the inferred
, the key estimand of interest, is little changed.
Another advantage of our approach is that it explicitly distinguishes imported cases from autochthonous ones. The effect of the COVID-19 pandemic on international, and in some cases intranational, travel has been unprecedented [
37,
38]. Differences in quarantine policies in different polities has led to marked variability in the importance of imported cases to the local epidemic, with countries such as China, New Zealand, and Singapore operating very successful quarantine systems [
2,
39]. With little local infection and a quarantine system that leads to little leakage, imported cases contribute little to secondary spread, but counts of imported cases are often not differentiated from autochthonous ones in international databases. As a result, estimates of
that do not distinguish these case types give a misleading depiction of the effectiveness of control in the country receiving infected international travellers. Future efforts to standardise data reporting should seek to explicitly distinguish these two groups for this reason. In our analysis, we found that assuming the same transmission potential of imported and local cases, or excluding the former altogether, led to noticeable differences in the estimates using existing methods; neither approach was necessary in our framework, however.
Limitations of this study include the assumption that all serial intervals are constant for all the regions explored. Differences in reporting times between linked cases in Singaporean households may be shorter than those between linked cases not sharing the same living space, causing an underestimate of
. The diverse nonpharmaceutical interventions taken by different regions or by one region at different time periods may also cause fluctuations in the serial interval distribution [
40]. The
estimates under EpiRegress displayed a weekly cycle which we attribute to the inclusion of Google mobility data, which is an important factor in the estimates, but these weekend dips may not truly reflect changes in risk [
41]. Furthermore, to get better estimate of
s for each region, we preferred not to exclude any of the variables listed in either mobility data or policy data, despite the existence of collinearities [
42]. As aforementioned, this did not deleteriously affect estimates of
but did prohibit us from assessing their individual impacts on
. Lastly, we also assumed a constant under-reporting rate which may be dependent on testing practices [
3] and that the lag between daily case counts and response to interventions was negligible, but we explored the latter in the
Supplementary Information which suggests that the estimates of
were robust to this in all four regions (
Table S9).
Despite these limitations, we believe that the extension of methods to estimate the effective reproduction number that account for time-varying covariates that are plausibly linked to transmission potential using our framework provides a useful addition to our analytic armamentarium for future outbreaks. It will be particularly valuable for places and times when outbreaks are smaller, in small countries or subnational regions, or when mitigation measures remain effective. Although we applied it to COVID-19, it will be applicable to other infectious diseases causing explosive outbreaks, when data on both cases and exogeneous factors are available.