**4. The Bi-Skew Logistic Distribution for Modelling Epidemic Waves**

We start by defining the bi-skew logistic distribution, which will enable us to model more than one wave of infections at a time. We then discuss how we partition the data into single waves, in a way that we can apply the maximum likelihood from the previous section to the data in a consistent manner.

We present the *bi-skew logistic distribution*, which is described by the sum,

$$f(\mathbf{x}; \lambda\_1, \mu\_1, s\_1) + f(\mathbf{x}; \lambda\_2, \mu\_2, s\_2),$$

of two skew logistic distributions. It is given in full as:

$$\frac{\exp\left(-\lambda\_1 \frac{x-\mu\_1}{s\_1}\right)}{s\_1 \left(1 + \exp\left(-\frac{x-\mu\_1}{s\_1}\right)\right)^2} + \frac{\exp\left(-\lambda\_2 \frac{x-\mu\_2}{s\_2}\right)}{s\_2 \left(1 + \exp\left(-\frac{x-\mu\_2}{s\_2}\right)\right)^2},\tag{14}$$

which characterises two distinct phases of logistic growth (c.f. [19,32]). We note that (14) can be readily extended to the general case of the sum of multiple skew logistic distributions; however, for simplicity, we only present the formula for the bi-skew logistic case. Thus, while the (single) skew logistic distribution can only model one wave of infected cases (or deaths, or hospitalisations), the bi-skew logistic distribution can model two waves of infections, and in the general cases, any number of waves.

In the presence of two waves, the maximum likelihood solution to (14), would give us access to the necessary model parameters, and solving the general case in the presence of multiple waves, when the sum in (14) may have two or more skew logistic distributions, is evidently even more challenging. Thus, we simplify the solution for the multiple wave case, and concentrate on an approximation assuming a sequential time series when one wave strictly follows the next. More specifically, we assume that each wave is modelled by a single skewed logistic distribution describing the growth phase until a peak is reached, followed by a decline phase; see [33] who consider epidemic waves in the context of the standard logistic distribution. Thus, a wave is represented by a temporal pattern of growth and decline, and the time series as a whole describes several waves as they evolve.

To provide further clarification of the model, we mention that the skew-bi logistic distribution is *not* a *mixture model* per se, in which case there is a mixture weight for each distribution in the sum, as in, say, a Gaussian mixture [34] (Chapter 9). In the bi-skew logistic distribution case we do not have mixture weights, rather, we have two phases in our context waves, which are sequential in nature, possibly with some overlap, as can be seen in Figure 1 (c.f. [19,32]). Strictly speaking, the bi-skew logistic distribution can be viewed as a mixture model where the mixture weights are each 0.5 and a scaling factor of 2 is applied. Thus, as an approximation, we add a preprocessing step where we segment the time series into distinct waves, resulting in a considerable reduction to the complexity of the maximum likelihood estimation. We do, however, remark that the maximum likelihood estimation for the bi-skew logistic distribution is much simpler than that of a corresponding mixture model, due to the absence of mixture weights. In particular, although we could, in principle, make use of the EM (expectation-maximisation) algorithm [34] (Chapter 9) and [35] to approximate the maximum likelihood estimates of the parameters, this would not be strictly necessary in the bi-skew logistic case, cf. [36]. The only caveat, which holds independently of whether the EM algorithm is deployed or not, is the additional number of parameters present in the equations being solved. We leave this investigation as future work, and focus on our approximation, which does not require the solution to the maximum likelihood of (14); the details of the preprocessing heuristic we apply are given in the following section.

#### **5. Data Analysis of COVID-19 Deaths in the UK**

Here, we provide a full analysis of COVID-19 deaths in the UK from 30 January 2020 to 30 July 2021, employing the E*SJS* goodness-of-fit statistic and comparing it to the *KS*2 statistic. The daily UK COVID-19 data we used was obtained from [37].

As a proof of concept of the modelling capability of the skew logistic distribution, we now provide a detailed analysis of the time series of COVID-19 deaths in the UK from 30 January 2020 to 30 July 2021.

To separate the waves, we first smoothed the raw data using a moving average with a centred sliding window of 7 days. We then applied a simple heuristic, where we identified all the minima in the time series and defined a wave as a consecutive portion of the time, of at least 72 days, with the endpoints of each wave being local minima apart from the first wave, which starts from day 0. The resulting four waves in the time series are shown in Figure 1; see last column of Table 1 for the endpoints of the four waves. It would be worthwhile, as future work, to investigate other heuristics, which may, for example, allow overlap between the waves to obtain more accurate start and end points and to distribute the number of cases between the waves when there is overlap between them.

In Table 1, we show the parameters resulting from maximum likelihood fits of the skew logistic distribution to the four waves. Figure 2 shows histograms of the four COVID-19 waves, each overlaid with the curve of the maximum likelihood fit of the skew logistic distribution to the data. Pearson's moment and median skewness coefficients [38] for the four waves are recorded in Table 2. It can be seen that the correlation between these and 1 − *λ* is close to 1, as we would expect.

**Figure 1.** Reported daily COVID-19 deaths from 30 January 2020 to 30 July 2021 and their minima labelled '\*', resulting in four distinct waves; a moving average with a centred sliding window of 7 days was applied to the raw data.

**Table 1.** Parameters from maximum likelihood fits of the skew logistic distribution to the four waves, and the day of the local minimum (End), which is the end point of the wave.


**Table 2.** Pearson's moment and median skewness coefficients for the four waves, and the correlation between 1 − *λ* and these coefficients.


**Figure 2.** Histograms for the four waves of COVID-19 deaths from 30 January 2020 to 30 July 2021, each overlaid with the curve of the maximum likelihood fit of the skew logistic distribution to the data.

We now turn to the evaluation of goodness-of-fit using the E*SJS* (empirical survival Jensen–Shannon divergence) [21,22], which generalises the Jensen–Shannon divergence [39] to survival functions, and the well-known *KS*2 (Kolmogorov–Smirnov two-sample test statistic) [23] (Section 6.3). We will also employ 95% bootstrap confidence intervals [25] to measure the improvement in the E*SJS* and *KS*2, goodness-of-fit measures, of the skewlogistic over the logistic and normal distributions, respectively. For completeness, we formally define the E*SJS* and *KS*2.

To set the scene, we assume a time series [40], **x** = {*x*1, *x*2, ... , *xn*}, where *xt*, for *t* = 1, 2, ... , *n* is a value indexed by time, *t*, in our case modelling the number of daily COVID-19 deaths. We are, in particular, interested in the marginal distribution of **x**, which we suppose comes from an underlying parametric continuous distribution *D*.

The *empirical survival function* of a value *z* for the time series **x**, denoted by *S*4(**x**)[*z*], is given by:

$$\hat{S}(\mathbf{x})[z] = \frac{1}{n} \sum\_{i=1}^{n} I\_{\{x\_i > z\}}\tag{15}$$

where *I* is the indicator function. In the following, we will let *P*4(*z*) = *S*4(**x**)[*z*] stand for the empirical survival function *S*4(**x**)[*z*], where the time series **x** is assumed to be understood from context. We will generally be interested in the empirical survival function *P*4, which we suppose arises from the survival function *P* of the parametric continuous distribution *D*, mentioned above.

The *empirical survival Jensen–Shannon divergence* (E*SJS*) between two empirical survival functions, *Q*4<sup>1</sup> and *Q*4<sup>2</sup> arising from the survival functions *Q*<sup>1</sup> and *Q*2, is given by:

$$\mathcal{E}S\{\hat{Q}\_1, \hat{Q}\_2\} = \frac{1}{2} \int\_0^\infty \hat{Q}\_1(z) \log\left(\frac{\hat{Q}\_1(z)}{\hat{M}(z)}\right) + \hat{Q}\_2(z) \log\left(\frac{\hat{Q}\_2(z)}{\hat{M}(z)}\right) dz,\tag{16}$$

where:

$$
\hat{M}(z) = \frac{1}{2} \left( \hat{Q}\_1(z) + \hat{Q}\_2(z) \right).
$$

We note that the E*SJS* is bounded and can thus be normalised, so it is natural to assume its values are between 0 and 1; in particular, when *Q*4<sup>1</sup> = *Q*4<sup>2</sup> its value is zero. Moreover, its square root is a metric [41], cf. [21].

The *Kolmogorov–Smirnov* two-sample test statistic between *Q*4<sup>1</sup> and *Q*4<sup>2</sup> as above, is given by:

$$\text{KS2}(\hat{Q}\_1, \hat{Q}\_2) = \max\_z |\hat{Q}\_1(z) - \hat{Q}\_2(z)|\,\tag{17}$$

where *max* is the maximum function, and |*v*| is the absolute value of a number *v*. We note that *KS*2 is bounded between 0 and 1, and is also a metric.

For a parametric continuous distribution *D*, we let *φ* = *φ*(*D*, *P*4) be the parameters that are obtained from fitting *D* to the empirical survival function, *P*4, using maximum likelihood estimation. In addition, we let *P<sup>φ</sup>* = *Sφ*(**x**) be the survival function of **x**, for *D* with parameters *φ*. Thus, the empirical survival Jensen–Shannon divergence and the Kolmogorov–Smirnov two-sample test statistic, between *P*4 and *Pφ*, are given by E*SJS*(*P*4, *Pφ*) and *KS*2(*P*4, *Pφ*), respectively, where *P*4 and *P<sup>φ</sup>* are omitted below as they will be understood from context. These values provide us with two measures of goodness-of-fit for how well *D* with parameters *φ* is fitted to **x** [22].

We are now ready to present the results of the evaluation. In Table 3, we show the E*SJS* values for the four waves and the said improvements, while in Table 4, we show the corresponding *KS*2 values and improvements. In all cases, the skew logistic is a preferred model over both the logistic and normal distributions, justifying the addition of a skewness parameter as can be see in Figure 2. Moreover, in all but one case the logistic distribution was preferred over the normal distribution—wave 3, where the *KS*2 statistic of the normal distribution was smaller than that of the logistic distribution. We observe that, for the second wave, the E*SJS* and *KS*2 values for the skew logistic and logistic distribution were the closest, since, as can be seen from Table 1, the second wave was more or less symmetric, in which case the skew logistic distribution reduces to the logistic distribution.

**Table 3.** E*SJS* values for the skew logistic (SL), logistic (Logit) and normal (Norm) distributions, and the improvement percentage of the skew logistic over the logistic (SL-Logit) and normal (SL-Norm) distributions, respectively.


In Tables 5 and 6, we present the bootstrap 95% confidence intervals of the E*SJS* and *KS*2 improvements, respectively, using the *percentile* method, while in Tables 7 and 8, we provide the 95% confidence intervals of the E*SJS* and *KS*2 improvements, respectively, using the *bias-corrected and accelerated* (BCa) method [25], which adjusts the confidence intervals for bias and skewness in the empirical bootstrap distribution. In all cases, the mean of the bootstrap samples is above zero with a very tight standard deviation. As noted

above, the second wave is more or less symmetric, so we expect that the standard logistic distribution will provide a fit to the data, which is as good as the skew logistic fit. It is thus not surprising that in this case the improvement percentages are, generally, not significant. In addition, the improvements for the third wave are also, generally, not significant, which may be due to the starting point of the third wave, given our heuristic, being close to its peak; see Figure 1. We observe that, for this dataset, it is not clear whether deploying the BCa method yields a significant advantage over simply deploying the percentile method.

**Table 4.** *KS*2 values for the skew logistic (SL), logistic (Logit) and normal (Norm) distributions, and the improvement percentage of the skew logistic over the logistic (SL-Logit) and normal (SL-Norm) distributions, respectively.


In Table 9, we show the mean and standard deviation statistics of the confidence interval widths, of the metrics we used to compare the distributions, implying that, in general, the E*SJS* goodness-of-fit measure is more powerful than the *KS*2 goodness-of-fit measure. This is based on the known result that statistical tests using measures resulting in smaller confidence intervals are normally considered to be more powerful, implying that a smaller sample size may be deployed [42].

**Table 5.** Results from the percentile method for the confidence interval of the difference of the E*SJS* between the logistic (Logit) and skew logistic (SL), and between the normal (Norm) and skew logistic (SL) distributions, respectively; Diff, LB, UB, CI, Mean and STD stand for difference, lower bound, upper bound, confidence interval, mean of samples and standard deviation of samples, respectively.


As mentioned in the introduction, we obtained comparable results to the above when modelling epidemic waves with the epsilon skew normal distribution [7] as opposed to using the skew logistic distribution; see also [43] for a comparison of a skew logistic and skew normal distribution in the context of insurance loss data, showing that the skew logistic performed better than the skew normal distribution for fitting the datasets tested. Further to the note in the introduction that the skew logistic distribution is a more natural one to deploy in this case due to its heavier tails, we observe that in an epidemic scenario, the number of cases counted can only be non-negative, while the epsilon skew normal also supports negative values.


**Table 6.** Results from the percentile method for the confidence interval of the difference of the *KS*2 between the logistic (Logit) and skew logistic (SL), and between the normal (Norm) and skew logistic (SL) distributions, respectively; Diff, LB, UB, CI, Mean and STD stand for difference, lower bound, upper bound, confidence interval, mean of samples and standard deviation of samples, respectively.

**Table 7.** Results from the BCa method for the confidence interval of the difference of the E*SJS* between the logistic (Logit) and skew logistic (SL), and between the normal (Norm) and skew logistic (SL) distributions, respectively; Diff, LB, UB, CI, Mean and STD stand for difference, lower bound, upper bound, confidence interval, mean of samples and standard deviation of samples, respectively.


**Table 8.** Results from the BCa method for the confidence interval of the difference of the *KS*2 between the logistic (Logit) and skew logistic (SL), and between the normal (Norm) and skew logistic (SL) distributions, respectively; Diff, LB, UB, CI, Mean and STD stand for difference, lower bound, upper bound, confidence interval, mean of samples and standard deviation of samples, respectively.



**Table 9.** Mean and standard deviation (STD) statistics for the confidence interval (CI) widths using the percentile (P) and BCa methods.

#### **6. Concluding Remarks**

We have proposed the skew-logistic and bi-logistic distributions as models for single and multiple epidemic waves, respectively. The model is a simple extension of the symmetric logistic distribution, which can readily be deployed in the presence of skewed data that exhibits growth and decay. We provided validation for the proposed model using the E*SJS* as a goodness-of-fit statistic, showing that it is a good fit to COVID-19 data in UK and more powerful than the alternative *KS*2 statistic. As future work, we could use the model to compare the progression of multiple waves across different countries, extending the work of [16].

**Funding:** This research received no external funding.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Publicly available datasets were analysed in this study. This data can be found here: https://coronavirus.data.gov.uk/details/download, accessed on 20 March 2022.

**Conflicts of Interest:** The author declares no conflicts of interest.
