1. Introduction
The SIR model describes the dynamics of an epidemic. Kermack and McKendrick introduced this in their pioneering work [
1]. In this model, we consider a homogeneous population. Thus, no age-structures or group behaviours are taken into account. The population is divided into three homogeneous sections: susceptible people
S, infectious people
I and recovered people
R. Since
and
R change in time
t, we represent these numbers as functions
and
. Births and deaths are ignored, considering a constant population size
N within time
t i.e.,
.
We proceed setting , i.e., the corresponding rates.
The following system of ordinary differential equations describes the SIR model [
2,
3,
4]
The function
is the transmission with respect to time, while
is the rate of recoveries. Ideally,
and
are constants. Notice that
and consequently
A very important quantity is the reproduction number, given as
Ordinary Differential Equations (
1) can be solved numerically using Runge–Kutta pairs [
5,
6,
7]. One of the main difficulties arising is the estimation of parameters
and, consequently, of
.
There is an ongoing interest in SIR-type models. Bertrand and Pirch [
8] presented a least-squares finite element method for the SEIQRD model. Cherniha and Davydovych [
9] studied a nonlinear model based on logistic equations, while Keller et al. [
10] simulated the spread of an infectious disease across a heterogeneous and continuous landscape. E. Kuhl [
11] focused on modeling various stages of COVID-19 outbreak and Viguerie et al. [
12] simulated COVID-19 via an SEIRD model.
From (
1) we deduce
and, consequently, we arrive at
Thus, our concern here is actually a reliable estimation of .
2. Interpolating the Past
We proceed to set as the current time, while and are the current observed numbers of susceptible and recovered. Consequently, corresponds to the previous time step, usually the previous day, i.e., and all t are considered integers. Past values of and population N are usually known and we seek an approximation of the trough . Through the estimation of this derivative at the right (current) point, . In the following, it is convenient to consider s as a function with respect to the increasing variable r. We have various choices.
2.1. Backward Finite Differences
The celebrated Taylor series [
13] states that
We are given various values of
s at distinct points,
and name them for simplicity
,
, etc. Backward finite differences answer the question of approximating the derivatives of
s at
. Thus, we may derive
which is easily verified from (
3) using its implicit Euler variant
with
. Approximation (
4) is said to be the first order of accuracy, since no higher orders of
h are involved in using (
3), e.g.,
etc. After substituting
with the mean value
, we obtain the formula
This latter estimation (
5) is of limited value, since it is based on data from two days and fluctuates intensively. Thus, we may proceed to higher-order differences. A second-order backward finite difference approximation of
produces the formula
As we worked in (
5), we substituted
with the mean value of
,
and
. We observe that
s is clearly a decreasing variable. Then,
and
must hold. In (
5), this property is preserved. However, this is not true for (
6). Let us check this using a small example. Take the following artificial data
to verify that, according to (
6), we get
!
This happens because second- and higher-order backward finite differences do not preserve monotonicity [
14]. Dealing with higher-order methods is,therefore, of no meaning.
Notice that using (
5), we have
for the last two days (i.e., based on the
-interval
). However, the previous day’s estimation was
. This oscillation is also unacceptable.
2.2. Cubic Splines
Cubic splines [
15] are a very interesting tool that can be used in various fields [
16,
17]. The interval
is divided to
n subintervals
,
,
. Using cubic splines, we can obtain
n polynomials of third degree, with each one active in every sub-interval. Concentrating again on the interval
, we may obtain the following formula after using the software found in [
18]
The data (
7) produce
, which is unacceptable. For this dataset, we can obtain two polynomials of the third degree. In the interval
, the polynomial approximation of
s is
with the positive derivative
at the rightmost point. Through this counterexample, we deduce that cubic splines do not preserve monotonicity either.
2.3. Piecewise Cubic Hermite Interpolant
An alternative to cubic splines is Piecewise Cubic Hermite interpolation (PCHIP). In cubic splines, the coincidence of higher derivatives at the nodes is demanded. In PCHIP, this is abandoned in order to achieve monotone cubic polynomials for monotone data. The issue is rather complicated to explain here. More details can be found in [
19] or in the MATLAB function
pchip.
For the data (
7), we can obtain the following polynomial active in the interval
with derivative
at the rightmost point. Monotonicity is marginally preserved, but
in this case. Even if we add more points from the past, we will not change the situation. The method tries always to produce a monotone polynomial of the third degree. Thus, it will make very small changes to (
8), regardless of how many points we add from the past.
All of the above three type of approximation (2.1–2.3) may attain higher algebraic orders, i.e., second-order or higher. This means that we may obtain better accuracy in the approximation of . However, these actually fail, since they do not preserve monotonicity, which is by no means a property of the function and is also present in the corresponding data. Failure to preserve monotonicity is catastrophic. We may sometimes experience then.
2.4. Linear Least Squares
Finally, we propose estimation
by the slope of linear least squares approximation of the data in
. Then, we get
The denominator of (
9) is always positive, since
r is ascending.
For the special case of using data in the interval
, we have
After using (
10) with data (
7), we arrive at an acceptable
.
It seems that (
9) furnishes a balanced result in view of the corresponding results found by (
5). The reason for the final choice of this method is the preservation of monotonicity, which helps to avoid unpleasant outcomes such as
. This is true, since the slope of a least squares line applied on decaying data is obligatorily negative. The latter does not always happen in cases where, e.g., a parabola passes through these points.
The question that raises now is the size of the data used. i.e.,
. A strategy to address this is begins with
and estimates
with a value named
. Continue with
and estimates
, respectively. We stop the iteration whenever two consecutive estimations of
differ less than
. That is, whenever
we accept
as the value on demand. In case this does not happen, we accept
as the final approximation of
. We implement this procedure as a MATLAB [
20] program in the
Appendix A.
3. Preliminary Tests
We choose the static case and with initial conditions , and . Here, . Then, we obtain the values of and r for by the following lines in the command window of MATLAB.
>> beta=0.1;gamma=0.05;
>> fcn=@(t,x)[-beta*x(1)*x(2);
beta*x(1)*x(2)-gamma*x(2);
gamma*x(2)];
>> [tout,xout]=ode45(fcn,(0:1:100)’,[0.9999 0.0001 0]’);
Now, we may approximate the actual value of for . Thus, we type
>> ro=zeros(101,1);
>> j1=21:101,ro(j1)=r00(xout(j1-20:j1,3),xout(j1-20:j1,1));end;
>> max(abs(ro(21:101)-2))
ans =
3.2964e-005
i.e., we obtain almost five digits of accuracy. This result was obtained by using the data from only tree consecutive data pairs (i.e., by using only (
10)). We observed this behavior since the parameters are constant. The same result is also attained for other selections of
and
.
The method also applies in case of varying parameters. In the next test, we choose constant and . Then, we type in MATLAB
>> fcn=@(t,x) [-(0.1+t/1000)*x(1)*x(2);
(0.1+t/1000)*x(1)*x(2)-.1*x(2);
0.1*x(2)];
>> [tout,xout]=ode45(fcn,(0:1:100)’,[0.999 0.001 0]’);
In this paradigm, we have , and the performance of the method is checked by the following
>> ro=zeros(101,1);
>> for j1=21:101,ro(j1)=r00(xout(j1-20:j1,3),xout(j1-20:j1,1));end;
>> max(abs(ro(21:101)-1-(20:100)’./100))
ans =
0.0149
i.e., the error is in the second decimal point. This is a rather good approximation of a , which takes values in the interval .
A very interesting issue is that using mean value of data
s in (
2) calibrates the result to the correct value of
. The percentage of susceptible
s varies slowly. Using the mean value instead of
causes only a small difference, which delivers 2–3 more digits of accuracy. Additionally, we mention that this new approach is much easier to compute than our previous method [
21]. It also seems to achieve better accuracy.
4. Tests on Real COVID-19 Data
We will test the new approach using real COVID-19 data. The time-series data of confirmed, recovered, and death cases for various countries were retrieved from [
22]. The data are presented in the format shown in
Table 1.
The two rightmost columns sum to form vector
r, after dividing this sum by the country’s population. The vector
s is formed by the division of confirmed cases by population. The population of each country was retrieved from Wikipedia [
23]. The data in [
22] are not always reliable. Data are missing or reported inaccurately for some countries. The method presented here does not apply properly at the outbreak of a disease, when values
r are rather small. Then, we may apply some scale, e.g., use vectors
S and
R. The values in
r have to vary somehow in order to obtain reliable results. Thus, no trustworthy results can be derived from the following
Table 2.
We do not believe that any serious method can extract something reliable from the above data. By this, we mean that, at the beginning of an outbreak, we may not obtain an explicit picture of the situation from a sole country’s data. Only a global view may raise some interest in these numbers. Thus, the method at hand may apply after the initial development of the pandemic.
We produced these results using at least 14 of the latest data in order to obtain smoother curves.
We observe that decays for Russia and, after mid-January 2021, stays below 1. The corresponding parameter for Germany stayed below 1 for 2021, but it seems that this was not true from the beginning of March. Finally, stayed below 1 for Italy until the end of February, and climbed above it in March.
The data for the countries mentioned above seem to be reliable. However, there are cases with misreported data. In France, after a year, only about a quarter of million were reported as recovered, while there are 4 million infected! The same problem is apparent in the data for the UK and other countries.
Thus, the new method cannot apply to corrupted data, as expected for any other method. Least squares may circumvent an outlier or some misreported data, but consistently faulty data are untreatable.
5. Conclusions
A new formula for directly estimating present in the SIR epidemic model is derived here. Using only the percentage values of susceptible s, and recovered r at consecutive days, we form a linear least squares approximation of the derivative . This approximation is non-positive (i.e., preserves monotonicity). Then, the new formula stays close enough to . For use with real COVID-19 data, we implemented an iterative technique that promises convergence to the actual value of . Similar research is planned in the future for other models, such as SIS, SIRD, and SEIR.