In this section, we apply the PR, NBR and PIGR models to a real data set containing the number of dengue-fever cases recorded in the city of Campo Grande, state of Mato Grosso do Sul, Brazil, in the period from January 2008 () to December 2019 ().
Due to the favorable climate for the proliferation of the dengue-transmitting mosquito, especially, between October and March, the city has a large number of dengue cases recorded every year. A dengue-control strategy implemented by the city government is based on the availability of health agents in city neighborhoods to provide information on dengue and how to eliminate the transmitter mosquito. Additionally, the city government has a program for cleaning neighborhoods to eliminate possible breeding sites of the dengue-transmitting mosquito.
Thus, in order to contribute to the dengue surveillance system in the city of Campo Grande—MS, this article proposes the fitting of a statistical model to identify the climatic variables that can influence the number of dengue cases. Once the variables are identified, the fitted model allows projections to and simulation of different scenarios of evolution of the number of cases of the disease. Therefore, it can help in decision-making regarding the implementation of measures to combat and/or control the vector that transmits the disease.
Results
Consider
to be the number of dengue-fever cases recorded in the city of Campo Grande, MS state, Brazil, in the period from January 2008 (
) to December 2019 (
). These measures are freely available on the website
http://tabnet.datasus.gov.br/cgi/tabcgi.exe?sinannet/cnv/denguebbr.def (accessed on 10 November 2020) and also can be obtained by emailing the authors of the present article.
Let
be a matrix of dimension
composed of the recorded measures of the variables
The recorded measures for variables
to
are freely available at
https://www.cemtec.ms.gov.br (accessed on 8 December 2020). Denote this dataset by
, which is a matrix of dimension
. The first column contains the recorded number of dengue-fever cases in each of the 144 months considered in the study. Columns 2 to 6 contain the recorded values of the explanatory variables
to
.
Figure 1 shows the number of recorded dengue-fever cases from 2007 to 2019. The figure includes the number of cases recorded in 2007 just to show that every three years the city of Campo Grande presents a larger outbreak of dengue-fever cases. However, the recorded number of dengue-fever cases in 2007 was not considered to fit the models because the website
https://www.cemtec.ms.gov.br (accessed on 8 December 2020) does not include values for the explanatory variables
to
in 2007.
Figure 2 shows the evolution of the number of dengue-fever cases by month. Due mainly to the climate of the city, characterized by high heat and heavy rains from October to March, this period contains most of the recorded dengue-fever cases in the city. This fact shows the importance of having a model for projection for the number of dengue cases from environmental variables, to support actions to combat the proliferation of the mosquito and consequently the reduction of the number of cases.
Table 2 shows the descriptive statistics of the recorded
y values in the period from January of 2008 to December of 2019. The smallest recorded value was 2 cases in August of 2008. The highest recorded value was 18,530 cases in January of 2013. On average, 1057 cases were recorded per month in the period considered.
Table 3 shows the correlations for each pair of variables. As one can note, the highest correlation is between variables
and
. However, since it is not a strong correlation (>0.75), we opt to maintain both variables for the fitting of the models.
In addition, we also verify if there is multicollinearity among explanatory variables by means of variance inflation factor (VIF) values for the PR and NBR models [
29]. At this point, we remind the reader that multicollinearity occurs when two or more explanatory variables are highly correlated with one another in a regression model. That is, one explanatory variable can be predicted from another expanatory variable. A VIF value equal to 1 means that the predictor is not correlated with other variables. The higher the value, the greater the correlation of the variable with other variables. In general, values smaller than 5 indicate weak correlation, values between 5 and 10 indicate moderate correlation, and values equal to or greater than 10 indicate high correlation.
In order to calculate the VIF values, we first fit the PR and NBR models using the
R software and the
glm function. We then obtain the VIF values by applying the
vif function of the
car package. Listing 1 shows the
R code used. The VIF values are presented in
Table 4. As one can see, all values are less than five, which indicates weak multicollinearity. Therefore, all five explanatory variables are used to fit the models.
Listing 1. R code. |
|
Using the sample average and sample variance presented in
Table 2, the overdispersion index given in Expression (
3) is
. That is, the recorded values are overdispersed. Additionally, we also apply the overdisperion test of Cameron and
Trivedi [
24] (CT test), using the
overdisp() function of the
R software.
Figure 3 shows the output of the test in the
R software. As one can note, the null hypothesis is rejected for the usual significance levels
, meaning that there is evidence for overdispersion.
Both results described above indicate that the PR model is not appropriate for this dataset. Due to this, hereafter we fit the NBR and PIGR models to the dataset and compare these two models according to the AIC and BIC model-selection criteria. The best model is the one that has the smallest AIC and BIC values.
We fit NBR and PIGR models using the gamlss() function of the gamlss package of the R software. Since the month variable has cyclical values, we fit both models by considering a cyclical P-spline term for this variable. For this, we use the pbc() function inside the gamlss function. In addition, we fit both models by considering smooth terms for continuous variables , and . For this case, we use the pb() function. We call the models fitted with pb() function of NBR-S and PIGR-S, respectively. Listing 2 shows the R code used for fitting the models.
Listing 2. R code. |
|
To significance level , none of the variables was significant for the NBR model (p-value ). For NBR-S and PIGR models, variables and were not significant (p-values ). For the PIGR-S model, and the variables and were not significant (p-values ). Due to this, we discard the NBR model and refit NBR-S, PIGR, and PIGR-S models without the non-significant variables.
Table 5 shows model-comparison criteria for the three fitted models. The smallest values are highlighted in boldface. Since the AIC and BIC values for the PIGR and PIGR-S models are very similar and the RMSE values are equal, we opt to maintain the PIGR as the best model because the smooth terms have not led to a significant improvement in the model.
With the models fitted, it is important to perform a residuals analysis in order to identify the discrepancies between the models and the data, and to assess the overall model goodness-of-fit. In a normal linear regression scenario, the Pearson and deviance residuals are usually considered. However, these residuals are not suitable for problems in which the response variable is discrete because they are not normally distributed, and according to Feng et al. [
30], “have nearly parallel curves according to the distinct discrete response values, imposing great challenges for visual inspection”. To circumvent this issue, Dunn
and Smyth [
21] propose the use of randomized quantile residuals (RQR). According to the authors, this kind of residuals is particularly ideal for visualizing the goodness-of-fit of count regression models.
In order to calculate the RQR, we first need to obtain the cumulative distribution function,
of the model considered, for
. For the continuous case,
values are uniformily distributed on interval
, and the RQR is defined as
where
is the cumulative distribution function of the standard normal distribution. However, since the cumulative distribution function
for the models considered (NBR and PIGR) is not strictaly continuous, but a step function, a randomization is introduced to produce continuous normal residuals. Thus, in order to get the RQR, Dunn and Smyth [
21] propose the following strategy. For
:
Determine a point , i.e., is the value of when approaching from the left;
Determine , i.e., the value of at the point ;
Generate a value from a uniform distribution on interval ;
Calculate the RQR .
We obtained the RQR values for NBR and PIGR models using the residuals function of the R software.
Figure 4 shows the normal quantile-quantile plot (q-q plot) for the randomized quantile residuals of the NBR-S and PIGR fitted models. The q-q plot is a scatterplot created by plotting the empirical quantiles of the residuals against the theoretical quantiles of the normal distribution. If residuals are normally distributed then they should form an approximately straight line.
Figure 5 shows the worm plot. This graph was proposed by van Buuren and Fredriks [
31] to identify regions (intervals) of the explanatory variable within which the model does not fit the data adequately [
15]. In this graph, the upwardsline of the q-q plot is rotated to the horizontal in order to remove the trend and the
Y axis contains the difference between its location in the theoretical and empirical distributions. If the residuals follow a normal distribution then the
Y values are near the horizontal line and consequently inside the confidence band. The
R function
wp() provides the worm plot for a
gamlss fitted model. As one can note, both figures indicate the PIGR model performs better than the NBR model. In addition, the graphs of the residuals from the PIGR model indicate that there is no reason to worry about the inadequacy of the fit.
Table 6 shows the estimates for the parameters of the PIGR model.
Figure 6 shows estimated relationships between the response variable and explanatory variables. As expected, the relationship with
(month) presents a cyclical behavior, and the relationship with
and
is linear. These graphs were constructed using the
term.plot function of the
R software.
Figure 7 shows the number of registered dengue cases (symbol •) and a confidence band of
generated from the fitted PIGR model. In order to construct the confidence band we use a parametric bootstrap. That is, from estimated value
and
, we generate
values from a PIG distribution using the
rPIG function of the
R software. Then we set the lower and upper limits as being the percentiles
and
of the generated values. As one can note, the fitted model indicates that every year a peak will occur. How high or low the recorded number of dengue cases will be in relation to the expected peak (given by the fitted model) is controlled by action taken to combat the proliferation of the mosquito. If such action is effective, there is no occurence of a peak, as in years 2008, 2009, 2011, 2012, 2014 and 2017. Otherwise, the peak may be higher than expected, i.e., there may be a larger outbreak, as in the years 2010, 2013, 2016 and 2019. That is, human behavior has a great influence on the number of cases that will be recorded. However, since this behavior is very difficult to quantify and is not present in the proposed model, this also has an influence on the predictive performance of the fitted model.
For example, in the next year after the years with peaks of cases (2007, 2010, 2013 and 2016), there was a significant reduction of recorded cases due to the implementation of actions to combat the proliferation of the disease vector and awareness campaigns reminding the population what happened the previous year. However, with the expected reduction in the number of recorded dengue cases obtained, the combat actions and awareness campaigns were not maintained, leading to an increase in the number of cases in the following two years. This has been occurring cyclically over the last 13 years.
Thus, although the proposed model does not present a satisfactory predictive performance, especially due to our inability to quantify and insert into the model the actions taken to combat the transmitting mosquito, it has at least three advantages: (i) better performance in comparison to the usual approaches, which are based on the fitting of PR and NBR models; (ii) the fitted model shows that a peak will occur every year and that the only way to avoid this peak is via the implementation of actions to combat the proliferation of the transmitting mosquito; and (iii) the fitted model shows which are the months of the year in which combat actions must be implemented.