1. Introduction
COVID-19 is a serious acute respiratory syndrome caused by the beta-coronavirus SARS-CoV-2, which was first reported at the end of 2019 [
1]. The rate of transmission has ranked COVID-19 as the worst pandemic of the century in terms of scale and speed [
2]. Transmission can occur through direct, indirect, or close contact with secretions of infected individuals or through direct contact with infected surfaces. SARS-CoV-2 enters the host cells via interaction with its entry receptor, angiotensin-converting enzyme 2 (ACE2), and an activating receptor, a protease such as TMPRSS2 or cathepsin [
3].
The first case of COVID-19 in South America was reported on 25 February 2020 in the city of São Paulo, Brazil, which is an important travel hub for the region [
4]. Since then, important control measures, such as overall or partial closing of marine, land, and air borders; travel restrictions, shutdown of schools and colleges; and imposed lockdown were implemented in different ways in Brazil and other countries of the region. According to official data, the country has reported the highest number of cases of COVID-19 in South America [
5,
6].
Brazil has the highest Gross Domestic Product (GDP) in South America, and the population density varies according to the regional division of the country. The mean population density is 24.69 inhabitants per square kilometer [
7]. According to Organisation for Economic Co-operation and Development (OECD) [
8], Brazil is composed of approximately 1.3 million independent or liberal professionals. Informal employment ranges from 20 to 49% of the workforce of the country. Formal and informal work positions were directly affected by restrictions to contain the spread of the virus, most of them based on social distancing. Data collected from 239 slum communities [
9], where approximately 6.5% of the Brazilian population lives, showcase part of the Brazilian reality for formal and informal workers during COVID-19 pandemic. Approximately 72% of residents report that they do not have any savings to fall back on, while 15% had only the equivalent of one minimum wage in savings to survive the next month. Approximately 50% of the residents of these communities are liberal professionals or rely on informal work positions as their main source of income.
National and hallmark holidays usually involve a massive mobility of people seeking stores and malls, parks, and beaches. During the COVID-19 pandemic, the 2020 Brazilian calendar maintained nine days of national holidays; seven of them included extended weekends, from Friday to Sunday and/or Monday, which increases the circulation of individuals. At the beginning of February, crowds were reported across the country [
10]. The government of the city of São Paulo estimates that the local carnival took about 15 million people to the streets in February of 2020 [
11].
To date, the Brazilian scenario has demonstrated that the pandemic just deepened the already-existing political, social, and economic issues in the country [
7]. Although the universal Brazilian public health system has been an example to the world struggling to manage other major outbreaks such as dengue fever, measles, Zika, and chikungunya viruses [
12,
13], the country has been coping with the several issues that affect COVID-19 prevention. The main ones are the lack of water supply, limited access to hand sanitizers and masks, and the lack of community engagement [
7]. Social distance, among some 6068 non-pharmaceutical interventions, is the most effective method adopted by leaders around the world, presenting a major impact on decreasing transmission rate
[
14]. As it turns out, the dilemma to adopt social distance in socioeconomically disadvantaged areas such as slums is difficult to assess [
15,
16].
In this context, this study has three main contributions: (1) investigating the impact of Brazilian national holidays in social distancing and the evolution of COVID-19 in the country; (2) assessing the reproduction number using SEIRD model as well as applying Principal Component Analysis (PCA) to reduce the dimensions of a dataset containing community mobility reports, using the outcomes produced by these methods as input data for regression; and (3) demonstrating that estimates from data-driven pipelines based on holidays and using multi-variate LSTM neural networks are appropriate to predict 14-day COVID-19 daily deaths in Brazil. Our study provides evidence for the impact of holidays, community mobility, and the association of these factors with crowding. Our results indicate that an acceleration of the spread of the virus happens after the holiday breaks, which may eventually influence the access to adequate medical attention or ICU beds.
2. Materials and Methods
The open-access dataset of COVID-19 consists of official reports provided by the sanitary authority of Brazilian states [
17]. The dataset contains country-level, daily-updated data retrieved from the Brazilian Ministry of Health and Brazilian Institute of Geography and Statistics (IBGE). The dataset contains reported cases, daily fatalities, number of cases per epidemiological week, number of deaths per day, total number of deaths, and reports of COVID-19 recovered and vaccinated individuals among others (
Table 1). Data are presented in absolute numbers or in percentage per 100,000 inhabitants. Our analysis contains a time series that started in 25 February 2020, comprising features for all 26 Brazilian states.
SARS-CoV-2 dispersed throughout the country rapidly after the first official report. On 13 April 2021, the 16th week of the year, 82,186 cases of COVID-19 and 3808 daily deaths were reported in Brazil. Considering a 7-day moving average, these numbers represent approximately 71,344 cases and 3068 deaths (
Figure 1a,b). These values are the result of the sum of reports in all Brazilian states. Nevertheless, a certain stability is observed in the death curve in the period ranging from weeks 22 to 34 (
Figure 1b). This could be attributed to several factors, including non-pharmaceutical interventions adopted to reduce the contact among people, which ultimately influence the amount of viral load that an individual is exposed to. Additionally, the relative availability of hospitals that were not operating at their full capacity provided proper access to ICU beds and adequate management of infected patients.
Geographic distributions for deaths by COVID-19 in all Brazilian states are shown in
Figure 2. The majority of casualties are concentrated in the Southeast region of the country, which includes the states of Espírito Santo (ES), São Paulo (SP), Rio de Janeiro (RJ), and Minas Gerais (MG).
Instead of using data from individual Brazilian states in the analysis, the main contribution of this work is to understand the social component across the country, despite the effect of local singularities in each Brazilian city and state. Therefore, we selected three populated states with the highest mortality rates for COVID-19 in Brazil to briefly highlight such singularities: São Paulo, Rio de Janeiro, and Minas Gerais. These states have 40% of the estimated Brazilian population.
Population densities are a result of the size of each state. São Paulo, for example, has an estimated population of 46,649,132 inhabitants and a density of 166.25 inhab/km
. People are relatively dispersed in the state. However, more than 22 million people are concentrated on the metropolitan area of the city of São Paulo. The same happens to Minas Gerais, with a population of approximately 21.5 million people, density of 33.41 inhab/km
, and more than six million people residing on the metropolitan area of the state capital, Belo Horizonte. The population density of the third smallest state in the country is 365.23 inhab/km
, and like the other two states, approximately 76% of the 17,463,349 inhabitants live in the metropolitan area of the state capital, the city of Rio de Janeiro [
18]. Altogether, the three states have a major influence on Brazilian reports of COVID-19, but São Paulo seems to help shape the curve of Brazilian daily cases (
Figure 3).
The Brazilian official calendar includes several federal holidays and religious festivities that can lead to crowding. State holidays and regional festivities were not included in the analysis. There were 12 official national holidays for 2020 (
Table 2). Due to the severity of COVID-19 worldwide, the Brazilian Ministry of Defense followed recommendations of health authorities and canceled military parades and other festivities of September 7th to avoid public events that could increase the spread of new variants of SARS-CoV-2 [
9,
19]. The two rounds of Brazilian elections, in November 15th and in November 29th, were also included as potential sources of crowding.
Holidays usually involve increased mobility of people. Thus, we assessed mobility of Brazilian community using data retrieved from Google LLC (Alphabet Inc., Mountain View, CA, USA) during the epidemiological weeks evaluated herein. The company started publishing mobility reports [
20] in early April 2020. These reports showcase how COVID-19 and countermeasures influenced the dynamics of mobility over time. Mobility database presents median of data collected daily after removal of noises rather than raw quantities [
21].
Changes and trends observed in mobility data in several contexts, such as retail and recreation, supermarket and pharmacy, parks, transit stations, businesses, and residential areas, and the variation of average daily cases of COVID-19 per week can be seen in
Figure 4. There is a contrast in mobility patterns, daily reports, and deaths from weeks 14 to 20 of 2021 (
Figure 1b and
Figure 4b). The increase in the number of cases was followed by a growth of death reports during the worst chapter of COVID-19 pandemic in Brazil. As a result, celebrations were canceled and stricter measures to contain mobility were implemented. Thus, the assessment of holidays and mobility patterns can reveal trends to be used as input data in regression systems to study COVID-19.
All of the figures, codes, and tests used on the graphics above were written in Python 3.7. Software libraries Pandas 1.3, TensorFlow v2.6.0, NumPy 1.21, Matplotlib stable release 3.4.3, and the free and open-source Python library SciPy 1.21.2 were used.
2.1. Principal Component Analysis
The mobility report dataset provided by Google contains a complex arrangement of information and patterns that are influenced by different community sectors, habits, and external events. Nevertheless, with the transformation of such information into data, it is possible to produce an independent variable that may be able to generalize a temporal event or situation. Principal Component Analysis (PCA) allowed us to perform the analysis of different projections of the dataset without losing significant information and to generalize dimensions with a smaller number of data.
The PCA method reduces the number of dimensions in a scenario with multiple variables by using a linear transformation to turn a large number of correlated original variables into a smaller number of uncorrelated variables [
22]. The newly discovered dimensions are smaller or equal in size to the original variables used in the algorithm [
23].
PCA is a statistical multi-variate analysis method that aims to identify the main factors that cause the most variation in a set of data. As a result, we consider as the primary goal the information contained in several original variables in a smaller set with as little information loss as possible.
To set up the sorting of the principal components (PC) of mobility reports data using PCA, the first step was to measure the average value of all dimensions of the database. To avoid unequal supplying of contribution for a given dimension into the process, we applied a method to standardize the data to generate input in a common scale. This mapping is given by:
where
z is the scaled value,
x is the original value, and
and
are the mean and standard deviation, respectively.
Next, we computed the covariance of variables and established the covariance matrix
A, which is given by:
When a linear transformation is applied to a nonzero vector, the
eigenvector, or characteristic vector of that linear transformation, changes by a scalar factor. The factor by which the
eigenvector is scaled corresponds to the
eigenvalue. Due to this transformation, it is necessary to compute
eigenvectors and corresponding
eigenvalues of the matrix
A:
where
is the identity matrix, and
is the
eigenvalue, which is a scalar value. This means that
defines the linear transformation. Finally, the
eigenvectors are sorted by decreasing
eigenvalues, and the
k eigenvectors with the largest
eigenvalues are chosen as the PC.
2.2. Epidemiological SEIRD Model
The
Susceptible–Exposed–Infected–Recovered–Dead model, also known as SEIRD [
24,
25], is an epidemiological analysis based on the
Susceptible–Infected–Removed model, known as SIR [
26,
27]. SEIRD is intended to enhance the SIR model in order to explain the evolution of a population in terms of transmissibility, contact rates, and the expected duration of infection in the course of an outbreak. Some assumptions must be made in SEIRD modeling:
Population size (N) is constant;
Demographic features are not implemented or adopted;
Heterogeneity: an infected individual has an equal chance of contacting a susceptible person.
SEIRD categorizes the population in groups to analyze data and design forecasts based on reported cases:
Susceptible (S): Individual who is prone to be infected on day t, and has never been infected and is not immune to infection;
Exposed (E): Individual who has been exposed to the disease but was not able to infect another person nor show symptoms;
Infected (I): Individual who is infected and producing virus that can potentially infect other individuals;
Recovered (R): Individual who was ill and recovered on day t with alleged acquired immunity;
Dead (D): Individual who died because of the infection.
The movement of individuals from one group to the other during the course of the outbreak is resolved by using Ordinary Differential Equations (ODE), which constitutes the dynamic of the SEIRD model. ODE are defined as follows:
where the infectious rate
beta (
) represents infections per exposure, i.e., a susceptible individual has contact with an infected individual and presents a latent infection after exposure; the infectious rate
epsilon (
) represents the potential rate of infection per exposure, i.e., a susceptible individual that has mutual contact with an exposed/infected individual and may infect another susceptible individual; the transitional rate
alpha (
) of an exposed individual to infect others is the average latent period
.
Gamma (
) and
mu (
) represent, respectively, the recovery rate and rate at which infected people become deceased;
is the mean infectious period.
Figure 5 synthesizes the model.
SIR was the first epidemiological model to use compartments. In this model, the whole population can be assigned to only three basic compartments, namely susceptible, infected, and removed. The last compartment includes people who recovered from the infection and the ones who died because of the infection. This is one of the limitations of the model. Disregarding exposure and the incubation period is another pitfall of using SIR analysis. An alternative approach to analyze the population in compartments during an outbreak is to employ the
Susceptible–Exposed–Infected–Removed–Susceptible (SEIRS) model [
28]. Unlike with SIR, individuals from the removed compartment can return to the susceptible compartment. However, our database provided death figures but lacked reinfection information. Thus, the SEIRD model proved to be more appropriate for our data set.
Basic Reproduction Number
The estimated number of secondary cases created by a single (typical) infection in a fully susceptible population is known as the basic reproduction number , a dimensionless number, and is referred to in many cases as the simple reproductive rate.
A possible way to find
is by adopting the next-generating matrix to calculate the reproduction number. Furthermore, the method calculates the current reproduction number
, which is the secondary infection rate. The model is divided into two groups:
, containing
E and
I individuals as the infective and infectious group, and
, which contains S, R, and D individuals as the susceptible, recovered, and dead group. Assuming
and
and
are the vectors of new infection parameters and other parameters, respectively, then
The eigenvalue of with the highest value is and . An outbreak is not likely to happen, or it is controlled, when . When , however, the disease spreads exponentially, resulting in an epidemic.
2.3. Multi-Variate LSTM Model
Neural networks are useful tools for pattern recognition. Methods such as ARIMA [
29,
30] have been demonstrated to be insufficient in the long run to generalize or maintain accuracy of patterns in regressions of long periods. Some approaches are appropriate for problems that cannot be solved linearly. In such cases, the neural network allows us to identify the degree of relationships among variables, such as the ones in our study, since non-linear variables cannot have causality or correlation explained by commonly used methods. Thus, neural networks produce promising results for one-variate, bi-variate, and multi-variate regression problems.
In Long Short-Term Memory (LSTM) [
31], similar to Recurrent Neural Network (RNN), a context of memory persists within the pipeline, allowing them to solve sequential and temporal subjects without being hampered by the vanishing gradient. These neural networks are built on the usage of gates that direct how information is forwarded and ignored inside its internal structures to achieve such complex learning retrieval from sequential sources.
LSTM differs from the traditional approach in that it contains an element called cell state, which determines whether the information is stored or not. The cell state can transport pertinent information throughout the sequence’s processing, cross the entire thread of interactions, add or delete data from this state cell, and set it according to structured switches.
Given this benefit, the goal is to use stacked LSTM units as a layer, composed of 200 blocks attached to a dense layer to predict a bi-weekly series of COVID-19 daily deaths taking into account one- and multi-variate input data, as shown in
Figure 6.
For the one-variate model, we use COVID-19 daily cases as input data to predict the next 14 days of COVID-19 deaths. The learning algorithm is asked to output a function in order to complete this task. For the multi-variate model, we set a mix of the data, finding the best approach, such as factor and/or principal components from PCA, to a multi-variate model to predict the next 14 days of COVID-19 deaths. The proposed model has the same architecture for any case, changing only its input data.
2.4. RMSE and Model Grid-Search
To evaluate the model, we apply the use of RMSE (root mean squared error), a metric that calculates the deviation of errors between observed (ground-truth) and predicted values (hypotheses):
where
is the true value, the ground truth, and
is the predicted value from the model. The sum will be given different weights and the RMSE index will rise dramatically as the instances’ error values rise. That is, if there is an outlier in the dataset, its weight in the RMSE calculation will be higher, harming the performance by increasing the RMSE.
Furthermore, a grid search to find the best set of input data for the model in forecasting the COVID-19 deaths with the lowest RMSE was applied. We apply the same experiment for each input dataset over 50 times, which allows us to extract the expected value and the standard deviation of each configuration in predicting our goal. For this, we split the data into training and testing data. The test set is related to the last 28 days (4 weeks) of the temporal series. Therefore, the RMSE extracted and reported in this work is related to comparisons against the test data.
The aforementioned workflow is described in
Figure 7. The model receives each set of input data, and the model’s expected outcome applied to the test data is subjected to RMSE calculations. After the process has been repeated 50 times, we distill a boxplot graph to observe which configuration performs better on average, observing the standard deviation and mean RMSE calculated. To create a temporal plot, comparing the ground truth (real values) and the model’s predictions, we choose the model with the lowest RMSE in forecasting COVID-19 deaths for each set of input configuration. For example, for all 50 repetitions using cases as input data, we plot the curve with the lowest RMSE.
4. Discussion
Our data suggest that holidays and holiday periods influenced
values for COVID-19, as well as mobility of people. Non-pharmaceutical interventions, such as social distancing, use of masks, and hand sanitizers were the initial measures to contain the spread of SARS-CoV-2. However, the Brazilian political, economical, and social contexts influenced the establishment of effective public health policies. The lack of extensive social distancing had a major impact on the number of cases and deaths of COVID-19 after holiday periods. The same trend also occurred with activities in which crowding was involved. The earlier social distancing measures are deployed, the sooner that such policies will be relaxed, mainly due to a decrease in cases and deaths by COVID-19 [
37].
Several factors help reducing compliance to public health policies, such as socioeconomic inequality, conspiracy-theory-driven noncooperation, and behavioral aspects, such as cognitive bias [
38] and free riding [
39]. The association of these factors have also influenced disease patterns. Cognitive biases, for example, play an important role in decision making, as they affect individual reasoning, cloud judgment [
38], and create anecdotal evidence [
40]. A few types of biases affect perception of reality and influence behavior. For example, confirmation bias, i.e., which is the tendency to favor, search for, interpret, and remember information that confirms one’s beliefs [
41], has negatively influenced prevention strategies, control measures, and research for COVID-19 [
40,
42,
43]. Additionally, free-riders usually avoid cooperation, exploiting others’ compliance with policies [
39]. The lack of compliance usually has a major impact on control measures, as it compromises collective efforts to contain the spread of SARS-CoV-2. Holidays have probably intensified such behavioral factors, causing an increase in the number of cases.
We emphasize that crowding may very well be the main factor for the maintenance of COVID-19 cases and deaths from now on. The circulation of new variants poses a threat to prophylactic measures that are now available against SARS-CoV-2. Although there is evidence that current vaccine strategies are able to elicit an effective immune response against the variants of interest and variants of concern of SARS-CoV-2 described until now, non-pharmaceutical control measures must remain as continuous public policies to avoid the spread of variants that can potentially cause new waves of COVID-19 [
44]. Crowding, particularly the crowding that occurred within the first year of COVID-19 in Brazil, rapidly changed COVID-19 reports. The same pattern can be observed on 1 January 2021, when disease reports started increasing after a holiday that traditionally leads to family reunions and festivities.
Nonetheless, part of the lack of compliance with non-pharmaceutical measures may have been validated by the executive branch of the federal government, which has encouraged the population to keep crowding and to avoid the use of masks, at the same time disregarding the severity of SARS-CoV-2 infection and criticizing effective countermeasures deployed by state and municipal authorities all over the country [
7]. Unfortunately, control measures were not taken even by federal, state, and municipal authorities. The direct result of the non-compliance with countermeasures is a mean mortality rate of 2.345, which is approximately 77% higher than the rest of the world [
45]. Additionally, COVID-19 denial has also been widely reported on traditional networks that support the Federal Executive Branch policies, on unchecked digital media, and on social networks, adding mistrust in science into this scenario. As a result, part of the population internalized political and economical biases that were part of the federal government’s agenda instead of complying with collective control measures, which resulted in one of the most severe examples of COVID-19 problems in the world [
7].
Non-compliance with COVID-19 collective control measures is also related to socioeconomic status in the population in several countries. Unfortunately, informal employment is a worldwide tendency, and Brazilian reality is not different from the rest of the world. In fact, issues that have historically present in the Brazilian territory were magnified with the sanitary scenario imposed by COVID-19. Increases in several types of violence, poverty, and differential access to health and education are among the hurdles Brazilians have to deal on a daily basis. Festivities and hallmark holidays are usually a chance for informal workers to increase individual or family income as they increase the demand for consumer goods such as food, clothes, and beverages. Public policies to control the spread of SARS-CoV-2, such as lockdowns and social distancing, caused a major impact in families whose income relies on informal jobs and social interactions. Informal employment causes significant tax loss and reduced public revenues, leading to less available resources for important public services, such as social protection and health care in Brazil, but also contributes to poorer working conditions and unfair competition for legitimate businesses and collective bargaining [
46]. Data from Bahia, a state from the northeast region of Brazil, indicates that withdrawal of informal workers decreases the productive capacity of the country and leads to pronounced negative impacts on the economy, mainly in the service sectors. However, federal, state, and regional programs for income transfer helps to mitigate the negative impacts of COVID-19 by 50% [
47]. Implementation of control measures associated with income transfer policies would not only influence mobility of people but also have a positive impact in COVID-19 reports.
5. Conclusions
As a result of the experiments, we unveil some intriguing concerns. First, we have discovered that the effects of holidays, or holiday periods, cause immediate increases in and trends. These factors, extracted from the SEIRD model, have a significant impact on the COVID-19 death curves and reports.
Furthermore, principal components generated with community mobility report data using the PCA approach indicate that holidays may cause distinct report patterns over time, which can be analyzed to improve COVID-19 regression of death curves. Furthermore, the PCA approach has proven to be important because it reduces the dimensionality of the feature space, and with fewer input dimensions, the model is easier to find.
Cases combined with holiday periods, cases in association with , or holiday periods with produced an effective strategy of analysis compared to using the current number of cases to predict future deaths. Furthermore, when access to epidemiological data is limited, the use of community mobility is a promising alternative as it produced better results than cases as input data.
Based on the trained model, it is possible to generate synthetic data, for example, simulating an increase in mobility, and see how this affects the model’s forecast. If the trained model has a high forecast accuracy, we can confidently estimate the impact of each feature on the dynamic of the death curve. This finding may be used to help authorities managing resources in the event of future epidemics.
Besides having produced important results showing that we are in the right research direction, our current data are not final, and further work should be done in order to draw a thorough and final conclusion on the enhancement of the models for prediction. For that purpose, it will be necessary in future work to rank-order factors (in order of relevance), taking into account the literature consensus on factors in COVID-19 infection rate. Some papers list more than 50 potential features for predicting the number of cases, including mobility and climate variables such as temperature, humidity, and air pollution [
48]. To predict deaths, the number of vaccinations, population age, and the number of available ICUs should all be considered.
Finally, using a large number of variables as input data will not necessarily improve the prediction of COVID-19 fatalities in an LSTM model. However, once a large number of data have been collected, it is possible to use exploratory methods to conduct trials that contribute to and improve the accuracy of the model. We also intend to adopt other forms of data visualization and approaches that can improve understanding of the virus’s spread dynamics and help make predictions. A possible approach is the use of complex networks, which help to understand how variables interact, as shown in [
49].
In the near future, we intend to study and add new variables to our prediction model, such as the association and influence of countermeasures defined by the federal, state, and municipal governments and particularities of people from different regions of the country, among others. With this approach, a comparison between the effectiveness of control measures by different states or municipalities can be evaluated to help avoid similar scenarios in the future.