1. Introduction
The COVID-19 pandemic, initially reported in Wuhan, China in December 2019, is incredibly infectious and has had a large impact the world [
1]. The World Health Organization (WHO) designated COVID-19 as a global pandemic on 11 March 2020 [
2]. The virus has spread rapidly worldwide and confirmed cases have been found in almost all countries. By November 2021, global cumulative cases were above 256 million and deaths were above 5.1 million, and the number keeps rising [
3].
Given this background, researchers have conducted a large number and variety of studies (e.g., clinical research, statistical modeling and behavior analysis). Many have focused on trend analysis and time-series prediction [
4,
5,
6,
7,
8], which could effectively estimate both the outbreak point and turning point of the COVID-19 pandemic as well as help to evaluate the effectiveness of measures and whether strategies should be strengthened [
9,
10,
11]. However, these studies mostly have not considered the influence of external risk factors, which are also important in when it comes to epidemic analysis [
12].
Researchers have explored the relationship between socioeconomic as well as infrastructural factors and the spread of the COVID-19 virus [
13,
14,
15,
16,
17]. Some have found that COVID-19 has had a more significant impact on poor areas [
18,
19] while a few have found that the influence of income has not been significant in certain study areas [
20,
21]. Factors such as population density, population mobility and accessibility to hospitals have also been considered in relation to the spread of COVID-19 in many studies [
22,
23,
24,
25,
26]. Additionally, a number of researchers have estimated the association between environmental factors and COVID-19 transmission, specifically the influence of air quality, wind, humidity as well as temperature on COVID-19 cases [
27,
28,
29,
30,
31,
32,
33]. However, in these studies, the spatial relationships among cities or countries were underestimated. As the spread of infectious diseases is often spatially autocorrelated, traditional nonspatial statistical models are not so suitable because the data violates the independence criterion [
12,
34,
35,
36].
To solve the problem of spatial autocorrelation, many spatial models such as spatial error models (SEM) and spatial lag models (SLM) models have been utilized in the analysis of COVID-19 [
20,
37,
38,
39]. The studies above considered spatial autocorrelation from a global point of view. However, one problem is that spatial epidemic data often exhibit high spatial heterogeneity [
34,
40]. Global models are not so helpful for examining the effect of risk factors that may vary across space. Some researchers have used the GWR model and its variations [
41,
42] to consider spatial varying characteristics that lie in the health risks factors of COVID-19 [
43,
44,
45,
46,
47,
48]. However, there were often multi-collinearity bias in the coefficients, not to mention the problem of choosing the appropriate bandwidth in GWR modeling [
49]. Griffith [
50] developed an eigenvector spatial filtering-based spatially varying coefficients (ESF-SVC) method to take control of spatial heterogeneity. One benefit of this method is that it decomposes the spatial weights matrix into eigenvectors which contain information of different spatial patterns. Murakami et al. [
51,
52] developed an ESF-SVC modeling approach by considering random effects and found the ESF-SVC method was better than the GWR approach in terms of model accuracy. Another advantage of the ESF-SVC method is that it can detect whether the coefficients of independent variables need to be spatially varying or constant through the Bayesian information minimization criteria (BIC). The output of the ESF-SVC model includes the estimated coefficients (be it spatially varying or constant) as well as their statistical significance, which could help to detect spatial varying characteristics [
53]. However, this method has not been used in the study of COVID-19.
The main research gap in the aforementioned COVID-19 studies is that many could not model spatial heterogeneity in an effective way. Therefore, an ESF-SVC model was constructed to reveal the spatially varying impact of certain socio and environmental factors on the spread of COVID-19. It deposited the spatial relationship into eigenvectors and combined them with selected health risk factors, and was therefore able detect whether the coefficients of health risk factors are spatially varying or constant. The main objectives of this paper were: (1) to explore how selected health risk factors are related the COVID-19 infection rate within different study extents; (2) to find out if the influence of selected health risk factors vary across space and time and how they vary. Considering data availability and rationality, 10 factors, including socio and environmental ones, were used as the initial health risk factors according to the literature reviews mentioned above. Socio factors included population density, human migration, hospital capacity, GDP and building density, while environmental factors included precipitation, wind speed, temperature, average altitude and air pressure. The ESF-SVC model results were compared with those of the OLS, ESF and GWR models, and the results showed that the proposed ESF-SVC was a promising method in the context of COVID-19 health risk modeling and the discovery of spatial varying characteristics. This study hopes to provide not only a feasible path to solve the problem of spatial autocorrelation and spatial heterogeneity in COVID-19 studies but also an intuitive way to discover spatial and temporal patterns that lie in the influencing factors.
4. Discussion
4.1. Improving Model Accuracy
According to the model assessment results, the ESF-SVC model performed better than the other three models in modeling COVID-19 infection rates.
In Hubei province, the average adjusted R2 of the ESF-SVC model was 16.31%, 5.48% and 18.83% higher than that of GWR, ESF, and OLS models, respectively. When the study area expanded to mainland China, the average adjusted R2 of the ESF-SVC model was 10.25%, 19.54% and 105.94% higher than that of GWR, ESF and OLS models, respectively.
The average RMSE value of the ESF-SVC model was much smaller than in the ESF and OLS models. Although its RMSE was slightly larger than that of the GWR in mainland China, the LOOCV results for the ESF-SVC model were the smallest. This suggests that the ESF-SVC model can better estimate the relationship between COVID-19 infection rates and health risk factors in different areas and took good control of the over-fitting problems that plagued the GWR model, thereby providing a more robust model [
49,
51,
75].
The average residual maps showed that the ESF-SVC model generated relatively large modeling errors in northwest China, whereas the GWR model produced large errors in central China. In particular, the ESF-SVC model outperformed the other three models when only the infected areas were modeled. None of the MC generated with the ESF-SVC model residuals were significant, indicating that the ESF-SVC model could better filter out the influence of spatial autocorrelation across large areas.
4.2. Influence of Health Risk Factors
The spatial varying coefficient values of the ESF-SVC model reflect how corresponding risk factors possibly affected COVID-19 infection rates. If a risk factor passed the significance test during the modeling, the larger the absolute coefficient and the greater its contribution to the COVID-19 infection rate.
In Hubei province, all 10 risk factors did not show significant correlation with the COVID-19 infection rate in the first three weeks. This might be because the nucleic acid detection method was not mature and detection resources were not sufficient in high-risk areas at first, such that many patients were not detected as infected very quickly in some places (i.e., Hubei province). The Chinese Healthcare Commission announced an improved method for detecting confirmed cases on 12 February, which was when week 4 in this study began. In the following weeks, PDEN and WDSP passed the final significance test and were selected for modeling. WDSP, with a larger average coefficient than that of PDEN, contributed more to the increase in the infection rate after 12 February. Similar results were also found by Şahin [
32]. This might be because the virus could spread more easily under low or moderate wind speed situations [
76,
77]. However, as the study area expanded to mainland China, WDSP did not show a significant impact on the increase of COVID-19, indicating that the influence of wind speed is more likely to act on small but high-risk areas. Previous studies also observed no significant association between wind speed and COVID-19 infection rates when the study extent was at the country level [
78,
79].
In mainland China, eight of the ten risk factors showed significant correlation with the COVID-19 infection rate, and six passed the model significance test (PDEN, MS, BD, TEMP, PRES and DEM) and were selected to develop the final models. MS had the largest influence on the increase of COVID-19 infection rates. It was found that the migration out of Wuhan was less contributive and less significant in cities near Wuhan than cities further away. This might be because there was intense interaction between Wuhan and nearby cities, be it by private car or other travel methods that were not recorded in the Baidu qianxi platform. When traveling to cities that are further away, people are more likely to choose public transportation, so the corresponding migration score is more associated with infection rates. The influence of MS on COVID-19 infection rates reached a peak in week two (around 29 January 2020) and then decreased. This may be explained by the city lockdown policy announced on 23 January 2020, around three weeks after the first reported case of COVID-19 in Wuhan. This policy prevented people from leaving or entering Wuhan by any means of transportation until 8 April 2020, before which there are no increased confirmed cases for a period of 21 days. Public transportation, businesses and entertainment venues within Wuhan were closed to ensure rigorous home quarantine. After the announcement of the city lockdown policy, other cities, especially cities around Wuhan, adjusted their emergency response levels and suggested that people stay home as much as possible and cancel gatherings and events, which largely weakened the virus spread [
28,
33,
80,
81,
82]. Building density (BD) had a relatively greater impact on cities in southern China after week three (around 5 February 2020), but the variation of coefficients between cities was small. This indicates that the clustering of entertainment venues accelerated the spread of COVID-19, but the size of its effect and the differences among cities shrank if social distancing was strong. In terms of temperature, the coefficients were mostly positive in the first two weeks, with an average temperature of 2.42 °C. As time passed, more cities, especially those in northeast and southwest China (e.g., cities within the Guangdong, Guangxi, Yunnan and Fujian provinces), had negative coefficients that passed the significant test, with an average temperature of 8.17 °C. The findings suggest that when temperature reached a certain point, the increase of temperature might have resulted in a decrease in the COVID-19 infection rate. Similar results about the influence of temperature were also found in some country-specific and worldwide studies [
26,
83,
84]. As for altitude (DEM), cities that passed the significance test and had high model coefficients were mainly found in plateau regions, suggesting that high altitude areas may have less capacity for the virus to survive. Another study also indicated that people living in high altitude areas might have a better tolerance to hypoxia and might be more resistant to the COVID-19 virus [
85]. Unlike in the extent of Hubei province, population density only showed a weak influence on COVID-19 infection rates in the first week and its coefficients were constants, indicating that in large study extents, population density could not explain the increase of COVID-19 infection rate very well when compared with migration outflow and the clustering of buildings [
21]. Also, social distancing and traveling restrictions helped to shrink the influence of population density on the spread of COVID-19 [
86].
4.3. Limitations
Although the constructed ESF-SVC model performed well in modeling spatial heterogeneity with improved model fitness and robustness in the context of the study of COVID-19, this study still has several limitations. First, some risk factors such as age structure, education level, government policy and community response, were not used. Second, although the time lag effect was considered by using the average of variables in the previous weeks, other temporal characteristics that may influence the COVID-19 infection rate, such as temporal lag effects at the level of individual days should be considered as well [
63,
64]. Third, the impact of health risk factors may vary within different regions, different study scales and different periods of the COVID-19 wave. The migration score in this study only included the outflow from Wuhan until 23 January 2020 (the day after lockdown) but did not take population outflows from secondary infection sources into account [
87]. In addition, as the migration connectivity was one of the key factors in the spread of COVID-19, conventional topology or a distance-based spatial weight matrix in spatial modeling might be insufficient. Therefore, a migration connectivity network matrix between study units should be taken into account.
5. Conclusions
In this paper, an ESF-SVC method was developed to explore how health risk factors influence the COVID-19 infection rate differently across space and time. It could simultaneously consider the influence of spatial autocorrelation and heterogeneity and could better control multicollinearity and over-fitting problems that plague the GWR model, with a higher average adjusted R2. Also, the ESF-SVC model’s cross validation RMSEs were also largely lower than in the other three models, indicating that it can better estimate the relationship between COVID-19 infection rate and health risk factors within different areas, thereby providing a more robust model.
The ESF-SVC’s spatial varying coefficients at different periods could discover spatial and temporal patterns of influencing factors. It was found that the effect of health risk factors was different as the study extents and study period changed. In Hubei province, WDSP contributed more to the increase in the infection rate than the other health risk factors after 12 February. When the study was extended to mainland China, migration score contributed the most to the COVID-19 infection rate, followed by building density, altitude and temperature, and all of them showed significant spatial varying characteristics. Migration score was less contributive and significant in cities near Wuhan than cities further away, while cities with larger building density coefficients were mainly found in southern China. DEM contributed to the decline of the COVID-19 infection rate and its influence became more significant in high altitude cities as time passed. The influence of temperature was at first positively correlated with COVID-19 infection rate, but after 11 February, the increase of TEMP showed was shown to have a weak but significant impact on the decrease in the COVID-19 infection rate in southeast and northwest cities.
These findings about the impact of health risk factors were also partially consistent with previous studies on COVID-19 and other respiratory infectious diseases, which could help increase public and governmental awareness of the potential health risks and therefore influence COVID-19 control strategies. For example, WDSP showed a significant impact in Hubei province; therefore, wearing a mask in this high-risk area while going out would be the preferred response, which was also recommended by other related studies within different study regions. Meanwhile, after around 29 January, the influence of MS and BD in mainland China decreased, suggesting that the lockdown and social distancing policies worked and could be referred to other areas.
As the proposed method was not limited by datasets, it could form a reference for other spatial epidemiology studies. In the future, we plan to add temporal variables and expand the ESF-SVC model into an ESF based spatiotemporal varying coefficient model, thereby exploring the spatiotemporal relationships between COVID-19 and health risk factors (e.g., the time lag effect of factors and the influence of secondary sources of infection). A dynamic migration connectivity weight matrix will be added to enhance the model. We will also consider how to expand the study area to the entire world and explore the spatiotemporal patterns of COVID-19 spread within different countries and continents.