1. Introduction
Tuberculosis (TB) is a chronic infectious disease caused by Mycobacterium Tuberculosis that infects many organs, the most common of which is pulmonary Tuberculosis [
1]. Tuberculosis patients often have a similar history of tuberculosis contact, and tuberculosis expeller is an unignorable source of infection [
2]. Mycobacterium tuberculosis spreads mainly through the air because it can be suspended in the nuclei of droplets discharged by patients who cough or sneeze, and it can infect healthy people when inhaled. Tuberculosis is a disease with a long history. In 1882, Dr. Robert Koch first detected the bacterium associated with tuberculosis and named it Mycobacterium tuberculosis [
3]. In 1993, the World Health Organization (WHO) proposed that tuberculosis had become a global public health problem and declared the “global tuberculosis emergency”. Since 1997, the WHO has been releasing statistics on the global tuberculosis report every year and makes them publicly available. According to the
Global Tuberculosis Report 2021 published by the WHO [
4], an estimated 9.9 million people worldwide were infected with tuberculosis in 2020, which is equivalent to 127 cases per 100,000 people. In 2019, TB was the 13th leading cause of global death and the first leading cause of death from a single infectious disease.
There are many factors influencing the incidence of tuberculosis. Studies have shown that meteorological factors [
5], social and economic factors [
6], and the geographical ecological environment [
7] have been closely associated with the incidence of tuberculosis. Changes in meteorological factors will affect the pattern and burden of tuberculosis. Currently, a few studies have shown that average temperature, air pressure, relative humidity, precipitation and sunshine duration all impact the incidence of tuberculosis [
8,
9,
10]. For example, Bie, S believed that Mycobacterium tuberculosis was more likely to survive in an environment of high humidity and precipitation, but not in an environment of high temperature and pressure. UV rays can cause damage to skin and eyes, so prolonged sun exposure can lead to weakened immunity and tuberculosis infection [
8]. However, Li, Z believed that temperature affects the probability of TB transmission by changing the indoor/outdoor activity time of TB-susceptible people and TB-infected people. A relatively high wind speed promotes the spread of mycobacterium tuberculosis in the air, thus increasing the risk of TB [
9]. Air transmission is the main avenue of tuberculosis infection; patients are infected via tuberculosis pathogens in droplets spread into the air through coughing or sneezing. Exposure to high levels of air pollutants damages the mucous membranes of the upper respiratory tract, which are the first line of defense against mycobacterium tuberculosis infection. In addition, continued exposure to air pollutants decreased the expression levels of interferon-γ (IFN-γ) and tumor necrosis factor-α (TNF-α), which play an important role in the fight against mycobacterium tuberculosis infection [
9]. One study found that exposure to ambient air pollution (PM
2.5, PM
10, O
3 and CO) significantly increased the risk of drug-resistant tuberculosis [
11], and a recent systematic review and meta-analysis study confirmed that long-term exposure to PM
10, SO
2 and NO
2 significantly increased the incidence of tuberculosis [
12]. These studies provide EBM (evidence-based medical) evidence for the influence of air pollution factors on the incidence of pulmonary tuberculosis.
The influence of meteorological factors and air pollutants on the incidence of infectious diseases and the establishment of prediction models based on them have become a research hotspot in the field of epidemiology. On the basis of traditional time series analysis, many studies have begun to consider the influence of meteorological factors in different regions, especially temperature and precipitation, on the incidence of tuberculosis, and used them as independent variables to establish the incidence prediction model of tuberculosis [
13,
14,
15]. At present, there are two kinds of common infectious disease prediction models: the first is the traditional mathematical prediction model, such as autoregressive integrated moving average model (ARIMA), regression prediction model, exponential smoothing model, etc.; the second is prediction models based on machine learning, such as support vector machine (SVM), random forest, and the BP artificial neural network model. [
16]. Although traditional mathematical prediction models have been relatively mature in the prediction of infectious diseases, different models are suitable for different data characteristics, and each type of infectious disease prediction model has its own advantages and disadvantages. The biggest shortcoming of the traditional mathematical prediction model is that it cannot extract a nonlinear relation in a time series [
17]. Therefore, the machine learning prediction model, which has good performance in dealing with nonlinear relations in time series, has been gradually applied to the modeling of infectious diseases through its special algorithms and advantages [
18,
19].
To sum up, considering the nonlinear relation between influencing factors and the incidence of tuberculosis, this research intends to analyze the tuberculosis notification data from the period of 2010–2021 from Changde City, Hunan province as well as the meteorological and air pollution data. It also intends to describe the occurrence characteristics of the tuberculosis and epidemic trend and explore the correlation between tuberculosis disease and meteorological factors and air pollutants. A machine learning algorithm (support vector regression, random forest regression and BP neural network) was used to construct the daily incidence prediction model of tuberculosis, based on meteorological factors and air pollutants. The mean absolute error (MAE), root mean square error (RMSE) and mean absolute percentage error (MAPE) were used to evaluate the prediction effect of each model; this was conducted to explore the best prediction model for tuberculosis and provide a basis for the prediction and early warning model construction of infectious diseases.
4. Discussion
In this study, Spearman rank correlation analysis was used to analyze the correlation between the daily TB notifications and meteorological factors and air pollutants in Changde City, Hunan Province. Meanwhile, a machine learning algorithm was utilized to construct a tuberculosis incidence prediction model based on meteorological and air quality. MAE, RMSE, and MAPE were performed to evaluate the prediction model. This study aimed to explore an accurate prediction model of tuberculosis incidence, predict the epidemic trend of tuberculosis cases, and provide reference for epidemic prevention and control departments.
From 2010 to 2021, the overall incidence of tuberculosis in Changde City, Hunan Province showed a downward trend. The number of male cases was larger than that of female cases, the age of onset mainly ranged from 16 to 59 years old, and the occupational population with the largest number of cases was farmers. It is very consistent with the epidemic characteristics of tuberculosis in other provinces of China. For example, from 2005 to 2018, 70.45% of the total cases of tuberculosis in Henan Province were male, the median age was 48 years old, and the number of farmers and herdsmen reached 81.5% [
30]. Men may be more socially active and smoke more than women, which increases the risk of TB [
31,
32]. The low immunity of middle-aged and elderly people [
33], low income of residents in rural areas and imperfect medical services [
34] may lead to an increased risk of tuberculosis, suggesting that we should pay more attention to men, middle-aged and elderly people and rural areas, and make targeted prevention and control measures to improve the detection and control rate of tuberculosis.
A total of eight meteorological factors and six air pollutants were included in the correlation analysis, among which the daily TB notifications were positively correlated with the daily average temperature, maximum temperature, minimum temperature, sunshine hours, PM
2.5, PM
10 and O
3, and negatively correlated with the daily average pressure, precipitation, relative humidity, CO and SO
2. There was no significant correlation between the daily TB notifications and the average wind speed and NO
2. Higher temperatures may be associated with longer sunshine hours, and a systematic review suggests that higher temperatures lead to air currents that are generally high, providing a favorable environment for TB transmission and suggesting that pathogens replicate more easily at higher temperatures [
35]. Long-term exposure to air pollution may lead to an increase in the number of reported cases of tuberculosis [
36,
37]. Studies have shown that PM 10 is associated with tuberculosis with a positive sputum culture, and the severity of the lung lesions increases with the increase in PM
10 level [
38]. PM 2.5 and PM 10 are solid particles that can change the key components of the anti-mycobacterial host immune response [
39], inhibit the natural defense barrier of the respiratory tract, cause oxidative stress in lung cells and increase the proinflammatory response [
40].
Using models to predict tuberculosis incidence could help identify trends and provide the basis for disease warning. This study used correlation analysis results using support vector regression, random forest regression and BP neural network model to construct six tuberculosis disease prediction models, and it compared the prediction results of the six models. It was found that the MAE, RMSE and MAPE of the BP neural network model, including average daily temperature, sunshine hours and PM
10, were all lower than those of the support vector regression model and random forest regression model. We assume that the BP neural network model has better predictive performance than SVR and RFR for tuberculosis incidence. Different prediction models for infectious diseases have obvious advantages and disadvantages, so it is crucial to select an appropriate prediction model according to its applicable conditions and sample data characteristics [
16]. SVR is a nonlinear model that can map data into a high-dimensional space so as to find a more appropriate regression curve in the high-dimensional space through a kernel function [
41]. SVR is highly inclusive of data and can use an ε insensitive loss function to perform linear regression in the high-dimensional feature space to reduce the complexity of the model. However, when it is applied to large sample data, it will consume a lot of machine memory and operation time [
42]. Random forest is an ensemble learning method that obtains more accurate results through a large number of decision trees. When the number of decision trees is large, it requires a lot of time and space [
43]. Random forest can process a large number of data samples without normalization of the data processing, and it has a strong ability to use data sets [
44]. Therefore, random forest has a high degree of fitting to the training set. The BP network has a strong self-learning ability and a highly nonlinear mapping ability. It can automatically summarize the nonlinear function relationship between data by learning or training without any prior formula [
45]. However, the BP neural network has many modeling parameters, so the determination of its structure is a major difficulty in the modeling process [
46]. The BP neural network model is intuitive and has good prediction ability. It was the best prediction model for tuberculosis in this study, which suggests that the BP neural network model can be used as a method to predict the incidence trend of infectious diseases by using external factors such as weather and air pollutants.
However, some questions in this study still need more in-depth research. First, the incidence of tuberculosis may be related to the social, economic, cultural and individual conditions of the population. This study used the daily notification data of tuberculosis, so it could not obtain the social, economic and cultural data of the same period, and could not consider the individual conditions of the population. Second, studies have shown that extended exposure to air pollution has a long-term effect on lung health [
47], and a brief exposure to a severe pollution event has had a long-term effect on health for nearly two decades [
48]. However, this study did not consider the cumulative impact of long-term exposure to pollutants on the incidence of tuberculosis, and only took the immediate effect of meteorological and air pollutants on health as a predictor, which may have underestimated the impact of pollutants on the incidence of tuberculosis.