Next Article in Journal
A Systematic Review on Life Extension Strategies in Industry: The Case of Remanufacturing and Refurbishment
Next Article in Special Issue
A Novel Hybrid Approach Based on Deep CNN to Detect Glaucoma Using Fundus Imaging
Previous Article in Journal
Hyperspectral Remote Sensing Image Feature Representation Method Based on CAE-H with Nuclear Norm Constraint
Previous Article in Special Issue
A Comprehensive Analysis of Deep Neural-Based Cerebral Microbleeds Detection System
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Predicting Regional Outbreaks of Hepatitis A Using 3D LSTM and Open Data in Korea

1
National Program of Excellence in Software Centre, Chosun University, Gwangju 61452, Korea
2
Department of Electronic and Electrical Engineering, SungKyunKwan University, Suwon 16419, Korea
*
Author to whom correspondence should be addressed.
Electronics 2021, 10(21), 2668; https://doi.org/10.3390/electronics10212668
Submission received: 7 October 2021 / Revised: 24 October 2021 / Accepted: 29 October 2021 / Published: 31 October 2021
(This article belongs to the Special Issue Machine Learning in Electronic and Biomedical Engineering)

Abstract

:
In 2020 and 2021, humanity lived in fear due to the COVID-19 pandemic. However, with the development of artificial intelligence technology, mankind is attempting to tackle many challenges from currently unpredictable epidemics. Korean society has been exposed to various infectious diseases since the Korean War in 1950, and to overcome them, the six most serious cases in National Notifiable Infectious Diseases (NNIDs) category I were defined. Although most infectious diseases have been overcome, viral hepatitis A has been on the rise in Korean society since 2010. Therefore, in this paper, the prediction of viral hepatitis A, which is rapidly spreading in Korean society, was predicted by region using the deep learning technique and a publicly available dataset. For this study, we gathered information from five organizations based on the open data policy: Korea Centers for Disease Control and Prevention (KCDC), National Institute of Environmental Research (NIER), Korea Meteorological Agency (KMA), Public Open Data Portal, and Korea Environment Corporation (KECO). Patient information, water environment information, weather information, population information, and air pollution information were acquired and correlations were identified. Next, an epidemic outbreak prediction was performed using data preprocessing and 3D LSTM. The experimental results were compared with various machine learning methods through RMSE. In this paper, we attempted to predict regional epidemic outbreaks of hepatitis A by linking the open data environment with deep learning. It is expected that the experimental process and results will be used to present the importance and usefulness of establishing an open data environment.

1. Introduction

As we can see from the spread of COVID-19, SARS, and MERS, we can significantly reduce the number of victims if we can predict the epidemic. The reason why infectious diseases are considered “the existence of fear” in living things, including mankind, is because we do not know when, how and how they will occur [1,2,3].
Recently, many researchers have used the machine learning technique, a form of artificial intelligence, to obtain effective results in the prediction of changes in emotions or decision-making among people by data from social network systems, such as tweets on Twitter, posts on Facebook, and blogs [4,5]. Random Forest, Gradient Boost, Lasso, Ridge, Linear Regression, KNN, MLP, XG Boost, and Cat boost are commonly used for data prediction in machine learning techniques. Let us look at the pros and cons of some machine learning techniques. Linear regression offers advantages, such as simple implementation, easy understanding, quick training, and classification based on features. In the case of KNN, the advantages are ease of understanding and lower overheads in the adjustment of parameters. On the other hand, the disadvantages of linear regression include: its limitation to linear applications, its unsuitability to many real-life problems, the default assumption of input error, and its assumption of independent features may not always be true. In the case of KNN, extra care required for the selection of K, and the cost of computation is high when working with large datasets.
Recently, various disease prediction studies have been published. Santos, Carlos, and Matos studied influenza in 2014, but they only considered Portugal in their proposed work [6]. In 2015, Grover, Sangeeta, and Aujla processed data using tweets for swine flu [7]. In 2017, McGough and Sarah F studied zika virus, and they only predicted one parameter for forecasting [8]. In 2018, Nair, Lekha R., Sujala D. Shetty, and Siddhanth D. Shetty studied heart disease; however, they did not do so under the category of epidemics, so their study needed to be linked with a health care service provider in order to work in real time [9]. In 2019, Maurice and Nduwayezu studied malaria; their study was limited to Nigeria only [10]. In 2020, Petropoulos, Fotios, and Makridakis worked on COVID-19, but they did not use machine learning [11].
In Korea, there are six cases of National Notifiable Infectious Diseases (NNIDs) at category I infection according to the definition established in 1954, as shown in Table 1. Recently, rates of cholera, typhoid fever, paratyphoid fever, shigellosis, and enterohemorrhagic Escherichia coli have been low in Korea. Typhoid fever, cholera, and shigellosis in particular were highly prevalent in the 1960s. According to the analysis of the nation’s hepatitis A antibody retention rate for the 10 years between 2005 and 2014, 7 out of 10 infected people are in their 30s and 40s, and hepatitis A prevention measures for this age group are necessary. In the past 10 years, Korea has taken the openness of public data as a national indicator and has been opening up various daily data, such as population data, meteorological observation data, water quality data, and air quality data. For this reason, using stable and high-accuracy deep learning technology, we have been able to verify the relationship between diseases and the public data on daily life collected over many years.
Hence, in this paper, we aim to minimize the costs and damages involved in the prevention of epidemic outbreaks by predicting regional outbreaks of hepatitis A by using publicly available data in Korea and recent machine learning algorithms.

2. Prediction System of Hepatitis A

To predict hepatitis A, we conducted a two-phase approach, as shown in Figure 1.
The first step is correlated factor selection for learning for the prediction model. In this correlated factor selection step, we separate irrelevant factors from environmental factors through statistical analysis. The second step is disease outbreak prediction through LSTMs (long short-term memory networks) [14,15]. In this phase of the prediction, we preprocess the selected correlated factors and predictions by using LSTMs.

2.1. Correlated Factor Selection

In this correlated factor selection, we conduct data gathering, data preprocessing, and statistical analysis, as shown in Figure 2. First, we perform web crawling to gather the open data for each region in Korea by studying open data sites in Korea.
1. Patient information: KCDC (Korea Centers for Disease Control & Prevention), http://www.cdc.go.kr;
2. Water Environment Information: NIER (National Institute of Environmental Research), http://water.nier.go.kr/publicMain/mainContent.do;
3. Weather Information: KMA (Korea Meteorological Agency), https://data.kma.go.kr;
4. Population Information: Public Open Data Portal, https://www.data.go.kr/;
5. Air Pollution Information: KECO (Korea Environment Corporation) AirKorea, https://www.airkorea.or.kr.
Second, we perform data preprocessing for the missing values, the regulations of individual regions in Korea. Third, we perform the evaluation of the correlation between the disease (hepatitis A) and each environmental factor. In this evaluation, we eliminate the non-related factors. Subsequently, we can obtain the candidate factors to predict the outbreak.

2.2. Disease Outbreak Prediction with Hepatitis A by Regression Analysis

In this disease outbreak prediction, we conduct the two steps, data preprocessing and LSTMs by using selected correlated factor (candidate factor), as shown in Figure 3. In the preprocessing step, we reorganize the data by living area, feature scaling from 0 to 1. In the prediction by LSTMs step, we calculate that RMSE (Root Mean Square Error) [16] for Random Forest [17], Gradient Boosting Regression [18], Lasso [19], Ridge [20], Linear Regression [21], K-Neighbors Regression, MLP (Multi-Layer Perceptron) Regression [22], XGB Regression, and Cat Boost Regression. These RMSE evaluation results are used for the determination of hyper-parameter adjustment and optimal algorithm selection.

3. Experimental Results

3.1. Correlated Factor Selection

We gather the data from the websites that mention ‘A. Correlated Factor Selection’ through web crawling, as shown in Table 2.
Measurement data that are missing for various reasons are called missing values. Missing values are displayed as None, NaN, or blank in the program, and a dataset with many such missing values greatly affects the quality of the statistical prediction in the model. In particular, in machine learning models, all input values are assumed to be meaningful values, so missing values further affect the quality of the model. Rubin [23] classified missing data problems into three categories, which are missing completely at random (MCAR), missing at random (MAR), and not missing at random (NMAR). If the probability of being missing is the same for all cases, then the data are said to be MCAR. If the probability of being missing is the same only within groups defined by the observed data, then the data are MAR. If neither MCAR nor MAR holds, then the probability is MNAR. The methods of dealing with missing values are cross-sectional data, consisting of observation values viewed at one point in time for each item, and panel data (longitudinal data), consisting of observation values of multiple objects from multiple viewpoints using time series data. Methods commonly used for cross-sectional data include removing missing values, the imputation of mean or median values, the imputation of the most frequent values or 0 or specific constants, the imputation of K-NN, the MICE (Multivariate Imputation by Chained Equation) imputation method, and imputation using deep learning.
In this study, we used deep learning-based imputation, which is currently widely used; it is more accurate than other methods and has the ability to process a feature encoder. When there are too many missing value items, corresponding items are removed. We measured the missing values using a Random Forest regressor, and both the previous subsequent five the missing values were used as training data. We set the estimator to 50 and the max depth to 4 to prevent overfitting because there was little training data.
We removed the missing value, as shown in Table 3. We marked the missing value as ‘*’ to represent the blank information, as shown in Table 3 (upper). We replaced the missing values with new values according to the missing values policy, as shown in Table 3 (lower).
We performed the data regulation and region regulation for the monthly data as the mean of the monthly measured data, the integration for the region as living area and the correlated environment data, as shown in Figure 4. Figure 4a (left) shows the original data and Figure 4a (right) shows the mean of the monthly data. The publicly available data includes water quality measurement data that does not exist at a specific time due to problems such as the installation of measurement sensors. To solve this problem, we recombined the regions based on the living area, as shown in Figure 4b, and divided them into eight areas. Each color was arbitrarily selected as a color that could clearly distinguish the region.
We then adjusted the number of epidemic outbreaks to the number of outbreaks per 100,000 population in order to measure the same conditions across different regions, as shown in Table 4.
We used multiple regression analysis to verify reliable factors in the relationship between hepatitis A and environmental factors. We validated the goodness of fit of the model by using the R-squared value, as shown in Table 5. We obtained an R-squared value of 0.7054. We present the positive correlations for the COD (Chemical Oxygen Demand) values, total coliform count, total dissolved nitrogen, daily precipitation, and PM10 (particulate matter) in italic bold blue characters, and an $ indicator after the item name, in Table 5. The negative correlations for the TOC (Total Organic Carbon) values, number of Fecal E. coliform, monthly precipitation, and so2 are presented in italic, bold, underlined red characters, with a % indicator after the item name, in Table 5. Figure 5 shows the statistical results of the linear hypothesis between hepatitis A and the environmental factors. As a result of the test, the differences between the two groups were interpreted as statistically significant.
Figure 6 shows that the results of validation of correlation coefficient between environmental factors using heatmap Figure 6 shows that some environmental predictors of the model used in the regression analysis we used have low correlations with other environmental predictors in the correlation coefficient between hepatitis A and environmental factors. Therefore, it was verified that the data analysis did not show a negative effect. The 29 environmental factors used to correlate with hepatitis A patients information are hydrogen ion concentration, dissolved oxygen, BOD, COD, suspended solids, total nitrogen, total phosphorus, TOC, mercury, electrical conductivity, total coliform bacteria, dissolved total nitrogen, ammonia nitrogen, Acid nitrogen, dissolved total phosphorus, phosphate phosphorus, chlorophyll, E. coli bacteria, average temperature, maximum temperature, minimum temperature, average relative humidity, monthly precipitation, highest daily precipitation, small total evaporation, average wind speed, average cloud quantity, deep snow, average ground Temperature.

3.2. Outbreak Region Prediction of Hapatitis A

Through the correlated factor selection process, we integrated patient information, water environment information, weather information, population information, and air pollution information, and refined the data per 100,000 population to obtain the results shown in Table 6. We removed data without patient information or relevant local information during this process. The data obtained were divided into 17 areas across the country, with relevance for 50 items, and Seoul was recombined into eight areas based on living standards. The data obtained are 613 national data from 2016 to 2018 and 769 Seoul data from 2011 to 2018.
Table 7 shows the normalized data by data scaling. We use the min–max normalization for rescaling the features. Min–max normalization consists in rescaling the range of features to scale the range in [0,1]. The Equation (1) for a min-max of [0,1] is given as:
x ' = x min ( x ) max ( x ) min ( x )
Table 7 (upper) represents the original data before scaling and Table 7 (lower) represents the data normalized by data scaling. In this process, we produce the same scale data for training and testing.
We chose the optimal model to be used for the LSTM network. Nine algorithms were tested, including Random Forest, Gradient Boost, Lasso, Ridge, Linear Regression, KNN, MLP, XG Boost, and Cat boost. Table 8 shows the comparison results for the nine algorithms to choose the candidate for tuning the hyper-parameters. We used the RMSE (Root Mean Square Errors) to compare the algorithm. According to the experimental results, Gradient Boost, Cat Boost, and Random Forest were selected for tuning the hyper-parameters. After tuning the parameters, the best optimal algorithm was Gradient Boost, whose value changed from 0.077935 to 0.0759682. We estimated the optimal parameter using Grid Search CV [24] for Gradient Boost, and modified the learning rate from the default value of 0.1 to 0.075, the N_estimators from the default value of 100 to 200, and the max_depth from the default value of 3 to 4. Grid search CV is a function provided by sklearn that automatically learns the number of cases that can be made with the values by entering the desired hyper-parameter and numerical range. Furthermore, it calculates the best-performing parameter as the final output based on the evaluation index (in this paper we used MSE) set by the user, based on the learned data [24].
The tests were conducted in one area of Seoul, the training data used were from 2016 to March 2018, and the validation data used were from April to October 2018. To perform the predictions, the tests were conducted using data from November and December 2018. The epidemic of hepatitis A is shown in Figure 7. The blue line is the training data, the orange line is the validation data, the green line is the test data. Figure 7 visually presents the selection of training data, validation data, and test data within the time series data, including the change in the number of hepatitis A patients.
We transformed the 2D data into 3D data, as shown in Figure 8. The 2D data comprised a number of features and samples. The 3D data comprised a number of features, samples, and time steps. In order to predict the y_t + 1 time point using the LSTM, a total of six time steps was used from the y_t time point to the past y_t-5, as shown in Table 9. In our model, we use the sequential model, the LSTM layer, and the Dense layer. The optimizer is RMSprop (Root Mean Square propagation) and the loss function is MSE (Mean Square Error). RMSProp prevents the learning rate from dropping too close to zero by reflecting only the information of the new slope, rather than adding all the previous slopes uniformly. MSE is the most commonly used regression loss function. MSE is the sum of squared distances between the target variable and the predicted values. In order to process small data, the batch size was set to 2.
We conduct 15 epochs for learning. Early stopper is used to halt the training of the LSTMs at the right time to avoid overfitting and underfitting the model.
For this paper, because of the amount of data used was not large, we applied the early stopping algorithm to prevent overfitting. Figure 9 shows the comparison results of the predicted and actual values for one area of Seoul.
Figure 10 shows the prediction results for the epidemic of hepatitis A in Seoul. We used the training data (from January 2016 to July 2018) and the test data (August 2018) on the eight recombined areas of Seoul. The circle symbol is the actual data and the start mark is the predicted data for each area.
Areas B and D demonstrate many differences between forecasts and measurements because the weather and air pollution information used in the forecasts were not measured in a specific area, but rather across Seoul. This is another potential reason for the error that occurred when forcibly setting the eight recombined areas as the district area of Seoul.
Figure 11 and Figure 12 show the national 17-area prediction of the epidemic of hepatitis A in Korea for each local government unit. We used the training data (from January 2016 to November 2018) and the test data (December 2018). The blue circle symbol is the actual data and the red star symbol is the predicted data for each area in Figure 11.

4. Conclusions

In this paper, we propose a prediction model for the epidemic of hepatitis A. We analyzed the correlation between environmental factors and hepatitis A based on data collected from the public data system in Korea. The predictions of the area of occurrence were performed based on 3D LSTM, a machine learning method, using information on the water environment, the weather, the population, air pollution information, and hepatitis A patients.
The prediction of hepatitis A showed high accuracy with an error of about person per 100,000 population. We confirm that the environmental information in this study can predict the prevalence of hepatitis A. In addition, our study confirmed that fecal coliform count and PM10 among the environmental information were factors of high importance in predicting hepatitis A. In the future research, we will identify factors that increase reliability and apply them to more infectious diseases.

Author Contributions

Conceptualization, M.L. and I.N.; data curation, K.L., M.L. and I.N.; formal analysis, K.L. and I.N.; funding acquisition, I.N.; methodology, K.L., M.L. and I.N.; project administration, I.N.; resources, K.L. and M.L.; supervision, I.N.; validation, M.L. and I.N.; visualization, M.L.; writing—original draft, M.L. and I.N.; writing—review and editing, K.L. and I.N. All authors have read and agreed to the published version of the manuscript.

Funding

This study was supported by research fund from Chosun University, 2020.

Data Availability Statement

We used public data from KCDC to study only information on the number of cases of infectious diseases by region. 1. Patient information: KCDC (Korea Centers for Disease Control & Prevention), http://www.cdc.go.kr; 2. Water Environment Information: National Institute of Environmental Research (NIER), http://water.nier.go.kr/publicMain/mainContent.do; 3. Weather Information: KMA (Korea Meteorological Agency), https://data.kma.go.kr; 4. Population Information: Public Open Data Portal, https://www.data.go.kr/; 5. Air Pollution Information: KECO (Korea Environment Corporation) AirKorea, https://www.airkorea.or.kr.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Lee, M.K.; Paik, J.H.; Na, I.S. (Eds.) Outbreak Prediction of Hepatitis A in Korea based on Statistical Analysis and LSTM Network. In Proceedings of the 2020 International Conference on Artificial Intelligence in Information and Communication (ICAIIC), Fukuoka, Japan, 19–21 February 2020. [Google Scholar]
  2. Park, S.; Cho, E. National Infectious Diseases Surveillance data of South Korea. Epidemiol. Health 2014, 36, e2014030. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  3. Alamo, T.; Reina, D.G.; Mammarella, M.; Abella, A. Covid-19: Open-Data Resources for Monitoring, Modeling, and Forecasting the Epidemic. Electronics 2020, 9, 827. [Google Scholar] [CrossRef]
  4. Singh, R.; Singh, R. Applications of sentiment analysis and machine learning techniques in disease outbreak prediction—A review. Mater. Today Proc. 2021. [Google Scholar] [CrossRef]
  5. Hong, T.; Pinson, P.; Fan, S.; Zareipour, H.; Troccoli, A.; Hyndman, R.J. Probabilistic energy forecasting: Global Energy Forecasting Competition 2014 and beyond. Int. J. Forecast. 2016, 32, 896–913. [Google Scholar] [CrossRef] [Green Version]
  6. Santos, J.C.; Matos, S. Analysing Twitter and web queries for flu trend prediction. Theor. Biol. Med Model. 2014, 11, 1–11. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  7. Grover, S.; Aujla, G.S. Prediction model for Influenza epidemic based on Twitter data. Int. J. Adv. Res. Comput. Commun. Eng. 2014, 3, 7541–7545. [Google Scholar]
  8. McGough, S.F.; Brownstein, J.S.; Hawkins, J.B.; Santillana, M. Forecasting Zika Incidence in the 2016 Latin America Outbreak Combining Traditional Disease Surveillance with Search, Social Media, and News Report Data. PLOS Negl. Trop. Dis. 2017, 11, e0005295. [Google Scholar] [CrossRef] [PubMed]
  9. Nair, L.R.; Shetty, S.D.; Shetty, S.D. Applying spark based machine learning model on streaming big data for health status prediction. Comput. Electr. Eng. 2018, 65, 393–399. [Google Scholar] [CrossRef]
  10. Nduwayezu, M.; Satyabrata, A.; Han, S.Y.; Kim, J.E.; Kim, H.; Park, J.; Hwang, W.J. Malaria Epidemic Prediction Model by Using Twitter Data and Precipitation Volume in Nigeria. J. Korea Multimed. Soc. 2019, 22, 588–600. [Google Scholar]
  11. Petropoulos, F.; Makridakis, S. Forecasting the novel coronavirus COVID-19. PLoS ONE 2020, 15, e0231236. [Google Scholar] [CrossRef] [PubMed]
  12. Korea Centers for Disease Control and, P. 2013 Infectious Diseases Surveillance Yearbook; KCDC: Osong, Korea, 2014; pp. 50–63. [Google Scholar]
  13. Korea Centers for Disease Control and, P. Public Health Weekly Report Disease Surveillance Statistics, 10th ed.; KCDC: Osong, Korea, 2018. [Google Scholar]
  14. Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural Comput. 1997, 9. [Google Scholar] [CrossRef] [PubMed]
  15. Cho, W.; Kim, S.; Na, M.; Na, I. Forecasting of Tomato Yields Using Attention-Based LSTM Network and ARMA Model. Electronics 2021, 10, 1576. [Google Scholar] [CrossRef]
  16. Chai, T.; Draxler, R.R. Root mean square error (RMSE) or mean absolute error (MAE)?—Arguments against avoiding RMSE in the literature. Geosci. Model Dev. 2014, 7, 1247–1250. [Google Scholar] [CrossRef] [Green Version]
  17. Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef] [Green Version]
  18. Friedman, J.H. Greedy function approximation: A gradient boosting machine. Ann. Stat. 2001, 29, 1189–1232. [Google Scholar] [CrossRef]
  19. Tibshirani, R. Regression shrinkage and selection via the lasso: A retrospective. J. R. Stat. Soc. Ser. B 2011, 73, 273–282. [Google Scholar] [CrossRef]
  20. Hoerl, A.E.; Kennard, R.W. Ridge Regression: Biased Estimation for Nonorthogonal Problems. Technometrics 1970, 12, 55–67. [Google Scholar] [CrossRef]
  21. Schneider, A.; Hommel, G.; Blettner, M. Linear Regression Analysis. Dtsch. Aerzteblatt Online 2010. [Google Scholar] [CrossRef] [PubMed]
  22. Rumelhart, D.E.; Hinton, G.E.; Williams, R.J. Learning representations by back-propagating errors. Nature 1986, 323, 533–536. [Google Scholar] [CrossRef]
  23. Rubin, D.B. Inference and Missing Data. Biometrika 1976, 63, 581–590. [Google Scholar] [CrossRef]
  24. sklearn.model_selection.GridSearchCV. Available online: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html (accessed on 7 October 2021).
Figure 1. Two-phase approach of prediction system for hepatitis A.
Figure 1. Two-phase approach of prediction system for hepatitis A.
Electronics 10 02668 g001
Figure 2. Process of correlated factor selection.
Figure 2. Process of correlated factor selection.
Electronics 10 02668 g002
Figure 3. Process of outbreak region prediction for hepatitis A.
Figure 3. Process of outbreak region prediction for hepatitis A.
Electronics 10 02668 g003
Figure 4. Regulation of data (a) left: original data, upper right: mean of monthly data, (b) integrated living area.
Figure 4. Regulation of data (a) left: original data, upper right: mean of monthly data, (b) integrated living area.
Electronics 10 02668 g004
Figure 5. Linear model based statistical test on the relationships between hepatitis A and environmental factors (a): theoretical quantiles vs standardized residuals; (b): leverage vs standardized residuals.
Figure 5. Linear model based statistical test on the relationships between hepatitis A and environmental factors (a): theoretical quantiles vs standardized residuals; (b): leverage vs standardized residuals.
Electronics 10 02668 g005
Figure 6. Validation of correlation coefficient between environmental factors using heatmap.
Figure 6. Validation of correlation coefficient between environmental factors using heatmap.
Electronics 10 02668 g006
Figure 7. Visualization of epidemic of hepatitis A dataset (training, validation, test data).
Figure 7. Visualization of epidemic of hepatitis A dataset (training, validation, test data).
Electronics 10 02668 g007
Figure 8. Transformation of 2D feed-forward data into 3D LSTM data.
Figure 8. Transformation of 2D feed-forward data into 3D LSTM data.
Electronics 10 02668 g008
Figure 9. Prediction differential of the number of hepatitis A patients in Seoul between predicted and actual values.
Figure 9. Prediction differential of the number of hepatitis A patients in Seoul between predicted and actual values.
Electronics 10 02668 g009
Figure 10. Prediction of epidemic of hepatitis A in Seoul.
Figure 10. Prediction of epidemic of hepatitis A in Seoul.
Electronics 10 02668 g010
Figure 11. National prediction of epidemic of hepatitis A per local government unit.
Figure 11. National prediction of epidemic of hepatitis A per local government unit.
Electronics 10 02668 g011
Figure 12. National prediction map of epidemic of hepatitis A. (a): actual values, (b): predicted values.
Figure 12. National prediction map of epidemic of hepatitis A. (a): actual values, (b): predicted values.
Electronics 10 02668 g012
Table 1. Prevalence of National Notifiable Infectious Category I Diseases in Korea (restructured based on [12,13]).
Table 1. Prevalence of National Notifiable Infectious Category I Diseases in Korea (restructured based on [12,13]).
Year1954–19591960–19691970–19791980–19891990–19992000–20092010–20132014201520162017
Disease
Cholera01972206145196210140045
Typhoid fever539840,79013,018248130122198566251121121128
Paratyphoid fever1934406417216479522337445673
Shigellosis1004270517035343368698678311088113111
Enterohemorrhagic Escherichia coli-----43124611171104139
Viral hepatitis A------75851307180446794429
Table 2. Examples of web crawling data.
Table 2. Examples of web crawling data.
DateProvinceCountyCholeraTyphoidParatyphoidShigellosisEnterohemorrhagic Escherichia ColiHepatitis A
229722016-09GangwonYangyang0.0000000000.0000000000.0000000000.0000000000.0000000000.000000000
229732016-09GangwonYeongwol0.0000000000.0000000000.0000000000.0000000000.0000000000.000000000
229742016-09GangwonWonju-si0.0000000000.0000000000.0000000000.0000000000.0000000001.790000000
229752016-09GangwonInje0.0000000000.0000000000.0000000000.0000000000.0000000000.000000000
229762016-09GangwonJeongseon0.0000000000.0000000000.0000000000.0000000000.0000000000.000000000
229772016-09GangwonCheorwon0.0000000000.0000000000.0000000000.0000000000.0000000000.000000000
229782016-09GangwonChuncheon-si0.0000000000.0000000000.0000000000.0000000000.0000000001.430000000
Table 3. Replace processing for missing value (upper: original data, lower: replaced missing value).
Table 3. Replace processing for missing value (upper: original data, lower: replaced missing value).
Total NitrogenTotal PhosphorusTOCMercury
7.3466666670.0940.36666666716.16666667
7.6153333330.1063333330.414.13333333
9.4863333330.0903333330.416.6
9.2260.0916666670.415.96666667
9.0236666670.1120.43333333315.86666667
6.4080.0873333330.46666666714.7
7.3461.114666667*14.4
7.7383333330.1270.46666666714.16666667
4.906666667*0.5*
*0.085333333*17.2
8.3483333330.088333333*17.66666667
*0.00333333*15.9
Total NitrogenTotal PhosphorusTOCMercury
7.3466666670.0940.36666666716.16666667
7.6153333330.1063333330.414.13333333
9.4863333330.0903333330.416.6
9.2260.0916666670.415.96666667
9.0236666670.1120.43333333315.86666667
6.4080.0873333330.46666666714.7
7.3460.1146666670.46666666714.4
7.7383333330.1270.46666666714.16666667
4.9066666670.0813333330.513.66666667
8.1566666670.0853333330.517.2
8.3483333330.0883333330.517.66666667
7.0793333330.1003333330.53333333315.9
Italic, underline and bold number: new values according to the missing values policy; *: missing value.
Table 4. Epidemic outbreaks to the number of outbreaks per 100,000 population.
Table 4. Epidemic outbreaks to the number of outbreaks per 100,000 population.
DateAreaNo. of OutbreaksPopulationNo. of Outbreaks per 100,000 Population
2016-01Kangwon11,549,1930.059888984
2016-01Gyeonggi20012,536,4741.593915342
2016-01Gyeongnam813,364,7642.398269409
2016-01Kyongbuk102,701,1600.374661333
2016-01Gwangju21,472,8020.135795579
2016-01Daegu42,487,8230.148598832
2016-01Daejeon21,518,0240.160783143
Table 5. Multiple regression analysis results of each environmental factor and hepatitis A.
Table 5. Multiple regression analysis results of each environmental factor and hepatitis A.
ItemsEstimate Std.Errort ValuePr(>|t|)
(Intercept)1.0844.783 × 10−12.2670.02394 *
pH−2.461 × 10−22.861 × 10−2−0.8600.39017
Dissolved Oxygen−2.812 × 10−31.007 × 10−2−0.2790.78009
BOD−1.008 × 10−21.414 × 10−2−0.7130.47645
COD $4.032 × 10 −21.664 × 10−22.4230.01587 *
Suspended Solid1.642 × 10−32.860 × 10−30.5740.56611
Total Nitrogen−1.448 × 10−14.887 × 10−2−2.9630.00324 **
Total Phosphorus6.142 × 10−12.755 × 10−12.2290.02639 *
TOC %−5.964 × 10−21.233 × 10−2−4.8351.94 × 10−6 ***
Water Temperature1.208 × 10−38.672 × 10−30.1390.88926
Conductivity−3.274 × 10−41.678 × 10−4−1.9510.05183.
Total Coliforms $3.369 × 10−71.348 × 10−72.4990.01287 *
Dissolved total Nitrogen $1.454 × 10−15.331 × 10−22.7270.00670 **
‘Ammonia Nitrogen’2.935 × 10−22.693 × 10−21.0900.27636
‘Nitrate Nitrogen’1.170 × 10−22.392 × 10−20.4890.62494
Dissolved total Phosphorus−8.784 × 10−32.565 × 10−2−0.3420.73224
Phosphate Phosphorus−5.387 × 10−13.614 × 10−1−1.4900.13695
Chlorophyll3.387 × 10−32.001 × 10−31.6930.09131.
Fecal E. coliform count %−5.887 × 10−61.922 × 10−6−3.0630.00235 **
Average temperature−1.371 × 10−12.557 × 10−2−5.3611.44 × 10−7 ***
Highest temperature6.476 × 10−38.274 × 10−30.7830.43432
Lowest temperature2.760 × 10−28.681 × 10−33.1790.00160 **
Average relative humidity−5.186 × 10−35.537 × 10−3−0.9370.34957
Monthly precipitation %−1.636 × 10−33.560 × 10−4−4.5965.86 × 10−6 ***
Daily maximum precipitation6.989 × 10−31.218 × 10−35.7371.97 × 10−8 ***
Small total evaporation−5.467 × 10−31.210 × 10−3−4.5188.33 × 10−6 ***
Average wind speed8.834 × 10−27.598 × 10−31.1630.24572
Average amount of cloud2.856 × 10−22.317 × 10−21.2320.21856
The most serious theory−1.148 × 10−26.809 × 10−3−1.6850.09273.
Average ground temperature1.115 × 10−12.041 × 10−25.4638.48 × 10−8 ***
so2 %−1.434e × 10+22.694e × 10+1−5.3241.74 × 10−7 ***
no2−5.512 × 10−35.934−0.0930.92605
o3−9.9965.941e−1.6830.09326.
co−3.551 × 10−13.715 × 10−1−0.9560.33978
pm10 $1.959 × 10−22.547 × 10−37.6901.27 × 10−13 ***
Significance code and p-value: ***: [0, 001] ; **: (0.001, 0.01] ; *: (0.01, 0.05] ; .: (0.05, 0.1]; : (0.1, 1]. Residual standard error: 0.866 on 548 degrees of freedom; Multiple R-squared: 0.7357, adjusted R-squared: 0.7054; F-statistic: 24.22 on 63 and 548 DF, p-value: < 2.2 × 10-16. Red marked text and underline text: negative correlations; Blue marked text: positive correlations.
Table 6. Fifty integrated factors (patient information, region information, environmental factors).
Table 6. Fifty integrated factors (patient information, region information, environmental factors).
ItemsValue 1Value 2Value 3Value 4Value 5Value 6
Patient0.0598891.59391532.39826940.37466130.13579560.1485988
AreaKangwonGyeonggiGyeongnamKyongbukGwangjuDaegu
Population1549193125364743364764270116014728022487823
pH7.858.037.77.85882357.35833337.805814
Dissolved Oxygen12.49230812.9110.76666713.77205913.92514.05
BOD3.79230773.040.51.61323532.10833331.2976744
COD6.50384626.020.46666674.44558824.9253.8709302
Suspended Solid5.49615386.370.53333336.17.14.4209302
Total Nitrogen8.90942313.65154.90666674.07576474.39666672.6079186
Total Phosphorus0.12180770.0590.08133330.07257350.090.0396628
TOC3.53.120.52.51470593.30833332.2813953
Water Temperature3.94230774.0713.6666673.35294124.46666674.0860465
Conductivity614.615384778.8157.33333310.82353375.83333339.83721
Table 7. Example of data scaling (upper: original data, lower: normalized data).
Table 7. Example of data scaling (upper: original data, lower: normalized data).
PopulationpHDissolved OxygenBODCODSuspended SolidTotal Nitrogen
35119747.86797810.5915733.9640456.86236012.5882024.982775
27012387.82205911.0441182.6617656.37205912.7750003.625824
130906487.75647113.1158822.1600004.5747065.1064716.055618
PopulationpHDissolved OxygenBODCODSuspended SolidTotal Nitrogen
35119740.6385190.4467060.4103410.5296160.2074080.400355
27012380.6216170.4913820.2720460.4904440.2105870.248327
130906480.5974740.6959100.2187610.3468470.0800910.520553
Table 8. RMSE Comparison results for the nine algorithms.
Table 8. RMSE Comparison results for the nine algorithms.
AlgorithmRMSE (Original)RMSE (Tuning)
0RandomForestRegressor0.081008-
1GradientBoostingRegressor0.0779350.075904
2Lasso0.147475-
3Ridge0.085332-
4LinearRegression0.086475-
5KNeighborsRegressor0.110483-
6MLPRegressor0.096595-
7XGBRegressor0.078657-
8CatBoostRegressor0.081142-
Table 9. Serial data for LSTM.
Table 9. Serial data for LSTM.
yX
loady_t + 1y_t − 5y_t − 4y_t − 3y_t − 2y_t − 1y_t
0.000.44nannannannannan0.00
0.440.84nannannannan0.000.44
0.840.59nannannan0.000.440.84
0.590.86nannan0.000.440.840.59
0.860.45nan0.000.440.840.590.86
0.450.260.000.440.840.590.860.45
0.260.030.440.840.590.860.450.26
0.030.260.840.590.860.450.260.03
0.260.470.590.860.450.260.030.26
0.470.460.860.450.260.030.260.47
0.460.650.450.260.030.260.470.46
Background: y:y_t + 1 time point for prediction, X:from y_t time point to the past y_t − 5.
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Lee, K.; Lee, M.; Na, I. Predicting Regional Outbreaks of Hepatitis A Using 3D LSTM and Open Data in Korea. Electronics 2021, 10, 2668. https://doi.org/10.3390/electronics10212668

AMA Style

Lee K, Lee M, Na I. Predicting Regional Outbreaks of Hepatitis A Using 3D LSTM and Open Data in Korea. Electronics. 2021; 10(21):2668. https://doi.org/10.3390/electronics10212668

Chicago/Turabian Style

Lee, Kwangok, Munkyu Lee, and Inseop Na. 2021. "Predicting Regional Outbreaks of Hepatitis A Using 3D LSTM and Open Data in Korea" Electronics 10, no. 21: 2668. https://doi.org/10.3390/electronics10212668

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop