1. Introduction
Artificial intelligence (AI) means developing computer-based algorithms, which can execute tasks similar to human intelligence. In some medical research, both the terms “
artificial intelligence” and “
machine learning” may be used interchangeably [
1,
2]. It is not correct and should be differentiated between the two terms. In fact, artificial intelligence includes a learning spectrum and is not limited to machine learning [
3,
4]. AI includes representation learning, deep learning, and natural language processing (NLP). AI indicates computational programs, which imitate and simulate human intelligence in problem-solving and the learning process [
5,
6]. In healthcare, artificial intelligence uses computer algorithms for discovering information from raw data to accurately and correctly make decisions in medicine [
7,
8].
Machine learning (ML) is a subset of artificial intelligence. It can automatically discover data patterns. ML-based models learn automatically and experimentally and do not need to be explicitly programmed [
9,
10]. In other words, the learning model learns based on samples, whereas explicit programming follows rules or a limited hypothesis [
11,
12]. ML improves efficiency and reliability and reduces costs in computational processes. Moreover, it can accurately and rapidly generate models through data analysis. Machine learning presents tools that can process a large amount of data, the volume of which is far beyond human understanding. For example, health data may include demographic data, images, laboratory results, genomic data, medical records, and data obtained from sensors. Various platforms are used to generate or collect these data samples; for example network servers, electronic health record (EHR), genomic data, personal computers, smartphones, mobile applications, sensors [
13,
14] and wearable devices [
15,
16].
Figure 1 represents various data generation resources in healthcare.
Medicine is known as the most important application of artificial intelligence and machine learning [
17]. In the mid 20th century, researchers presented many medical decision-making systems. Rule-based methods were very popular in 1970 [
18,
19]. They were successfully used to interpret electrocardiograms (ECGs), identify diseases and select appropriate treatment methods. However, rule-based systems were costly and highly vulnerable. They need to accurately interpret decision-making rules. They should also be updated continuously. They are known as the first generation of AI-based systems [
20,
21]. In these systems, medical knowledge must be interpreted accurately by experts to formulate decision-making rules. In contrast, new AI-based models use machine learning (ML) techniques to extract data patterns from complex environments [
22,
23]. ML has many applications in medicine. These applications include disease identification and classification, the risk ranking of diseases, and the selection of appropriate treatment approaches.
Figure 2 displays some ML applications in healthcare. In recent years, researchers have presented many studies that focus on different aspects of healthcare [
24,
25]. They have used various machine learning methods such as Naïve Bayes (NB), artificial neural networks (ANNs), evolutionary algorithms (EAs), support vector machines (SVMs) and fuzzy systems (FSs) [
26], as well as some hybrid methods, such as neuro-genetic systems or neuro-fuzzy systems in their research.
Many researchers work on artificial intelligence and machine learning in healthcare every day. Therefore, we must review more research in this area due to the large advancements in machine learning techniques and their applications in medicine. In
Table 1, we present some review papers on ML applications in healthcare. These papers have often focused on ML applications in a specific medical field, for example, medical imaging or machine learning applications for diagnosing or treating a specific illness. They pay less attention to the structure of ML-based models used in different methods. AI specialists should be aware of the structure of learning models used in different approaches and identify their strengths and weaknesses to improve these models in healthcare. Because there are few review papers, for example [
27], in healthcare, which consider the structure of Ml-based models. Therefore, this subject requires more attention. Consequently, in this paper, we review the concepts associated with the structure of ML-based models in healthcare and consider their applications in the healthcare field. This paper provides a comprehensive view for artificial intelligence researchers to answer the question, “
how can machine learning techniques be used to improve different healthcare methods?”
Table 2 compares our review paper with other review papers in this area. In this paper, we first present a classification of machine learning-based schemes in healthcare. This classification categorizes machine learning-based schemes in healthcare based on data pre-processing methods (data cleaning methods, data reduction methods), learning methods (unsupervised learning, supervised learning, semi-supervised learning, and reinforcement learning), evaluation methods (simulation-based evaluation and practical implementation-based evaluation in real environment) and applications (diagnosis, treatment).
We believe that this review paper helps AI researchers to familiarize themselves with the latest research on ML-based approaches in healthcare, recognize the challenges and limitations in this area, and become aware of future research directions. In this review paper, we focus on a number of papers related to machine learning in healthcare published in 2017–2021. We also reviewed and studied various review papers, book captures, research papers, conference papers from different publications such as Springer, Elsevier, IEEE, Wiley, Taylor & Francis, Nature, ACM, and MDPI. Because the number of papers published in the healthcare field is very high, we do not study all of them in the limited volume of this review paper. As a result, we have selected the papers that have recently been published in the healthcare field, provide a more detailed evaluation, and use a larger dataset among papers with the same concept. Then, we remove other papers. We use Google Scholar to find these papers and search various phrases such as “Machine learning”, “Artificial intelligence in medicine”, “Machine learning applications in medicine”, “Intelligent medicine”, “Supervised learning in healthcare”, “Unsupervised learning in healthcare”, “Semi-supervised learning in healthcare”, “Reinforcement learning in healthcare”, “Deep learning”, and “Future hospitals”.
In the following, the paper is organized as follows: in
Section 2, machine learning and its applications in healthcare are expressed. In
Section 3, we present the general framework for designing a learning model in the medical field. In
Section 4, our proposed classification is introduced. In
Section 5, we study some ML-based methods in healthcare in accordance with the classification provided in this paper. In
Section 6, we summarize discussions about the ML-based methods examined in this paper. In
Section 7, we describe some challenges and restrictions on the use of machine learning in medicine briefly. Finally, the conclusion of the paper is presented in
Section 8.
3. The General Framework for Designing a Learning Model in Medicine
In this section, we introduce various phases for designing a learning model in the healthcare field. Note that the purpose of this section is that researchers understand how to design a learning model in medicine. We recommend researchers review and undertake more research in this area to achieve a deep understanding of and knowledge about learning models [
18,
21]. For designing a learning model in the healthcare field, we must consider five main phases: problem definition, dataset, data preprocessing, ML model development, and evaluation. These phases are shown in
Figure 3. In the following, each of these phases is described in detail.
Problem Definition. When designing a learning model in the healthcare field, we must first answer the question: “
What is the purpose of designing this learning model?” To design a useful model, the first step is to identify problems and challenges in the healthcare field. Researchers should also analyze exactly how to improve medical services using machine learning. In addition, they should examine the existing solutions presented in this area so far [
31]. In the first phase, a key point is to review data availability. This means that researchers should be aware of existing data sources because data should be sufficiently available for developing the learning model and evaluating this model. In the healthcare field, the lack of data can be due to a lack of digital data, patient privacy, commercial issues, or rare diseases.
Database. When designing a learning model in the healthcare field, datasets are used for training, validating, and testing. Healthcare datasets may include demographic information, images, laboratory results, genomic data, and data obtained from sensors [
54,
55]. Various platforms are used to produce or collect these data, for example network servers, e-health records, genome data, personal computers, smartphones, mobile applications, and wearable devices [
56,
57]. Today, the Internet and cloud-based technology could improve global connections [
58,
59]. As a result, data availability has become easier. Before developing a learning model in the healthcare field, it is necessary to design the appropriate mechanism for evaluating the learning model because it is not enough for machine learning for the designer to claim that its learning model has a high performance and is very desirable. ML-based models are data-centric. Therefore, they may be faced with a problem called overfitting or underfitting [
60,
61]. An efficient learning model should make a tradeoff between overfitting and underfitting. This means that it must have an appropriate bias and proper variance. Underfitting occurs when we design a very simple learning model relative to the complexity of the problem and the size of the dataset. This learning model has a weak performance on both training sets and testing sets. This means that it has a lot of bias. On the other hand, overfitting also occurs when the learning model is very complex and includes large parameters relative to the complexity of the problem and the size of the dataset. In this case, this model has a good performance on the training dataset, whereas it has a weak performance for the testing set. In this case, it has a high variance. In general, a proper learning model should have low bias and low variance.
Figure 4 describes the overfitting and underfitting problems.
In order to prevent overfitting, a common solution is that the dataset is divided into two parts: training set and testing set. The “
training set” indicates a dataset used for training the learning model and adjusting its parameters. The “
testing set” also indicates a dataset used for evaluating the performance of the learning model. Usually, the training set is larger than the testing set, for example, the ratio of 70 to 30. One solution for selecting the training set and the testing set is to randomly divide the dataset into two parts. Another important point is that, sometimes, the dataset is small. Therefore, it is not possible to assign a part of the dataset only for testing. In this case, the
K-Fold Cross-Validation technique is used [
62,
63]. In this technique, the dataset is divided into
k sections. Then, a section is used for testing and
sections are used for training. This process is repeated
k times so that, in each step, a new section is used for testing. Then, we must evaluate the performance of this learning model in each step. Finally, the overall performance of the learning model is equal to the average performance in
k steps.
K-Fold Cross-Validation is shown in
Figure 5.
Data Pre-Processing. When designing a learning model in the healthcare field, one of the most challenging issues is data preprocessing because a machine learning model requires high-quality data to achieve a higher quality in the training process and a more suitable performance in terms of accuracy. In general, data pre-processing is a process for investigating noisy data, missing values, duplicate data, and contradictory data. The purpose of this process is to increase the quality of the database before creating the learning model. Therefore, in data pre-processing, we may need to filter outliers or estimate missing values. If data also have high dimensions, some data reduction methods, such as feature selection [
64,
65] or feature extraction [
66], can be used. Feature selection selects the best subset of features. On the other hand, feature extraction finds a new dataset with lower dimensions based on the initial data set.
ML Model Development. When designing a learning model in the healthcare field, we must consider the database size, type of learning scheme, and model inference time. We determine the complexity of a learning model based on the database size to avoid overfitting or underfitting. Considering the training time of a learning model is very important. However, learning models with more parameters can produce more accurate results. However, in this case, these models perform more computational operations and need a longer time for training. As a result, they cannot be used for real-time applications. Therefore, lightweight architectures are more appropriate for designing a leaning model. Considering the type of learning scheme is also very important when developing ML models [
67,
68]. In general, there are four main learning methods, including supervised learning, unsupervised learning, semi-supervised learning, and reinforcement learning [
69,
70]. We describe these techniques more accurately in
Section 4.
Evaluation. Evaluating a machine learning-based system means executing various operations to detect differences between the current behavior of the system and the expected behavior [
71]. After designing a learning model in healthcare, the necessary evaluations should be performed to determine an answer to the question, “
Does this model have the deployment conditions in real environments?” In the evaluation process, designers use various scales to examine the performance of the learning model. This evaluation determines its strengths and weaknesses. In addition, after deploying the learning model in real environments, we must re-examine the performance of the learning model to evaluate its behavior when interacting with real users [
72,
73]. Different evaluation aspects of a machine learning system include: evaluating the data used to build the final learning model, evaluating the learning algorithms used to design the final model, and evaluating the performance of the final model. In the following, we explain these aspects more precisely:
Evaluating the data used to build the final learning model: The performance of learning models depends highly on data. Any error in the data can negatively affect the final model and weaken its performance. In the data evaluation process, it is necessary to answer different questions. For example, are there enough data to train and test the model? Can the existing data be considered representative of all real data for a specific area? Is available data balanced? Is there any hostile or false information in data?
Evaluating the learning algorithms used to design the final model: At this step, learning algorithms used for creating the final learning model must be carefully evaluated to determine possible errors in designing or selecting the algorithms. For example, the designer should test different learning algorithms to select the most suitable algorithm for building the final model. When we do not perform sufficient tests to select the proper learning algorithm, it may increase the error rate in the final learning model. In addition, at this step, we can adjust different parameters of a learning algorithm. For example, SVM parameters, or artificial neural networks parameters, such as the number of neurons in each layer, the number of hidden layers and network weights or decision tree parameters, including the number of leaves or its depth.
Evaluating the performance of the final model: After constructing and training the final model, its performance must be evaluated based on the following factors:
- −
Correctness: This factor evaluates how much the current result of the learning system and the expected results are close to each other. In this area, there are evaluation scales listed in the following. For this purpose, we first define some terms:
- ∗
True positive (): The number of positive class members, which are properly predicted by the classifier and are labeled as positive class.
- ∗
True negative (): The number of negative class members, which are properly predicted by the classifier and are labeled as negative classes.
- ∗
False positive (): The number of negative class members, which are falsely predicted by the classifier and are labeled as positive class.
- ∗
False negative (): The number of positive class members, which are falsely predicted by the classifier and are labeled as negative class.
In the following, we introduce some important scales for evaluating a learning model. This scales are based on the true positive (), true negative (), false positive () and false negative ():
Sensitivity: This scale is defined as a probability so that a classifier truly predicts the result as positive, when the corresponding ground truth is also positive. The other name of this scale is the true positive rate (
TPR) and it is calculated as follows:
Specificity: This scale is defined as the probability so that a classifier truly predicts the result as negative, when the corresponding ground truth is also negative. The other name of the specificity is the true negative rate (
TNR) and it is calculated as follows:
Positive predicted value (PPV): This scale is defined as the probability so that a classifier truly predicts the result as positive, when the test result (output of classifier) is positive. The other name of PPV is precision and it is calculated as follows:
Negative predicted value (NPV): This scale is defined as the probability so that a classifier truly predicts the result as negative, when the test result is negative. This scale is calculated as follows:
Accuracy: This scale is very important. Usually, classifiers are evaluated based on this scale. It is defined as the percentage of samples, which have truly been classified by the classifier. It is calculated as follows:
Matthews correlation coefficient (MCC): It is defined as the correlation coefficient between the predicted result and the corresponding ground truth. It has a value between +1 and −1. If
, then, this means that the classifier predicts the result truly. If
, then, this means that the classifier cannot predict the result better than a random manner. If
, then, this means that there is a full contradiction between the predicted result and the corresponding ground truth. The
scale is calculated as follows:
False discovery rate (FDR): This scale evaluates the ratio of samples that are falsely predicted as positive, to all samples, which are classified as positive. The FDR scale is calculated as follows:
AU-ROC: This scale is also another important criterion, which is used for evaluating classifiers. It is calculated based on the area under the receiver operating characteristic (ROC) curve. Note that ROC has been drawn based on TPR and FPR. This scale is calculated as follows:
F1-Score: This scale combines two scales, including precision and sensitivity. It is defined as their weighted average. When
, it is the best value. In contrast, when
, it is considered as the worst value. This scale is calculated as follows:
Receiver operating characteristic (ROC) curve: This curve is a method for drawing, organizing and selecting classifiers based on their performance. ROC is a two-dimensional graph. Its vertical axis represents sensitivity and its horizontal axis indicates specificity. A new scale is defined based on ROC called the area under ROC (AUC), which is used for comparing the performance of classifiers. It has a value between and one. If AUC is close to , the classifier has a weak performance.
Note that other evaluation criteria can also be used based on applications [
74,
75]. For example, ML techniques can be used in applications to automatize tasks such as medical image segmentation. In this case, other scales, such as the Dice coefficient and Jaccard index, can be used to evaluate machine learning models. For more details, refer to [
76].
- −
Model Relevance: This parameter is used to evaluate mismatches between model and data. This refers to overfitting and underfitting. If the available data are not enough, it causes a non-match between the data and the model. The useful solution for solving this issue is cross-validation. However, we do not exactly know how much overfitting is allowable for the learning model. Suitable methods have been presented in [
77,
78], for detecting overfitting.
- −
Efficiency: It represents the prediction speed and the learning speed in a learning model. The efficiency problem occurs when the machine learning-based system conducts the learning or prediction processes very slowly. As a result, ML designers should consider the runtime of learning algorithms.
- −
Interpretability: Sometimes, learning models are used to decide on medical treatment. As a result, humans must understand the logic and reason behind the decisions taken by these models to trust their decisions so that the final models are socially acceptable. However, it is difficult to define interpretability in terms of mathematics. To understand the interpretability of the ML model, refer to [
79]. According to [
80], interpretability means the user’s understanding of the decisions taken by ML. Various solutions have also been presented in [
81,
82,
83,
84] to evaluate the interpretability of a machine learning-based system.
5. Investigating Several ML-Based Methods in Healthcare
In this section, we introduce some ML-based methods in medicine based on the framework provided in this paper and express their weaknesses and strengths. We also review the different sections of each method based on our proposed classification, including data pre-processing scheme, learning technique, evaluation method, and application.
5.1. An Integrated Model Based on LOG and RF
Qin et al. [
101] suggested an ML-based method to timely diagnose chronic kidney disease (CKD). First, the authors used the KNN imputation technique to estimate the missing values in the database. They also used optimal subset regression and RF for reducing dimensionality and selecting the most suitable features in the dataset. Then, the learning model was designed using various classifiers. In the following, this learning model is described in detail.
Table 8 and
Table 9 present the most important characteristics of this ML-based model and its weaknesses and strengths, respectively.
Problem definition. Chronic kidney disease (CKD) is a serious disease, which can threaten general health. ML-based methods can help us to timely and accurately diagnose this disease. In the real world, most medical datasets have many missing values. In [
101], the authors believe that existing CKD diagnosis methods have low accuracy, or they used a constrained and weak technique to estimate the missing values. Therefore, the authors of [
101] provided an ML-based model for CKD diagnosis. The purpose of this learning method is to increase accuracy and improve its application.
Dataset. In [
101], the CKD database available in University of California Irvine (UCI) machine learning repository is used. In this database, there are 400 data points. These data points have 24 features, including 11 numerical features and 13 nominal features. Moreover, there are two final labels, including CKD (In this dataset, there are 250 CKD patients) and NOTCKD (In this dataset, there are 150 data points, which are known as NOTCKD). Note that this dataset is relatively small, and this issue limits the performance of this method in terms of generalizability.
Data pre-processing method. In [
101], the KNN Imputation method is applied for estimating the missing values in the database. This method selects
k data points without missing values. This data points must be closest to the missing values. Similarity scale is Euclidean distance. Here, there are two cases. One case is that the missing value is a numerical variable. In this case, the missing value is estimated based on the median of
k data points. Second case is that the missing value is a nominal variable. In this case, it is obtained based on the majority voting. In addition, this learning model uses a feature selection method based on the optimal subset regression and RF to select the most beneficial features.
ML model development. In [
101], a supervised learning scheme is used for predicting CKD disease. In the classification process, various classifiers are examined. The purpose is that classifiers with the best performance are selected for designing the final model. These learning models include: (1) Logistic regression (LOG); (2) Random forest (RF); (3) Support vector machine (SVM); (4)
K nearest neighbor (KNN); (5) Naïve Bayes (NB); (6) Feed forward neural network (FNN). Then, they evaluate performance of different models based on several parameters such as accuracy, number of misjudgments, runtime, and among others. Finally, RF and LOG are selected to build the final integration model.
Evaluation. This method uses a simulation-based evaluation. For this, the authors used R 3.5.2 software for simulating the CKD prediction model. To evaluate the learning model, 4-Fold-Cross-Validation method is used. Finally, this learning model has been evaluated according to various criteria such as accuracy, sensitivity, specificity, and F1 Score.
5.2. FCMIM-SVM
Li et al. [
102] provided an ML-based system for detecting the heart failure disease. They proposed a feature selection method called FCMIM. In addition, the authors examined different learning techniques, such as artificial neural networks (ANN), support vector machine (SVM), decision tree (DT), Naïve Bayes (NB),
K nearest neighbor (KNN), and Logistic regression (LR), for developing the final learning model. Finally, they created the final learning system called FCMIM-SVM. In the following, we describe this ML-based method in detail.
Table 8 and
Table 9 summarize the most important characteristics of this ML-based method and its weaknesses and strengths, respectively.
Problem definition. Heart disease is known to be a serious disease. It can threaten the lives of many people in the world. Traditional methods for detecting this disease are time-consuming, expensive, and inefficient. Therefore, ML-based methods can be very effective because they can detect heart disease using a fast, accurate, and low-cost scheme. In addition, the performance of an ML-based scheme can be improved when a balanced database and an efficient feature selection scheme are used. Regarding the issues mentioned, the authors of [
102] have provided an ML-based method and a feature selection approach to detect heart disease rapidly and accurately.
Dataset. FCMIM-SVM uses a heart disease dataset related to Cleveland. This dataset includes 303 data points. Each data point also has 75 features. There are six data points with missing values. In the pre-processing process, these data points have been removed. Furthermore, there are two classes for the final label: HD or Not-HD.
Data pre-processing method. FCMIM-SVM applies different data pre-processing techniques. For example, it removes data points with missing values from the dataset. It also performs some normalization operations such as Standard Scalar (SS) and Min–Max Scalar on the dataset. Furthermore, FCMIM-SVM designs a feature selection method called FCMIM for reducing dimensionality. Additionally, various feature selection algorithms, such as Relief [
103], mRMR [
104], LASSO [
105] and LLBFS [
106], are reviewed.
ML model development. In [
102], the authors have first assessed different classifiers like ANN, SVM, DT, NB, KNN, and LR to select the appropriate classifiers for developing the final learning model. Finally, the SVM classifier has been selected by the authors because it has the highest accuracy (i.e., Accuracy = 92.37%). Therefore, the final learning model, called FCMIM-SVM, has been created.
Evaluation. FCMIM-SVM has been evaluated using a simulation-based scheme. This scheme is simulated in Python software. This method also uses the Leave-one-subject-out cross-validation (LOSO) as the evaluation technique. In the evaluation process, the performance of FCMIM is compared with several feature selection approaches. According to the experimental results, the authors believe that FCMIM has a good performance. Then, FCMIM-SVM is evaluated based on various scales such as accuracy, specificity, sensitivity, MCC, and processing time.
5.3. CWV-BANN-SVM
Abdar and Makarenkov [
107] offered an expert system for detecting breast cancer. This method uses an ensemble learning technique based on support vector machine and artificial neural network. In this method, the optimal parameters of SVM are determined via different experiments. This ensemble system includes two SVMs, multi-layer perceptron (MLP), and radial basis function (RBF) neural network. The performance of neural networks is also improved using boosting technique. In the following, we describe this learning model exactly. In addition,
Table 8 and
Table 9 express the main characteristics of the CWV-BANN-SVM method and its advantages and disadvantages, respectively.
Problem definition. Breast cancer is the most common cancer in the world. This disease requires high costs for treatment. Therefore, ML-based solutions can reduce these costs and increase the accuracy of diagnosis. In general, learning methods reduce the diagnosis time and increase its accuracy. As a result, in [
107], an ensemble learning method has been developed to timely and accurately diagnose breast cancer.
Database. In [
107], the authors used the Wisconsin breast cancer dataset (WBCD). WBCD has 699 data points. There are two labels for output result, including benign and malignant. Each data point has 10 features. There are 452 data points belonging to the benign class and there are 241 data points belonging to the malignant class.
Data pre-processing method. In the dataset, there are 16 data points with missing values that are removed in the data pre-processing process.
ML model development. To develop the learning model, first, the authors tested a simple SVM with different parameters to find its most appropriate parameters. These parameters include regularization parameter (C), gamma parameter (), and . The authors believe that this improves the accuracy of the learning model and prevents overfitting. For designing the final learning model, the authors performed four main steps. First, they tested six classifiers: simple SVM, polynomial SVM, simple MLP, simple RBF, boosting MLP, boosting RBF. According to the experimental results, the authors selected two polynomial SVMs, boosting MLP, and boosting RBF to design the final ensemble model. They also applied SVM-CPG to determine the importance of each feature in the database for detecting breast cancer. In the second step, a data pre-processing process is performed for removing data with missing values. In the third step, the selected classifiers are re-evaluated on the modified database. In the final step, the authors created an ensemble classifier using two SVMs, boosting MLP, and boosting RBF. This ensemble system uses the confidence-weighted Voting (CWV) technique.
Evaluation. The CWV-BANN-SVM method uses a simulation-based evaluation. This scheme is simulated in IBM SPSS Modeler 14.2 software. The dataset is divided in two parts so that 50% is used for training and 50% is applied for testing. In the evaluation process, various criteria such as accuracy, sensitivity, specificity, precision, FPR, FNR, F1 Score, AUC, and Gini Index are considered.
5.4. Nested Ensemble Method (NE)
Abdar et al. [
108] introduced the nested ensemble (NE) method for automatically predicting breast cancer. NE is a two-layer scheme, which includes classifiers and meta-classifiers. In the following, we explain this method based on our proposed classification in this paper.
Table 8 summarizes the most important features of the NE method. furthermore,
Table 9 describes its advantages and disadvantages.
Problem definition. Breast cancer is the most common cancer among women. There are some schemes such as mammography for detecting breast cancer, but they are not accurate. In addition, physicians and specialists such as radiologists, hematologists, and pathologists must cooperate with each other to achieve a precise diagnosis about the disease. This is a very time-consuming work. Therefore, ML-based models can be very beneficial to accurately and rapidly detect this disease. In [
108], a ML-based method was presented to automatically diagnose breast cancer. The purposes of this method are to improve accuracy and reduce the required time for detecting malignant tumors.
Database. In [
108], NE uses the breast cancer Wisconsin diagnostic database (WDBC). This database includes 256 data samples. Each data sample has 32 features. There are two output labels, including benign and malignant.
Data pre-processing method. In this scheme, a feature selection method has been used to reduce dimensionality. In this process, 10 useful features are selected for detecting breast cancer. Note that the authors do not mention what feature selection method is used in NE, and this process is very ambiguous.
ML model development. To design the NE method, several ensemble learning techniques and some basic algorithms are used. The basic algorithms used in this method include Bayesian network (BN), Naïve Bayes (NB), Stochastic gradient descent (SGD), J48, REP-Tree, and logistic model trees (LMT). In general, NE includes classifiers and meta-classifiers. The meta-classifier includes two or more different classifiers. To develop the final learning model, four nested ensemble learning models are created using stacking and voting techniques (SV). These NEs are:
SV-BayesNet-2MetaClassifier: BN + LMT + SGD + 2-Metaclassifier (SGD + J48)
SV-Naïve Bayes-2MetaClassifier: NB + LMT + SGD + 2-Metaclassifier (SGD + J48)
SV-BayesNet-3MetaClassifier: BN + LMT + SGD + 3-Metaclassifier (SGD + J48 + REPTree)
SV-Naïve Bayes-3MetaClassifier: NB + LMT + SGD + 3-Metaclassifier (SGD + J48 + REPTree)
Then, these NEs are tested based on different experiments. According to the experimental results, the authors selected SV-Naïve Bayes-3MetaClassifier as their final learning model.
Evaluation. In [
108], the authors used the simulation-based evaluation. They used WEKA 3.9.1 simulator for implementing NEs. To evaluate these methods, the 3, 5, 10-Fold Cross-Validation technique has been used. NEs are evaluated based on different criteria, including accuracy, precision, recall, F1 Score, ROC, and processing time.
5.5. HMANN
Ma et al. [
109] suggested an improved neural network called HMANN. This scheme is used for detecting, segmenting, and identifying chronic renal failure. HMANN is implemented on the Internet of Medical Things (IoMT) platform. This method combines support vector machine (SVM), multi-layer perceptron (MLP), and backpropagation algorithm (BP). In the following, we explain HMANN in detail. Moreover,
Table 8 provides the most important characteristics of HMANN and
Table 9 expresses its weaknesses and strengths.
Problem definition. When kidneys do not work well, this issue can threaten human life. Therefore, it is very important to timely detect kidney stones. Often, digital images have low contrast. They are also highly noisy. Therefore, it is very difficult to use these images for detecting kidney abnormalities. Artificial neural networks are one of the most common tools for solving this problem. Because they are fault-tolerant. They can also be generalized easily. Moreover, they have a suitable learning ability. Therefore, in [
109], a neural network-based system has been developed.
Database. The authors use images in the UCI chronic kidney disease dataset to train and test HMANN. In this method, there is no explanation about this database. The authors do not mention the number of images in the dataset and their type.
Data pre-processing method. As mentioned earlier, digital images often have noise and low contrast. Their evaluation is difficult. In HMANN, the authors have reduced noise using threshold wavelet coefficients. In general, a pre-processing process is performed on these images to overcome the low contrast and noise. The data pre-processing process includes three steps: (1) Rebuilding images using a level set method; (2) Sharpening or smoothing using a Gabor filter; (3) Improving contrast using a histogram equalization process. In addition, a specialist physician performs manually the segmentation process on normal and abnormal digital images. Then, HMANN uses a feature extraction process called the gray-level co-occurrence matrix (GLCM) on these segmented regions to extract features related to this disease. These features include adaptive, Haralick, and histogram features. Then, a feature selection process is performed for selecting nine features.
ML model development. In [
109], the final learning model is built based on three main components, including SVM, MLP, and BP. The final learning model is called HMANN. The purpose of HMANN is to classify digital images modified in the previous step, identify kidney stones, and accurately detect their location.
Evaluation. HMANN uses simulation-based evaluation. This method is simulated and evaluated through various experiments to determine its efficiency. However, the authors do not explain the simulation tool, training set, testing set, and other simulation parameters. HMANN is evaluated based on various criteria such as prediction rate, AUC, accuracy, computational time, and ROC.
5.6. SRL-RNN
Wang et al. [
110] proposed an ML-based model called SRL-RNN. This scheme uses reinforcement learning and recurrent neural network (RNN). The purpose of SRL-RNN is to solve the dynamic treatment regime (DTR) problem. The main idea of this method is to combine two signals, including indicator and evaluation simultaneously. In the following, we describe SRL-RNN in detail. The most important features of SRL-RNN are represented in
Table 10. Furthermore,
Table 11 expresses its strengths and weaknesses.
Problem definition. Many researchers reviewed drug recommendation systems to help physicians for better decision-making. These systems can be designed using supervised or reinforcement learning algorithms. Supervised systems utilize similarities between patients to produce recommendations. However, these methods cannot directly learn the relationship between illness and drugs. These methods depend on the ground truth. However, there is no response to this question:
how is this ground truth created? In this case, they work based on the indicator signal. While reinforcement learning-based systems do not have this problem. However, they may present treatment recommendations that are strongly different from the prescription recommended by the physician. This is because a supervisor does not control them. This problem can increase the treatment risk. In fact, they work based on the evaluation signal. Therefore, the authors of [
110] combine supervised learning and reinforcement learning to produce a new model called SRL-RNN. This method can avoid unauthorized risks and deduce optimal and dynamic treatment.
Database. The authors utilize a large and available database called MIMIC-3 v1.4 to evaluate SRL-RNN. This database includes information about 43.000 patients in the intensive care units (ICU). This information has been collected from 2001 to 2012. It contains information about 6695 specific diseases and 4127 drugs.
Data preprocessing method. In [
110], when a data point has many missing values, more than 10 features, then this data point must be removed from the database. On the other hand, when a data point has a small number of missing values, then these missing values are estimated using the KNN method.
ML model development. In [
110], the authors presented a deep architecture called SRL-RNN for managing a DTR, including several diseases and different prescriptions. The aim is to learn the prescriptive policy by combining the index signal and the evaluation signal. SRL-RNN includes three main networks: (1) Actor network for producing drugs in a time-variant manner based on the dynamic status of patients. In this process, doctor’s decisions play the role of an indicator signal. This means that there is a supervisor to ensure safe actions and speed up the learning process; (2) Critic network for assessing the action related to the actor network to reward or penalize the recommended treatment; (3) LSTM network for developing SRL-RNN to manage a partially-observed Markov decision process (POMDP). It summarizes the observations to produce a more complete observation. Note that LSTM is one of the most famous recurrent neural networks (RNNs). It is known as a deep neural network.
Evaluation. SRL-RNN uses both evaluation methods i.e., simulation-based and practical implementation-based. In the practical implementation, the prescriptions produced by this method are evaluated for two patients in ICU. Note that the authors do not mention the software used to simulate this method. The dataset is divided into three groups, including the training set (80% of the dataset), validation set (10% of the dataset), and testing set (10% of the dataset). In [
110], the mortality rate is considered as an evaluation scale to evaluate the effect of this method for reducing mortality. The Jaccard coefficient has been used to measure the compatibility between prescriptions recommended by SRL-RNN and prescriptions produced by the physician.
5.7. A Closed-Loop Healthcare Processing Scheme
Dai et al. [
111] simulated the human body using deep neural networks (DNNs) and utilized deep reinforcement learning (DRL) to find suitable treatment schemes for the simulated body. In this method, the simulated body plays the role of a patient and DRL plays the role of a physician. In the following, we describe this scheme exactly. Furthermore,
Table 10 expresses the main characteristics of this method and
Table 11 presents its advantages and disadvantages.
Problem definition. In healthcare, it is necessary that the human body is continuously monitored to timely perform the corresponding treatments. However, it is not true to perform unauthorized tests on the human body. Therefore, it is necessary to design a virtual human body. However, the human body is a very complex system. Today, modern science has been accompanied by great progress. However, it cannot completely imitate the human body. A solution is to consider the body as a black box to interpret output data in response to input data. This means that it is based on a data-driven method. DNN is a useful tool for modeling the human body because it has a global approximation capability. Therefore, in [
111], DNN is used to simulate the human body.
Database. In [
111], the authors use a database including 990 tongue images. These images include 9 different structures to train a deep neural network (DNN). Note that the authors do not present exact explanations for the database.
Data pre-processing method. There is no pre-processing method in this scheme.
ML model development. The learning model presented in [
111] includes two main components: simulated body and treatment part. The simulated body consists of two main parts, including regulating network and decoding network. The regulating network is tasked to show the effect of treatment on the health status. Furthermore, the decoding network is tasked to transform a space with low dimensions (i.e., the health status) into a space with high dimensions. In [
111], LSTM has been used as a deep learning method for simulating the human body. In [
111], the conceptual alignment deep auto-encoder (CADAE) has been used as a decoding network. The second component i.e., treatment part is also responsible for receiving observations and producing therapeutic recommendations. This component dynamically interacts with the simulated body. It has two main parts: disease diagnosis and proper therapeutic recommendation. In [
111], the author used a deep reinforcement learning (DRL) scheme to merge these two parts. In this regard, they used a deep Q-network (DQN) for discrete space and the deep deterministic policy gradient (DDPG) for continuous space.
Evaluation. This method uses a simulation-based evaluation. Therefore, this scheme is simulated using TensorFlow installed on Python. The simulated body is trained using CADAE. This method is evaluated in terms of convergence rate and mis-diagnostic rate. Note that this method has presented the experimental results in a graph form. As a result, we do not present numerical results for this scheme.
5.8. GAN + RAE + DQN
Tseng et al. [
112] provided a deep reinforcement learning scheme for making treatment decisions. This method includes three components: (1) GAN for generating artificial data based on a small dataset. (2) Transition DNN for constructing the virtual radiotherapy environment. (3) DQN for determining the optimal radiation dose for the radiotherapy treatment process. In the following, we describe this method in detail. In addition, we present the most important specifications of this method in
Table 10.
Table 11 describes its strengths and weaknesses.
Problem definition. Usually, doctors believe that surgery is not a suitable option for treating non-small-cell lung cancer (NSCLC) patients and it is better to treat them using radiotherapy. However, this technology is progressing every day. However, its treatment results are not satisfactory. A suitable option is to increase the radiation dose in radiotherapy for enhancing the treatment process. Although, this can increase inflammation due to radiation and reduce the life quality of patients. This research tries to respond to this question: “
Whether the machine learning algorithms can determine the optimal radiation dose based on features of patients for controlling tumors locally and minimizing inflammation?” In recent years, deep reinforcement learning has been successfully used in various areas. This is because this learning technique can extract high-level features directly from raw data. Therefore, in [
112], DQN is used to determine the radiation dose in radiotherapy.
Database. This research uses a database including 114 NSCLC patients. Note that each data sample data consists of 297 features. For more details, please refer to [
112].
Data pre-processing method. In [
112], the authors use a feature selection scheme for selecting nine important features to simulate the radiotherapy environment. For this purpose, Bayesian network graph theory is used to hierarchically determine relationships between features and the desired output. This scheme tries to find the minimum features for controlling the tumor locally and reducing inflammation due to radiation.
ML model development. In [
112], the authors simulated the radiotherapy environment to design an artificial radiotherapy environment. The transition DNN algorithm is tasked to perform this work. For this work, they used GAN along with the transition DNN algorithm. This is because the available database is very small. As a result, GAN, which is a deep neural network, can produce artificial data very similar to real data. Then, the transition DNN algorithm is trained based on both real data and artificial data to simulate the radiotherapy environment. Next, DQN interacts with this simulated environment to imitate the doctor’s decision and determine the radiation dose for each patient.
Evaluation. This method uses simulation-based evaluation. It applies the MATLAB software for the feature selection process. In this case, AUC is considered as an evaluation scale. Note that the evaluation process uses a 10-Fold Cross-Validation method. Then, the final learning model is implemented in TensorFlow. As mentioned earlier, there are 114 data samples in the database. Then, GAN uses this database to produce artificial data. After executing this process, 4000 artificial data samples are produced. As a result, the number of data samples (real data and artificial data) is equal to 4114. Then, the DNN algorithm is trained according to this new database. In this case, the evaluation criterion is the average accuracy. Then, the DQN algorithm is executed on 34 patients in the UMCC protocol. In this case, the root mean square error (RMSE) is considered an evaluation scale, which is approximately 0.76.
5.9. HQLA
Khalilpourazari and Hashemi [
113] offered a reinforcement learning-based algorithm called HQLA. This algorithm uses the Quebec database to predict the Coronavirus prevalence. In this algorithm, the authors utilize two techniques, including reinforcement learning and evolutionary algorithms. In the following, we describe this method in detail.
Table 10 represents the most important features of this method in summary. Furthermore,
Table 11 expresses its advantages and disadvantages.
Problem definition. Modeling and predicting the COVID-19 epidemic process can help specialists in the healthcare field to finish its prevalence. However, it is very challenging to predict the COVID-19 prevalence due to its unclear and complex nature. The metaheuristic algorithms are very flexible and efficient. They can solve many problems in healthcare because they reduce computational costs and time complexity. They can also efficiently explore optimal responses. In addition, reinforcement learning algorithms can solve many issues in the real world, especially in healthcare. According to this issue, in [
113], the authors combine the metaheuristic algorithms and reinforcement learning to predict the coronavirus pandemic.
Database. Quebec is one of Canada’s provinces. The dataset includes data samples related to COVID-19 and the mortality rate recorded from 25 June to 19 July in 2020. This database includes 63713 data samples related to COVID-19 patients and 5770 data samples related to the dead individuals due to COVID-19.
Data pre-processing method. In [
113], there is no data pre-processing process.
ML model development. This method (HQLA) combines reinforcement learning and evolutionary algorithms. This scheme can solve complex optimization problems in a short-term time period. HQLA uses various evolutionary algorithms such as GWO [
114], SCA [
115], MFO [
116], PSO [
117], WCA [
118], and SFS [
119] to update the particle position in response space. Q-Learning is used to select the best operator (evolutionary algorithm) in the optimization process to obtain the best efficiency. Q-learning starts with several random operations. Then, it evaluates the efficiency for each operator in each step. This helps Q-Learning to learns the best operations for getting the best response. If an operator improves the final response quality, Q-learning rewards this operator. Otherwise, it penalizes the current operator.
Evaluation. HQLA uses simulation-based evaluation. Note that the authors do not mention the software used to implement this method. In the evaluation process, the mean square error is considered as the objective function. Its optimal amount is equal to 6.26
. The authors also presented several graphs, including convergence rate, a comparison between predicted data and actual data. Evolutionary algorithms have been evaluated in terms of various parameters. It is outside the field of this paper. For more details, please refer to [
113].
5.10. tVAE
Baucum et al. [
120] introduced the transitional variational auto-encoders (tVAE). It tries to learn the disease progression procedure to map a patient’s status to his next state at the next time point. In the following, we present this method in detail. In
Table 10, some features of tVAE are expressed.
Table 11 presents its advantages and disadvantages.
Problem definition. Reinforcement learning (RL) is a useful tool for developing a personalized treatment regime. For ethical reasons, RL agents cannot directly interact with real patients. Two solutions to this issue are: (1) Training the model using the existing dataset (Off-policy RL); (2) Learning a virtual environmental model using the available dataset (On-policy RL). In [
120], the authors presented a deep reinforcement learning method called tVAE. This scheme is based on the on-policy technique. tVAE seeks to learn the disease model accurately.
Database. In [
120], the authors used the MIMIC database. It includes information about 2067 patients in ICU. In this database, patients’ parameters such as heparin dose and aPTT have been measured every hour. Note that, in this dataset, 42.4% of patients are women. The mean age of the patients is 70.4 and their average weight is equal to 173 Lbs.
Data pre-processing method. The MIMIC database includes missing values. In [
120], the sample-and-hold interpolation method is used to determine the missing values related to the heparin dosage. An artificial neural network is used for estimating the missing values corresponding to the aPTT parameter. Note that the authors have normalized all variables in the dataset, but they do not mention the normalization method.
ML model development. tVAE method uses the standard VAE structure for simulating transitions between successive patient states. In this scheme, the purpose is to model a virtual patient environment to learn the prescriptive policy. Next, tVAE trains an artificial neural network so that it receives the continuous latent states as input and produces an output. This method can consider a continuous disease space and create randomness in the model. tVAE is suitable for medical time series. After designing a virtual patient environment, an on-policy reinforcement learning algorithm called A3C is used to learn the best heparin dose.
Evaluation. In [
120], tVAE uses simulation-based evaluation. This method is simulated in TensorFlow. In the evaluation process, the dataset is divided into two parts: training set (85% of the data samples) and testing set (15% of the data samples). In addition, the evaluation criterion is the mean absolute error (MAE).
5.11. TE-DLSTM
Zhu et al. [
121] presented a semi-supervised learning method called TE-DLSTM to identify body activities using inertial sensors. This method uses a deep long short-term network (DLSTM) to extract high-level features. In the following, we explain TE-DLSTM in detail.
Table 12 and
Table 13 represent the most important characteristics of this method and its advantages and disadvantages, respectively.
Problem definition. Human activity recognition (HAR) is a very important issue for informatics applications, especially healthcare. For example, when users use smartphone applications, HAR helps us to understand their behavior. In fact, HAR discovers their health status and presents high-quality health recommendations. However, a challenging issue is that we deal with unlabeled data when designing the HAR system. One effective solution for this issue is semi-supervised learning. Today, many methods use semi-supervised learning techniques to identify body activity. However, they can only extract low-level and simple features and do not have an acceptable performance. Accordingly, in [
121], a DLSTM-based method is presented for designing HAR to extract high-level features.
Database. In [
121], the authors used the UCI database, which includes time-series samples collected from 30 people. Their ages are between 19–48 years. Each time-series sample is sampled based on an overlapping window frame, which is equal to 2.56 s. The total number of samples is 10,000. Note that in this database, each data sample has 561 features.
Data pre-processing method. In [
121], the authors perform a simple feature extraction process on the database to extract some simple statistical features such as maxim, minimum, mean, and variance. Then, these low-level features feed the neural network to learn high-level features. Note that the final learning model is also a feature extraction method for extracting high-level features from the database.
ML model development. The database used for designing the learning model includes both labeled data and unlabeled data. For developing the learning model, in the first step, an augmentation technique enlarges the database. This technique acts as a regularizer in terms of randomness. Then, the authors extract simple features from the dataset. DLSTM is trained based on these low-level features. Then, the Dropout network acts as a regularizer to enhance the generalization ability of DLSTM. In the next step, the cross-entropy method is used for measuring supervised learning loss. It analyses the difference between the ground truth and the predicted label. The Square Loss method is used for measuring unsupervised learning loss so that the predicted output is compared with the previous ensemble output. Finally, the final loss is calculated based on a combination of supervised learning loss and unsupervised learning loss to obtain deep learning parameters based on the back-propagation method.
Evaluation. TE-DLSTM uses simulation-based evaluation. It is simulated in Python software. In the simulation process, the dataset is divided into two groups, including a training set (70% of data samples) and a testing set (30% of data samples). In this method, the evaluation criteria are accuracy and runtime.
5.12. SS-BLSTM
Gupta et al. [
122] presented a recurrent neural network-based method called SS-BLSTM. The purpose of this semi-supervised approach is to extract mentions related to adverse drug reaction (ADR) from Twitter. In the following, we explain this method.
Table 12 and
Table 13 represent the most important features of the SS-BLSTM method and its weaknesses and strengths, respectively.
Problem definition. Due to easy and broad access, social networks are known as a beneficial platform for sharing health information and are an appropriate option for monitoring health status. In [
122], the authors try to discover mentions related to ADR from Twitter. This is very challenging because these texts are informal and brief. Many supervised learning methods are presented for this purpose. However, their performance is not desirable because enough labeled data samples are not available. Recently, new methods have used deep neural networks, especially LSTM to solve this issue. However, they need a large database for the training process to avoid overfitting. Accordingly, in [
122], the authors presented a semi-supervised method, which uses both labeled and unlabeled data.
Database. In [
122], the authors used the ADR dataset collected from Twitter for the supervised learning phase. This database has been obtained from 2007 to 2010. In these tweets, there are 81 drugs. The database includes 645 tweets. The unlabeled dataset is produced using Twitter’s Search API. This database includes 0.1 million tweets.
Data pre-processing method. In [
122], a data normalization process is performed on the dataset to remove some words, symbols, and spaces.
ML model development. SS-BLSTM has two main steps: (1) The unsupervised learning step. The main task is to extract the drug name from tweets using an unsupervised learning scheme. For this, a bi-LSTM is trained. In this step, its weights are updated. Finally, these weights are maintained for the second step; (2) The supervised learning phase. The main task is to extract ADR from tweets using a supervised method. In this phase, the bi-LSTM model, which has been trained in the first step, is trained again to learn the labels mentioned in the tweet text.
Evaluation. SS-BLSTM uses simulation-based evaluation. It is implemented in Python software. To evaluate the performance of this method, the labeled database is divided into two sets, including training (470 tweets) and testing (170 tweets). In the evaluation process, various parameters including F1-Score, precision, and recall are used.
5.13. ECG Classification System Based on Semi-Supervised Learning
Zhai et al. [
123] suggested a semi-supervised learning system to classify electrocardiogram (ECG). The purpose of the classification is to detect arrhythmia. This learning issue classifies time series signals with unbalanced classes. It has three classes: normal beats, supraventricular ectopic beats (SVEB), and ventricular ectopic beats (VEB). The purpose of this scheme is to diagnose SVEB and VEB without labeling ECG data. Note that the authors use a two-dimensional convolutional neural network (CNN) in this scheme. In the following, we describe this scheme in detail. Moreover,
Table 12 and
Table 13 present the specifications of this system and its advantages and disadvantages, respectively.
Problem definition. Electrocardiogram (ECG) is a useful tool to detect arrhythmia. However, ECG interpretation is a very difficult, time consuming, and expert task. However, collecting ECG information is almost simple. Therefore, it is very necessary to design an automatic ECG classification system. Today, there are many techniques for classifying time series, but their performance is not acceptable. This is because enough labeled data is not available. Therefore, the combination of both unlabeled and labeled data can improve the performance of an ECG classifier. As a result, the authors of [
123] select semi-supervised learning for designing such a system.
Dataset. In [
123], two datasets are used for modeling this system: (1) The MIT-BIH arrhythmia database, which includes 48 ECG recorded for 47 people. In this database, each record includes ECG data for 30 min. The label of each record is determined by an expert; (2) Unlabeled database, often data samples in this database are normal beats. This allows the classifier to learn the normal beats specifications.
Data pre-processing method. In [
123], a data normalization process is performed on the dataset.
ML model development. This learning model has three main steps. In the first step, an unsupervised learning process is used to accurately detect normal beats based on unlabeled ECG data. In the second step, the CNN classifier is trained using the MIT-BIH dataset and the normal beats estimated in Step 1. Then, a semi-supervised process is performed for updating labels extracted from CNN to improve its performance.
Evaluation. This method uses simulation-based evaluation. It is simulated in MATLAB software. In the evaluation process, the MIT-BIH database is divided into two parts, including the training set (22 records) and the testing set (22 records). Criteria parameters are accuracy, sensitivity, specificity, PPR, and F1-Score.
5.14. A Deep Learning Model for Segmenting Retinal Fundus Images
Bengani et al. [
124] offered a deep learning model for segmenting the optic disk in retinal images. This method uses two learning techniques, including semi-supervised learning and transfer learning. In the following, we explain this method in summary. In addition,
Table 12 and
Table 13 represent the main characteristics of this method and its advantages and disadvantages, respectively.
Problem definition. Ophthalmologists use retinal images to detect eye diseases such as retinopathy. The location connecting the optic nerve to the retina is called the optic disk (OD). Detecting the optic disk in retinal images is very challenging, time-consuming. Therefore, computer diagnostic systems are very useful tools to segment and measure OD. The purpose of this system is to automatically detect OD for providing proper and timely treatment services. Today, deep learning models, especially artificial neural networks such as CNN have been used to do this work. These networks have a very good learning ability. However, they need a large database for training to avoid overfitting. On the other hand, the databases available for deep retinal images are very small. In [
124], the authors attempt to overcome these problems using semi-supervised learning and transfer learning.
Dataset. In [
124], the authors use various databases. These databases are: (1) Kaggle’s diabetic retinopathy database. The authors employ this labeled dataset for training the auto-encoder network. It includes 88702 retinal images; (2) DRISHTI GS1 database. The authors use this dataset for the segmentation network. It includes 101 retinal images. The authors divide this dataset into two parts, including the training set (50 images) and the testing set (50 images); (3) RIM-ONE database. This database includes 159 retinal images. Experts segment these images and determine OD in these images. The segmentation network utilizes this dataset.
Data pre-processing method. In the first step, the auto-encoder network and the segmentation network perform a two-phase data pre-processing scheme. In the first phase, image size is changed. The purpose of this phase is to normalize images and adjust their size. In the second phase, data augmentation is performed. The purpose of this phase is to increase the number of instances. This work is performed using different transformations on the input image.
ML model development. In the first step, a deep neural network called convolutional auto-encoder (CAE) is employed. This network is trained based on the unlabeled database. The aim is to learn the features of images based on input data to rebuild output images. Then, a convolutional layer is added to this trained CAE. In this case, it is converted to the segmentation network. In this step, transfer learning is used. This means that weights are obtained according to the trained CAE model. Then, the segmentation network is again trained using the labeled dataset. Finally, this model can be used to detect OD in retinal images.
Evaluation. This method uses simulation-based evaluation. It is simulated by the TensorFlow tool in Python. The evaluation scales are DSC, Jaccard index, accuracy, sensitivity, and specificity. Note that the times required for training the CAE network and the segmentation network are 10 h and 26 min and one hour and 31 min, respectively. The times required for testing on the DRISHTI and RIM-ONE datasets are 1.19 and 1.4 s, respectively.
5.15. A Semi-Supervised Learning Method Based on GAN
Yang et al. [
125] proposed a semi-supervised learning scheme, which uses the generative adversarial networks (GAN). The purpose of this scheme is to improve clinical detections in the IoT-based healthcare system. This method can solve two problems, including not availability of labeled medical data and imbalance classes. In the following, we describe this method. In addition, the most important characteristics of this method are in
Table 12. We present its weaknesses and strengths in
Table 13.
Problem definition. Today, the Internet of things (IoT) is changing our lifestyle in many areas, including healthcare. The IoT technology can produce a large amount of data for medical services. These data samples are used to produce a medical support system. The main task of this system is classification. Note that the performance of a classifier will be improved with increasing access to labeled data. However, this issue deals with various challenges, for example (1) IoT helps us to collect many medical data, but the labeled data samples are highly low; (2) In IoT, we deal with a problem called imbalanced data; this problem is due to high diversity in datasets. For solve these problems, one solution is to use semi-supervised learning. Therefore, in [
125], a GAN-based semi-supervised learning method is presented.
Dataset. In [
125], the authors utilize 10 UCI balanced datasets and 10 UCI unbalanced datasets. The number of data samples in these datasets is between 80 and 2000. Furthermore, each data sample has between 3 and 30 features in these datasets. Additionally, the cerebral stroke database has been used to evaluate the performance of the learning method. This dataset includes 11,039 data samples. So that, each data sample has 33 features. This dataset includes both labeled data (100 data samples) and unlabeled data (10,939 data samples).
Data pre-processing method. In [
125], the authors designed a data pre-processing module that modifies the dataset with the unbalanced classes. This module increases the size of a small labeled dataset using GAN. Then, a feature selection process is performed on the dataset. Note that the authors do not describe this module and the feature selection process exactly.
ML model development. In the first step, GAN receives the labeled dataset as the input to produces a number of artificial data samples. The purpose of this work is to enlarge the size of the labeled dataset and correct the unbalanced class. Then, the authors train two basic learning algorithms, including support vector machine (SVM) and K-nearest neighbors (KNN) using both the labeled dataset and artificial data samples. The purpose of these algorithms is to predict the label of unlabeled data samples. Then, the data samples with the predicted label are added to the labeled dataset. In the next step, GAN will be used again for this dataset to produce artificial data samples. The number of these artificial data samples is equal to the size of the dataset. Finally, the authors train the final classifier (i.e., SVM) using both real data samples and artificial data to perform the classification task.
Evaluation. This scheme uses simulation-based evaluation. It is implemented using MATLAB software. Note that each dataset is divided into two sections, including the training set (70% of data samples) and the testing set (30% of data samples). The evaluation scale for this method is accuracy.
5.16. Hybrid Fuzzy Clustering Scheme
Kanniappan et al. [
126] segmented abnormal areas in brain MRI slides. They used fuzzy clustering to model a semi-automatic system for detecting normal and abnormal areas in each brain MRI slide. In the following, we examine this method exactly. In addition, the main specifications of this method are summarized in
Table 14.
Table 15 presents its strengths and weaknesses.
Problem definition. In healthcare, detecting brain tumors is a very important issue. Obtaining information about abnormal tissues is a very critical phase to detect the disease and start the treatment process. The segmentation techniques can help radiologists to discover these abnormalities in MRI. Today, computer-based methods can efficiently diagnose brain tumors. One solution for this issue is clustering. In particular, fuzzy clustering technique is a suitable method for segmenting MR images to diagnose brain tumors. Therefore, in [
126], the authors presented a hybrid fuzzy clustering method to solve this issue.
Dataset. In [
126], the authors used two MRI datasets: (1) A real medical dataset. It includes 22 brain slides. Proscans Diagnostics Center has produced these images; (2) BRATS dataset. It includes information about 10 individuals. In this dataset, there are 200 brain slides for each patient.
Data pre-processing method. In [
126], in the first step, the authors preprocess these slides to normalize their size. So that they are represented as array, which is
pixels. In addition, all non-brain tissues are removed from MR images to improve the performance of this scheme.
ML model development. In [
126], the authors used the fuzzy clustering (FC) technique to segment MR images. The purpose of fuzzy clustering is to group
m data samples of the brain slide into
k clusters. After the clustering process, each data sample achieves a membership degree for a specific cluster, so that the data sample closest the cluster center has the highest membership degree. Then, the cluster center is calculated based on the mean of data samples. These data samples are weighted using their membership degree. In the next step, the membership degree of each data samples is updated. This process continues until the total distance between each data sample to the cluster center is minimized or the better result is not achieved. This process segments the brain structure. Note that in the clustering process, it is very important to determine the number of clusters. In [
126], this work is done using the silhouette score. In the next step, extracted structures are improved through morphological operations to determine the boundary between clusters. Finally, the authors perform some post-processing techniques to extract the desired area (i.e., tumor) from brain slides.
Evaluation. This scheme uses both simulation-based evaluation and practical implementation-based evaluation. It is implemented in Python software. Some evaluation criteria are Peak Signal to Noise Ratio (PSNR), Normalized Cross-Correlation (NCC), Normalized Absolute Error (NAE) and Structural Similarity Index (SSIM). The performance of hybrid fuzzy clustering is evaluated based on some similarity criteria such as Dice and Jaccard. Note that this method practically evaluates the brain MR images of a particular patient.
5.17. An Medical Support System for Detecting Social Anxiety Disorder
Fathi et al. [
127] designed a medical support system for detecting social anxiety disorder (SAD). The authors used the self-organizing map (SOM) to detect noisy data. SAD is detected through an adaptive neuro-fuzzy inference system (ANFIS) technique. In the following, we describe this method in detail.
Table 14 expresses the most important features of this method. Furthermore,
Table 15 presents its advantages and disadvantages.
Problem definition. Social anxiety disorder (SAD) is one of the most common phobias. Psychiatrists face with many challenges for detecting this disease because patients do not have enough knowledge about this disorder. Therefore, it is very useful to design a medical support system for detecting SAD. In [
127], ANFIS is used for modeling such a system. ANFIS is an appropriate learning model that utilizes the advantages of artificial neural networks and fuzzy logic. This means that the fuzzy system helps ANFIS to solve uncertainties and ambiguities, and the neural network helps ANFIS to manage noisy data.
Dataset. In this method, the authors achieve primary raw data through a website. The dataset includes information about 214 patients. Each data sample has 11 features. Note that the dataset has no missing values.
Data pre-processing method. In [
127], the data pre-processing scheme has three steps: (1) Data normalization. The purpose of the data normalization process is that different features have the same effect on the final learning model. In [
127], the authors used the Min-Max normalization method; (2) The feature selection process. The purpose of this step is to decrease the model complexity, save the time required for training model, lower data dimensionality, and avoid overfitting. The feature selection process is performed using SPSS Modeler V18.0 software to select seven useful features for detecting SAD; (3) Noise detection. In [
127], SOM technique has been used for noise detection. After the clustering process, clusters that includes a small number of data samples (one or two data samples) are considered as noisy data and are removed from the dataset. Then, the cluster’s behavior is evaluated based on two standards, namely social phobia inventory (SPIN) and Liebowitz social anxiety (LSA). After this evaluation, if clusters have abnormal behavior then they are recognized as noisy data. Therefore, they are removed from the dataset. After this step, 63 data samples are removed from the dataset. As a result, the dataset has 151 data samples.
ML model development. The authors of [
127] used the ANFIS classifier to detect SAD disorder. It is a combination of fuzzy logic and neural network. This algorithm is trained using least square and back-propagation methods. ANFIS has five layers. The first layer refers to input layer and final layer indicates output.
Evaluation. This method uses simulation-based evaluation. Note that the authors do not mention any description about simulator. The five-Fold Cross-Validation technique validates this scheme. Evaluation criteria include accuracy, sensitivity, and specificity.
5.18. AFGC
Huang [
128] suggested an adaptive fast generalized fuzzy C-means clustering (AFGC) algorithm. The purpose of this method is to segment the thyroid nodule images in a noisy environment to accurately detect malignant thyroid tumors. In the following, we describe this method in detail.
Table 14 expresses the specifications of this method in summary. Furthermore,
Table 15 presents its strengths and weaknesses.
Problem definition. The most common malignant thyroid is called the papillary thyroid carcinomas (PTC), which must be treated timely to stop or control this disease. Usually, ultrasound images are applied for detecting this disease. However, interpreting these images is a very difficult, time-consuming, and expert task. Therefore, computer-based systems are very beneficial for analyzing ultrasound images. The existing clustering methods for segmenting ultrasound images have poor performance and are not sufficiently accurate. This is because these images are highly noisy. In [
128], a suitable segmentation model has been proposed based on the AFGC clustering method.
Database. In [
128], the authors used the Jinshan Hospital database including thyroid nodule images. The PACS system is used to take these images from January 2014 to April 2016. In general, there are 610 thyroid nodule images related to 543 patients. These images are divided into two classes, including benign (403 patients) and malignant (207 patients). This dataset is used as the training set. In addition, the testing set includes the thyroid nodule images from May 2016 to September 2016. The testing set includes information about 45 patients and 50 thyroid nodule images.
Data pre-processing method. In [
128], the authors did not perform any data pre-processing scheme on the database.
ML model development. In [
128], the authors presented an AFGC-based segmentation algorithm to accurately segment the thyroid nodule images. In the first step, the authors determine a balance scale. This scale is calculated based on the noise probability of none-local pixels. This work helps the scheme to determine the structure information in the image exactly. In the second step, the AFGC algorithm and the weighted image are merged together. In this process, the authors consider the balance scale. This operation produces a filtered image. This scheme performs the filtering process dynamically. This means that if this image has high noise, then this scheme increases the filtering degree. Otherwise, it reduces the degree.
Evaluation. This scheme uses simulation-based evaluation. It is simulated using MATLAB software. Two evaluation scales, including segmentation accuracy (SA) and comparison scores (CS), have been used to evaluate this method.
5.19. UDR-RC
Janarthanan et al. [
129] offered the unsupervised deep learning assisted reconstructed coder (UDR-RC). The purpose of this method is to present a data pre-processing scheme to optimize the dataset. In the following, we explain this method in detail. Moreover, we represent the main specifications of the UDR-RC method in
Table 14.
Table 15 expresses its advantages and disadvantages.
Problem definition. Human activity recognition (HAR) has created opportunities for designing e-health methods. It uses wearable sensors to recognize different body activities. These sensors are very important for detecting different diseases and selecting a suitable treatment policy. Their output is a signal. This signal must be analyzed using deep learning approaches like DCCN. For analyzing these signals, existing models have high computational time and a lot of error rate. This means that they are not sufficiently accurate. Therefore, in [
129], the UDR-RC method is presented to solve the stated problems.
Dataset. UDR-RC employs the WISDM database. The wearable sensors sense these data samples. These data samples indicate six human activities, such as walking, running, upstairs, downstairs, sitting, and standing.
Data pre-processing method. UDR-RC is a data pre-processing method, including feature selection and feature extraction. It reduces computational time and the error rate, and enhances accuracy.
ML model development. UDR-RC is designed to extract automatically high-level features. This process includes several steps. In the first step, data samples are analyzed. The purpose of this step is to represent data samples analytically. It also reduces noise in data samples. The data samples are signals based on time and frequency. In [
129], Fourier transformation (FT) is used to analyze these data samples. In this scheme, a signal with a long time is broken into smaller parts. In [
129], these time series are divided using a time window with constant size. In the second step, the feature extraction is performed. This step is the core of the UDR-RC method. For this purpose, the coder architecture and the Z-Layer method are merged. They create a deep learning framework. The coder architecture is an encoder-decoder architecture, which processes the input signal to extract its features using the Z-Layer method. In the third step, UDR-RC performs a feature selection process to select the most suitable features for HAR. Finally, an artificial neural network (ANN) is used for classifying human activity. It includes an input layer, an output layer, and three hidden layers.
Evaluation. UDR-RC uses simulation-based evaluation. However, the authors do not mention the software used for implementing this method. In this scheme, evaluation scales include accuracy, MSE, and runtime.
5.20. CLUSTIMP
Shobha and Savarimuthu [
130] presented a clustering-based imputation technique called CLUSTIMP. In the following, we describe this method in detail. Furthermore,
Table 14 expresses the most important characteristics of the CLUSTIMP method.
Table 15 presents its advantages and disadvantages.
Problem definition. Healthcare datasets have useful information. However, they often include many missing values, unbalanced classes, and other problems. Missing values are known as a serious problem in these datasets. This problem can be solved using two schemes: (1) Marginalization, In this scheme, data samples with missing values are removed from the dataset; (2) Imputation, This scheme estimates the missing values. The marginalization method causes the imbalance class problem; While the imputation method does not have this problem. Therefore, in [
130], an unsupervised learning algorithm is provided for estimating these missing values.
Dataset. In [
130], the authors used two databases, including the mammographic mass dataset and the HCC dataset. The mammographic mass dataset has been obtained from the UCI repository. It includes 961 data samples. These data samples have six features. There are 162 data samples with missed values. Furthermore, the HCC database includes information about 165 patients. Each data sample has 50 features. In this dataset, there are missing values (10.22% of data samples).
Data pre-processing method. CLUSTIMP is a data pre-processing scheme for estimating missing values.
ML model development. In [
130], the authors presented a clustering-based imputation algorithm called CLUSTIMP. This imputation model employs ART2 for creating clusters. ART2 is an unsupervised learning algorithm, which is rooted in the ART scheme. This scheme works with continuous features. After creating the cluster, each cluster has two types of data samples, including complete data samples and data samples with missing values. In the next step, cluster members are divided into two groups, including group 1 (complete data samples) and group 2 (data samples with missing values). Then, missing values are estimated using two methods, including Expectation Maximization (EM) and J48 (a decision tree). Note that numerical missing values are imputed using EM and categorical missing values are imputed using J48.
Evaluation. CLUSIMP uses simulation-based evaluation. It is implemented in Python 2.7 software. Evaluation criteria include error rate, accuracy, and root mean squared error (RMSE).
6. Discussion
In this section, we provide some points about the ML-based methods in healthcare according to the learning models examined in
Section 5. Note that the real-world datasets in the healthcare field often deal with various problems such as missing values, noisy data, high data dimensionality (a high number of features), and among others. These problems reduce the quality of datasets. This problem negatively affects the performance of ML-based models. According to the research done in this paper, we deduce that most ML-based methods in medicine consider the data pre-processing methods. Data with missing values is the most common problem in healthcare datasets. Based on the ML-based methods studied in this paper, we find that there are two main strategies for solving this problem: (1) Deleting data with missing values; (2) Estimating missing values. Qin et al. in [
101], Wang et al. in [
110], Baucum et al. in [
120], and Savarimuthu and Shobha in [
130] offered various designs for estimating missing values. Li et al. [
102], Abdar and Makarenkov [
107], and Wang et al. in [
110] removed data with missing values from datasets. It is a simple approach for solving this problem; however, it can lead to a new problem called imbalanced classes. This problem has a negative effect on the performance of learning models. Therefore, methods, which impute missing values, provide a more appropriate solution to solve this issue. However, when designing a method for estimating missing values, it is very important to estimate missing values exactly. Otherwise, the learning model does not have an accepted performance. Wang et al. in [
110] provided a hybrid method for solving this issue. This means that some data samples with high missing values are removed from the dataset and some data samples with low missing values are also imputed. In addition, most ML-based methods consider the data normalization process. The purpose of data normalization is that variables with different scales are standardized in a certain range, for example
, to have the same effect on the learning model. For example, Li et al. in [
102], Baucum et al. in [
120], Gupta et al. in [
122], Zhai et al. in [
123], Bengani et al. in [
124], Kanniappan et al. in [
126], Fathi et al. in [
127], and Janarrhanan et al. in [
129] used the data normalization methods. Noise is another problem in healthcare datasets. It reduces the accuracy of learning models and increases their error. Therefore, it is very important to design approaches to remove noisy data to improve the performance of ML-based models. Data has different types, for example digital images, numerical data, and qualitative data. The noise removal process varies according to the data type in datasets. In this paper, we examined different methods for removing different types of noise in various datasets. For example, Ma et al. in [
109], Fathi et al. in [
127], Huang in [
128], and Janarrhanan et al. in [
129] provided various approaches to remove noise from data. We examined these methods in
Section 5. Another important point is that the healthcare datasets often have high dimensions. This means that data samples have many features. This can increase the model complexity and boost learning time, and lead to overfitting. To solve this problem, the appropriate solution is to use methods for reducing dimensionality such as feature selection and feature extraction. Some research works have focused on feature selection and feature extraction. For example, Qin et al. in [
101], Li et al. in [
102], Abdar et al. in [
108], Ma et al. in [
109], Tseng et al. in [
112], Zhu et al. in [
121], Yang et al. in [
125], Fathi et al. in [
127], and Jannrthanan et al. in [
129] provided approaches for reducing dimensionality. However, some of the methods studied in this paper do not explain the method used for reducing dimensionality. This is an important weakness in these methods because we cannot validate the results presented in these schemes to review the effect of the feature selection method on their performance. For example, Abdar et al. in [
108] and Yang et al. in [
125] did not provide any explanation about the feature selection process.
Table 16 categorizes the ML-based methods based on data pre-processing methods.
Another important point in ML-based models is the type of learning algorithm used for their development. According to our reviews in this paper, it can be found that unsupervised learning-based methods are often used for data pre-processing applications. For example, Fathi et al. in [
127] used the self-organizing map (SOM) for detecting noise. Janarrhanan et al. in [
129] presented an unsupervised deep learning method for feature extraction, feature selection, and noise removal to reduce computational time. Savarimuthu and Shoha in [
130] provided an unsupervised neural network for estimating missing values in the dataset. While supervised learning methods are often used to diagnose and classify a disease. For example, the learning approaches provided by Qin et al. [
101], Li et al. [
102], Abdar and Makarenkov [
107], Abdar et al. [
108], Ma et al. [
109]. Today, deep learning methods are also used to design treatment recommendation systems. However, an important problem in these methods is that their performance depends on the labeled database. A supervised learning algorithm has good performance when enough labeled data are available for training and testing this model. However, in the healthcare field, we often do not access large labeled datasets. This can lead to an overfitting problem. This reduces the generalizability of the learning model and increases its error. Furthermore, some authors have provided solutions to solve this issue. One solution to such a problem is to use reinforcement learning. For example, Wang et al. in [
110], Dia et al. in [
111], Tseng et al. in [
112], Khalilpourazari and Hashemi in [
113], and Baucum et al. in [
120] employed reinforcement learning for designing the learning models. However, the most important problem when using this technique in healthcare is that a reinforcement learning method should track the patient’s health status continuously to learn the optimal treatment strategy. According to the text presented above, firstly, a very difficult work is to track the patient’s health status. Secondly, researchers cannot do unauthorized tests on the patient’s body. A solution for these problems is to create an artificial environment for reinforced learning-based models. For example, Dia et al. in [
111], Tseng et al. in [
112], and Baucum et al. in [
120] designed an artificial environment using deep learning techniques to interact with reinforcement learning-based models. Another solution to solve data unavailability is to produce artificial data samples. For example, Tseng et al. in [
112] and Yang et al. [
125] used a deep neural network called GAN to produce artificial data samples and enlarge the initial dataset. Another solution for data unavailability is to use semi-supervised learning methods. These methods use a combination of labeled data and unlabeled data for designing the learning model. Moreover, these methods use both learning techniques, including supervised learning and unsupervised learning. For example, Zhu et al. in [
121], Gupta et al. in [
122], Zhai et al. in [
123], Bengani et al. in [
124], and Yang et al. in [
125] used semi-supervised learning for designing the learning model.
Table 17 categorizes the ML-based methods in the healthcare field in terms of various learning techniques.
When examining the ML-based methods in healthcare, another point is that researchers often evaluate the performance of their learning model using simulation software. However, this evaluation method is very important, but we believe that it is not enough. Because the ML-based methods in healthcare should be analyzed in real environments and are evaluated by physicians and specialists in this area to identify their weaknesses. In the research done in this paper, only Wang et al. in [
110] and Kanniappan et al. in [
126] examined their methods in a real environment, but it is highly limited. Note that the practical implementation of the learning models in healthcare is very costly. They deal with hardware complexities for implementing the ML-based models. Additionally, it is very difficult to repeat different scenarios. These problems are often considered as important obstacles for artificial intelligence researchers because they need to evaluate their own models to update them continuously. In
Table 18, the ML-based methods in healthcare are categorized in terms of evaluation methods.
The final point on the ML-based models in the healthcare field is that most ML-based methods are used to diagnose a disease. The number of papers presented in the treatment field, which use machine learning techniques is very limited. Therefore, researchers must work in this area to resolve its problems. For example, Wang et al. in [
110], Dai et al. in [
111], Tseng et al. [
112], and Baucum et al. in [
120].
Table 19 compares the ML-based methods in healthcare are in terms of application.
The state-of-the-art survey had presented a comprehensive review of the applications of machine learning in medical sciences. From cardiovascular disease [
131], to pandemic research [
132], various methods had been considered and notable methods presented. Machine learning in particular showed an exponential increase in COVID-19 research where novel methods proposed [
133,
134,
135,
136,
137,
138,
139,
140]. It has been shown that ensemble, deep learning, and hybrid methods are rapidly getting popularity as also stated in previous surveys, for example, [
140,
141,
142,
143,
144,
145,
146]. The progress on the applications of evolutionary methods, for example, [
147,
148,
149,
150] in training the machine learning methods had not been progressive as other fields.