A Novel Model Based on CNN–ViT Fusion and Ensemble Learning for the Automatic Detection of Pes Planus

Doğan, Kamil; Selçuk, Turab; Yılmaz, Abdurrahman

doi:10.3390/jcm13164800

Open AccessArticle

A Novel Model Based on CNN–ViT Fusion and Ensemble Learning for the Automatic Detection of Pes Planus

by

Kamil Doğan

¹

,

Turab Selçuk

^2,*

and

Abdurrahman Yılmaz

³

¹

Radiology Department, Bursa City Hospital, 16110 Bursa, Turkey

²

Department of Electrical and Electronics Engineering, Kahramanmaras Sutcu Imam University, 46200 Kahramanmaraş, Turkey

³

Department of Orthopaedics and Traumatology, Faculty of Medicine, Hacettepe University, 06100 Ankara, Turkey

^*

Author to whom correspondence should be addressed.

J. Clin. Med. 2024, 13(16), 4800; https://doi.org/10.3390/jcm13164800

Submission received: 12 June 2024 / Revised: 30 July 2024 / Accepted: 8 August 2024 / Published: 15 August 2024

(This article belongs to the Special Issue Advances in Surgical Management of Orthopedics—Artificial Intelligence (AI) Applications and Robotic Surgery Techniques)

Download

Browse Figures

Versions Notes

Abstract

:

Background: Pes planus, commonly known as flatfoot, is a condition in which the medial arch of the foot is abnormally low or absent, leading to the inner part of the foot having less curvature than normal. Symptom recognition and errors in diagnosis are problems encountered in daily practice. Therefore, it is important to improve how a diagnosis is made. With the availability of large datasets, deep neural networks have shown promising capabilities in recognizing foot structures and accurately identifying pes planus. Methods: In this study, we developed a novel fusion model by combining the Vgg16 convolutional neural network (CNN) model with the vision transformer ViT-B/16 to enhance the detection of pes planus. This fusion model leverages the strengths of both the CNN and ViT architectures, resulting in improved performance compared to that in reports in the literature. Additionally, ensemble learning techniques were employed to ensure the robustness of the model. Results: Through a 10-fold cross-validation, the model demonstrated high sensitivity, specificity, and F1 score values of 97.4%, 96.4%, and 96.8%, respectively. These results highlight the effectiveness of the proposed model in quickly and accurately diagnosing pes planus, making it suitable for deployment in clinics or healthcare centers. Conclusions: By facilitating early diagnosis, the model can contribute to the better management of treatment processes, ultimately leading to an improved quality of life for patients.

Keywords:

vision transformer; pes planus; ensemble learning

1. Introduction

Pes planus, or flatfoot, is a prevalent condition characterized by the inward collapse of the arch along the inner side of the foot [1,2]. It is commonly categorized into two types: pediatric pes planus and adult-acquired pes planus (AAPP). Pediatric pes planus is also categorized into two groups: flexible pes planus and painful pes planus. The flexible type is typically characterized by hypermobility of the subtalar joint, and the development of the medial arch normally occurs naturally before adulthood [3,4]. However, painful pes planus in children and adolescents is mostly caused by talocalcaneal coalition [5]. AAPP is acknowledged as a source of discomfort and impairment in adulthood, primarily resulting from posterior tibial tendon deficiency (PTTD) [6].

Tarsal coalition is when two or more tarsal bones fuse together incorrectly, which can lead to painful pes planus in adolescents. Tarsal coalitions are classified into fibrous, cartilaginous, and osseous coalitions based on the abnormal bridge morphology [7]. The exact prevalence of tarsal coalition is uncertain; estimates vary from less than 1% to around 1–2% of the population [8].

The prevalence of AAPP ranges from 5% to 15% in the community. Approximately 1 out of every 100 people in the overall population seeks medical attention for the pain resulting from a flat foot. The prevalence of asymptomatic AAPP, which is also quite prevalent, is not fully known [9,10].

PTTD, the most common reason for AAPP, can be attributed to various factors, including age, obesity, ligamentous laxity, and trauma. There is a watershed area behind the medial malleolus that typically experiences degeneration as a result of a sudden alteration in the orientation of the tibialis posterior tendon. PTTD results in a disruption of force balance [4,6,11]. The peroneus brevis tendon causes forefoot abduction, and the Achilles tendon causes a valgus deformity of the ankle. Subsequently, spring tendon weakness ensues, leading to the collapse of the medial arch [12,13].

Pes planus can be diagnosed by both a clinical examination and radiographic methods. The primary radiological approach for investigating pes planus is a full-weight lateral foot X-ray in the stance phase [14,15]. The assessment of AAPP involves measurements on lateral weight-bearing X-rays. These measurements include the talo-first metatarsal angle (Meary’s angle), calcaneal pitch angle, and distance between the medial cuneiform and the floor. However, using these parameters can be time-consuming and may lead to inconsistent results between different observers [16].

The potential of computer-based, automated systems for disease detection from X-ray images is extremely promising. There are many studies in the literature on pes planus detection from X-ray images with machine learning. There are two basic approaches to pes planus detection. The first is to determine whether it exists by classifying images. The second is to calculate the angles used to detect pes planus from the patient’s foot X-ray images. Image interpretation using computer-aided systems equipped with artificial intelligence can be more objective, faster, and more accurate than human interpretation [17,18,19]. Therefore, the computer-based automatic detection of pes planus is an important research topic in the literature.

Convolutional Neural Networks (CNNs) are widely used deep learning models for image processing and classification tasks [20,21]. CNNs have achieved great success in the field of image processing and classification and are particularly powerful tools for medical image analysis. Vision Transformers (ViTs) were originally developed for natural language processing (NLP) and have been adapted for image processing tasks [22]. ViTs are a relatively new approach and perform exceptionally well, especially on large datasets.

In this study, a novel fusion model has been created using both CNNs and ViTs for pes planus detection. There is no study in the literature investigating the performance of a Vision Transformer (ViT) for detecting pes planus from X-ray images. The aim of the study is to evaluate the accuracy and reliability of the combined use of these two models in detecting pes planus. Additionally, an ensemble learning method consisting of a support vector machine (SVM) [23], K-Nearest Neighbors (KNNs) [24], and decision trees (DT) [25] was used. Predictions from each model were either voted on or averaged. Ensemble learning enhances the performance, accuracy, robustness, and stability of predictive models, making it a valuable technique in machine learning and data science. The results obtained show that the integration of CNNs and ViTs offers an innovative approach to developing automatic and precise diagnostic tools and provides an effective solution that can assist clinicians in the automatic detection of pes planus in clinical applications.

2. Materials and Methods

2.1. Dataset

The dataset consists of weight-bearing X-ray images of patients admitted to the Radiology Department of Elazığ Fethi Sekin City Hospital. These patients were undergoing routine pre-military health screenings or were suspected of having flatfoot conditions. All images were captured using a Philips dual detector digital X-ray machine (65 kV, 6.3 mAs) and stored as JPG files.

A total of 439 patients provided images. However, X-ray images from 18 of these patients were removed due to low quality. A total of 842 X-ray images from 421 patients were labeled according to the calcaneal angle by two expert radiologists. A third specialist radiologist conducted post-labeling by re-evaluating the conflicting cases. The calcaneal angle is shown in Figure 1.

If the calcaneal pitch angle was less than 18 degrees, it indicated that pes planus was present, and if it was equal to or greater than 18 degrees, it indicated that pes planus was not present. The distribution of the dataset is given in Table 1.

2.2. Image Preprocessing

A large dataset is critical for performance during the training of deep learning models [27]. Otherwise, problems such as overlearning are encountered [28]. When the dataset is limited, it is possible to obtain the desired amount of data by applying data augmentation techniques. First, all images were resized to 224 × 224 dimensions. Ten-degree rotation and mirroring (on the x and y axes) were performed randomly on the images in the dataset. In this way, the number of images was increased from 842 to 1500. Additionally, to avoid data imbalance, 750 images were set as pes planus and 750 images were set as normal.

2.3. CNN and ViT

CNN models have some limitations regarding image classification, especially when global structural information is obtained. Since CNN models tend to better learn a class with a large amount of data, they may make some incorrect classifications [29]. In our study, a model combining a ViT and CNN is presented. This model achieved higher performance than ordinary CNNs in detecting pes planus from ultrasound images. VGG16 was used as the CNN model, and a Swin transformer was used as the ViT. Additionally, classification performance was further improved with ensemble learning consisting of an SVM, K-Nearest Neighbor (KNN), RF, and decision tree classifiers. In addition, it was found that the proposed model performed better than ordinary CNNs regarding data imbalance.

VGG16 consists of 13 convolutional layers (convolutional blocks) in which each block comprises multiple convolutional layers with 3 × 3 filters and a stride of 1 (Figure 2). The number of filters starts at 64 and then doubles after every max-pooling layer. Following the convolutional layers, VGG16 has 3 fully connected layers, also known as dense layers or FC layers, with 4096 neurons each.

ViT is a transformer architecture originally designed for natural language processing (NLP). In recent years, it has been used for image classification tasks and is a popular architecture with high performance. In a standard ViT architecture, an image is divided into fixed-size segments, which are then linearly embedded and processed by a transformer encoder. However, for tasks that require capturing more complex spatial information or interactions between different parts of the image, multiple transformer blocks or layers can be added to the ViT architecture.

For the ViT model, each image was divided into 16 × 16 patches (a total of 196 patches). Then, the location information and class information of each patch were applied as input to the encoder part of the transformer network. Finally, a multilayer perceptron (MLP) was transferred from the transformer network. Since the dataset used was not large, a ViT-B/16-based ViT was chosen. A feature vector length of 768 was obtained from each patch. In total, feature vector dimensions of 196 × 768 were obtained from each image. Figure 3 shows the structure of the ViT model. In the training process, the initial learning rate was set to 0.01. We selected binary cross-entropy loss as our loss function for the binary classification. The momentum and the batch size were set to 0.9 and 64, respectively.

Feature selection was performed to eliminate unnecessary features. Among the 13,696 features in the third layer of VGG16, the 32 features with the highest correlation were selected. Similarly, out of the 151,296 features obtained with the ViT, 486 features with high correlation were selected. The mRMR algorithm was used for feature selection. mRMR is a feature selection algorithm based on the Maximum Relevance Minimum Redundancy criterion. It analyzes the relevance between features and the class and the redundancy among the features themselves. For this analysis, Euclidean distance, Pearson correlation, and information measurement are used. The mathematical expressions of the mRMR algorithm are given in Equations (1) and (2) [30]. The proposed model is shown in the flow diagram in Figure 4.

M a x i m u m R e l e v a n c e (M R) = m a x [\frac{1}{|F|} \sum_{f_{i} \in F} D (f_{i}, C)]

(1)

M i n i m u m R e d u d a n c y (m R) = m i n [\frac{1}{{|F|}^{2}} \sum_{f_{i} \in F} \sum_{f_{j} \in F} D (f_{i}, f_{j})]

(2)

2.4. Ensemble Learning

An ensemble learning-based classification method was applied. Ensemble learning offers better classification performance by combining the most advantageous features of different classifiers. For example, decision trees (DTs) are significantly affected by data change. However, in some cases, they show higher performance than SVMs. Therefore, an ensemble learning combination of an SVM, KNN, and DT was used in the present study. We trained the classifiers with the feature vectors of each image. The proposed ensemble learning model is shown in Figure 5.

Both majority voting and weighted average voting were used. In majority ensemble learning, each model in the ensemble makes its own prediction for a given input or example. The final prediction is then determined by selecting the class label that receives the most “votes” or agreement from the individual models [31]. In classification tasks, the class label with the highest prediction frequency across the ensemble is selected as the final prediction. In a weighted average, each model in the ensemble makes a prediction for a given input. Each prediction is assigned a weight that reflects its relative importance or reliability. The final estimate is obtained by calculating the weighted average of the individual estimates, where each prediction is multiplied by the corresponding weight.

If we need to express ensemble learning mathematically, the ensemble E contains n classifiers, denoted as C₁, C₂, …, C_n for each classifier C_t (where t is the index of the classifier); its decision regarding class c_j is represented by d_t,j. This can be expressed as the following:

$d_{t, j} = 1 if classifier C_{t} predicts c_{j}$
$d_{t, j} = 0 if classifier C_{t} does not predict c_{j}$

Majority voting and weighted voting are calculated as follows in Equations (3) and (4):

\hat{y} = \underset{c_{j}}{argmax} \sum_{t = 1}^{n} d_{t, j}

(3)

\hat{y} = \underset{c_{j}}{argmax} \sum_{t = 1}^{n} w_{t} d_{t, j}

(4)

where

\hat{y}

represents the final prediction and each classifier can be assigned a weight

w_{t}

.

2.5. Performance Indices

Sensitivity, specificity, and accuracy values are used to quantify performance. These indices are used for binary classification problems in the literature [32,33]. In addition to the proposed model using the F1 score, individual uses of the Vgg16 and ViT model were also compared. In addition to the performance of different classifiers, the performance of ensemble learning was also examined. In classification problems, accuracy refers to the accuracy rate of a model, as shown in Equation (5). Accuracy is the ratio of correctly predicted samples to the total number of samples. Sensitivity measures how many of the true positives (TPs) it correctly identifies, as shown in Equation (6), while specificity measures how many of the true negatives (TNs) it correctly identifies (Equation (7)). The F1 score allows for precision and recall values to be taken into account together. In particular, in unbalanced class distributions (i.e., when one class has considerably more or fewer samples than another), the F1 score can be a useful performance metric in addition to accuracy assessment. It is calculated as shown in Equation (8). These metrics are calculated using TPs, TNs, false negatives (FNs), and false positives (FPs).

A c c u r a c y = \frac{T P s + T N s}{T P s + T N s + F P s + F N s} \times 100 %

(5)

S e n s i t i v i t y = \frac{T P s}{T P s + F N s} \times 100 %

(6)

S p e c i f i c i t y = \frac{T N s}{T N s + F P s} \times 100 %

(7)

F 1 = \frac{2 \times T P s}{2 \times T P s + F N s + F P s}

(8)

TPs are the samples that were actually pes planus and were classified correctly; FPs are the samples that were not actually pes planus but were misclassified as pes planus by the model; TNs are the samples that were not actually pes planus but were correctly classified by the model; FNs are the samples that were actually pes planus but were misclassified as not pes planus by the model.

3. Results

In our experiment, a binary classification was used: pes planus and normal. The dataset was divided into 80% for training and 20% for testing. Ten-fold cross-validation was employed to ensure the classification performance. Dropout was used to prevent overfitting. For Vgg16, pretrained weights were initially used. The Adam optimizer was chosen as the optimizer, and 100 epoch models were trained. In Table 2, the performances of the CNN and ViT networks in the fusion model are shown separately.

The fusion model consisting of Vgg16 and ViT performed better than the Vgg16 and ViT models used alone. The accuracy, sensitivity, and specificity values of the proposed fusion model were 96%, 95.2%, and 96.8%, respectively. The lowest performance was achieved with Vgg16. The ViT network outperformed Vgg16 in classification. Both the ViT network and the Vgg16 network exhibited advantages over each other in feature extraction. Therefore, the proposed fusion model showed higher performance by using the advantages of both networks. There was an increase in both the number of correct predictions in positive cases (pes planus) and the number of correct predictions in negative cases (normal). The number of FNs, that is, the samples that were actually pes planus but were mistakenly predicted to be normal, was 19. The number of FPs, that is, the samples that were actually normal but were incorrectly classified as pes planus by the model, was 29.

Different classifiers were trained with the features extracted by the fusion model. The performances of each classifier are shown in Table 3. Ensemble learning was used in the experiment. The decisions of the different classifiers were voted on. Two different votes were possible: majority and weighted. Weighted average voting was obtained through ensemble learning, which was more accurate than the individual decisions of the classifiers.

In our experiment, the SVM showed a higher performance than the KNN and DT. The lowest performance was achieved with the DT. In community learning, very similar performances were obtained with two different votes. Based on these metrics, EL (weighted) generally outperformed the other classifiers in terms of accuracy, sensitivity, specificity, and the F1 score. However, the choice of classifier ultimately depends on the specific requirements and characteristics of the dataset and the problem at hand. Classified images are shown in Figure 6.

The fusion model developed in this study, consisting of a combination of a CNN and ViT, offers a solution for the automatic detection of pes planus. This model is capable of detecting the presence of pes planus by analyzing features from foot images. Similar studies in the literature generally focus on the detection of pes planus using only one CNN model or more than one CNN model together (Table 4). However, the aim of the present study was to create a more powerful model by combining different architectures such as the CNN and ViT architectures. The results obtained show that the developed fusion model has higher accuracy rates in pes planus detection than the CNN and ViT models used alone. This indicates the model’s ability to better analyze complex structures in the sole of the foot. In addition, due to the generalization ability of the fusion model, reliable results can be obtained in different patient groups and under various conditions.

4. Discussion

The continuing need to address specific clinical problems of the foot with kinematic measurements is pushing research towards the design of new or improved multicompartmental models, as well as expanding the availability of reference data for different pathologies and age groups.

Various methods have been used in the literature for the automatic detection of pes planus. Jian et al. used the Canny edge detection algorithm to determine the arch angle required to detect pes planus. Pes planus was detected according to the principle of obtaining the arch angle by identifying key points. They also used a Gaussian filter to remove noise in images [35]. Image processing-based applications significantly affect image contrast performance. Kao et al. counted the white pixels in an image to identify the calcaneal and fifth metatarsal bones. They determined the arch angle by identifying key points on the calcaneal and fifth metatarsals through binary images. They stated that 73% of all cases were detected correctly [36].

Machine learning algorithms are also used effectively to detect pes planus. Nitris et al. proposed feature extraction based on ResNet50. They suggested a new convolutional neural network (CNN) model that includes Adam optimization. They divided an image into three regions and detected key points separately in each region. A single CNN model’s ability to extract features from an image is limited, which restricts its performance. Gul et al. proposed a new model for the detection of pes planus by combining different CNN models. Image features were extracted with MobilenetV2, which provided the highest performance. Then, dimension reduction was carried out for feature selection. Classification was performed with a support vector machine (SVM), and a 95.14% accuracy was achieved [3].

There could be multiple potential causes of FN and FP outcomes. Initially, the suggested system lacked sufficient training on two groups: adolescent X-rays with open physis and those with other foot abnormalities. A pes cavus deformity (Figure 6f) and an adolescent foot with an open physis (Figure 6e) were most likely to blame for two of the FP results in our test group. Comprehending the cause of FN X-rays is more complicated. We categorized X-rays based on the calcaneal inclination angle, which is determined by measuring the angle produced by lines running from the bottom of the calcaneocuboid joint to the same horizontal line running along the bottom surface of the calcaneus to the lowest point of the fifth metatarsal head [37]. We discovered that, in our FN outcomes, the other metatarsal bones completely encircled the fifth metatarsal bone, making it impossible for our AI system to precisely detect its borders. This may be a potential cause of the FNs.

The proposed model has significant potential for various applications. Firstly, it can be employed by physicians in patient care settings. Lateral weight-bearing foot X-rays are commonly employed to radiographically assess individuals who present to outpatient clinics with suspected pes planus. Our model allows clinicians to make pre-judgments about patients’ pes planus status before taking standard pes planus angle measurements. This approach can expedite patient care. Our developed model could also be applied in the military recruitment process. Several countries enforce mandatory military service for all male citizens. During this procedure, it is necessary to examine them for any potential abnormalities in their feet. Our developed model can assist by evaluating radiographs for a large number of individuals. Moreover, the model can serve as a valuable tool in research. When applied to large datasets of lateral weight-bearing foot X-rays, our model efficiently identifies pes planus.

This study has limitations. We utilized the calcaneal inclination angle in lateral weight-bearing foot X-rays to detect pes planus. However, it is important to note that the underlying causes of pes planus vary. Similar to other orthopedic X-ray assessments, the condition of pes planus should be examined by utilizing a minimum of two X-rays taken from varying angles. In addition, three-dimensional computed tomography (CT) is valuable for the purpose of surgical planning. Future studies may utilize X-rays taken from various angles and CT imaging to enhance the accuracy of detecting pes planus. Another limitation is the restricted quantity of X-rays available. Future studies could focus on further enhancing the model’s performance by utilizing larger and more diverse datasets to reduce FPs and FNs in the adolescent and open-physis groups, particularly. The last limitation is that the F1 score of our proposed model under data-imbalance conditions was not investigated.

In conclusion, the fusion model developed for automatic pes planus detection yielded promising results and could play a significant role in both academic research and clinical practice, contributing to the early diagnosis and treatment of pes planus.

5. Conclusions

The results obtained in this study indicate that our fusion model, which integrates a CNN and ViT, has promising potential for the automatic detection of pes planus. The success of the model is evident in its high accuracy rates and sensitivity values. Furthermore, it outperformed previous approaches in pes planus detection, offering a new perspective in the field and potentially serving as a valuable tool in clinical applications.

Author Contributions

Conceptualization, K.D., A.Y. and T.S.; methodology, A.Y. and T.S.; software, T.S.; validation, K.D. and A.Y.; investigation, K.D., A.Y. and T.S.; writing—original draft preparation, K.D., A.Y. and T.S.; writing—review and editing, K.D. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The dataset is available at https://www.kaggle.com/datasets/suleyman32/pesplanus-two-class-dataset (accessed on 8 August 2024).

Conflicts of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References

Lalevée, M.; Mansur, N.S.B.; Schmidt, E.; Carvalho, K.; Vandelune, C.; Bernasconi, A.; Wilken, J.; Netto, C.d.C. Does tibialis posterior dysfunction correlate with a worse radiographic overall alignment in progressive collapsing foot deformity? A retrospective study. Foot Ankle Surg. 2022, 28, 995–1001. [Google Scholar] [CrossRef]
Mann, R.A.; Thompson, F.J. Rupture of the posterior tibial tendon causing flat foot. Surgical treatment. J. Bone Jt. Surg. 1985, 67, 556–561. [Google Scholar] [CrossRef]
Gül, Y.; Yaman, S.; Avcı, D.; Çilengir, A.H.; Balaban, M.; Güler, H. A Novel Deep Transfer Learning-Based Approach for Automated Pes Planus Diagnosis Using X-ray Image. Diagnostics 2023, 13, 1662. [Google Scholar] [CrossRef] [PubMed]
Ueki, Y.; Sakuma, E.; Wada, I. Pathology and management of flexible flat foot in children. J. Orthop. Sci. 2019, 24, 9–13. [Google Scholar] [CrossRef] [PubMed]
Guduri, V.; Dreyer, M.A. Talocalcaneal coalition. In StatPearls; StatPearls Publishing: Treasure Island, FL, USA, 2019. [Google Scholar]
Smyth, N.A.; Aiyer, A.A.; Kaplan, J.R.; Carmody, C.A.; Kadakia, A.R. Adult-acquired flatfoot deformity. Eur. J. Orthop. Surg. Traumatol. 2017, 27, 433–439. [Google Scholar] [CrossRef]
Catanzano, A.A., Jr.; Akoh, C.C.; Easley, M.E.; Mosca, V.S. Decision-making and management of tarsal coalition in the young adult patient: A critical analysis review. JBJS Rev. 2023, 11, e23. [Google Scholar] [CrossRef]
Bernasconi, A.; Lintz, F.; Sadile, F. The role of arthroereisis of the subtalar joint for flatfoot in children and adults. EFORT Open Rev. 2017, 2, 438–446. [Google Scholar] [CrossRef]
Ferciot, C.F. The etiology of developmental flatfoot. Clin. Orthop. Relat. Res. 1972, 85, 7–10. [Google Scholar] [CrossRef] [PubMed]
Ling, S.K.K.; Lui, T.H. Suppl-4, M12: Posterior Tibial Tendon Dysfunction: An Overview. Open Orthop. J. 2017, 11, 714. [Google Scholar] [CrossRef]
Rhim, H.C.; Dhawan, R.; Gureck, A.E.; Lieberman, D.E.; Nolan, D.C.; Elshafey, R.; Tenforde, A.S. Characteristics and future direction of tibialis posterior tendinopathy research: A scoping review. Medicina 2022, 58, 1858. [Google Scholar] [CrossRef]
Cifuentes-De la Portilla, C.; Larrainzar-Garijo, R.; Bayod, J.J. Biomechanical stress analysis of the main soft tissues associated with the development of adult acquired flatfoot deformity. Clin. Biomech. 2019, 61, 163–171. [Google Scholar] [CrossRef] [PubMed]
Kaye, R.A.; Jahss, M.H. Foot fellows review: Tibialis posterior: A review of anatomy and biomechanics in relation to support of the medial longitudinal arch. Foot Ankle 1991, 11, 244–247. [Google Scholar] [CrossRef]
Fadle, A.A.; Khalifa, A.A.; Bahy, A.; El-Gammal, Y.T.; Abubeih, H.; El-Adly, W.; Osman, A.E. Joint preservation surgery for correcting adolescents’ spasmodic flatfoot deformity: Early results from a specialized North African foot and ankle unit. Int. Orthop. 2023, 48, 1543–1552. [Google Scholar] [CrossRef]
Gould, N.J. Graphing the adult foot and ankle. Foot Ankle 1982, 2, 213–219. [Google Scholar] [CrossRef]
Bock, P.; Pittermann, M.; Chraim, M.; Rois, S. The inter-and intraobserver reliability for the radiological parameters of flatfoot, before and after surgery. Bone Jt. J. 2018, 100, 596–602. [Google Scholar] [CrossRef]
Stotter, C.; Klestil, T.; Röder, C.; Reuter, P.; Chen, K.; Emprechtinger, R.; Hummer, A.; Salzlechner, C.; DiFranco, M.; Nehrer, S. Deep learning for fully automated radiographic measurements of the pelvis and hip. Diagnostics 2023, 13, 497. [Google Scholar] [CrossRef] [PubMed]
Yuan, R.; Janzen, I.; Devnath, L.; Khattra, S.; Myers, R.; Lam, S.; MacAulay, C. MA19. 11 Predicting Future Lung Cancer Risk with Low-Dose Screening CT Using an Artificial Intelligence Model. J. Thorac. Oncol. 2023, 18, S174. [Google Scholar] [CrossRef]
Arzhaeva, Y.; Wang, D.; Devnath, L.; Amirgholipour, S.K.; McBean, R.; Hillhouse, J.; Luo, S.; Meredith, D.; Newbigin, K.; Yates, D. Development of Automated Diagnostic Tools for Pneumoconiosis Detection from Chest X-ray Radiographs; CSIRO: Canberra, Australia, 2019. [Google Scholar]
Tajbakhsh, N.; Shin, J.Y.; Gurudu, S.R.; Hurst, R.T.; Kendall, C.B.; Gotway, M.B.; Liang, J. Convolutional neural networks for medical image analysis: Full training or fine tuning? IEEE Trans. Med. Imaging 2016, 35, 1299–1312. [Google Scholar] [CrossRef]
Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017. [Google Scholar]
Chen, C.-F.R.; Fan, Q.; Panda, R. Crossvit: Cross-attention multi-scale vision transformer for image classification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021. [Google Scholar]
Cortes, C.; Vapnik, V. Support-vector networks. Mach. Learn. 1995, 20, 273–297. [Google Scholar] [CrossRef]
Alimjan, G.; Sun, T.; Liang, Y.; Jumahun, H.; Guan, Y. A new technique for remote sensing image classification based on combinatorial algorithm of SVM and KNN. Int. J. Pattern Recognit. Artif. Intell. 2018, 32, 1859012. [Google Scholar] [CrossRef]
Drukker, K.; Sahiner, B.; Hu, T.; Kim, G.H.; Whitney, H.M.; Baughan, N.; Myers, K.J.; Giger, M.L.; McNitt-Gray, M. MIDRC-MetricTree: A decision tree-based tool for recommending performance metrics in artificial intelligence-assisted medical image analysis. J. Med. Imaging 2024, 11, 024504. [Google Scholar] [CrossRef] [PubMed]
Contributors, W. Calcaneal Pitch. Available online: https://en.wikipedia.org/w/index.php?title=Calcaneal_pitch&oldid=1218557863 (accessed on 23 July 2024).
López, O.A.M.; López, A.M.; Crossa, J. Overfitting, model tuning, and evaluation of prediction performance. In Multivariate Statistical Machine Learning Methods for Genomic Prediction; Springer: Dordrecht, The Netherlands, 2022; pp. 109–139. [Google Scholar]
Bejani, M.M.; Ghatee, M. Regularized deep networks in intelligent transportation systems: A taxonomy and a case study. arXiv 2019, arXiv:1911.03010. [Google Scholar]
Chen, W.; Ayoub, M.; Liao, M.; Shi, R.; Zhang, M.; Su, F.; Huang, Z.; Li, Y.; Wang, Y.; Wong, K.K. A fusion of VGG-16 and ViT models for improving bone tumor classification in computed tomography. J. Bone Oncol. 2023, 43, 100508. [Google Scholar] [CrossRef]
Li, B.-Q.; Hu, L.-L.; Niu, S.; Cai, Y.-D.; Chou, K.-C. Predict and analyze S-nitrosylation modification sites with the mRMR and IFS approaches. J. Proteom. 2012, 75, 1654–1665. [Google Scholar] [CrossRef] [PubMed]
Devnath, L.; Luo, S.; Summons, P.; Wang, D.; Shaukat, K.; Hameed, I.A.; Alrayes, F.S. Deep ensemble learning for the automatic detection of pneumoconiosis in coal worker’s chest X-ray radiography. J. Clin. Med. 2022, 11, 5342. [Google Scholar] [CrossRef]
Xu, D.; Quan, W.; Zhou, H.; Sun, D.; Baker, J.S.; Gu, Y. Explaining the differences of gait patterns between high and low-mileage runners with machine learning. Sci. Rep. 2022, 12, 2981. [Google Scholar] [CrossRef] [PubMed]
Xu, D.; Zhou, H.; Quan, W.; Jiang, X.; Liang, M.; Li, S.; Ugbolue, U.C.; Baker, J.S.; Gusztav, F.; Ma, X.; et al. A new method proposed for realizing human gait pattern recognition: Inspirations for the application of sports and clinical gait analysis. Gait Posture 2024, 107, 293–305. [Google Scholar] [CrossRef]
Alsaidi, F.A.; Moria, K.M. Flatfeet Severity-Level Detection Based on Alignment Measuring. Sensors 2023, 23, 8219. [Google Scholar] [CrossRef]
Jian, M.-S.; Shen, J.H.; Chen, Y.C.; Chang, C.C.; Fang, Y.C.; Chen, C.C.; Chen, W.H. Cloud image processing and analysis based flatfoot classification method. Int. J. Comput. 2014, 8, 90–98. [Google Scholar]
Kao, E.-F.; Lu, C.-Y.; Wang, C.-Y.; Yeh, W.-C.; Hsia, P.-K. Fully automated determination of arch angle on weight-bearing foot radiograph. Comput. Methods Programs Biomed. 2018, 154, 79–88. [Google Scholar] [CrossRef]
Akdogan, I.; Akkaya, S.; Akkaya, N.; Kiter, E. Comparison of the calcaneal pitch angle and modified projection area per length squared method for medial longitudinal arch evaluation of the foot. Balk. Med. J. 2012, 2012, 406–409. [Google Scholar] [CrossRef]

Figure 1. Definition of calcaneal pitch angle [26].

Figure 2. VGG-16-based CNN architecture.

Figure 3. Structure of the ViT model.

Figure 4. The flow diagram illustrating the proposed model.

Figure 5. The proposed ensemble learning model.

Figure 6. Example images detected as TPs (a,b), TNs (c,d), FPs (e,f), and FNs (g,h).

Table 1. Distribution of the dataset.

	Category	Pes Planus		Normal
		Patient (n)	Image (n)	Patient (n)	Image (n)
Age	14–24	94	188	98	196
	25–35	78	156	88	176
	35–47	29	58	34	68
Sex	Male	171	342	180	360
Sex	Female	30	60	40	80

Table 2. Comparison of the fusion model with ViT and Vgg16.

Model	Accuracy (%)	Sensitivity (%)	Specificity (%)	F1 Score
Vgg16	90	89.8	91.2	91.5
Vit-B/16	94.2	93.9	94.2	94.1
Vgg16-ViT	96.8	97.4	96.8	96.1

Table 3. Performances of different classifiers and ensemble learning.

Classifier	Accuracy (%)	Sensitivity (%)	Specificity (%)	F1 Score
SVM	94	93.3	94.7	94.1
KNN	92.3	91.8	93.4	92.9
DT	90.2	89.48	90.8	90.3
EL (majority)	95.8	95	96.4	95.9
EL (weighted)	96.8	97.4	96.3	96.8

Table 4. Similar studies detecting pesplanus based on CNN.

Study	Method	Accuracy	Sensitivity	F1 Score
Alsaidi et al. [34]	CNN/Random Forest	88.75	88.00	85.71
Gul et al. [3]	Pyramid MobilenetV2	95.14	96.86	95.22
Proposed	Fusion Model (CNN + ViT)	96.8	97.4	96.8

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Doğan, K.; Selçuk, T.; Yılmaz, A. A Novel Model Based on CNN–ViT Fusion and Ensemble Learning for the Automatic Detection of Pes Planus. J. Clin. Med. 2024, 13, 4800. https://doi.org/10.3390/jcm13164800

AMA Style

Doğan K, Selçuk T, Yılmaz A. A Novel Model Based on CNN–ViT Fusion and Ensemble Learning for the Automatic Detection of Pes Planus. Journal of Clinical Medicine. 2024; 13(16):4800. https://doi.org/10.3390/jcm13164800

Chicago/Turabian Style

Doğan, Kamil, Turab Selçuk, and Abdurrahman Yılmaz. 2024. "A Novel Model Based on CNN–ViT Fusion and Ensemble Learning for the Automatic Detection of Pes Planus" Journal of Clinical Medicine 13, no. 16: 4800. https://doi.org/10.3390/jcm13164800

APA Style

Doğan, K., Selçuk, T., & Yılmaz, A. (2024). A Novel Model Based on CNN–ViT Fusion and Ensemble Learning for the Automatic Detection of Pes Planus. Journal of Clinical Medicine, 13(16), 4800. https://doi.org/10.3390/jcm13164800

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Novel Model Based on CNN–ViT Fusion and Ensemble Learning for the Automatic Detection of Pes Planus

Abstract

1. Introduction

2. Materials and Methods

2.1. Dataset

2.2. Image Preprocessing

2.3. CNN and ViT

2.4. Ensemble Learning

2.5. Performance Indices

3. Results

4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI