4.2. Experimental Results
This study utilized a binary metaheuristic algorithm to decrease the number of features in the common gene dataset obtained through differential expression analysis, resulting in an optimized dataset that includes only crucial features pertinent to the research. The LASSO regression method was used to analyze data from a maximum of 14 genes.
The implementation of the method was carried out using R software. In LASSO regression, the lambda (
) value needs to be kept constant, to adjust the amount of coefficient shrinkage. The optimal lambda cross-validation for the dataset minimizes the prediction error rate.
Figure 4 shows that the left dashed vertical line corresponds to the logarithmic value of the optimal lambda that minimizes the prediction error, which is approximately
and provides the most accurate results.
In general, regularization aims to balance accuracy and simplicity by finding a model with the highest accuracy and the minimum number of predictors. The optimal value of lambda is usually chosen by considering two values:
and
. The former produces a simpler model but may be less accurate, while the latter is more accurate but less parsimonious. In this study, the accuracy of LASSO regression was compared with the accuracy of the full logistic regression model, as shown in
Table 2. The results showed that
produced the highest accuracy, and the obvious choice of the optimal value was 0.001. Finally, the most significant genes were selected based on this optimal value.
The LASSO method applies
or absolute value penalties in penalized regression and is particularly effective for variable selection in the presence of many predictors. The resulting solution is often sparse, containing estimated regression coefficients with only a few non-zero values.
Table 3 presents the list of selected genes obtained using the LASSO method.
Next, the best classification algorithm was selected among SVM, RF, and KNN, by applying each algorithm to the full dataset and the filtered dataset separately. As shown in
Table 4 and
Figure 5, the RF algorithm achieved the highest average accuracy of 73.32% on the full dataset. On the other hand, the filtered dataset performed very well with the SVM classifier, achieving a high average classification accuracy and low variance. The performance of SVM with the BRSA algorithm was found to be the highest (87.22%) when compared to the KNN and RF classifiers. Therefore, the SVM was selected as the best classifier to be adopted in this study.
To convert the continuous search area into a binary version in BRSA, a sigmoid function was used.
Table 5 shows the statistical outcomes obtained for each of the evaluation matrices used in each sigmoid transfer function. The best statistical results of the sigmoid transfer functions are highlighted in bold.
The fourth sigmoid transfer function (
) showed significantly higher averages for classification accuracy, fitness value, precision, and specificity compared to the other three transfer functions. Specificity refers to the percentage of true negatives, and
exhibited a specificity of 82.78%, indicating that 82.78% of those without the target disease will test negative. The best and worst values for the evaluated matrices of the four transfer functions were almost equal.
Figure 6 provides a clearer representation of the average number of selected features, where
has the fewest significant features (6.05). Furthermore,
Figure 7, shows that
and
had similar average fitness values, but different accuracy and sensitivity values. A higher sensitivity in
indicates that the model correctly identifies most positive results, whereas a low sensitivity means the model misses a significant number of positive results.
Furthermore, the convergence of four distinct sigmoid functions is compared in
Figure 8 and illustrates the efficiency of the algorithms.
Figure 8 depicts that the
sigmoid transfer function not only attained a superior convergence speed but also acquired the best fitness scores. It typically achieved its optimal solution in around 70 iterations, whereas
began with a low fitness value and converged to a high fitness value after approximately 220 iterations. As a result, the
sigmoid transfer function was deemed the most appropriate for the proposed BRSA.
Next, the proposed BRSA was compared with four alternative algorithms: the binary dragonfly algorithm (BDA), binary particle swarm optimization (BPSO), and two variants of the binary gray wolf optimization algorithm (BGWO1 and BGWO2). To initiate the analysis, we applied various statistical metrics, and the results are presented in
Table 6. Indeed,
Table 6 indicates that the average accuracy, average F-measure, and average sensitivity of BRSA were higher than those of the other algorithms, except for the average precision value.
Additionally, BDA was found to be the most competitive algorithm, with BRSA following closely behind. Based on these findings, we can infer that BRSA outperformed BPSO, BDA, BGWO1, and BGWO2 in selecting the most relevant features from the tested datasets to optimize classification performance, while minimizing the number of selected features.
Furthermore, according to the conclusion by Demšar [
39] and Benavoli et al. [
40] that “the non-parametric tests should be preferred over the parametric ones”, we employed the Friedman test [
41] to validate the obtained results and determined that the differences between the competing methods were significant.
Table 7,
Table 8 and
Table 9 display the final rank of each algorithm as determined by the Friedman test. The test was conducted using IBM SPSS Statistics version 22. Based on the ranks, it is evident that BRSA achieved the first rank in terms of performance measures for both classification accuracy and fitness value, thereby taking first place among all algorithms. However, in terms of the number of selected features, BRSA ranked second, with BDA obtaining first place in the Friedman test.
After implementing the proposed BRSA approach on 4055 common DE genes, the top subset of six genes was identified as the optimal subset with 87.22% accuracy for the SVM classifier.
Table 10 presents the selected genes obtained using this approach.
In order to enhance the predictive accuracy of ACE2 in COVID-19 diagnosis, the selected genes obtained using the proposed method were compared with the ACE2 gene, and genes were identified through LASSO regression.
Figure 9 illustrates a heatmap presenting the ACE2 gene and the genes selected through LASSO regression. Displaying gene expression data as a heatmap is a popular way to visualize it. A heatmap can also be used in conjunction with clustering techniques, which pair together genes and/or datasets based on how similarly their genes are expressed. This can be helpful for determining the biological signatures linked to a specific situation (such as disease or an environmental condition) or genes that are frequently regulated.
The heatmap displayed in
Figure 9 indicates that the expressions of ACE2, IFIT5, and TRIM14 were almost identical, and the proposed algorithm selected them. This implies that IFIT5 and TRIM14 share the characteristics of ACE2, which is a COVID-19-related gene. ACE2, also known as ACEH, may play opposing roles in health and disease. The COVID-19 virus uses the ACE2 receptor to enter human cells, and this receptor is found in almost all organs of the body [
42,
43]. In addition, BEX2 and SNHG9 show similarities in their up and downregulated genes, but they are not related to COVID-19 symptoms. According to “the National Library of Medicine” website, BEX2, and SNHG9 genes have no connection with COVID-19 symptoms.