Next Article in Journal
Information-Analytical Software for Developing Digital Models of Porous Structures’ Materials Using a Cellular Automata Approach
Previous Article in Journal
Comparing Classical and Quantum Generative Learning Models for High-Fidelity Image Synthesis
 
 
Article
Peer-Review Record

Generating Mathematical Expressions for Estimation of Atomic Coordinates of Carbon Nanotubes Using Genetic Programming Symbolic Regression

Technologies 2023, 11(6), 185; https://doi.org/10.3390/technologies11060185
by Nikola Anđelić *,† and Sandi Baressi Šegota
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Technologies 2023, 11(6), 185; https://doi.org/10.3390/technologies11060185
Submission received: 19 October 2023 / Revised: 7 December 2023 / Accepted: 15 December 2023 / Published: 18 December 2023
(This article belongs to the Section Innovations in Materials Processing)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

The paper delves into the use of Genetic Programming Symbolic Regression (GPSR) to accurately estimate atomic coordinates of Carbon Nanotubes (CNTs), offering an alternative to computationally expensive Density Functional Theory (DFT) calculations. It rigorously explores the optimization of GPSR's hyperparameters using Random Hyperparameter Value Search (RHVS) and evaluates the model's performance using multiple metrics. However, the paper requires a minor revision for a few key reasons. First, while it comprehensively details methods and evaluation metrics, the paper fails to deliver complete results, leaving the reader with only 'hints' about performance metrics. Second, the paper discusses the 'bloat phenomenon' and the 'parsimony pressure method,' but the empirical validation for these techniques is unclear. Third, it indicates the dependency of GPSR's performance on dataset quality and well-tuned hyperparameters but does not adequately discuss how this compares with previous methods like neural networks in a quantifiable way.

1.     The paper's discussion section mentions that the results had "slightly lower estimation performance e than those reported in other research papers" (line 406) but does not delve into specifics. The study should include comprehensive results, offering a full comparison with existing methods.

2.     The paper introduces a parsimony pressure method for the bloat phenomenon but does not provide sufficient empirical evidence to back its effectiveness. More data and possibly comparative graphs showing the bloat phenomenon with and without parsimony pressure could solidify this section.

3.     Although the paper aims to address the shortcomings of using neural networks for this problem, there needs to be quantifiable data comparing GPSR with neural networks. The paper would benefit from a dedicated section comparing the performance, computational cost, and ease of interpreting GPSR vis-à-vis neural networks.

4.     While the paper mentions that future work will focus on enhancing the GPSR algorithm, this is too vague. It would be more beneficial to the academic community if specific directions, potential methodologies, or collaborations with other evolutionary algorithms were discussed in detail.

5.     Figure visualization could be improved like Figure 5 and Figure 6, add legend for colors, and unify the font used.

Comments on the Quality of English Language

Acceptable

Author Response

The authors of this manuscript want to thank the reviewer for his time and effort in providing his comments and suggestions that could greatly improve the manuscript's quality. We hope that the answers to the reviewers' comments, and modifications made in the manuscript according to the reviewer's comments and suggestions improved manuscript quality and that the manuscript could be accepted for publication in this form. 

 

The paper delves into the use of Genetic Programming Symbolic Regression (GPSR) to accurately estimate atomic coordinates of Carbon Nanotubes (CNTs), offering an alternative to computationally expensive Density Functional Theory (DFT) calculations. It rigorously explores the optimization of GPSR's hyperparameters using Random Hyperparameter Value Search (RHVS) and evaluates the model's performance using multiple metrics. However, the paper requires a minor revision for a few key reasons. First, while it comprehensively details methods and evaluation metrics, the paper fails to deliver complete results, leaving the reader with only 'hints' about performance metrics. Second, the paper discusses the 'bloat phenomenon' and the 'parsimony pressure method,' but the empirical validation for these techniques is unclear. Third, it indicates the dependency of GPSR's performance on dataset quality and well-tuned hyperparameters but does not adequately discuss how this compares with previous methods like neural networks in a quantifiable way.

 

  • The paper's discussion section mentions that the results had "slightly lower estimation performance e than those reported in other research papers" (line 406) but does not delve into specifics. The study should include comprehensive results, offering a full comparison with existing methods.

 

Answer: 

In the discussion section, we conducted a thorough comparison of our research results with a previous paper wherein the authors crafted the dataset utilized in our study. Despite an exhaustive search across available literature, we could not identify any comparable research beyond the scope of the referenced manuscript [14]. The juxtaposition of our findings with those reported in [14] reveals a marginal discrepancy, with our R^2 scores slightly lower and MAE and RMSE slightly higher.

 

The distinctive advantage of our research lies in the symbolic expressions acquired through the Genetic Programming Symbolic Regression (GPSR) approach. These expressions offer seamless integration into applications and user-friendly utilization when contrasted with conventional machine learning architectures such as FFNN, FITNET, CFNN, and GRNN. The inherent challenge with these neural networks is their resistance to transformation into simple symbolic forms due to the intricate interconnections among numerous neurons. Furthermore, these machine learning algorithms, once trained, demand significantly greater resources for storage, not to mention the more substantial computational resources required to reuse trained models for the estimation of new atomic coordinates.

 

In essence, while our estimation accuracy slightly trails that of neural networks, the pivotal strength of our approach resides in the accessibility of the obtained symbolic expressions. These expressions are easily deployable and circumvent the resource-intensive demands associated with neural networks. Consequently, the overarching conclusion drawn from this observation is that, despite a minor disparity in accuracy, the practical advantages offered by our approach, namely the simplicity and efficiency of symbolic expressions, outweigh the computational overhead associated with more complex machine learning architectures.

Since the other reviewer has requested that we describe in detail the comparison of our results with results from other literature we have expanded Table 8 and made a detailed comparison of our results with the results reported in [14]. Citing the detailed description of Table 8 positioned below Table 8: “The outcomes presented in Table \ref{tab:FinalComparison} highlight a noteworthy parallel between the results of this paper and those documented in \cite{aci2016artificial}. In the mentioned study, various neural networks were explored, with the FITNET neural network exhibiting the highest estimation accuracy, as evidenced by superior $R^2$ values and lower $MAE$ and $RMSE$ values. It's essential to underscore that the reported results encompass all three target values, namely $u_c$, $v_c$, and $w_c$, across the four neural networks employed, with the lowest accuracy observed in the GRNN.

In specific scenarios where $nmu$, $nmv$, and $nmw$ were utilized as input variables in the GPSR algorithm to predict $u_c$, $v_c$, and $w_c$, the estimation accuracy aligns closely for $u_c$ and $v_c$, while showing a slightly lower performance for the $w_c$ target variable. A comparative analysis with the findings from \cite{aci2016artificial} reveals that our approach outperforms most Machine Learning (ML) algorithms for $u_c$ and $v_c$ targets, with FITNET being the exception. For the $w_c$ target, our approach outperforms only the GRNN algorithm.

In instances where all input variables, i.e., $nmuvw$, were employed in the GPSR algorithm to predict the aforementioned target variables, the highest estimation accuracy was observed for $u_c$, followed by $v_c$ and $w_c$, respectively. Notably, our results for the $u_c$ target closely align with the findings from \cite{aci2016artificial} with FITNET, surpassing other neural networks such as FFNN, CFNN, and GRNN. The estimation accuracy for $v_c$ and $w_c$ outperforms that of the GRNN algorithm.

A comprehensive comparison between both approaches underscores that the optimal accuracy in calculating $u_c$, $v_c$, and $w_c$ is achieved by employing Mathematical Expressions (MEs) that require $nmu$, $nmv$, and $nmuvw$, respectively. Intriguingly, the accurate calculation of $w_c$ necessitates the inclusion of all input variables.

A nuanced examination of our results vis-à-vis those in \cite{aci2016artificial} reveals a striking similarity. The distinctive advantage of our approach lies in its practical implementation, eliminating the need to store a trained GPSR model. The obtained MEs are user-friendly, requiring minimal computational resources, in stark contrast to the trained neural networks in the referenced literature, which demand more extensive computational capabilities. This pragmatic aspect reinforces the utility and accessibility of our approach in real-world applications.

 

  • The paper introduces a parsimony pressure method for the bloat phenomenon but does not provide sufficient empirical evidence to back its effectiveness. More data and possibly comparative graphs showing the bloat phenomenon with and without parsimony pressure could solidify this section.

Answer: 

While the parsimony pressure method stands as a critical hyperparameter in Genetic Programming Symbolic Regression (GPSR), it is essential to clarify that our research did not center around this specific parameter. Our primary objective was to ascertain whether GPSR could yield symbolic expressions for the estimation of Carbon Nanotubes (CNTs) atomic coordinates. The parsimony coefficient, one of several hyperparameters subject to random search within predefined ranges through the Random Hyperparameter Value Search (RHVS) method, was indeed included in our investigation.

 

It is noteworthy that, although briefly explored within the broader spectrum of hyperparameter values, the parsimony coefficient was acknowledged as a sensitive factor in the GPSR algorithm. Its significance lies in its potential to impact the size of the obtained Mathematical Equations (MEs) and serves as a crucial tool in averting the bloat phenomenon. This phenomenon, characterized by excessively lengthy MEs accompanied by elevated fitness values (MAE), did not manifest throughout our entire investigation. Hence, we can confidently assert that the bloat phenomenon did not materialize in our study.

 

Your suggestion regarding a more in-depth exploration of the parsimony coefficient is duly acknowledged and aligns with our intentions for future research, as explicitly outlined in the Future Work section appended to the Conclusions. This signifies our commitment to delving deeper into the nuanced influence of the parsimony coefficient to enhance our understanding and potentially optimize its contribution to the symbolic expressions generated by the GPSR algorithm. Citing the future work regarding the parsimony coefficient at the end of the Conclusions section: “While the parsimony coefficient was not a primary focus in the current study, future research will explore the use of higher parsimony coefficient values. As already stated in this research higher values of the parsimony coefficient can prevent an increase in symbolic expression length i.e. bloat phenomenon.  So one of the objectives in future work is to generate smaller MES in terms of length and depth while maintaining or even enhancing the accuracy achieved in this study.

 

  • Although the paper aims to address the shortcomings of using neural networks for this problem, there needs to be quantifiable data comparing GPSR with neural networks. The paper would benefit from a dedicated section comparing the performance, computational cost, and ease of interpreting GPSR vis-à-vis neural networks.

Answer: 

The primary objective of this paper was not to engage in a direct performance comparison between Genetic Programming Symbolic Regression (GPSR) and neural networks. Instead, our focus centered on investigating whether GPSR could attain accuracy levels akin to those reported in the referenced article [14]. The second goal was inherently tied to the nature of GPSR, where the primary aim was to derive mathematical equations.

 

It's crucial to emphasize that, to the best of our knowledge, transforming neural networks into mathematical equations is not a straightforward task, if possible at all. This difficulty is likely attributed to the intricate network of interconnections between neurons within neural networks. The sheer complexity arising from these interconnections poses a significant hurdle in achieving a direct transformation into symbolic forms.

 

Contrastingly, the merit of acquiring mathematical equations through GPSR lies in their operational efficiency. These expressions demand lower computational resources for generating outputs compared to neural networks. In the case of neural networks, the trained model must be stored in memory, consuming considerable space, and requiring substantial computational resources when producing outputs. In stark contrast, the simplicity and efficiency of symbolic expressions facilitate quicker computations, demonstrating a notable advantage in terms of resource utilization. This underscores the practical benefits of symbolic expressions derived through GPSR in scenarios where computational efficiency is paramount.

 

  1. While the paper mentions that future work will focus on enhancing the GPSR algorithm, this is too vague. It would be more beneficial to the academic community if specific directions, potential methodologies, or collaborations with other evolutionary algorithms were discussed in detail.

Answer: The authors agree that future work is too vague so in the revised version of the manuscript we have improved the future work as much as possible. The other reviewer suggested that the future work needs to be expanded as much as possible so we had to expand it. Citing the future work from the revised version of the manuscript: “The future work will concentrate on implementing various evolutionary algorithms (such as Genetic Algorithm, Differential Evolution, Particle Swarm Optimization,...) to discover the optimal combination of hyperparameters for the GPSR model. The aim is to derive highly accurate MEs for estimating atomic coordinates of CNTs. The utilization of evolutionary algorithms is anticipated to expedite the identification of optimal hyperparameter combinations compared to the current method, which involves randomly searching hyperparameter values.

In future work, emphasis will be placed on determining whether highly accurate MEs can be achieved with smaller values for population size (Size\_pop) and maximum number of generations (max\_gen). This is crucial as these hyperparameters have been identified as contributors to prolonged execution times in the GPSR algorithm.

While the parsimony coefficient was not a primary focus in the current study, future research will explore the use of higher parsimony coefficient values. As already stated in this research higher values of the parsimony coefficient can prevent an increase in symbolic expression length i.e. bloat phenomenon.  So one of the objectives in future work is to generate smaller MES in terms of length and depth while maintaining or even enhancing the accuracy achieved in this study.

Besides the GPSR the future work regarding this dataset will be to employ various other AI algorithms such as Bayesian regularization networks \cite{awan2023convective, awan2023intelligent, raja2021integrated, awan2023novel, awan2023bayesian, awan2021intelligent, wahid2023parametric}, ensemble methods \cite{andjelic2022mean}, multi-layer perceptron, and XGBoost \cite{baressi2023use} among others and compare their performance with the GPSR estimation performance.



  1.  Figure visualization could be improved like Figure 5 and Figure 6, add legend for colors, and unify the font used.

Answer: Thank you for the comment. We have changed both figures. The bars in the R^2 barplot are in red color, the bars in the MAE barplot are in blue color, and the bars in the RMSE plot are in lime color. The labels on the x-axis are the same as those described in the text of the manuscript. The font size of x and y labels are the same. We do hope that this modification of Figures 5 and 6 solves the problem.

Reviewer 2 Report

Comments and Suggestions for Authors

 

Review Comments:  Generating Mathematical Expressions for Estimation of Atomic Coordinates of Carbon Nanotubes Using Genetic Programming Symbolic Regression.

In this research work, authors focused on the calculation of CNTs atomic coordinates using density functional theory which  can be cha1lenging task and in some cases the calculation can last for days. To overcome this problem the genetic programming symbolic regression method was applied on public available dataset to see if obtained mathematical equations (MEs) could estimate calculated atomic coordinates obtained using DFT with high accuracy. Since GPSR has a lot of hyperparameters the idea was to developed random hyperparameter values search method (RHVS) to find optimal combination of GPSR hyperparameter values using which the highest estimation accuracy could be achieved. Two different approaches were considered i.e. the first was to apply GPSR to estimate calculated coordinates uc/vc/wc using all input variables (initial atomic coordinates u, v, and w and integers n, m that specifies chiral  vector), and second approach was to apply GPSR to estimate each calculated atomic coordinate using integers n and m and the corresponding initial atomic coordinates. With application of the  proposed approach different dataset variations were created. The GPSR algorithm was trained using 5-fold cross-validation process and to evaluate MEs obtained after each training coefficient of  determination (R2), mean absolute error (MAE), root mean squared error (RMSE), and depth and length of generated MEs. The proposed approach showed that the GPSR could be applied to estimate the atomic coordinates of CNTs with high accuracy (R2 1.0).

Following are my comments which need to be addressed in revised version of manuscript:

11) GPSR stands for. Before using abbreviation in Abstract, first you have to write Genetic Programming Symbolic Regression and then use abbreviation of it.

22) The abstract is wordy and not informative. The structure of the abstract needs revision. Revise the abstract to provide (i) the significance of the study, (ii) the aim of the study, (iii) the research methodology, (iv) the major conclusion of the study.

33) The originality of the paper needs to be stated clearly. It is of importance to have sufficient results to justify the novelty of a high-quality journal paper. The Introduction should make a compelling case for why the study is useful along with a clear statement of its novelty or originality by providing relevant information and providing answers to basic questions such as: What is already known in the open literature? What is missing (i.e., research gaps)? What needs to be done, why and how? Clear statements of the novelty of the work should also appear briefly in the Abstract and Conclusions sections.

44) An explanation is needed with each one of the figures. What is the physical meaning of the graphs? What is the conclusion(s) from each figure? The lack of physical argumentation is a concern that should be rectified in a revised version.

55)  In the introduction, the authors did not provide a strong motivation of the paper and the obtained results. In addition, they should discuss the main contributions of their work in detail after the motivation part. Then they should summarize the main structure of their paper in brief at the end of the introduction.

66) The introduction section should be made more concise to show previous work in the field. At present lot of related research are stated in the introduction. However, no analysis is presented. The authors should ask themselves: what are the problems with the presented research? Why is the recent work needed? Hope this can improve the present work by following past articles.

77) Literature review section needs lot of improvement. Only 16 papers are cited and disused. Most recent paper is of 2018. So update the literature review.

88) Results and discussion should be provided in elaborative manner for each graphical and numerical illustrations And comparison of the algorithm with respect to complexity should be given in the results and discussion section.

99)  Provide proper reference of equations .

T10) The use of stochastic numerical computing approaches use as promising alternative should also mention in the conclusion section. The following literature is relevant to the stochastic numerical computing approaches.

·         Convective flow dynamics with suspended carbon nanotubes in the presence of magnetic dipole: Intelligent solution predicted Bayesian regularization networks.

·         Intelligent Bayesian regularization‐based solution predictive procedure for hybrid nanoparticles of AA7072‐AA7075 oxide movement across a porous medium.

·         Integrated intelligent computing application for effectiveness of Au nanoparticles coated over MWCNTs with velocity slip in curved channel peristaltic flow.

·         Novel design of intelligent Bayesian networks to study the impact of magnetic field and Joule heating in hybrid nanomaterial flow with applications in medications for blood circulation.

·         Bayesian regularization knack-based intelligent networks for thermo-physical analysis of 3D MHD nanofluidic flow model over an exponential stretching surface.

·         Intelligent Bayesian regularization networks for bio-convective nanofluid flow model involving gyro-tactic organisms with viscous dissipation, stratification and heat immersion.

·         Parametric estimation scheme for aircraft fuel consumption using machine learning

111) Future application of the presented technique should be listed in the conclusion section.

112) Authors should compare their results with already published literature.

113) The authors should try to proof read the entire manuscript in order to eliminate various sentence and typo errors.

 

Comments on the Quality of English Language

1The authors should try to proof read the entire manuscript in order to eliminate various sentence and typo errors.

Author Response

The authors of this manuscript want to thank the reviewer for his time and effort in providing his comments and suggestions that could greatly improve the manuscript's quality. We hope that the answers to the reviewers' comments, and modifications made in the manuscript according to the reviewer's comments and suggestions improved manuscript quality and that the manuscript could be accepted for publication in this form. 

 

 

In this research work, authors focused on the calculation of CNTs atomic coordinates using density functional theory which  can be cha1lenging task and in some cases the calculation can last for days. To overcome this problem the genetic programming symbolic regression method was applied on public available dataset to see if obtained mathematical equations (MEs) could estimate calculated atomic coordinates obtained using DFT with high accuracy. Since GPSR has a lot of hyperparameters the idea was to developed random hyperparameter values search method (RHVS) to find optimal combination of GPSR hyperparameter values using which the highest estimation accuracy could be achieved. Two different approaches were considered i.e. the first was to apply GPSR to estimate calculated coordinates uc/vc/wc using all input variables (initial atomic coordinates u, v, and w and integers n, m that specifies chiral  vector), and second approach was to apply GPSR to estimate each calculated atomic coordinate using integers n and m and the corresponding initial atomic coordinates. With application of the  proposed approach different dataset variations were created. The GPSR algorithm was trained using 5-fold cross-validation process and to evaluate MEs obtained after each training coefficient of  determination (R2), mean absolute error (MAE), root mean squared error (RMSE), and depth and length of generated MEs. The proposed approach showed that the GPSR could be applied to estimate the atomic coordinates of CNTs with high accuracy (R2 ≈ 1.0).

 

Following are my comments which need to be addressed in revised version of manuscript:

 

  1.  GPSR stands for. Before using abbreviation in Abstract, first you have to write Genetic Programming Symbolic Regression and then use abbreviation of it.

Answer: Thank you for noticing. The first time GPSR appears in the abstract the full name is given following which the abbreviation is given in parentheses. This was done for all abbreviations in the abstract.  

  • The abstract is wordy and not informative. The structure of the abstract needs revision. Revise the abstract to provide (i) the significance of the study, (ii) the aim of the study, (iii) the research methodology, (iv) the major conclusion of the study.

Answer: Thank you for the suggestion. The abstract is rewritten to provide the significance of the study, the aim of the study, the research methodology and the major conclusions of the study. Citing from the revised manuscript version: “The study addresses the formidable challenge of calculating atomic coordinates for Carbon Nanotubes (CNTs) using density functional theory, a process that can endure for days. To tackle this issue, the research leverages the Genetic Programming Symbolic Regression (GPSR) method on a publicly available dataset. The primary aim is to assess if the resulting Mathematical Equations (MEs) from GPSR can accurately estimate calculated atomic coordinates obtained through Density Functional Theory (DFT). Given the numerous hyperparameters in GPSR, a Random Hyperparameter Values Search (RHVS) method is devised to pinpoint the optimal combination of hyperparameter values, maximizing estimation accuracy.

 

Two distinct approaches are considered: the first involves applying GPSR to estimate calculated coordinates ($u_c$, $v_c$, $w_c$) using all input variables (initial atomic coordinates $u$, $v$, $w$, and integers $n$, $m specifying the chiral vector). The second approach applies GPSR to estimate each calculated atomic coordinate using integers $n$ and $m$ alongside the corresponding initial atomic coordinates. This results in the creation of six different dataset variations. The GPSR algorithm undergoes training via a 5-fold cross-validation process. The evaluation metrics include the coefficient of determination ($R^2$), mean absolute error ($MAE$), root mean squared error ($RMSE$), and the depth and length of generated MEs.

 

The findings from this approach demonstrate that GPSR can effectively estimate CNT atomic coordinates with high accuracy, as indicated by an impressive $R^2 \approx 1.0$. This study not only contributes to the advancement of accurate estimation techniques for atomic coordinates but also introduces a systematic approach for optimizing hyperparameters in GPSR, showcasing its potential for broader applications in materials science and computational chemistry.



  • The originality of the paper needs to be stated clearly. It is of importance to have sufficient results to justify the novelty of a high-quality journal paper. The Introduction should make a compelling case for why the study is useful along with a clear statement of its novelty or originality by providing relevant information and providing answers to basic questions such as: What is already known in the open literature? What is missing (i.e., research gaps)? What needs to be done, why and how? Clear statements of the novelty of the work should also appear briefly in the Abstract and Conclusions sections.

Answer: We agree that the emphasis of the originality and the novelty was lacking in the original version of the manuscript. So we have emphasized even more. The higher emphasis on the originality and novelty of this research is given in the introduction section just before the defined hypotheses (questions in the bullet format). Citing from the revised manuscript version: “The novelty and originality of this paper lie in its departure from the prevalent use of complex neural networks in prior research for estimating calculated atomic coordinates. While these networks deliver exceptional estimation accuracy, their drawback is the inability to transform the model into a straightforward mathematical form. This limitation impedes ease of use and demands substantial computational resources, including storage and CPU power, hindering the prediction of new atomic coordinates based on input variables.

 

In contrast, this paper introduces a groundbreaking approach by implementing a Genetic Programming Symbolic Regression (GPSR) algorithm on a dataset sourced from \cite{aci2016artificial}. The key objective is to derive simple yet highly accurate Mathematical Expressions (MEs) capable of estimating atomic coordinates. This departure from complex neural networks is driven by the desire for a more interpretable and computationally efficient model.

 

To address the challenge of tuning numerous hyperparameters in GPSR, the paper introduces a Random Hyperparameter Value Search (RHVS) method. This innovative technique aims to pinpoint optimal GPSR hyperparameter values, facilitating the generation of MEs that achieve remarkable accuracy in estimating calculated atomic coordinates for Carbon Nanotubes (CNTs). The proposed methodology is further strengthened by the application of the 5-fold cross-validation (5FCV) method during the GPSR training process, ensuring robustness and reliability in the model's performance assessment.

 

In summary, this paper pioneers a novel approach by combining the interpretability of symbolic regression with the accuracy of GPSR, providing a solution to the limitations of previous research methods. The integration of RHVS for hyperparameter tuning adds a layer of sophistication, making this study a significant contribution to the field of computational chemistry and materials science.

We think that in this form the novelty and the originality are sufficient for this research paper. 

 

  • An explanation is needed with each one of the figures. What is the physical meaning of the graphs? What is the conclusion(s) from each figure? The lack of physical argumentation is a concern that should be rectified in a revised version.

Answer: 

The description of the Figure 2 after the Figure 2 was extended in the revised version of the manuscript. Citing from the revised version of the manuscript: “Figure \ref{fig:CNT-Corr} provides insights into the correlation structure within the dataset. Notably, the highest correlation value to the output variable $u_c$ is observed with the input variable $u$ (correlation coefficient = 1.0). Furthermore, $u_c$ demonstrates a correlation of 0.5 with the variable $v$ and near-zero correlations with the remaining input variables. Similar correlation patterns are observed for the output variable $v_c$, where $v$ exhibits the highest correlation (correlation coefficient = 1.0), and a correlation of 0.5 is observed with $u$. Additionally, a correlation is noted between the initial atomic coordinates ($u$ and $v$) and the calculated atomic coordinates ($u_c$ and $w_c$). A perfect correlation exists between $w$ and $w_c$.”

The description before Figure 3 is extended to present why is the detection of dataset outliers important. Citing the paragraph before Fiugre 3 from revised version of the manuscript: “The final step involves scrutinizing the dataset for the presence of outliers, which are data points deviating significantly from the majority of the data \cite{vinutha2018detection}. Outliers, whether much higher or lower than typical values in the dataset, can exert a substantial impact on both statistical analyses and machine learning models. In machine learning, the presence of outliers can distort the learning process and compromise the model's generalization ability.

 

Machine learning algorithms often rely on statistical measures and assumptions about the distribution of the data. Outliers, being atypical and aberrant, can distort these assumptions, leading to biased models or inaccurate predictions. Furthermore, outliers can disproportionately influence the determination of model parameters, leading to suboptimal results.

 

Detecting and addressing outliers is crucial for enhancing the robustness, accuracy, and reliability of machine learning models. It aids in producing more resilient models that perform well on unseen data by mitigating the impact of extreme values that might otherwise skew the learning process. Additionally, outlier detection serves as a quality control mechanism, helping to identify and rectify issues such as errors in data collection, measurement inaccuracies, or the presence of rare and impactful events. Overall, in the context of machine learning, outlier detection is an indispensable step in ensuring the integrity and effectiveness of the modeling process.”

 

The description of the Figure 4 is moved after Figure 4 and extended as much as possible to describe in detail the training and testing process used in this research. Citing from the revised manuscript version the description of Figure 4: “As seen from Figure \ref{fig:Training} the dataset underwent an initial division into training and testing datasets, following a standard 70:30 ratio. This division strategy ensures a sufficiently large training set for model development, while reserving a distinct portion for assessing the model's performance on unseen data. The training dataset played a pivotal role in the training of the Genetic Programming Symbolic Regression (GPSR) algorithm, employing the 5-Fold Cross-Validation (5FCV) method.

The training procedure aligns with the methodology employed in \cite{andjelic2023development}, ensuring a consistent and validated approach. Post the 5FCV process, the resulting set of Mathematical Expressions (MEs) is subjected to evaluation, primarily focusing on estimating accuracy on the training dataset. The assessment criteria involve scrutinizing metrics such as the coefficient of determination ($R^2$), mean absolute error ($MAE$), and root mean squared error ($RMSE$). The desired benchmark for successful training is set ambitiously high, aiming for $R^2 > 0.99$, $MAE < 0.1$, and $RMSE < 0.02$.

Should the estimation accuracy not meet these stringent criteria, the process initiates anew with the selection of random hyperparameter values. This iterative approach aims to systematically explore hyperparameter space and fine-tune the model for optimal performance.

Conversely, if the obtained set of MEs satisfies the defined criteria, the model proceeds to the testing phase. The MEs are then evaluated on the reserved 30\% of the dataset designated for testing. The evaluation metrics—$R^2$, $MAE$, and $RMSE$—are computed to gauge the model's generalization performance on unseen data.

Should the evaluation metrics on the test dataset fail to meet the set criteria, indicating potential overfitting, the process reverts to the beginning, invoking the Random Hyperparameter Value Search (RHVS) method for refining the model. Conversely, if the evaluation metrics on the test dataset align with the predefined criteria, signifying robust generalization, the training and testing process is deemed complete. This meticulous procedure ensures the development of a GPSR model that not only excels in training but also demonstrates high performance on new and unseen data.

 

The detailed description of Figure 5 is provided after Figure 5 in the revised version of the manuscript. Citing from the revised version of the manuscript: “Analyzing the results depicted in Figure \ref{fig:estimationPerformance_R1} reveals intriguing insights into the estimation performance across different cases. Notably, the highest mean value of $R^2$ and the lowest values of $MAE$ and $RMSE$ are observed in the case of $nmuvw-u_c$. This signifies a superior level of accuracy and precision in the estimation of the calculated atomic coordinates, reflecting the efficacy of the GPSR algorithm in capturing the underlying patterns in the dataset.

 

In the second case, $nmuvw-v_c$, a slightly lower $R^2$ is observed, leading to marginally higher $MAE$ and $RMSE$ values compared to the $nmuvw-u_c$ case. However, it's crucial to note the presence of larger error bars ($\sigma$ values), indicating a higher variability in the estimation performance. This suggests that while, on average, the estimation performance is slightly lower than in the first case, there is greater variability in individual predictions.

 

The third case, $nmuvw-w_c$, exhibits the lowest $R^2$ and $MAE$ values, with $RMSE$ being the highest among the three cases. However, an interesting observation is the absence of $\sigma$ values in the case of $R^2$ and $RMSE$. This implies a more consistent and less variable performance in terms of coefficient of determination and root mean squared error compared to the $nmuvw-v_c$ case. Despite the lower mean values in accuracy metrics, this case demonstrates a more stable and predictable estimation performance.

 

The nuanced differences observed across these cases highlight the intricacies in estimating different atomic coordinates using the GPSR algorithm. The trade-off between mean performance and variability, as evidenced by the presence of error bars, underscores the need for a nuanced evaluation of the algorithm's effectiveness across distinct output variables. This thorough analysis enhances our understanding of the algorithm's strengths and limitations, providing valuable insights for refining and optimizing future implementations.

The explanation of the results after Figure 6 is extended and results are described in detailed. Citing the description of the Figure 6 after Figure 6 from the revised manuscript version: “Examining the outcomes presented in Figure \ref{fig:evalmetricsR2} provides an in-depth understanding of the estimation performance across various scenarios. Notably, the highest mean $R^2$ value is achieved in the case of $nmu-u_c$, indicating an exceptional level of goodness of fit for this particular atomic coordinate estimation. Following closely is the case of $nmv-v_c$, which exhibits a slightly lower mean $R^2$ but still attains a commendable level of explanatory power. The case of $nmw-w_c$ follows suit, with the lowest mean $R^2$ among the three scenarios. This hierarchy of mean $R^2$ values sheds light on the algorithm's ability to capture and explain the variance in the dataset, emphasizing its proficiency in certain estimation scenarios.

 

In terms of mean $MAE$ values, the case of $nmw-w_c$ emerges as the most accurate, reflecting the smallest average absolute errors in atomic coordinate estimation. This suggests that, on average, the GPSR algorithm demonstrates superior accuracy in predicting the calculated atomic coordinates for $nmw-w_c$. It is noteworthy that the other cases, $nmu-u_c$ and $nmv-v_c$, while having larger mean $MAE$ values, still maintain reasonably low levels of absolute error.

 

However, the examination of standard deviation ($\sigma$) values unveils interesting nuances. In the cases of $nmv-v_c$ and $nmw-w_c$, despite their higher mean $MAE$ values, larger $\sigma$ values indicate greater variability in the estimation errors. This implies that, while these cases might exhibit higher mean errors on average, there is also a broader range of errors, suggesting a more varied performance across different instances. Contrastingly, in the case of $nmu-u_c$, despite having a larger mean $MAE$, the $\sigma$ value is comparatively smaller, suggesting a more consistent and predictable estimation performance.

 

Considering the mean $RMSE$ values, the case of $nmu-u_c$ stands out with the smallest average root mean squared errors. This signifies a superior overall precision in this particular atomic coordinate estimation scenario. However, it's essential to note that, similar to $MAE$, the $\sigma$ values for $RMSE$ are relatively larger in the cases of $nmu-u_c$ and $nmv-v_c$, indicating a broader range of errors.

 

This detailed exploration of mean and variability metrics across different atomic coordinate estimation scenarios underscores the nuanced performance of the GPSR algorithm. It provides valuable insights into the algorithm's strengths and weaknesses, aiding in the identification of optimal scenarios for its application and guiding future refinements for enhanced accuracy and stability.

 

  • In the introduction, the authors did not provide a strong motivation for the paper and the obtained results. In addition, they should discuss the main contributions of their work in detail after the motivation part. Then they should summarize the main structure of their paper in brief at the end of the introduction.

Answer: In the original version we have not provided a strong motivation for performing this research. So we have provided a strong motivation for performing this research and placed it just after the definition of the hypotheses in the Introduction section. Citing from the revised version of the manuscript: “

This research is motivated by the limitations of using complex neural networks to estimate atomic coordinates, which lack interpretability and demand substantial computational resources. In contrast, we propose a novel approach utilizing the GPSR algorithm on a dataset from \cite{aci2016artificial}. The aim is to generate simple yet highly accurate MEs for estimating atomic coordinates, addressing the shortcomings of previous methods. To optimize GPSR hyperparameters, we introduce a RHVS method. This innovative technique identifies optimal hyperparameter values, enhancing the accuracy of MEs in estimating CNTs atomic coordinates. The study's novelty lies in combining the interpretability of symbolic regression with the accuracy of GPSR, offering a practical solution and contributing significantly to computational chemistry and materials science.








The summary of the main structure of the paper is summarized in the original version of the manuscript at the end of the introduction section. Citing from the original and revised version of the manuscript: “The rest of the manuscript consists of the following sections i.e. Materials and Methods, Results, Discussion, and finally Conclusions. However, the best mathematical expressions are shown in the Appendix section. The Materials and Methods section contains a description of the research methodology,  dataset statistical analysis with outlier detection, a description of GPSR with RHVS method, the training/testing process, and finally used computational resources. The Results section contains the results of a conducted investigation using which a discussion section is provided. Finally, in the Conclusions section, the conclusions are given based on given hypotheses, advantages, and disadvantages of the proposed method with possible directions for future work. The additional Appendix section provides information about modifications to mathematical functions used in GPSR and a description of how to download and use the obtained MEs in this research.

  • The introduction section should be made more concise to show previous work in the field. At present lot of related research are stated in the introduction. However, no analysis is presented. The authors should ask themselves: what are the problems with the presented research? Why is the recent work needed? Hope this can improve the present work by following past articles.

Answers: The answer to the previous questions is summarized in one paragraph after the hypotheses and strong motivation were defined. Citing from the revised manuscript version: “The presented research addresses problems associated with the limitations of previous methodologies, particularly the use of complex neural networks for estimating atomic coordinates. The drawbacks include high estimation accuracy but a lack of interpretability, impracticality in transforming models into simple mathematical forms, and significant computational resource requirements. The recent work becomes necessary to overcome these issues by introducing a shift from complex neural networks to a GPSR algorithm. This new approach aims to provide both high accuracy and simple, interpretable MEs, addressing the shortcomings of previous models. The incorporation of the RHVS method and the 5FCV process further enhances the accuracy and reliability of the proposed solution, making it a significant advancement in the field of computational chemistry and materials science.

  • Literature review section needs lot of improvement. Only 16 papers are cited and disused. Most recent paper is of 2018. So update the literature review.

Answer: In literature is updated with the literature suggested by you and some additional literature is listed in the introduction section. The current number of literature is 30 papers and more than 16 references are from the 2018-2023 period.



  • Results and discussion should be provided in elaborative manner for each graphical and numerical illustrations And comparison of the algorithm with respect to complexity should be given in the results and discussion section.

Answer: The results and discussion section were rewritten i.e. the description of the results was extended as much as possible. The entire discussion was also extended and detailed comparisons between results obtained in this research and the results from the literature [14] were made. 



  • Provide proper reference of equations .

Answer: We do not know which equations are you referring to. If you are referring to the equations given in the manuscript for determining R^2, MAE, and RMSE then there is one mistake that we noticed. The MAE equation is written two times in the manuscript i.e. the first time in GPSR since the MAE is the fitness function and the second time in the evaluation metrics. In the revised version of the manuscript, the equation for MAE was removed from the evaluation metrics subsection.

Regarding the MEs obtained with the presented approach (RHVS + GPSR + 5FCV), a total of 30 MEs were obtained. Due to the large number of MEs and length of individual MEs, they were placed in the GitHub repository and if you want to use these MEs follow the procedure given in the Appendix Section of the manuscript. Regarding the MEs description in the manuscript, the description is provided regarding their length and depth. Regarding the required number of input variables in the calculation of MEs we have provided the information at the end of the result section. Citing from an original and revised version of the manuscript: “Unfortunately, the analysis of required input variables in obtained MEs showed that all input variables used in dataset variation are required to calculate the corresponding output.

We do hope that this answer is the proper answer to the question you have asked. 

  • The use of stochastic numerical computing approaches use as promising alternative should also mention in the conclusion section. The following literature is relevant to the stochastic numerical computing approaches.
  • Convective flow dynamics with suspended carbon nanotubes in the presence of magnetic dipole: Intelligent solution predicted Bayesian regularization networks.
  • Intelligent Bayesian regularization‐based solution predictive procedure for hybrid nanoparticles of AA7072‐AA7075 oxide movement across a porous medium.
  • Integrated intelligent computing application for effectiveness of Au nanoparticles coated over MWCNTs with velocity slip in curved channel peristaltic flow.
  • Novel design of intelligent Bayesian networks to study the impact of magnetic field and Joule heating in hybrid nanomaterial flow with applications in medications for blood circulation.
  • Bayesian regularization knack-based intelligent networks for thermo-physical analysis of 3D MHD nanofluidic flow model over an exponential stretching surface.
  • Intelligent Bayesian regularization networks for bio-convective nanofluid flow model involving gyro-tactic organisms with viscous dissipation, stratification and heat immersion.
  • Parametric estimation scheme for aircraft fuel consumption using machine learning

Answer: The literature is included in the conclusion section i.e. at the end of the conclusion section in future work. Citing the conclusion from the revised version of the manuscript: “ ”

  • Future application of the presented technique should be listed in the conclusion section.

Answer: The other reviewer requested to expand the future work to address the cons of the research methodology presented in this paper. Citing the future work from the revised manuscript version: “The future work will concentrate on implementing various evolutionary algorithms (such as Genetic Algorithm, Differential Evolution, Particle Swarm Optimization,...) to discover the optimal combination of hyperparameters for the GPSR model. The aim is to derive highly accurate MEs for estimating atomic coordinates of CNTs. The utilization of evolutionary algorithms is anticipated to expedite the identification of optimal hyperparameter combinations compared to the current method, which involves randomly searching hyperparameter values.

In future work, emphasis will be placed on determining whether highly accurate MEs can be achieved with smaller values for population size (Size\_pop) and maximum number of generations (max\_gen). This is crucial as these hyperparameters have been identified as contributors to prolonged execution times in the GPSR algorithm.

While the parsimony coefficient was not a primary focus in the current study, future research will explore the use of higher parsimony coefficient values. As already stated in this research higher values of the parsimony coefficient can prevent an increase in symbolic expression length i.e. bloat phenomenon.  So one of the objectives in future work is to generate smaller MES in terms of length and depth while maintaining or even enhancing the accuracy achieved in this study.

Besides the GPSR the future work regarding this dataset will be to employ various other AI algorithms such as Bayesian regularization networks \cite{awan2023convective, awan2023intelligent, raja2021integrated, awan2023novel, awan2023bayesian, awan2021intelligent, wahid2023parametric}, ensemble methods \cite{andjelic2022mean}, multi-layer perceptron, and XGBoost \cite{baressi2023use} among others and compare their performance with the GPSR estimation performance.

  •  Authors should compare their results with already published literature.

Answer: The authors did compare the results of this research to the relevant published literature. To the best of our knowledge and after investigating published literature we have found that the paper referenced [14] is the only paper in which atomic coordinates were estimated using ML methods. One of the ideas was to investigate if GPSR could be used to obtain MEs that have similar accuracy as those reported in the literature [14]. However, the results in Table 8 are expanded and compared in detail. Citing from the revised manuscript version (detailed description after Table 8): “The outcomes presented in Table \ref{tab:FinalComparison} highlight a noteworthy parallel between the results of this paper and those documented in \cite{aci2016artificial}. In the mentioned study, various neural networks were explored, with the FITNET neural network exhibiting the highest estimation accuracy, as evidenced by superior $R^2$ values and lower $MAE$ and $RMSE$ values. It's essential to underscore that the reported results encompass all three target values, namely $u_c$, $v_c$, and $w_c$, across the four neural networks employed, with the lowest accuracy observed in the GRNN.

In specific scenarios where $nmu$, $nmv$, and $nmw$ were utilized as input variables in the GPSR algorithm to predict $u_c$, $v_c$, and $w_c$, the estimation accuracy aligns closely for $u_c$ and $v_c$, while showing a slightly lower performance for the $w_c$ target variable. A comparative analysis with the findings from \cite{aci2016artificial} reveals that our approach outperforms most Machine Learning (ML) algorithms for $u_c$ and $v_c$ targets, with FITNET being the exception. For the $w_c$ target, our approach outperforms only the GRNN algorithm.

In instances where all input variables, i.e., $nmuvw$, were employed in the GPSR algorithm to predict the aforementioned target variables, the highest estimation accuracy was observed for $u_c$, followed by $v_c$ and $w_c$, respectively. Notably, our results for the $u_c$ target closely align with the findings from \cite{aci2016artificial} with FITNET, surpassing other neural networks such as FFNN, CFNN, and GRNN. The estimation accuracy for $v_c$ and $w_c$ outperforms that of the GRNN algorithm.

A comprehensive comparison between both approaches underscores that the optimal accuracy in calculating $u_c$, $v_c$, and $w_c$ is achieved by employing Mathematical Expressions (MEs) that require $nmu$, $nmv$, and $nmuvw$, respectively. Intriguingly, accurate calculation of $w_c$ necessitates the inclusion of all input variables.

A nuanced examination of our results vis-à-vis those in \cite{aci2016artificial} reveals a striking similarity. The distinctive advantage of our approach lies in its practical implementation, eliminating the need to store a trained GPSR model. The obtained MEs are user-friendly, requiring minimal computational resources, in stark contrast to the trained neural networks in the referenced literature, which demand more extensive computational capabilities. This pragmatic aspect reinforces the utility and accessibility of our approach in real-world applications.

 

  • The authors should try to proof read the entire manuscript in order to eliminate various sentence and typo errors.

Answer: The revised version of the manuscript is proofread and to the best of our knowledge the sentence and typo errors are eliminated.

Round 2

Reviewer 2 Report

Comments and Suggestions for Authors

Authors have incorporated all ,y review comments. So my recommend is to accept paper in its current form.

Back to TopTop