4.1. Dataset
The dataset of 74 CatL inhibitors was collected from the literature [
18] between 2005 and 2023 and they are listed in
Table 10,
Table 11,
Table 12 and
Table 13. Information on the chemical structures and IC
50 values of the 74 CatL inhibitors is summarized in the tables, with measurements obtained through consistent methodologies and uniform laboratory settings. The dataset was partitioned into training and testing subsets at a 4:1 ratio using a uniform random sampling technique. The internal training set consisted of 60 compounds, which were used for model training and internal cross-validation. The external test set included 14 compounds, which were used to evaluate the model’s prediction accuracy during external validation. In order to test the robustness of the model, 5-fold cross-validation (5-fold) [
39] and leave-one-out cross-validation (LOO) were used.
To characterize the dataset’s activity profile, the distribution of measured lg (IC
50) values was analyzed and visualized (
Figure 13). A logarithmic transformation of IC
50 values was applied for normalization. The histogram shows lg (IC
50) values range from approximately −0.5 (leftmost bin) to 2.5 (rightmost bin), spanning about three orders of magnitude. This broad range reflects diverse inhibitory potency, critical for QSAR modeling as it enables capturing structure–activity relationships across varying strengths. Values are spread across bins without excessive clustering, confirming sufficient activity diversity. The training set (60 compounds) and test set (14 compounds) exhibit overlapping distributions, ensuring test set representativeness for reliable validation.
4.2. Computation of Molecular Descriptors
Molecular descriptors are calculated using CODESSA2.64 (
https://revvitysignals.com accessed on 10 January 2025), which provides a wide range of 2D and 3D descriptors. The descriptors used in this study are 2D descriptors and include topology, composition, and electrostatic properties. These descriptors are used to develop QSAR models to predict IC
50 values of CatL inhibitors.
The steps to calculate the molecular descriptors of a compound are shown below.
The molecular structures of all compounds were first plotted in ChemDraw 8.0 software (
https://revvitysignals.com accessed on 1 May 2025) and subsequently aromatized where applicable. Subsequently, the structure of the compound was initially optimized by the MM+ molecular mechanical force field in HyperChem 4.0 software (
http://hypercubeusa.com accessed on 17 May 2025) and more accurately by semi-empirical PM3 or AM1 methods [
40]. The three-step procedure yielded the conformation with the lowest potential energy and optimal structural stability, thereby contributing to more accurate molecular descriptor calculations. Finally, the files obtained from HyperChem were put into the MOPAC 6.9 [
41] software (Stewart Computational Chemistry MOPAC Home Page accessed on 11 May 2025) to produce MNO files, and then the MNO files were utilized as input to the CODESSA 2.64 software to compute five classes of molecular descriptors: constitutional, topological, geometrical, electrostatic, quantum chemical from 604 descriptors [
42].
4.3. Statistical Parameters
The coefficient of determination, represented by
, was utilized as a measurement of the model’s goodness-of-fit [
43]. And the root mean square error (
) was adopted to measure the forecasting accuracy of different models for a particular dataset.
Validating the QSAR model is essential to guarantee its reliability in predicting the biological activity of unknown samples. Model validation includes internal validation, which tests the reproducibility of the model, and external validation, which evaluates the model’s ability to generalize to an independent dataset and its potential for application to novel or external conditions. The internal prediction ability of the model was verified by 5-fold and LOO cross-validation. In addition, four external validation parameters, concordance correlation coefficient (
),
,
, and
, were also adopted. Specifically, the concordance correlation coefficient (
) assesses reproducibility, a fundamental principle that supports the integrity of the scientific method [
44]. The
,
, and
assess the external prediction performance of the models [
44,
45,
46,
47]. Higher values of these metrics indicate better external prediction capability of the model.
y-Randomization was used to test for chance correlations in the process of constructing models. To ensure the randomness of the models, y-randomization was performed for 100 rounds. and of each round were recorded to calculate the averages. These two parameters were used to indicate whether there were any chance correlations in the models.
4.5. Calculate Feature Importance by XGBoost
An efficient method is required to optimize nonlinear models and reveal nonlinear interactions among descriptors in complicated datasets. XGBoost is useful for evaluating feature importance by assessing the contribution of each descriptor to model performance.
The coverage method and the split gain method are the two approaches commonly employed in XGBoost to determine feature importance. The split gain approach calculates the increase in prediction accuracy that occurs when decision tree nodes are split using a feature. Higher split gain features are thought to be more significant because they capture more nuanced, nonlinear relationships between descriptors.
The split gain method was applied in this study due to its suitability for uncovering complex relationships within the data. In XGBoost, the importance of molecular descriptors is assessed based on their split gain during the construction of decision trees. Descriptors with higher split gain values, such as hydrophobicity and hydrogen bond counts, show strong associations with inhibitory activity and contribute significantly to predictive performance. The algorithm automatically identifies these key features, while regularization parameters such as lambda and gamma help control model complexity and reduce the risk of overfitting. This approach enhances predictive accuracy and offers valuable insights into the structure–activity relationships of CatL inhibitors.
4.6. Nonlinear Model by GEP
Given the inherent complexity and nonlinearity of factors influencing the inhibitory effects of CatL targeting compounds, gene expression programming (GEP) was employed to construct a nonlinear model for predicting their IC50 values.
GEP is a new type of adaptive evolution algorithm based on the invention of biological gene structure and function. GEP was developed from genetic algorithms (GAs) and genetic programming (GP) [
49,
50,
51,
52], which absorbs the advantages of both but overcomes the shortcomings of both, and its distinctive feature is that it can solve complex problems with simple coding.
GEP uses a unique chromosome-based encoding method that optimizes both the model structure and its parameters simultaneously. This characteristic enables GEP to effectively capture nonlinear relationships between molecular descriptors and biological activity while maintaining strong predictive performance. As a result, GEP is particularly suitable for modeling the complex relationships observed in the CatL inhibitor dataset.
4.8. Nonlinear Models by SVR
Support vector regression (SVR), an extension of support vector machine (SVM), seeks a hyperplane that optimally fits continuous data points [
55]. SVM maps data to a high-dimensional space to find the best separating hyperplane for classification or regression [
56]. Maximizing the margin between samples and the hyperplane enhances prediction accuracy.
SVR differs from SVM in that it focuses on minimizing the overall deviation of all data points from the regression hyperplane, rather than maximizing the margin between the hyperplane and the nearest data points. SVR imported the
-insensitive loss function to penalize errors beyond a specified threshold
, which improves robustness and generalization ability of the model. In addition, penalization parameter
serves as a regularization term that regulates the penalty applied to classification errors and slack variables
and
which are introduced as tolerance allowing some samples at the boundary of classification errors or intervals. The final optimization problem is as follows in Equation (3).
To simplify the solution for SVR, Lagrange multipliers
are introduced. If the Karush–Kuhn–Tucker conditions are satisfied, the dual problem can be solved by quadrature phase procedures [
56]. The final solution is as in Equation (4).
The is the kernel function to be introduced in the next section.
4.8.1. SVR Kernel Function
Standard SVR struggles with nonlinear data due to the reliance on linear decision boundaries. To address this, complex data are mapped to a higher-dimensional space at the cost of increasing complexity and overfitting risk. Kernel functions solve this by efficiently computing inner products in the transformed space, avoiding explicit high-dimensional mappings and simplifying the handling of complex data. Basic kernel functions include the radial basis function (RBF) [
57], the linear kernel function, the polynomial kernel function, and the sigmoid kernel function. The initial three kernel functions have been extensively used when establishing models by SVR [
58].
However, considering the disadvantages of single kernel functions and the complementarity between popular kernel functions in handling complex datasets, it is necessary and reasonable to construct mixed kernel functions to meet the requirements. In fact, any function kernel that satisfies the theorem of Mercer can be used as a kernel function, which provides many ways to construct new kernel functions. If are kernel functions, then:
- 1
is a kernel function.
- 2
is a kernel function.
More importantly, integrating several basic kernel functions into mixed kernel functions means they are able to complement each other, thus overcoming their respective shortcomings.
The theorem of Mercer is defined as such that if
satisfies Mercer’s premise shown in Equation (5) which means that
is a semidefinite function [
59].
Then it can be expressed in the form of a series expansion of the above-mentioned eigenvalues and eigenfunctions. This implies that can be represented as an inner product in some high-dimensional feature space.
4.8.2. Nonlinear Model
RBF is a universal kernel function that can be employed without prior knowledge of the data [
60,
61,
62,
63]. RBF stands out for its good learning ability and efficiency among all basic kernel functions [
64]. However, the model exhibits limited generalization performance. The RBF kernel can be expressed as Equation (6).
SVR that adopted as its kernel function is called RBF-SVR.
The linear kernel is the most basic among all kernel functions, as it simply calculates the inner product between two feature vectors. Moreover, the linear kernel function has the advantages of strong learning ability which is restricted to linear relationships and effectively improves the generalization ability when the dataset is linearly separable and low-dimensional. However, since the linear kernel is limited to computing a simple inner product between feature vectors, its modeling capacity is inherently constrained. Consequently, the linear kernel typically achieves lower accuracy compared to more flexible kernel functions when handling nonlinear problems [
65], but it efficiently identifies key feature vectors with minimal computational cost. The linear kernel is defined in Equation (7).
The polynomial kernel function is one of the most widely used nonlinear kernel functions, as it extends the linear kernel by introducing polynomial combinations of feature interactions. Unlike the linear kernel, which only computes a simple dot product, the polynomial kernel maps data into a higher-dimensional space, enabling the learning of complex nonlinear relationships. Moreover, the polynomial kernel function possesses the advantage of adjustable flexibility, governed by its degree parameter, which enables it to model a broad spectrum of decision boundaries spanning approximately linear to highly curved surfaces. This makes it particularly effective when data exhibits polynomial patterns or multiplicative feature dependencies. The polynomial kernel function demonstrates strong generalization performance due to its global approximation properties. However, this comes at the expense of weaker local learning capabilities, as the polynomial kernel tends to smooth over fine-grained variations in the feature space. The polynomial kernel can be expressed as Equation (8).
To overcome the limitations of individual kernel functions while preserving their respective advantages, some kinds of mixed kernel functions incorporating different numbers of individual kernel functions were proposed.
The linear kernel function demonstrates strong learning capability and excellent generalization performance for linearly separable and low-dimensional datasets. However, its effectiveness is fundamentally limited to linear relationships. In comparison, the RBF kernel function shows superior learning ability in handling nonlinear and high-dimensional datasets, with relatively weaker generalization performance. To leverage their complementary strengths, a mixed kernel function combining both the RBF and linear kernels was proposed. The mixed kernel function that linearly combines the RBF kernel function and linear kernel function allows the linear kernel to enhance the overall generalization ability while the RBF kernel improves the learning capability for complex patterns, thereby achieving balanced performance across different data characteristics. The form of this new double-kernel function can be expressed as Equation (9).
SVR that adopted as its kernel function is called LMIX2-SVR.
The variable α takes values within the interval [0, 1].
To overcome the limitations of the linear kernel function, which exhibits strong generalization performance only for low-dimensional and linearly separable datasets but struggles with high-dimensional and nonlinear datasets, a mixed kernel function is constructed by incorporating the RBF kernel function, linear kernel function, and polynomial kernel function.
The polynomial kernel further enhances nonlinear modeling by incorporating feature interactions through its adjustable degree parameter. Importantly, the polynomial kernel improves generalization in high-dimensional and nonlinear datasets, effectively compensating for the deficiency of the linear kernel in these scenarios.
The proposed mixed kernel function combines three complementary kernel functions including the linear kernel for low-dimensional linear separability, the RBF kernel for nonlinear pattern recognition, and the polynomial kernel for high-dimensional feature representation. This comprehensive integration achieves optimal balance between learning capacity and generalization ability across diverse data characteristics.
The form of the new triple-kernel function can be expressed as Equation (10).
SVR that adopted as its kernel function is called LMIX3-SVR.
The coefficients α and β are positive and satisfy the constraint α + β ≤ 1.
4.8.3. SVR Model Optimized by Particle Swarm Optimization (PSO)
The performance and generalization capacity of SVR models are highly sensitive to parameter settings. When constructing a model using a triple kernel SVR, six parameters require optimization: the penalty factor , the insensitive parameter , the kernel radius of RBF kernel function , the order of polynomial kernel function , the coefficient of RBF kernel function , and the coefficient of polynomial kernel function . Their search scope is as follows. , , , , , .
The optimization process becomes more difficult as the number of parameters of SVR increases. To address the limitations of conventional parameter tuning techniques such as grid and random search, particle swarm optimization (PSO) was employed to optimize model parameters during the construction of the three SVR-based models.
Particle swarm optimization (PSO), introduced in 1995, draws inspiration from the social behavior of bird flocks searching for optimal routes through shared information. The algorithm initializes particle positions and velocities using random values in a high-dimensional space and iteratively refines them through both individual learning and collective swarm interactions. In PSO algorithms, particles only pass optimal information during iterations. Therefore, PSO has the advantages of fast convergence speed, few parameters, and simple and easy implementation of algorithms.
PSO uses a velocity–position model, where the position and velocity of particles
in D-dimensional solution space can be expressed as Equations (11) and (12).
The optimal position of particle
is denoted as
, the global best position is denoted as
. In each iteration, particles track the
,
, and their previous states to adjust the position and velocity at the current moment. The iterative Equations (13) and (14) are as follows.
,
,
,
are the velocity and position of the particle
at the current time and the next time.
is a random number in range
.
and
are learning factors which are usually set to 2.
is a weight factor, and its value automatically decreases with the iteration of the algorithm to speed up the convergence speed [
66]. It is described as Equation (15).
represents the current iteration count,
denotes the predefined upper limit. The parameters
,
correspond to the upper and lower bounds of the inertia weight, respectively [
67].
4.10. Property Prediction and Molecular Docking
Based on a number of carefully chosen molecular descriptors that correspond to the activity of a compound and the ensuing data analysis, the QSAR model is crucial to the creation of new pharmaceuticals. In drug discovery, achieving potent binding affinity to the target is critical for identifying viable drug candidates. Furthermore, key drug-like properties, such as pharmacokinetics and toxicity profiles, are vital factors in determining whether a compound progresses to clinical development [
70].
Molecular docking is a key tool in structural molecular biology and computer-aided drug design [
71]. The objective of ligand–protein docking is to predict the primary binding mode between a ligand and a known three-dimensional protein. In this study, Sybyl-X 2.1 software was used to explore the potential interactions between the newly designed CatL inhibitors and the target protein at the binding site. The target protein, cathepsin L (PDB code: 7w33), was retrieved from the Protein Data Bank. The binding site was identified based on the known binding position of the ligand. The GRID for docking was marked according to the ligand’s original binding site, using standard parameters in Sybyl-X. The receptor was kept rigid during docking, while the ligand was allowed to flex within the binding site. To validate the docking protocol, redocking was performed to ensure the ligand could dock successfully to its original position in the target’s binding site. The docked conformations were then evaluated based on their binding affinities and interactions with key amino acid residues.
The docking procedure involved several steps: first, the input chemical structure was optimized using Sybyl-X. The Tripos force field was then applied to minimize the molecules and assign Gasteiger–Hückel charges until convergence was reached (0.05 kcal/mol/Å). Next, the protein structure was processed in Sybyl-X by removing irrelevant ligands and solvent molecules, ensuring proper alignment for docking. Finally, PyMol software (
http://www.pymol.org/pymol accessed on 15 June 2025) was used to visualize the docking results.