2.1.3. Dataset Division and Modeling Tools

The entire pool of descriptors was treated with a 0.0001 variance cutoff and passed through a 0.99 correlation coefficient to eradicate the highly correlated feature and decrease the noise level among descriptors. Followed by the dataset is divided into 3:1 ratio randomly by generating training and test sets with 44 and 15 FDs, respectively. The training set was then employed to develop a PLS based model using Partial Least Squares version 1.0 tool [25].

## 2.1.4. Model Validation and Designing Criteria

To assess the quality of a QSPR model followed by its prediction capability towards new compounds depend largely on statistical metrics. Internal metrics like R<sup>2</sup> (goodness-of-fit) and leave-one-out cross-validation (Q2 LOO) are important parameters. While external validation metric R<sup>2</sup> pred or Q<sup>2</sup> ext(F1) signify the predictability. Along with these classical parameters, to check the quality of the developed model, we have further employed stringent metrics like the rm<sup>2</sup> metrics [26], the Q<sup>2</sup> ext(F2) [27] and Golbraikh and Tropsha's [28] criteria. To follow the Organization for Economic Co-operation and Development (OECD) principle 3, we have studied the applicability domain test employing the Euclidean distance approach [29]. To check the robustness of the model, Y-randomization technique had been performed to generate 100 random models [30]. The average R<sup>2</sup> and Q2 LOO values of all 100 random models should be failed the stipulated threshold value of 0.5.
