Identification of Investment-Ready SMEs: A Machine Learning Framework to Enhance Equity Access and Economic Growth

Gogas, Periklis; Papadimitriou, Theophilos; Goumenidis, Panagiotis; Kontos, Andreas; Giannakis, Nikolaos

doi:10.3390/forecast7030051

Open AccessArticle

Identification of Investment-Ready SMEs: A Machine Learning Framework to Enhance Equity Access and Economic Growth

by

Periklis Gogas

^*

,

Theophilos Papadimitriou

,

Panagiotis Goumenidis

,

Andreas Kontos

and

Nikolaos Giannakis

Department of Economics, Democritus University of Thrace, 69100 Komotini, Greece

^*

Author to whom correspondence should be addressed.

Forecasting 2025, 7(3), 51; https://doi.org/10.3390/forecast7030051

Submission received: 3 July 2025 / Revised: 3 September 2025 / Accepted: 9 September 2025 / Published: 16 September 2025

(This article belongs to the Section Forecasting in Economics and Management)

Download

Browse Figures

Versions Notes

Abstract

Small and medium-sized enterprises (SMEs) are critical contributors to economic growth, innovation, and employment. However, they often struggle in securing external financing. This financial gap mainly arises from perceived risks and information asymmetries creating barriers between SMEs and potential investors. To address this issue, our study proposes a machine learning (ML) framework for predicting the investment readiness (IR) of SMEs. All the models involved in this study are trained using data provided by the European Central Bank’s Survey on Access to Finance of Enterprises (SAFE). We train, evaluate, and compare the predictive performance of nine (9) machine learning algorithms and various ensemble methods. The results provide evidence on the ability of ML algorithms in identifying investment-ready SMEs in a heavily imbalanced and noisy dataset. In particular, the Gradient Boosting algorithm achieves a balanced accuracy of 75.4% and the highest ROC AUC score at 0.815. Employing a relevant cost function economically enhances these results. The approach can offer specific inference to policymakers seeking to design targeted interventions and can provide investors with data-driven methods for identifying promising SMEs.

Keywords:

SMEs; investment readiness; machine learning; SAFE dataset; equity financing; economic resilience; gradient boosting; random forest; logistic regression; SVC; VIM

1. Introduction and Literature Review

Small and medium-sized enterprises are defined as enterprises with fewer than 250 employees, and with an annual turnover less than 50 million EUR and/or an annual balance sheet total less than 43 million EUR (Commission of the European Communities, 2003/361/EC). They play a crucial role in economic growth, employment, and innovation. They account for approximately 90% of all businesses globally, providing over half of the total employment and significantly contributing to GDP across various economies [1]. Despite their importance, SMEs frequently struggle to secure sufficient external financing, limiting their potential for growth and innovation. This issue is particularly significant in emerging markets, where about 40% of formal SMEs report unmet financing needs [1]. The financing gap faced by SMEs is often due to information asymmetries and the perception of higher risk causing lenders and investors to be cautious when considering investments in smaller firms. As traditional financial institutions generally rely on established credit histories and collateral requirements many SMEs cannot meet, these enterprises often resort to internal funding or informal finance sources [2,3]. This financing barrier highlights the necessity of improved mechanisms to identify investment-ready SMEs and connect them with suitable investors.

Investment readiness—the ability of an SME to attract and secure external funding—has emerged as an important concept to bridge the financing gap. A firm that is “investment-ready” typically demonstrates a sound and viable business model, credible financial reporting, clear growth potential, and an effective management team capable of communicating these strengths to potential investors [3]. Yet, previous research consistently finds a mismatch between SMEs’ self-assessments of readiness and investors’ evaluations, resulting in many firms being overlooked despite having solid business foundations [4].

Traditional approaches for evaluating SME investment readiness, such as financial ratios, scoring systems, and qualitative assessments, have significant limitations. These methods often overlook complex, non-linear relationships within the data and rely heavily on past financial performance, potentially disregarding other vital aspects like managerial capability or innovation [5,6]. In response, machine learning (ML) techniques have begun to gain prominence due to their ability to analyze large, multidimensional datasets and identify complex, predictive patterns that traditional econometric methods might miss [6].

The key benefit of using machine learning to solve this problem as opposed to traditional econometric methods is that it is perfectly suitable for the specific task of prediction in a data-rich and complex environment [5,7]. Traditional models such as the Logistic Regression model that is used here as a benchmark are often designed for statistical analysis and data interpretation [8]. These models assume linearity in the data-generating process [9,10]. Machine learning techniques have been developed for the purpose of improving the accuracy of predictive models outside the training dataset [6,11] and they are highly effective at automatically detecting complex, non-linear patterns and interactions within large and complex datasets without the need for prior knowledge [12]. When considering the large amount of data in the SAFE survey, the potential for noise in the responses, and the complex interactions of the variables that determine investment readiness, then an ML approach provides a powerful, data-driven solution for creating a robust predictive screening tool [13,14]. Given these challenges and opportunities, this paper applies advanced ML methods to assess the investment readiness of SMEs, using the European Central Bank’s (ECB) Survey on Access to Finance of Enterprises (SAFE) dataset. This large-scale, cross-national dataset offers extensive qualitative and quantitative information on SMEs’ financial conditions, market positions, innovation activities, and management capabilities—factors critical for assessing investment readiness.

1.1. Investment Readiness and SME Financing

Investment readiness is a multidimensional concept reflecting an enterprise’s capacity to understand and satisfy investor expectations as captured through business planning, financial transparency, growth prospects, and managerial competence. Mason and Harrison (2001) [2] emphasized that simply increasing the availability of venture capital without ensuring SMEs are investment-ready is insufficient to address funding gaps. Many entrepreneurs are often perceived by investors as “not ready” due to weak financial documentation, unrealistic growth expectations, or an aversion to equity dilution. Douglas and Shepherd (2002) [3], similarly, found discrepancies in perceptions of investment readiness between entrepreneurs and investors, highlighting the importance of clear communication and alignment of expectations to improve SMEs’ chances of securing financing.

Investment-readiness programs have been developed globally to bridge these gaps by providing targeted training and support to SMEs, improving their business plans, and enhancing their financial transparency and overall investability [15]. However, despite their importance, such programs typically yield only modest improvements and may not fully resolve systemic issues such as informational asymmetries and entrenched investor biases [16].

Recent studies have shown that machine learning (ML) methods significantly outperform traditional econometric models in forecasting complex financial market behavior. ML models, unlike traditional approaches, can handle vast and multidimensional datasets, uncovering and capturing complex and intricate relationships and interactions between variables [17,18]. For instance, Dumitrescu et al. (2021) [6] demonstrated how hybrid ML–econometric models significantly improve credit scoring by combining predictive accuracy with interpretability.

Despite the promising potential of ML approaches, their application to the specific problem of predicting SME investment readiness remains limited. Most studies focus on broader financial forecasting contexts or specific applications such as credit risk or stock market prediction, leaving a notable research gap regarding the application of ML in predicting SME investment readiness. Alexakis et al. (2025) [19] use machine learning and the SAFE questionnaire, as in this study, to forecast investment-ready SMEs. They use (a) a slightly different definition of investment readiness, not emphasizing “openness” as a crucial factor for investment readiness, (b) a larger number of initial observations, but only 23 independent variables as opposed to the 59 we use in this study, and (c) a more updated dataset that spans eight instead of six years. They test three hypotheses on the investment-readiness determinants (entrepreneurial ecosystem, financial risk/cost, firm structure) and focus on exploring at the country level and seeking the cultural variation in their predictors using 19 EU countries. In this study, we use a richer set of relevant independent variables and a different investment-readiness definition, and we focus on models that specifically deal with the problem of class imbalance, in order to create a universal EU model able to detect investment readiness irrespective of the individual country.

Additionally, while investment-readiness programs have been implemented globally to support SMEs, their effectiveness in addressing structural financing gaps remains minimal, suggesting the need for more sophisticated, data-driven assessment tools [15,16].

From a venture capital perspective, enhancing screening efficiency to identify promising SMEs remains a challenge. The potential for ML to significantly streamline this screening process, reducing biases and enhancing deal-flow quality, has not yet been thoroughly examined in academic research [20]. From a venture capital perspective, investment-readiness screening is essential due to the high-risk nature of early-stage financing. Venture capitalists often reject many SMEs at early screening stages due to insufficient preparedness in terms of management capabilities, realistic financial projections, and business viability [20,21]. While human judgment remains essential, integrating machine learning that is independent of related human biases and misjudgments into the venture screening process could significantly streamline identification efforts, reducing biases and expanding the search for investment-ready SMEs beyond personal networks [20].

Our research directly addresses the challenge of identifying investment-ready SMEs using machine learning by empirically evaluating multiple advanced ML algorithms, i.e., Gradient Boosting, Logistic Regression, Random Forest, and ensemble methods, using a comprehensive and robust dataset provided by the European Central Bank’s Survey on Access to Finance of Enterprises (SAFE). A basic question guiding our work is whether these advanced ML models can outperform a conventional approach in correctly identifying investment-ready firms. Given the highly imbalanced nature of the SAFE survey, data by explicitly linking managerial competencies, innovation, and openness to equity financing to SME investment readiness through ML techniques, this study contributes both methodologically and substantively to the entrepreneurial finance literature. It not only advances predictive accuracy but also offers actionable insights for SMEs, investors, and policymakers, ultimately aiming to enhance resource allocation efficiency and support broader economic growth and innovation. Additionally, we include a cost-sensitive evaluation in our model. In practical terms, missing an investment-ready SME (a false negative) is often more costly in lost opportunities for growth and investment than incorrectly flagging a firm as ready though it is not (a false positive). To ensure a balanced and accurate representation of the situation, we assigned a significantly higher cost to false negatives than to false positives when evaluating the performance of the model. We investigate how this asymmetric cost affects the selection of the “optimal” model and the overall misclassification cost. We have found that by focusing on reducing false negatives, we can identify models that not only perform well on standard metrics but also align with the economic goal of maximizing successful funding outcomes.

1.2. Research Question and Hypothesis

According to the above, in this empirical research work, with the data derived from the SAFE questionnaire, the main research question is the following:

RQ1. Can machine learning models enhance the ability to correctly identify investment-ready SMEs given the high imbalance in questionnaire data, the significant noise created by the questionnaire respondents, and the complexity and volume of such data in the EU’s SAFE questionnaire?

Accordingly, the following research hypothesis is tested in this empirical work:

H1. Machine learning models demonstrate a greater balanced accuracy compared to traditional classifiers in identifying investment-ready SMEs within highly imbalanced datasets.

2. Data and Research Methodology

2.1. Data Collection and Pre-Processing

The dataset was collected from the European Central Bank Data Portal and contains anonymous qualitative responses from micro, small, medium, and large companies across Europe. The dataset provides insights into the financing conditions faced by these companies.

The dataset was thoroughly cleaned by removing duplicates, non-SME entries, sparse features, and variables closely tied to the target to ensure data relevance and prevent leakage and contamination. After cleaning, the final dataset contained 10,937 SME entries (rows) each described by 51 variables (columns). The dataset is imbalanced, with 12% of entries belonging to class 1 (investment-ready class) and the remaining 88% to class 0 (non-investment-ready class). Out of 51 variables (there is a detailed description of every variable in the Appendix A—Table A1), 44 were kept as inputs, 1 was the target variable, and the rest were columns containing information on date, company ID, and country. Categorical variables were converted to numerical format using dummy encoding, creating binary columns to represent each category. In order to avoid the dummy variable trap and collinearity issues for each categorical variable with k unique values, we produced k − 1 binary columns [21].

The binary target variable takes the value of 1 when the SME is investment-ready and 0 in the opposite case. We define an SME as investment-ready (IR) when it has the following characteristics:

Innovative: These are the firms that have reported new developments, such as introducing a new product or service to the market, implementing a new production process or method, adopting new management practices, or exploring new ways of selling goods or services.
Fast-growing: These are the cases where the annual turnover increases by more than 20%.
Open to equity financing: Whether it reported equity as either a relevant funding source or one used in the past six months.

2.2. Machine Learning Algorithms

Machine learning is the branch of Artificial Intelligence which allows a machine—an algorithm—to learn from data and improve its performance without being explicitly programmed. ML algorithms analyze historical data to identify simple or complex patterns or to make predictions. As Norvig and Russel (2020) state, “the process where the computer observes some data, builds a model based on the data and uses it as a hypothesis about the world and as a software that can solve problems” is called machine learning. Supervised learning is described as the type of machine learning where the algorithm learns and produces a model (function) from labeled training data. The goal is for the system to generalize from these examples, meaning that it can efficiently produce the correct output to new and unseen inputs [22].

2.2.1. Logistic Regression

Logistic Regression is a statistical method of classification. It is an algorithm that models the probability of a discrete outcome. While it shares similarities with linear regression in terms of using a linear combination of input features, the key distinction is that Logistic Regression applies a non-linear logit (log-odds) transformation to model the probability of the outcome. Logistic Regression is considered relatively interpretable among classification methods because the sign of each coefficient indicates whether a predictor positively or negatively influences the outcome’s log-odds. However, the exact effect of a unit change in an input on the predicted probability is not directly intuitive due to the non-linear (logistic) transformation. In other words, while one can relate coefficient values to odds ratios, translating those into absolute probability changes requires additional calculation. The algorithm can also incorporate regularization techniques such as L1 (Lasso), L2 (Ridge), or ElasticNet regularization norms, which help prevent overfitting and improve generalization by penalizing the excessively large coefficients of the model [23].

2.2.2. K-Nearest Neighbors

The K-Nearest Neighbors algorithm is a non-parametric, instance-based machine learning method. Rather than learning an explicit model from the training data, K-NN makes predictions by directly referencing the stored training examples. For a given instance, the algorithm identifies the k closest data points (neighbors) in the feature space, typically using a distance metric such as Euclidean distance. The predicted class for the instance is the majority class among these neighbors. Choosing a small value of k can lead to overfitting, as the model becomes sensitive to noise in the training data, while a large value of k may result in underfitting by oversmoothing class boundaries [24].

2.2.3. Random Forest

The Random Forest classifier is an ensemble learning algorithm that builds multiple decision trees and combines their outputs. Each tree in the forest is trained on a random subset of the data, selected through bootstrap sampling (sampling with replacement), and at each split within a tree, a random subset of features is considered to determine the best split. This combination of data and feature randomness helps reduce overfitting and enhances the model’s generalization ability. For classification tasks, the final prediction is determined by majority voting. Random Forests are well-suited for high-dimensional datasets, are robust to noise, and generally maintain a degree of interpretability through feature importance measures [7].

2.2.4. Support Vector Machines

Support Vector Machines are a powerful class of supervised machine learning algorithms used for classification and regression tasks. SVMs’ aim is to locate the (optimal) hyperplane that maximizes the distance between the hyperplane and the closest data points of different classes, known as support vectors. When data is not linearly separable, Support Vector Machines utilize the kernel trick, which maps the input features into a higher-dimensional space where a linear separation between the classes is possible. In our experiments, we used the radial basis function kernel and the linear kernel. Regularization in SVMs helps balance the trade-off between fitting the training data well and maintaining generalization to new data [25].

2.2.5. Naïve Bayes

Naïve Bayes is a probabilistic classifier based on Bayes’ Theorem, used to classify data by estimating the probability that a given instance belongs to a particular class. It is termed “naïve” because it assumes that the features (or attributes) are conditionally independent of each other given the class label. In Naïve Bayes classification, the algorithm estimates the prior probabilities for each class by calculating the frequency of appearance of each class in the dataset. It, then, computes the likelihood of observing that feature value given the class. For a new unknown instance, the classifier computes the posterior probability for each possible class. Posterior probability is the probability that an instance belongs to a class C given the features X. The class with the highest posterior probability is selected as the predicted class [26].

2.2.6. AdaBoost

Adaptive Boosting (AdaBoost) is an ensemble machine learning algorithm that combines multiple weak classifiers to create a strong classifier. It works by sequentially training a series of simple models, in our case shallow decision trees, on repeatedly modified versions of the data. In each new model the algorithm focuses on the instances misclassified by the previous ones. AdaBoost assigns higher weights to misclassified samples, compelling subsequent models to focus on these hard-to-classify cases. The final prediction is made by aggregating the outputs of all weak classifiers through a weighted majority vote, where classifiers with better accuracy have greater influence [27].

2.2.7. Easy Ensemble

The Easy Ensemble classifier was introduced by Liu, Wu et al. (2009) [28] to address imbalanced classification problems by leveraging ensemble learning. It consists of multiple AdaBoost classifiers, each trained on distinct, balanced bootstrap samples. To achieve balance, the method applies random under-sampling in the majority class, ensuring that both classes will have equal representation in the sub-sample. Each iteration of the algorithm involves training an AdaBoost model using a different sub-sample. This process is repeated several times, generating multiple classifiers. The final prediction is produced through a voting mechanism of all models. By repeatedly resampling the majority class and boosting weak learners, Easy Ensemble effectively improves classification performance on imbalanced datasets while maintaining the strengths of boosting techniques.

2.2.8. Balanced Bagging Classifier

The Balanced Bagging classifier is an ensemble method designed to address the class imbalance in classification tasks. It is a variation on the traditional Bootstrap Aggregating (Bagging) algorithm, where multiple models are trained in parallel and their predictions are combined to make the final decision. The key difference is that Balanced Bagging classifier also handles the class imbalance. The classifier creates multiple subsets of the original data by randomly sampling data points with replacement. The subsets are balanced before the training of each classifier. The balance of the class labels is achieved by random under-sampling on the majority class to match the size of the minority class. Then, each classifier is trained on the balanced subsets. The final prediction is made by utilizing the majority voting concept on classification tasks [29].

2.2.9. Gradient Boosting Trees

Gradient Boosting classifiers are additive models, where the prediction for a given input is made by combining the outputs of several weak learners, which in our problem are decision trees. The Gradient Boosting model is built iteratively in a greedy fashion, where each new tree is trained to correct the errors (residuals) made by the previous trees. Initially, the model starts with a constant value, the mean of the target values, to minimize the loss. During each iteration, the model computes the residuals—the differences between the observed and predicted values—and fits a decision tree to these residuals. The tree is adjusted to minimize the loss function, and the model is updated by adding the contribution of the new tree. This process is repeated for a set number of trees (weak learners), with each tree improving the model by correcting the errors of previous ones.

Finally, we have also included the Logistic Regression model, from traditional econometrics, as a benchmark. This allowed us to directly compare the performance of advanced machine learning algorithms to a standard econometric classification technique.

2.2.10. Cross-Validation

Before applying any machine learning algorithm, we first split the dataset into (a) the training set, which is used to locate the best hyperparameters and identify the optimal models, and (b) the validation set, which is never used in the previous process (training) and is used to evaluate the out-of-sample performance of the trained models (Figure 1).

A critical aspect of machine learning model development is managing the trade-off between overfitting and underfitting. Underfitting occurs when a model is unable to capture the underlying structure of the data, often due to excessive simplicity or insufficient training, and is typically reflected in a poor performance across both the training and validation datasets. Overfitting occurs when a model becomes too complex relative to the amount of training data, leading it to memorize noise and idiosyncrasies within the specific training set. As a result, while training accuracy may be high, the performance on any new and unseen data, like the validation dataset, degrades significantly. In either case, the model fails to generalize beyond the training distribution. To mitigate overfitting, cross-validation techniques are employed to evaluate the model’s performance across multiple data partitions, providing an estimate of its predictive accuracy on unseen data.

Thus, to start the training process, we employed a cross-validation technique (Figure 1). The training dataset is partitioned into 5 equal-sized folds, and the training is repeated iteratively; in each iteration, 4 folds are used for training the model, while the remaining fold is used for testing. This process is repeated 5 times, with each fold serving as the test set once. For each unique set of hyperparameters, a new model is trained, and the corresponding test scores are averaged and used to identify the optimal model. Given the class imbalance in the dataset, where the minority class (class 1) comprises only 12% of the observations, we employed a stratified 5-fold cross-validation technique, rather than simple random sampling for the folds, to ensure that all folds were consistent. This ensures that each fold maintains roughly the same class distribution as the original dataset, thereby providing stable and representative validation results.

2.2.11. Model Selection

Nine classification algorithms were used in this study. The hyperparameter-tuning process was conducted using the grid search technique, a systematic approach that exhaustively explores a predefined subset of the hyperparameter space. Grid search iteratively evaluates combinations of hyperparameters by fitting models to the training data and assessing their performance using the stratified 5-fold cross-validation scheme. The optimal hyperparameters for each algorithm were chosen based on the highest average performance in cross-validation on the test sets. The classification algorithms applied in the experimental analysis are the following:

Logistic Regression;
K-Nearest Neighbors;
Random Forest;
Support Vector Machines with RBF and Linear Kernel;
Naïve Bayes Classifier;
AdaBoost;
Easy Ensemble;
Balanced Bagging Classifier;
Gradient Boosting Trees.

2.2.12. Data Imbalance

To address the issue of the imbalanced dependent variable, the models were trained using balanced class weights. This approach involves assigning higher weights to the underrepresented class (class 1—investment-ready) and lower weights to the overrepresented class (class 0—not investment-ready) during the training process. By doing this, the model in the training process is penalizing the misclassification of the minority-class observations more, thus encouraging it to pay more attention to the less frequent observations.

There are various empirical techniques that can be implemented to handle class imbalance. Oversampling techniques, like SMOTE, ADASYN, etc., create and add artificial observations inflating the minority class, under-sampling techniques reduce the number of observations for the majority class, and cost-sensitive learning applies a higher cost to the minority class in the cost function or assigns class weights. In our case, the minority class is approximately 12% of the data, which represents a moderate imbalance. In such settings, balanced class weights are sufficient to achieve a good performance, especially in conjunction with tree-based classifiers (e.g., Random Forest, XGBoost), which are intrinsically less sensitive to class imbalance than linear models. So, to handle class imbalance, we used both class weights in the training process and also in the cost function; in construction, the minority class is given 5 times the cost of the majority one. Moreover, some of the algorithms selected to be used for the classification are specifically designed to handle class imbalance: (a) the Easy Ensemble classifier, which was introduced by Liu, Wu et al. (2009) [28] to address imbalanced classification problems by leveraging ensemble learning; and (b) the Balanced Bagging classifier, which is a variation of the traditional Bootstrap Aggregating (Bagging) algorithm and can handle class imbalance.

2.3. Forecasting Performance Metrics

In a binary classification problem, normally, a confusion matrix (Figure 2) is created to visually summarize the models’ performance. The confusion matrix consists of four values as below:

True Positives (TP): The number of instances in which the model correctly predicts the positive class (in our case, the SME to be predicted as investment-ready, class 1).
True Negatives (TN): The number of instances in which the model correctly predicts the negative class (in our case, the SME to be predicted as non-investment-ready, class 0).
False Positives (FP): The number of instances in which the model incorrectly predicts the positive class when the actual class is negative.
False Negatives (FN): The number of instances in which the model incorrectly predicts the negative class when the actual is positive.

From these values above, various metrics are being defined and have been used to evaluate our models, as below.

2.3.1. Precision

Measures the reliability of positive predictions for each class:

Class 0 (not investment-ready):

It counts the proportion of companies forecasted as “not investment-ready” that are truly not investment-ready. High precision indicates high confidence in negative predictions.

P r e c i s i o n (0) = \frac{T N}{T N + F N}

(1)

Class 1 (investment-ready):

In this case, the precision measures the proportion of companies predicted as “investment-ready” that are truly investment-ready. High precision reduces false positives (e.g., mislabeling companies not investment-ready as investment-ready).

P r e c i s i o n (1) = \frac{T P}{T P + F P}

(2)

2.3.2. Recall (Sensitivity/Specificity)

Measures the models’ ability to identify all relevant instances of a class:

Class 0 Recall (Specificity):

Recall of class 0 counts the ability of the model to correctly identify companies not investment-ready.

R e c a l l (0) = \frac{T N}{T N + F P}

(3)

Class 1 Recall (Sensitivity):

Recall of class 1 counts the ability of the model to correctly identify investment-ready companies.

R e c a l l (1) = \frac{T P}{T P + F N}

(4)

2.3.3. F1-Score

Balances precision and recall using their harmonic mean. A high F1-score indicates a strong performance in both precision and recall for a class. It is critical for evaluating the investment-ready (minority) class, where both false positives (costly misallocations) and false negatives (missed opportunities) are consequential.

F 1 S c o r e = \frac{T P}{T P + \frac{1}{2 (F P + F N)}}

(5)

or alternatively,

F 1 S c o r e = \frac{2 \times \Pr e c i s i o n \times R e c a l l}{\Pr e c i s i o n + R e c a l l}

(6)

2.3.4. Balanced Accuracy

While accuracy is the standard metric for evaluating classification models on balanced datasets, it becomes less informative in the presence of class imbalance, as in our case. In such data, the balanced accuracy measure is preferred. Defined as the average recall (sensitivity) across all classes, balanced accuracy offers a more reliable assessment of model performance by accounting for sensitivity in each class explicitly. Balanced accuracy can also be interpreted as class-wise accuracy weighted by class prevalence in the dataset [31].

B a l a n c e d A c c u r a c y = \frac{1}{2} (\frac{T P}{T P + F N} + \frac{T N}{T N + F P})

(7)

2.3.5. Receiver Operating Characteristic Area Under Curve (ROC AUC)

The ROC AUC (Figure 3) quantifies the model’s ability to discriminate between classes. The ROC curve plots the true positive rate (sensitivity) against the false positive rate (1-specificity) at various threshold settings, while the area under curve (AUC) ranges from 0.5 (no discriminative power, equivalent to random guessing) to 1 (perfect discrimination).

All the above metrics matter. High precision in Class 0 ensures companies not investment-ready are correctly filtered out. High recall in Class 1 ensures investment-ready companies are not overlooked, while Class 1 precision avoids costly false positives. The F1-score and balanced accuracy provide a single metric to balance these trade-offs, which is especially vital for an imbalanced dataset.

2.3.6. Feature Engineering and Exploratory Analysis

The occurrence of each class of the 44 variables in our dataset can be found in Table A2 in the Appendix A. A potential problem that should be considered when using one-hot encoding to a wide range of categorical variables is the possibility of multicollinearity. A high correlation among predictors can potentially lead to biased coefficient estimates and large standard errors in traditional econometric linear models such as the Logistic Regression [32]. Nonetheless, and this is an additional advantage of such machine learning algorithms, tree-based ensemble models including the Random Forest and Gradient Boosting, can efficiently handle datasets with high correlations among the independent variables. As Breiman (2001) [7] explained, a decision tree, from a group of highly correlated independent variables, will typically select only one for each split as the others provide redundant information and thus have no effect on the model’s predictive power. As a result, tree-based models, by construction, inherently treat possible multicollinearity issues. Nonetheless, multicollinearity can impact the stability and interpretation of feature importance measures as the importance may be reduced or randomly allocated across the correlated variables.

2.4. Variable Importance Measure (VIM) and Shapley Additive Explanation (SHAP)

In tree-based models, the VIM is used to assess and rank the contribution of each variable to the model’s predictive performance. During the construction of a decision tree, the algorithm selects variables at each split based on how well they improve the model’s ability to distinguish between classes. This is typically performed by minimizing a metric known as impurity (e.g., Gini impurity or entropy). Variables that consistently result in greater reductions in impurity across the tree are considered more important. The VIM reflects this by aggregating the impurity reductions attributed to each variable over the entire tree (or forest, in ensemble models like Random Forest), helping to identify which features have the most influence on the model’s decisions.

We interpret the Gradient Boosting classifier with SHAP (Shapley Additive exPlanations). SHAP decomposes our predictions into additive feature contributions relative to the model expectation (baseline):

f (x) = φ_{0} + \sum_{i = 1}^{n} φ_{i} (x)

(8)

where φ₀ represents the baseline and

φ_{i} (x)

is the SHAP value for feature i on instance x. In simple words, we count how much a feature “pushes” the prediction up or down relative to the baseline. In our case, a positive

φ_{i}

pushes towards IR, and a negative one in the opposite direction relatively to the baseline.

3. Empirical Results

Given the significant class imbalance in our dataset (88% not investment-ready vs. 12% investment-ready), the simple accuracy is not an adequate forecasting metric. A naïve classifier that classifies all instances in the dataset as “not investment-ready” would achieve an 88% accuracy while failing entirely to identify investment-ready firms. Balanced accuracy addresses this limitation, as we described above. By averaging the recall rates for both classes, balanced accuracy gives equal importance to companies’ investment-ready and not investment-ready, regardless of their proportions in the dataset. Therefore, we use the balanced accuracy score to rank the models’ overall performance and identify the best model.

As we can see in Figure 4, the Gradient Boosting model outperforms the rest in balanced accuracy and the ROC AUC scores, with 0.754 and 0.815, respectively. Moreover, in Figure 5 and Figure 6, where we depict the forecasting metrics for the two classes separately, Gradient Boosting is on average the best forecasting model. Logistic Regression comes second in terms of balanced accuracy and ROC AUC scores (Table A3 in the Appendix A reports the balanced accuracy and ROC AUC scores of the optimal models for each methodology). In Figure 4, Figure 5 and Figure 6, we present the results for all models and various performance metrics. We can see that the Gradient Boosting model outperforms all other models for both class-wide metrics, namely balanced accuracy and ROC AUC (Figure 4). When looking at class 1 metrics, it achieved some of the best performances, only slightly falling behind in terms of recall and precision, where the models that excelled in those metrics were Logistic Regression and Random Forest, respectively. However, Gradient Boosting, with the best F1-score for class 1, offers the best trade-off between precision and recall, making it the most well-rounded model for discovering promising SMEs.

Moreover, in Figure 7 below, we present the ROC curve for the Gradient Boosting model. ROC curves for the rest models can be seen in Appendix A Figure A1, Figure A2, Figure A3, Figure A4, Figure A5 and Figure A6.

Furthermore, Easy Ensembles show a consistent performance across metrics. While not leading in any metric, with a balanced accuracy of 0.746, a precision of 0.396, and a recall of 0.717, its consistent performance suggests robustness in its capabilities. With an F1-score of 0.510 and a ROC AUC of 0.804, it performs only slightly worse than the two leading models. Finally, the Balanced Random Forest had the third highest ROC AUC score of 0.810, only marginally worse than Logistic Regression, and had a competitive recall rate of 0.711. Nonetheless, it has the lowest balanced accuracy (0.738) and precision (0.383) among the top models.

Considering precision, recall, and F1, Gradient Boosting appears to outperform most of the competition for class 0, exhibiting some of the top values across all evaluation metrics. The model achieved the best overall precision of 0.931, with the second-highest overall recall score of 0.794 and the second-highest F1-score of 0.857, only falling behind Random Forest, which had a recall score of 0.830 and an F1-score of 0.872.

Precision remains consistent across all models, with limited variability (precision variance of models is 0.031). The same is not true for recall (recall variance of models is 0.128). All models achieve exceptional class 0 precision (>90%), with Gradient Boosting offering the optimal performance.

Overall, Gradient Boosting is the optimal model for predicting SME investment readiness and is the recommended model in terms of creating a balanced predictive strategy. There are however models that excel at specific metrics, like Logistic Regression in terms of class 1 recall or Random Forest in terms of class 0 recall. Classifying the majority class proved to be an easy task for all models, having achieved very high levels of both precision and F1-scores (Table A4). This indicates that the algorithms were effective at filtering out firms not investment-ready. On the other hand, predicting the minority class proved to be a challenge. This is made evident by the overall lower F1-scores all of the models, which peaked at 0.526, and low precision levels. This highlights the challenge in avoiding false positives.

In Figure 8 and Figure 9, we present the confusion matrix and the normalized confusion matrix, respectively, for the best model, Gradient Boosting. It achieved the following:

267 TP predictions (investment-ready) out of 374, or 71.3%.
1440 TN predictions (not investment-ready) out of 1814, or 79.3%.
374 FP predictions (predicted as investment-ready but are not), or 20.6%.
107 FN predictions (predicted as not investment-ready and are not), or 28.6%.

The confusion matrices for every trained model are included in Figure A7, Figure A8, Figure A9, Figure A10, Figure A11, Figure A12, Figure A13, Figure A14, Figure A15 and Figure A16 in the Appendix A. The performance metrics values for the optimal models for each methodology can be found in Table A3 (balanced accuracy and ROC AUC score) and Table A4 (precision, recall, and F1-score) in the Appendix A.

When applying the Variable Importance Measure (VIM) technique to our best model, Gradient Boosting, the 10 most important variables and their ranking are presented in Figure 10. The variable importance plot illustrates the relative contribution of each input feature to the model’s classification performance, with higher values indicating a greater influence on the model’s predictions. Out of the ten most important variables ranked according to the VIM, the first six are comparatively the most prominent.

The most important predictor appears to be the firm’s “confidence in negotiations with equity investors or venture capital firms”. High levels of negotiation confidence likely reflect firm preparation, understanding of investor expectations, and stronger business fundamentals, all of which are critical traits for attracting external investment.
The second most important predictor is “financing growth”, which indicates that the firms that are actively seeking and managing financial growth tend to be investment-ready. This highlights the importance of proactive financial planning and scaling strategies as indicators of a firm’s investment potential.
“Factors in the future of financing of the firm” is ranked third in top predictors. This variable points to the importance of future financial planning. Investors may favor firms that not only demonstrate current performance but also show foresight in securing future financing.
“External financing factors” including market conditions or access to funding channels is also an essential feature that influences the model’s decision-making process. Firms capable of navigating external factors’ influences may be more successful in attracting external investors.
“Autonomous organization type”, relating to the structure of the organization, plays the fourth-most-important role in predicting an investment-ready SME. Autonomous firms might be more agile and able to innovate and thus they draw investors’ interest.
“Willingness of investors to invest in the enterprise”, relating to the investors’ sentiment towards the firm.

These six variables summarize the importance of both (a) internal factors such as confidence, growth, future outlook, and the firm’s autonomy and flexibility, and (b) external factors such as market conditions, availability of financing, and the overall sentiment of the investors towards investing in a firm.

3.1. SHAP Values

We present the SHAP summary figure (Figure 11). SHAP was introduced by Lundberg and Lee (2017) to provide a unified, model-agnostic approach for interpreting individual predictions via Shapley-value-based additive feature attributions that satisfy local accuracy, consistency, and missingness. The rows are ordered according to their global importance; each dot is an SME. The position of each dot is the SHAP value, and color shows the feature value (red means high and blue means low). Every point left of zero decreases IR and every point right of zero the opposite.

According to the above explanation and SHAP graph, we can observe the following for the four most-important independent variables.

Confidence in negotiations: High values of confidence—as expected—indicate an investment-ready SME.
Importance of factors in the future financing of the firm: Declining future financing prospects reduce the investment readiness of an SME.
Not taking autonomous financial decisions due to being a subsidiary: The results show that high values of this variable significantly decrease investment readiness for an SME.
Willingness of external investors to invest in a specific SME: Perceived persistence of the interest of external investors in specific SMEs seems to increase investment readiness.

Both VIM and SHAP identify an overlapping set of high-impact predictors—e.g., negotiation confidence, financing-for-growth posture, future-financing priorities, external financing conditions, organizational autonomy, and perceived investor willingness. VIM uses the global split-based importance, whereas SHAP uses the marginal contributions of the observations; agreement across these unrelated criteria indicates that the outcome is not affected by the choice of the estimator. The minor rank-order differences are expected given the distinct objectives and estimation criteria of VIM and SHAP.

3.2. Cost Function

Another way to assess the forecasting efficiency of the selected algorithms and identify the best model is by calculating the misclassification cost and including it in the evaluation process. This is achieved by using a cost function in the selection of the optimal model. The cost function that we used is defined as

Total Misclassification Cost (TMC) = C_FP FP + C_FN FN

where C_FP is the cost of a false positive forecast, C_FN is the cost of a false negative forecast, FP is the total number of false positive cases, and FN the total number of false negative cases. In our case, we set C_FP = 1 and C_FN = 5 following many studies in the field [33,34,35]. The use of such a weighted cost is part of the cost-sensitive evaluation methodology studied in classification problems with unequal errors [36]. Table 1 presents, for each of the five optimal models, the FN, the FP, and the resulting TMC for each model. The use of such a weighted cost is part of the cost-sensitive evaluation methodology studied in classification problems with unequal errors [36]

The Gradient Boosting model exhibits the lowest TMC (908), making it the best model, even under the cost sensitivity criterion. Thus, it is the best forecasting model even when we impose the cost function as a model selection criterion. Logistic Regression produces a slightly higher TMC cost (914), indicating a comparable cost performance. These results confirm that accounting for misclassification costs refines our evaluation: while Gradient Boosting was already strong in terms of balanced accuracy, it also minimizes the economic cost of errors under our cost assumptions.

Under the cost-optimized threshold (0.452), the confusion matrix for the Gradient Boosting model is as shown in Figure 12 and Figure 13.

The empirical evidence provides answers to our research questions as follows: With regard to the question of whether machine learning (ML) enhances identification in situations involving imbalance and noise, the gradient boosting approach attained the highest balanced accuracy (0.754) and ROC AUC (0.815), while also identifying 71.3% of investment-ready SMEs (TP = 267/374) with a class 1 F1 of 0.526 (see Figure 4, Figure 5, Figure 6, Figure 8 and Figure 9). These findings support H1 and confirm RQ1. A feature importance analysis (see Figure 10) shows that a limited number of variables have been identified that are most significant in predicting outcomes including negotiation confidence with equity investors, financing for growth, forward-looking financing plans, external financing conditions, organizational autonomy, and investor willingness. Finally, implementing asymmetric costs (CFN = 5, CFP = 1) redirected the evaluation process towards minimizing false negatives. Gradient Boosting minimized the Total Misclassification Cost (TMC = 908) versus Logistic Regression (TMC = 914) and similar models (Table 1), with a cost-effective threshold (0.452) further improving the trade-off by demonstrating that cost-sensitive model selection criteria result in better economic choices.

Based on the above empirical results, we can now discuss our research question and whether the related research hypothesis is supported by the empirical results.

Regarding RQ1 and whether machine learning (ML) enhances the identification of investment-ready SMEs in situations involving imbalanced datasets and noise, we performed the following statistical tests:

A DeLong test on the ROC AUCs produced by the two best-performing algorithms, the XGBoost and the logit. The hypotheses tested are the following:
H0: The two algorithms produce AUCs that are not statistically different.
H1: The two algorithms produce AUCs that are statistically different.
The test is performed on the same validation set for both models. And the z-score is −0.4462 with a p-value of 0.6554, providing statistical evidence that the two algorithms do not significantly differ in terms of their ROC AUCs. The relevant confidence intervals are the following:
XGBoost AUC at 95% probability produces a confidence interval of [0.783, 0.835].
Logit AUC at 95% probability produces a confidence interval of [0.781, 0.833].
Thus, with the DeLong test, we did not find evidence that one model produces a statistically better ROC AUC than the other.

2.

We performed a McNemar test, which is a non-parametric test applied on categorical observations to analyze the differences between two related groups, usually a 2 × 2 matrix with dichotomous variables. On the main diagonal, we have the count of instances where the two models agree, and on the secondary diagonal, we have the count of observations where the two models disagree.

The differences between the two methods’ performances are the following (the secondary diagonal values):

XGBoost correct, logit wrong: 110.
XGBoost wrong, logit correct: 56.

Thus, approximately double the instances are classified correctly with the XGBoost model. The total number of discordance instances is high enough (167), at more than 25, which is considered the minimum required so that the asymptotic chi-square approximation is asymptotically accurate. When evaluating whether this difference is statistically significant, the McNemar test statistic is equal to 16.922, with a p-value of 0.000. The negative difference in error rates (errors/all) is XGBoost-Logit = −0.0247. This confirms that GB is better: it has a slightly lower plain misclassification rate. In addition to that, when bootstrapping the mean error rate difference of −0.0247, we obtain a confidence interval of [−0.0361, −0.0133]. Since the confidence interval does not include zero, we can confirm that the difference is statistically significant.

4. Discussion

The model that we proposed goes beyond a routine ML model comparison in four ways:

We created a universal EU model using the large dataset collected by the Survey on Access to Finance of Enterprises (SAFE), instead of using a country-specific one.
Usually IR is estimated using capability [2,3], communication [37], governance [38] and preparedness [39]. In this paper, we combine innovation, high growth, and equity openness in a model that is broader and more operational for policy/investment targeting.
We addressed the problem of class imbalance using weights and appropriate methods (e.g., Easy Ensemble, Balance Bagging).
We created and implemented a cost-sensitive metric (false negatives weighted five times the false positives in our model) to find the optimal model, reflecting economic consequences, instead of using just simple accuracy metrics. Consequently, Gradient Boosting is not just the optimal model statistically, but more importantly also economically.

These design choices translate into concrete, practical benefits:

Venture capital and private equity early screening often relies on subjective judgments and/or personal networks. Perceived weaknesses may lead to early SME’s rejection. The proposed ML model, in contrast, evaluates firms on data-driven criteria, reducing the risk of bias or limited information.
The proposed ML-based model (and, in general, ML approaches) can scan a large dataset like SAFE, identifying IR firms outside the investor circle thus expanding the pool of potential opportunities.
The proposed ML-driven classification sorts SMEs by underlining those with higher predicted IR. Investors can then focus their in-depth analysis on a smaller set of IR-identified SMEs—cutting costs and improving efficiency.
Finally, IR in this paper is defined through concrete features like innovation, growth, and openness to equity. This means that the model can function not only as a filter to identify IR but also as a diagnostic guide, showing SMEs whose capabilities they can strengthen to improve their funding prospects.

5. Conclusions

The primary goal of this research was to apply advanced machine learning (ML) techniques to predict SME investment readiness, utilizing the large dataset collected by the Survey on Access to Finance of Enterprises (SAFE) conducted semiannually by the European Central Bank. After training and evaluating multiple ML algorithms, which included Gradient Boosting, Logistic Regression, Random Forest, and various ensemble methods, the Gradient Boosting model demonstrated the highest predictive performance, with a balanced accuracy of 75.4% and a ROC AUC score of 0.815. Logistic Regression was notably close in performance, indicating that simpler, interpretable models can also effectively predict investment readiness, providing practical benefits in scenarios requiring rapid deployment or interpretability.

Although explicit feature selection was not implemented in this study, our findings highlight several important factors influencing investment readiness. These factors include managerial confidence, innovation activities, openness to equity financing, and investor engagement. Clearly identifying these factors helps SMEs strategically prioritize improvements, thereby enhancing their likelihood of securing external financing.

Possible limitations of our analysis may be the use of self-reported, cross-sectional SAFE survey data, which may introduce biases affecting the generalizability of the results. The dataset offers a static snapshot, preventing an analysis of changes in investment readiness over time. Model validation was limited due to a lack of independent data. There was a significant class imbalance (12% investment-ready vs. 88% not) that was addressed through appropriate metrics, weights, and imbalance-resistant algorithms. Finally, applying these ML tools in SME financing requires tackling integration, privacy compliance, interpretability, and stakeholder trust issues.

Despite these limitations present to most empirical research, from an academic perspective, this research adds valuable insights to the entrepreneurial finance literature by demonstrating the practical effectiveness of machine learning techniques in financial decision-making. However, the close performance between Gradient Boosting and Logistic Regression underscores the continued relevance of traditional econometric models, particularly in situations where interpretability and prompt results are crucial.

The economic implications are considerable. Independent investors, relevant governmental agencies, and financial institutions can leverage these predictive models to better assess SMEs, minimizing costly misinvestments (Type I errors). Consequently, such models have the potential to optimize the allocation of the already scarce private and government financial resources. Policymakers can utilize these insights to tailor more effective support programs for SMEs, addressing their specific financing needs. Ultimately, adopting these predictive approaches promises more efficient capital markets, enabling SMEs to contribute more effectively to economic growth and job creation.

An additional insight from the cost evaluation is that the selection of the best model can shift significantly once the economic implications of misclassifications are accounted for. When both costs (false positive and false negative) are adequately measured and assigned to the optimization of the predictive models, a model with higher balanced accuracy is not necessarily the one with the lowest overall cost. When the two costs are asymmetric, as was assumed in this paper, and the cost of a false negative is greater than the cost of a false positive, the precision as a model selection metric will be inferior to the recall (sensitivity). This is because the latter focuses on maximizing the identification of true positives, which is the same as minimizing false negatives. This analysis confirms that cost-sensitive evaluation complements and adjusts the overall findings, providing more practical and economically sensitive guidance for model selection in real-world scenarios.

With respect to the research question and hypothesis stated in the introduction, we find support that using machine learning models can be effective in identifying investment-ready SMEs despite data imbalance and noise. Through a feature importance analysis, we have identified a subset of variables that hold the most predictive power.

Finally, a valuable next step would be to analyze how the model performance varies across different cross-sectional factors such as the country of the firm, firm size, industry or sector, and EU membership status. These steps could reveal important factors and help to create predictive models specifically designed for certain cross-sections.

Author Contributions

Conceptualization, P.G. (Periklis Gogas) and T.P.; methodology, P.G. (Periklis Gogas) and T.P.; software, N.G. and A.K.; validation, P.G. (Periklis Gogas) and T.P.; formal analysis, P.G. (Periklis Gogas), T.P. and P.G. (Panagiotis Goumenidis); investigation, P.G. (Panagiotis Goumenidis), N.G. and A.K.; resources, P.G. (Panagiotis Goumenidis), N.G. and A.K.; data curation, N.G. and A.K.; writing—original draft preparation, P.G. (Panagiotis Goumenidis), N.G. and A.K.; writing—review and editing, P.G. (Periklis Gogas), T.P. and P.G. (Panagiotis Goumenidis).; visualization, N.G. and A.K.; supervision, P.G. (Periklis Gogas), T.P. and P.G. (Panagiotis Goumenidis); project administration, P.G. (Periklis Gogas) and T.P.; funding acquisition, P.G. (Periklis Gogas) and T.P. All authors have read and agreed to the published version of the manuscript.

Funding

This research project was implemented under the framework of H.F.R.I. call “Basic Research Financing (Horizontal support of all Sciences)” under the National Recovery and Resilience Plan “Greece 2.0” funded by the European Union—Next-Generation EU (H.F.R.I. Project Number: 16856). Forecasting 07 00051 i001

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

Table A1. Variable description.

#	Variable Name	Question
1	Autonomous	How would you characterise your enterprise?
2	MainActivity	What is the main activity of your enterprise?
3	Age	In which year was your enterprise first registered?
4	Ownership	Who owns the largest stake in your enterprise?
5	profitable	Firms that report, simultaneously, higher turnover and profits, lower or no interest expenses and lower or no debt-to-assets ratio.
6	MostImportantProblemFacing	Which was the most important problem faced by for your enterprise during the {previous quarter and current quarter} OR {current quarter}?
7	ExternalFinancing_Factors#GeneralEconomicOutlookToObtainExternalFinancing	For each of the following factors, would you say that they have improved, remained unchanged or deteriorated during the {previous quarter and current quarter} OR {current quarter}? General economic outlook, insofar as it affects the availability of external financing
8	ExternalFinancing_Factors#AccessToPublicFinancialSupportIncludingGuarantees	For each of the following factors, would you say that they have improved, remained unchanged or deteriorated during the {previous quarter and current quarter} OR {current quarter}?. Access to public financial support, including guarantees
9	ExternalFinancing_Factors#YourFirmSpecificOutlook	For each of the following factors, would you say that they have improved, remained unchanged or deteriorated during the {previous quarter and current quarter} OR {current quarter}?. Your enterprise-specific outlook with respect to your sales and profitability or business plan
10	ExternalFinancing_Factors#YourFirmOwnCapital	For each of the following factors, would you say that they have improved, remained unchanged or deteriorated during the {previous quarter and current quarter} OR {current quarter}? Your enterprise’s own capital
11	ExternalFinancing_Factors#YourFirmCreditHistory	For each of the following factors, would you say that they have improved, remained unchanged or deteriorated during the {previous quarter and current quarter} OR {current quarter}? Your enterprise’s credit history
12	ExternalFinancing_Factors#WillingnessOfBanksToProvideCredit	For each of the following factors, would you say that they have improved, remained unchanged or deteriorated during the {previous quarter and current quarter} OR {current quarter}? Willingness of banks to provide credit to your enterprise
13	ExternalFinancing_Factors#WillingnessOfBusinessPartnersToProvideTradeCredit	For each of the following factors, would you say that they have improved, remained unchanged or deteriorated during the {previous quarter and current quarter} OR {current quarter}? Willingness of business partners to provide trade credit
14	ExternalFinancing_Factors#WillingnessOfInvestorsToInvestInYourEnterprise	For each of the following factors, would you say that they have improved, remained unchanged or deteriorated during the {previous quarter and current quarter} OR {current quarter}? Willingness of investors to invest in your enterprise
15	FirmGrowth#InTermsOfEmploymentRegardingTheNumberOfFullTimeOrFullTimeEquivalentEmployees	Over the last three years (<year>-<year>), how much did your firm grow on average per year? (in terms of employment regarding the number of full-time or full-time equivalent employees)
16	ConfidenceInNegotiations#WithBanks	Do you feel confident talking about financing with banks and that you will obtain the desired results? And how about with equity investors/venture capital enterprises? With banks
17	ConfidenceInNegotiations#WithEquityInvestorsOrVentureCapitalFirms	Do you feel confident talking about financing with banks and that you will obtain the desired results? And how about with equity investors/venture capital enterprises? With equity investors/venture capital enterprises
18	FinancingGrowth_Instruments#FinancingGrowth_Instruments	If you need external financing to realise your growth ambitions, what type of external financing would you prefer most?
19	FinancingGrowth_Amount#FinancingGrowth_Amount	If you need external financing to realise your growth ambitions over the next two to three years what amount of financing would you aim to obtain?
20	FinancingGrowth_TheMostImportantLimitation	What do you see as the most important limiting factor to get this financing?
21	ExternalFinancing_Expectations#InternalFounds	Looking ahead, for each of the following types of financing available to your enterprise, please indicate whether you think their availability will improve, deteriorate or remain unchanged over the next {two quarters} OR {quarter}. Retained earnings or sale of assets/internal funds
22	ExternalFinancing_Expectations#BankLoans	Looking ahead, for each of the following types of financing available to your enterprise, please indicate whether you think their availability will improve, deteriorate or remain unchanged over the next {two quarters} OR {quarter}. Bank loans (excluding overdraft and credit lines)
23	ExternalFinancing_Expectations#EquityInvestments	Looking ahead, for each of the following types of financing available to your enterprise, please indicate whether you think their availability will improve, deteriorate or remain unchanged over the next {two quarters} OR {quarter}. Equity capital
24	ExternalFinancing_Expectations#TradeCredit	Looking ahead, for each of the following types of financing available to your enterprise, please indicate whether you think their availability will improve, deteriorate or remain unchanged over the next {two quarters} OR {quarter}. Trade credit
25	ExternalFinancing_Expectations#DebtSecuritiesIssued	Looking ahead, for each of the following types of financing available to your enterprise, please indicate whether you think their availability will improve, deteriorate or remain unchanged over the next {two quarters} OR {quarter}. Debt securities issued
26	ExternalFinancing_Expectations#Other	Looking ahead, for each of the following types of financing available to your enterprise, please indicate whether you think their availability will improve, deteriorate or remain unchanged over the next {two quarters} OR {quarter}. Other
27	ExternalFinancing_Expectations#CreditLineBankOverdraftOrCreditCardsOverdraft	Looking ahead, for each of the following types of financing available to your enterprise, please indicate whether you think their availability will improve, deteriorate or remain unchanged over the next {two quarters} OR {quarter}. Credit line, bank overdraft or credit cards overdraft
28	ImportanceOfFactorsInTheFutureFinancingOfTheFirm#GuaranteesForLoans	On a scale of 1–10, where 10 means it is extremely important and 1 means it is not at all important, how important are each of the following factors for your enterprise’s financing in the future? Guarantees for loans
29	ImportanceOfFactorsInTheFutureFinancingOfTheFirm#MeasuresToFacilitateEquityInvestments	On a scale of 1–10, where 10 means it is extremely important and 1 means it is not at all important, how important are each of the following factors for your enterprise’s financing in the future? Measures to facilitate investments
30	ImportanceOfFactorsInTheFutureFinancingOfTheFirm#ExportCreditsOrGuarantees	On a scale of 1–10, where 10 means it is extremely important and 1 means it is not at all important, how important are each of the following factors for your enterprise’s financing in the future? Export credits or guarantees
31	ImportanceOfFactorsInTheFutureFinancingOfTheFirm#TaxIncentives	On a scale of 1–10, where 10 means it is extremely important and 1 means it is not at all important, how important are each of the following factors for your enterprise’s financing in the future? Tax incentives
32	ImportanceOfFactorsInTheFutureFinancingOfTheFirm#BusinessSupportServices	On a scale of 1–10, where 10 means it is extremely important and 1 means it is not at all important, how important are each of the following factors for your enterprise’s financing in the future? Business support services
33	ImportanceOfFactorsInTheFutureFinancingOfTheFirm#MakingExistingPublicMeasuresEasierToObtain	On a scale of 1–10, where 10 means it is extremely important and 1 means it is not at all important, how important are each of the following factors for your enterprise’s financing in the future? Making existing public measures easier to obtain
34	FirmIncomeGenerationIndicators#LabourCost	Have the following company indicators decreased, remained unchanged or increased during the {previous quarter and current quarter} OR {current quarter}? Labour costs (including social contributions)
35	FirmIncomeGenerationIndicators#OtherCost	Have the following company indicators decreased, remained unchanged or increased during the {previous quarter and current quarter} OR {current quarter}? Other costs (materials, energy, other)
36	FirmIncomeGenerationIndicators#InterestExpenses	Have the following company indicators decreased, remained unchanged or increased during the {previous quarter and current quarter} OR {current quarter}? Interest expenses
37	FirmIncomeGenerationIndicators#Profit	Have the following company indicators decreased, remained unchanged or increased during the {previous quarter and current quarter} OR {current quarter}? Profit
38	FirmIncomeGenerationIndicators#ProfitMargin	Have the following company indicators decreased, remained unchanged or increased during the {previous quarter and current quarter} OR {current quarter}? Profit Margin
39	FirmIncomeGenerationIndicators#DebtComparedToAssets	Have the following company indicators decreased, remained unchanged or increased during the {previous quarter and current quarter} OR {current quarter}? Debt compared to assets
40	FinancingApplied#ApplicationOfExternalFinancing_BankLoans	Have you applied for the following types of financing during the {previous quarter and current quarter} OR {current quarter}? Bank loans
41	FinancingApplied#ApplicationOfExternalFinancing_TradeCredit	Have you applied for the following types of financing during the {previous quarter and current quarter} OR {current quarter}? Trade credit
42	FinancingApplied#ApplicationOfExternalFinancing_OtherExternalFinancing	Have you applied for the following types of financing during the {previous quarter and current quarter} OR {current quarter}? Other external financing
43	FinancingApplied#ApplicationOfExternalFinancing_CreditLine	Have you applied for the following types of financing during the {previous quarter and current quarter} OR {current quarter}? Credit line
44	vulnerable	Firms that report, simultaneously, lower turnover, decreasing profits, higher interest expenses and higher or unchanged debt-to-assets ratio.
45	permid	Id of SME
46	wave	Survey wave
47	id	Survey unique id
48	CountryOfResidence	Country of SME
49	FirmSize	Size of Company
50	Date	Date of survey
51	InvestmentReady	Target variable

Table A2. Variable statistics of training set.

#	Variable Name	Variable Type	Description Values	Freq/Count	Percentage
1	Autonomous	Categorical	an autonomous profit-oriented enterprise	7117	81.35
			part of a profit-oriented enterprise (e.g., subsidiary* or branch) not taking autonomous financial decisions	1444	16.6
			a subsidiary of another enterprise	178	2.03
			a branch of another enterprise	8	0.09
			[DK/NA]	2	0.02
2	MainActivity	Categorical	Industry	3109	35.54
			Services	2630	30.06
			Trade	2164	24.73
			Construction	846	9.67
3	Age	Categorical	10 years or more	6799	77.71
			5 years or more but less than 10 years	1162	13.28
			2 years or more but less than 5 years	481	5.5
			[DK/NA]	203	2.32
			Less than 2 years	104	1.19
4	Ownership	Categorical	Family or entrepreneurs	4387	50.14
			A natural person, one owner only	1809	20.68
			Other firms or business associates	1438	16.44
			Public shareholders, as your company is listed on the stock market	601	6.87
			Other	312	3.57
			Venture capital firms or business angels	160	1.82
			DK/NA	42	0.48
5	profitable	Binary	0	8148	93.13
	profitable		1	601	6.87
6	MostImportantProblemFacing	Categorical	Finding customers	1822	20.83
			Competition	1360	15.54
			Availability of skilled staff or experienced managers	1352	15.45
			Access to finance	1237	14.14
			Costs of production of labour	1039	11.88
			Regulation and industrial regulations	793	9.06
			Other	726	8.3
			All problems are equally pressing	300	3.43
			DK/NA	120	1.37
7	ExternalFinancing_Factors#GeneralEconomicOutlookToObtainExternalFinancing	Categorical	Remained unchanged	3801	43.45
			Deteriorated	2560	29.26
			Improved	1959	22.39
			DK	429	4.9
8	ExternalFinancing_Factors#AccessToPublicFinancialSupportIncludingGuarantees	Categorical	Remained unchanged	3243	37.07
			Not applicable	2993	34.21
			Deteriorated	1407	16.08
			Improved	610	6.97
			DK	496	5.67
9	ExternalFinancing_Factors#YourFirmSpecificOutlook	Categorical	Remained unchanged	3923	44.84
			Improved	3169	36.22
			Deteriorated	1263	14.44
			DK	394	4.5
10	ExternalFinancing_Factors#YourFirmOwnCapital	Categorical	Remained unchanged	4278	48.9
			Improved	3436	39.27
			Deteriorated	932	10.65
			DK	103	1.18
11	ExternalFinancing_Factors#YourFirmCreditHistory	Categorical	Remained unchanged	4860	55.55
			Improved	2796	31.96
			Deteriorated	713	8.15
			DK	372	4.25
			Not applicable	8	0.09
12	ExternalFinancing_Factors#WillingnessOfBanksToProvideCredit	Categorical	Remained unchanged	3415	39.03
			Improved	2150	24.58
			Deteriorated	1579	18.05
			Not applicable	1259	14.39
			DK	346	3.95
13	ExternalFinancing_Factors#WillingnessOfBusinessPartnersToProvideTradeCredit	Categorical	Remained unchanged	3611	41.27
			Not applicable	2615	29.89
			Improved	1178	13.46
			Deteriorated	942	10.77
			DK	403	4.61
14	ExternalFinancing_Factors#WillingnessOfInvestorsToInvestInYourEnterprise	Categorical	Not applicable	5782	66.09
			Remained unchanged	1802	20.6
			Improved	478	5.46
			DK	377	4.31
			Deteriorated	310	3.54
15	FirmGrowth#InTermsOfEmploymentRegardingTheNumberOfFullTimeOrFullTimeEquivalentEmployees	Categorical	Less than 20% per year	3620	41.37
			No growth	2201	25.16
			Over 20% per year	1412	16.14
			Got smaller	1400	16
			DK/NA	61	0.7
			Not applicable, the firm is too recent	55	0.63
16	ConfidenceInNegotiations#WithBanks	Categorical	Yes	6727	76.89
			No	1293	14.78
			Not applicable	618	7.06
			DK	111	1.27
17	ConfidenceInNegotiations#WithEquityInvestorsOrVentureCapitalFirms	Categorical	Not applicable	4254	48.62
			Yes	2442	27.91
			No	1758	20.09
			DK	295	3.38
18	FinancingGrowth_Instruments#FinancingGrowth_Instruments	Categorical	Bank loan	5627	64.32
			Loan from other sources	1156	13.21
			Equity investment	780	8.92
			Other	450	5.14
			DK/NA	439	5.02
			Subordinated loans, participation loans or similar financing instruments	297	3.39
19	FinancingGrowth_Amount#FinancingGrowth_Amount	Categorical	Over €1 million	1632	18.65
			DK/NA	1620	18.52
			More than €25,000 and up to €100,000	1456	16.64
			€100,000–€1 million	1443	16.49
			More than €250,000 and up to €1 million	1163	13.29
			More than €100,000 and up to €250,000	867	9.91
			Up to €25,000	568	6.50
20	FinancingGrowth_TheMostImportantLimitation	Categorical	There are no obstacles	2646	30.23
			Interest rates or price too high	1474	16.85
			Insufficient collateral or guarantee	1321	15.1
			Other	824	9.42
			Financing not available at all	549	6.28
			DK/NA	1558	17.81
			Reduced control over the firm	305	3.49
			too much paper work	72	0.82
21	ExternalFinancing_Expectations#InternalFounds	Categorical	Will remain unchanged	4459	50.97
			Will improve	2405	27.49
			Not applicable	1170	13.37
			Will deteriorate	451	5.15
			DK	264	3.02
22	ExternalFinancing_Expectations#BankLoans	Categorical	Will remain unchanged	4498	51.41
			Will improve	1793	20.49
			Not applicable	1480	16.92
			Will deteriorate	796	9.1
			DK	182	2.08
23	ExternalFinancing_Expectations#EquityInvestments	Categorical	Not applicable	4292	49.05
			Will remain unchanged	3027	34.6
			Will improve	979	11.19
			Will deteriorate	180	2.06
			DK	271	3.1
24	ExternalFinancing_Expectations#TradeCredit	Categorical	Will remain unchanged	4157	47.51
			Not applicable	2670	30.52
			Will improve	1205	13.77
			Will deteriorate	523	5.98
			DK	194	2.22
25	ExternalFinancing_Expectations#DebtSecuritiesIssued	Categorical	Not applicable	5605	64.06
			Will remain unchanged	1959	22.39
			DK	839	9.59
			Will improve	205	2.34
			Will deteriorate	141	1.62
26	ExternalFinancing_Expectations#Other	Categorical	Not applicable	3396	38.82
			Will remain unchanged	3168	36.3
			DK	1167	13.33
			Will improve	787	9
			Will deteriorate	231	2.64
27	ExternalFinancing_Expectations#CreditLineBankOverdraftOrCreditCardsOverdraft	Categorical	Will remain unchanged	4739	54.17
			Not applicable	1700	19.43
			Will improve	1374	15.7
			Will deteriorate	570	6.52
			DK	366	4.18
28	ImportanceOfFactorsInTheFutureFinancingOfTheFirm#GuaranteesForLoans	Ordinal	8	1556	17.78
			5	1467	16.77
			1	1283	14.66
			10	1232	14.08
			7	1011	11.56
			6	574	6.56
			9	475	5.43
			3	470	5.37
			2	371	4.24
			4	310	3.55
29	ImportanceOfFactorsInTheFutureFinancingOfTheFirm#MeasuresToFacilitateEquityInvestments	Ordinal	1	2662	30.42
			5	1475	16.86
			8	867	9.91
			7	715	8.17
			2	705	8.06
			3	602	6.88
			10	551	6.3
			6	512	5.85
			4	418	4.78
			9	242	2.77
30	ImportanceOfFactorsInTheFutureFinancingOfTheFirm#ExportCreditsOrGuarantees	Ordinal	1	3216	36.76
			5	1175	13.43
			8	782	8.94
			2	660	7.54
			7	644	7.36
			10	610	6.97
			3	531	6.07
			6	476	5.44
			4	363	4.15
			9	292	3.34
31	ImportanceOfFactorsInTheFutureFinancingOfTheFirm#TaxIncentives	Ordinal	10	1886	21.56
			8	1491	17.04
			5	1310	14.97
			7	1023	11.69
			1	816	9.33
			9	632	7.22
			6	615	7.03
			3	380	4.34
			4	340	3.89
			2	256	2.93
32	ImportanceOfFactorsInTheFutureFinancingOfTheFirm#BusinessSupportServices	Ordinal	5	1640	18.74
			8	1193	13.64
			7	1088	12.44
			1	983	11.24
			6	856	9.78
			10	842	9.62
			3	649	7.42
			4	570	6.52
			2	493	5.63
			9	435	4.97
33	ImportanceOfFactorsInTheFutureFinancingOfTheFirm#MakingExistingPublicMeasuresEasierToObtain	Ordinal	10	1670	19.09
			8	1575	18
			5	1319	15.08
			7	1148	13.12
			6	772	8.82
			1	675	7.72
			9	636	7.27
			3	349	3.99
			4	338	3.86
			2	267	3.05
34	FirmIncomeGenerationIndicators#LabourCost	Categorical	Increased	5006	57.21
			Remained unchanged	3041	34.76
			Decreased	677	7.74
			DK/NA	25	0.29
35	FirmIncomeGenerationIndicators#OtherCost	Categorical	Increased	5535	63.26
			Remained unchanged	2594	29.65
			Decreased	585	6.69
			DK/NA	35	0.4
36	FirmIncomeGenerationIndicators#InterestExpenses	Categorical	Remained unchanged	4177	47.74
			Increased	2559	29.25
			Decreased	1358	15.52
			DK/NA	608	6.95
			Not applicable, the firm has no debt	47	0.54
37	FirmIncomeGenerationIndicators#Profit	Categorical	Increased	3688	42.15
			Decreased	2513	28.72
			Remained unchanged	2411	27.56
			DK/NA	137	1.57
38	FirmIncomeGenerationIndicators#ProfitMargin	Categorical	Remained unchanged	2945	33.66
			Decreased	2526	28.87
			Increased	1957	22.37
			DK/NA	1321	15.1
39	FirmIncomeGenerationIndicators#DebtComparedToAssets	Categorical	Remained unchanged	3471	39.67
			Decreased	2754	31.48
			Increased	1704	19.48
			Not applicable, the firm has no debt	743	8.49
			DK	77	0.88
40	FinancingApplied#ApplicationOfExternalFinancing_BankLoans	Categorical	Did not apply because of sufficient internal funds	4234	48.39
			Applied	2350	26.86
			Did not apply for other reasons	1630	18.63
			Did not apply because of possible rejection	334	3.82
			DK/NA	201	2.3
41	FinancingApplied#ApplicationOfExternalFinancing_TradeCredit	Categorical	Did not apply because of sufficient internal funds	3910	44.69
			Did not apply for other reasons	2365	27.04
			Applied	1939	22.16
			DK/NA	343	3.92
			Did not apply because of possible rejection	192	2.19
42	FinancingApplied#ApplicationOfExternalFinancing_OtherExternalFinancing	Categorical	Did not apply because of sufficient internal funds	4272	48.83
			Did not apply for other reasons	2527	28.88
			Applied	1373	15.69
			DK/NA	370	4.23
			Did not apply because of possible rejection	207	2.37
43	FinancingApplied#ApplicationOfExternalFinancing_CreditLine	Categorical	Did not apply because of sufficient internal funds	4259	48.68
			Applied	2061	23.56
			Did not apply for other reasons	1689	19.31
			Did not apply because of possible rejection	318	3.63
			DK/NA	422	4.82
44	vulnerable	Binary	0	8430	96.35
	vulnerable		1	319	3.65

Table A3. Balanced accuracy and ROC AUC score for all models.

Model	Balanced Accuracy	ROC AUC
Gradient Boosting	0.754	0.815
Logistic Regression	0.750	0.811
Easy Ensemble Classifier	0.746	0.804
Balanced Random Forest Classifier	0.738	0.810
Random Forest	0.733	0.807
SVC	0.732	0.793
Balanced SVC	0.729	0.800
AdaBoost	0.718	0.794
MultinomialNB	0.711	0.790
Balanced MultinomialNB	0.709	0.789
Balanced KNeighbors	0.678	0.743
KNeighbors	0.552	0.640

Table A4. Precision, recall, and F1-score for all models.

Model	Precision Class 0	Recall Class 0	F1 Class 0	Precision Class 1	Recall Class 1	F1 Class 1
Gradient Boosting	0.931	0.794	0.857	0.417	0.714	0.526
Logistic Regression	0.931	0.779	0.848	0.402	0.722	0.517
Easy Ensemble Classifier	0.930	0.775	0.845	0.396	0.717	0.510
Balanced Random Forest Classifier	0.928	0.764	0.838	0.383	0.711	0.498
Random Forest	0.917	0.830	0.872	0.436	0.636	0.517
SVC	0.920	0.803	0.858	0.409	0.660	0.505
Balanced SVC	0.925	0.754	0.831	0.371	0.703	0.486
AdaBoost	0.918	0.768	0.836	0.373	0.668	0.478
MultinomialNB	0.918	0.745	0.823	0.354	0.676	0.465
Balanced MultinomialNB	0.917	0.746	0.823	0.353	0.671	0.463
Balanced KNeighbors	0.900	0.770	0.830	0.344	0.586	0.433
KNeighbors	0.845	0.962	0.899	0.434	0.142	0.214

Figure A1. Gradient Boosting (left) and Logistic Regression (right) ROC curves.

Figure A2. Easy Ensemble (left) and Balanced Random Forest (right) ROC curves.

Figure A3. Random Forest (left) and SVC (right) ROC curves.

Figure A4. Balanced SVC (left) and Adaboost (right) ROC curves.

Figure A5. Multinomial NB (left) and balanced multinomial NB (right) ROC curves.

Figure A6. K-Neighbors (left) and Balanced K-Neighbors (right) ROC curves.

Figure A7. Gradient Boosting confusion matrices.

Figure A8. Logistic Regression model confusion matrices.

Figure A9. Easy Ensemble Classifier model confusion matrices.

Figure A10. Balanced Random Forest model confusion matrices.

Figure A11. Random Forest model confusion matrices.

Figure A12. SVC model confusion matrices.

Figure A13. Balanced Bagging SVC model confusion matrices.

Figure A14. Adaboost model confusion matrices.

Figure A15. MultinomialNB model confusion matrices.

Figure A16. Balanced Bagging multinomial NB model confusion matrices.

References

World Bank. Small and Medium Enterprises (SMEs) Finance—Improving SMEs’ Access to Finance and Finding Innovative Solutions to Unlock Sources of Capital. 2023. Available online: https://www.worldbank.org (accessed on 3 September 2025).
Mason, C.M.; Harrison, R.T. “Investment readiness”: A critique of government proposals to increase the demand for venture capital. Reg. Stud. 2001, 35, 663–668. [Google Scholar] [CrossRef]
Douglas, E.J.; Shepherd, D.A. Exploring investor readiness: Assessments by entrepreneurs and investors in Australia. Ventur. Cap. 2002, 4, 219–236. [Google Scholar] [CrossRef]
Fellnhofer, K. Literature review: Investment readiness level of small and medium-sized companies. Int. J. Manag. Financ. Account. 2015, 7, 268–284. [Google Scholar] [CrossRef]
Mullainathan, S.; Spiess, J. Machine learning: An applied econometric approach. J. Econ. Perspect. 2017, 31, 87–106. [Google Scholar] [CrossRef]
Dumitrescu, E.; Hué, S.; Hurlin, C.; Tokpavi, S. Machine learning or econometrics for credit scoring: Let’s get the best of both worlds. HAL Open Sci. 2021. Available online: https://hal.archives-ouvertes.fr/hal-02507499 (accessed on 3 September 2025).
Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Levy, J.; Mussack, D.; Brunner, M.; Keller, U.; Cardoso-Leite, P.; Fischbach, A. Contrasting classical and machine learning approaches in the estimation of value-added scores in large-scale educational data. Front. Psychol. 2020, 11, 2190. [Google Scholar]
Agresti, A. Foundations of Linear and Generalized Linear Models, 1st ed.; Wiley: Hoboken, NJ, USA, 2015; pp. 1–480. [Google Scholar]
Hosmer, D.W.; Lemeshow, S.; Sturdivant, R.X. Applied Logistic Regression, 3rd ed.; Wiley: Hoboken, NJ, USA, 2013; pp. 1–528. [Google Scholar]
Churpek, M.M.; Yuen, T.C.; Winslow, C.; Meltzer, D.O.; Kattan, M.W.; Edelson, D.P. Multicenter Comparison of Machine Learning Methods and Conventional Regression for Predicting Clinical Deterioration on the Wards. Crit. Care Med. 2016, 44, 368–374. [Google Scholar] [CrossRef]
Hornung, O.; Smolnik, S. AI invading the workplace: Negative emotions towards the organizational use of personal virtual assistants. Electron. Markets 2022, 32, 123–138. [Google Scholar] [CrossRef]
Groves, R.M.; Fowler, F.J., Jr.; Couper, M.P.; Lepkowski, J.M.; Singer, E.; Tourangeau, R. Survey Methodology, 2nd ed.; Wiley: Hoboken, NJ, USA, 2009. [Google Scholar]
European Central Bank Survey on the Access to Finance of Enterprises (SAFE). 2023. Available online: https://www.ecb.europa.eu/stats/ecb_surveys/safe/html/data.en.html (accessed on 3 September 2025).
Cusolito, A.P.; Dautovic, E.; McKenzie, D. Can government intervention make firms more investment-ready? A randomized experiment in the Western Balkans. Rev. Econ. Stat. 2021, 103, 428–442. [Google Scholar] [CrossRef]
Owen, R.; Botelho, T.; Hussain, J.; Anwar, O. Solving the SME finance puzzle: An examination of demand and supply failure in the UK. Ventur. Cap. 2023, 25, 31–63. [Google Scholar] [CrossRef]
Du, J.; Rada, R. Machine learning and financial investing. In Handbook of Research on Machine Learning Applications and Trends: Algorithms, Methods, and Techniques; IGI Global: Hershey, PA, USA, 2010; pp. 375–398. [Google Scholar] [CrossRef][Green Version]
Khan, A.H.; Shah, A.; Ali, A.; Shahid, R.; Zahid, Z.U.; Sharif, M.U.; Jan, T.; Zafar, M.H. A performance comparison of machine learning models for stock market prediction with novel investment strategy. PLoS ONE 2023, 18, e0286362. [Google Scholar] [CrossRef] [PubMed]
Alexakis, C.; Gogas, P.; Petrella, G.; Polemis, M.; Salvadè, F. Investigating the Investment Readiness of European SMEs: A Machine Learning Approach. 2025. Available online: https://ssrn.com/abstract=5007133 (accessed on 3 September 2025).
Zana, D.; Barnard, B. Venture capital and entrepreneurship: The cost and resolution of investment readiness. SSRN Electron. J. 2019. [Google Scholar] [CrossRef]
Wooldridge, J.M. Introductory Econometrics: A Modern Approach, 5th ed.; South-Western Cengage Learning: Mason, OH, USA, 2013. [Google Scholar]
Norvig, R. Artificial Intelligence. A Modern Approach, 4th ed.; Pearson Series; Pearson Education, Inc.: Upper Saddle River, NJ, USA, 2020. [Google Scholar]
Cox, D.R. The regression analysis of binary sequences. J. R. Stat. Soc. Ser. B (Methodol.) 1958, 20, 215–232. [Google Scholar] [CrossRef]
Cover, T.M.; Hart, P.E. Nearest neighbor pattern classification. IEEE Trans. Inf. Theory 1967, 13, 21–27. [Google Scholar] [CrossRef]
Cortes, C.; Vapnik, V. Support-vector networks. Mach. Learn. 1995, 20, 273–297. [Google Scholar]
Schütze, H.; Manning, C.D.; Raghavan, P. Introduction to Information Retrieval; Cambridge University Press: Cambridge, UK, 2008; pp. 234–265. [Google Scholar]
Freund, Y.; Schapire, R.E. A decision-theoretic generalization of on-line learning and an application to boosting. Eur. Conf. Comput. Learn. Theory 1997, 55, 119–139. [Google Scholar]
Liu, X.; Wu, J.; Zhou, Z. Exploratory Undersampling for Class-Imbalance Learning. IEEE Trans. Syst. Man Cybern. Part B (Cybern.) 2009, 39, 539–550. [Google Scholar]
Breiman, L. Bagging predictors. Mach. Learn. 1996, 24, 123–140. [Google Scholar]
Gogas, P.; Papadimitriou, T.; Sofianos, E. Money Neutrality, Monetary Aggregates and Machine Learning. Algorithms 2019, 12, 137. [Google Scholar] [CrossRef]
Brodersen, K.H.; Ong, C.S.; Stephan, K.E.; Buhmann, J.M. The balanced accuracy and its posterior distribution. In Proceedings of the 20th International Conference on Pattern Recognition 2010, Istanbul, Turkey, 23–26 August 2010; pp. 3121–3124. [Google Scholar]
Hair, J.; Anderson, R.; Tatham, R.; Black, W. Multivariate Data Analysis, 5th ed.; Prentice Hall: Upper Saddle River, NJ, USA, 1998. [Google Scholar]
López, V.; Fernández, A.; García, S.; Palade, V.; Herrera, F. An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics. Inf. Sci. 2013, 250, 113–141. [Google Scholar] [CrossRef]
Pes, B.; Lai, G. Cost-sensitive learning strategies for high-dimensional and imbalanced data: A comparative study. PeerJ Comput. Sci. 2021, 7, e832. [Google Scholar] [CrossRef] [PubMed]
Araf, I.; Idri, A.; Chairi, I. Cost-sensitive learning for imbalanced medical data: A review. Artif. Intell. Rev. 2024, 57, 80. [Google Scholar] [CrossRef]
Peykani, P.; Peymany Foroushany, M.; Tanasescu, C.; Sargolzaei, M.; Kamyabfar, H. Evaluation of Cost-Sensitive Learning Models in Forecasting Business Failure of Capital Market Firms. Mathematics 2025, 13, 368. [Google Scholar] [CrossRef]
Mason, C.M.; Kwok, J. Investment Readiness Programmes and Access to Finance: A Critical Review of Design Issues; Working Paper 10-03; Hunter Centre for Entrepreneurship, University of Strathclyde: Scotland, UK, 2010. [Google Scholar]
Brush, C.G.; Edelman, L.F.; Manolova, T.S. Ready for funding? Entrepreneurial ventures and the pursuit of angel financing. Ventur. Cap. 2012, 14, 111–129. [Google Scholar] [CrossRef]
Landström, H. The ivory tower of business angel research. Ventur. Cap. 2019, 21, 97–119. [Google Scholar] [CrossRef]

Figure 1. A graphical representation of a 3-fold cross-validation process. A graphical representation of a 3-fold cross-validation process. For every set of hyperparameters tested, each fold serves as a test sample, while the remaining folds are used to train the model. The average performance for each set of parameters over the k test folds was used to identify the best model [30].

Figure 2. Confusion matrix.

Figure 3. ROC AUC.

Figure 4. Balanced accuracy and ROC AUC.

Figure 5. Class 0 metrics.

Figure 6. Class 1 metrics.

Figure 7. Gradient Boosting ROC curve.

Figure 8. Gradient Boosting confusion matrix.

Figure 9. Gradient Boosting confusion matrix—normalized.

Figure 10. Gradient Boosting variable importance plot.

Figure 11. SHAP summary for the Gradient Boosting model.

Figure 12. The Cost—Optimized Gradient Boosting normalized confusion matrix.

Figure 13. The Cost—Optimized Gradient Boosting confusion matrix.

Table 1. Comparison of models based on False Positives (FP), False Negatives (FN), and Total Misclassification Cost (TMC).

Model	FP	FN	TMC
Gradient Boosting	438	94	908
Logistic Regression	379	107	914
Easy Ensemble Classifier	409	103	924
Random Forest	441	99	936
AdaBoost	386	111	941

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Gogas, P.; Papadimitriou, T.; Goumenidis, P.; Kontos, A.; Giannakis, N. Identification of Investment-Ready SMEs: A Machine Learning Framework to Enhance Equity Access and Economic Growth. Forecasting 2025, 7, 51. https://doi.org/10.3390/forecast7030051

AMA Style

Gogas P, Papadimitriou T, Goumenidis P, Kontos A, Giannakis N. Identification of Investment-Ready SMEs: A Machine Learning Framework to Enhance Equity Access and Economic Growth. Forecasting. 2025; 7(3):51. https://doi.org/10.3390/forecast7030051

Chicago/Turabian Style

Gogas, Periklis, Theophilos Papadimitriou, Panagiotis Goumenidis, Andreas Kontos, and Nikolaos Giannakis. 2025. "Identification of Investment-Ready SMEs: A Machine Learning Framework to Enhance Equity Access and Economic Growth" Forecasting 7, no. 3: 51. https://doi.org/10.3390/forecast7030051

APA Style

Gogas, P., Papadimitriou, T., Goumenidis, P., Kontos, A., & Giannakis, N. (2025). Identification of Investment-Ready SMEs: A Machine Learning Framework to Enhance Equity Access and Economic Growth. Forecasting, 7(3), 51. https://doi.org/10.3390/forecast7030051

Article Menu

Identification of Investment-Ready SMEs: A Machine Learning Framework to Enhance Equity Access and Economic Growth

Abstract

1. Introduction and Literature Review

1.1. Investment Readiness and SME Financing

1.2. Research Question and Hypothesis

2. Data and Research Methodology

2.1. Data Collection and Pre-Processing

2.2. Machine Learning Algorithms

2.2.1. Logistic Regression

2.2.2. K-Nearest Neighbors

2.2.3. Random Forest

2.2.4. Support Vector Machines

2.2.5. Naïve Bayes

2.2.6. AdaBoost

2.2.7. Easy Ensemble

2.2.8. Balanced Bagging Classifier

2.2.9. Gradient Boosting Trees

2.2.10. Cross-Validation

2.2.11. Model Selection

2.2.12. Data Imbalance

2.3. Forecasting Performance Metrics

2.3.1. Precision

2.3.2. Recall (Sensitivity/Specificity)

2.3.3. F1-Score

2.3.4. Balanced Accuracy

2.3.5. Receiver Operating Characteristic Area Under Curve (ROC AUC)

2.3.6. Feature Engineering and Exploratory Analysis

2.4. Variable Importance Measure (VIM) and Shapley Additive Explanation (SHAP)

3. Empirical Results

3.1. SHAP Values

3.2. Cost Function

4. Discussion

5. Conclusions

Author Contributions

Funding

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI