Next Article in Journal
Short-Term Prediction in an Emergency Healthcare Unit: Comparison Between ARIMA, ANN, and Logistic Map Models
Previous Article in Journal
SGR-Net: A Synergistic Attention Network for Robust Stock Market Forecasting
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Identification of Investment-Ready SMEs: A Machine Learning Framework to Enhance Equity Access and Economic Growth

by
Periklis Gogas
*,
Theophilos Papadimitriou
,
Panagiotis Goumenidis
,
Andreas Kontos
and
Nikolaos Giannakis
Department of Economics, Democritus University of Thrace, 69100 Komotini, Greece
*
Author to whom correspondence should be addressed.
Forecasting 2025, 7(3), 51; https://doi.org/10.3390/forecast7030051
Submission received: 3 July 2025 / Revised: 3 September 2025 / Accepted: 9 September 2025 / Published: 16 September 2025
(This article belongs to the Section Forecasting in Economics and Management)

Abstract

Small and medium-sized enterprises (SMEs) are critical contributors to economic growth, innovation, and employment. However, they often struggle in securing external financing. This financial gap mainly arises from perceived risks and information asymmetries creating barriers between SMEs and potential investors. To address this issue, our study proposes a machine learning (ML) framework for predicting the investment readiness (IR) of SMEs. All the models involved in this study are trained using data provided by the European Central Bank’s Survey on Access to Finance of Enterprises (SAFE). We train, evaluate, and compare the predictive performance of nine (9) machine learning algorithms and various ensemble methods. The results provide evidence on the ability of ML algorithms in identifying investment-ready SMEs in a heavily imbalanced and noisy dataset. In particular, the Gradient Boosting algorithm achieves a balanced accuracy of 75.4% and the highest ROC AUC score at 0.815. Employing a relevant cost function economically enhances these results. The approach can offer specific inference to policymakers seeking to design targeted interventions and can provide investors with data-driven methods for identifying promising SMEs.

1. Introduction and Literature Review

Small and medium-sized enterprises are defined as enterprises with fewer than 250 employees, and with an annual turnover less than 50 million EUR and/or an annual balance sheet total less than 43 million EUR (Commission of the European Communities, 2003/361/EC). They play a crucial role in economic growth, employment, and innovation. They account for approximately 90% of all businesses globally, providing over half of the total employment and significantly contributing to GDP across various economies [1]. Despite their importance, SMEs frequently struggle to secure sufficient external financing, limiting their potential for growth and innovation. This issue is particularly significant in emerging markets, where about 40% of formal SMEs report unmet financing needs [1]. The financing gap faced by SMEs is often due to information asymmetries and the perception of higher risk causing lenders and investors to be cautious when considering investments in smaller firms. As traditional financial institutions generally rely on established credit histories and collateral requirements many SMEs cannot meet, these enterprises often resort to internal funding or informal finance sources [2,3]. This financing barrier highlights the necessity of improved mechanisms to identify investment-ready SMEs and connect them with suitable investors.
Investment readiness—the ability of an SME to attract and secure external funding—has emerged as an important concept to bridge the financing gap. A firm that is “investment-ready” typically demonstrates a sound and viable business model, credible financial reporting, clear growth potential, and an effective management team capable of communicating these strengths to potential investors [3]. Yet, previous research consistently finds a mismatch between SMEs’ self-assessments of readiness and investors’ evaluations, resulting in many firms being overlooked despite having solid business foundations [4].
Traditional approaches for evaluating SME investment readiness, such as financial ratios, scoring systems, and qualitative assessments, have significant limitations. These methods often overlook complex, non-linear relationships within the data and rely heavily on past financial performance, potentially disregarding other vital aspects like managerial capability or innovation [5,6]. In response, machine learning (ML) techniques have begun to gain prominence due to their ability to analyze large, multidimensional datasets and identify complex, predictive patterns that traditional econometric methods might miss [6].
The key benefit of using machine learning to solve this problem as opposed to traditional econometric methods is that it is perfectly suitable for the specific task of prediction in a data-rich and complex environment [5,7]. Traditional models such as the Logistic Regression model that is used here as a benchmark are often designed for statistical analysis and data interpretation [8]. These models assume linearity in the data-generating process [9,10]. Machine learning techniques have been developed for the purpose of improving the accuracy of predictive models outside the training dataset [6,11] and they are highly effective at automatically detecting complex, non-linear patterns and interactions within large and complex datasets without the need for prior knowledge [12]. When considering the large amount of data in the SAFE survey, the potential for noise in the responses, and the complex interactions of the variables that determine investment readiness, then an ML approach provides a powerful, data-driven solution for creating a robust predictive screening tool [13,14]. Given these challenges and opportunities, this paper applies advanced ML methods to assess the investment readiness of SMEs, using the European Central Bank’s (ECB) Survey on Access to Finance of Enterprises (SAFE) dataset. This large-scale, cross-national dataset offers extensive qualitative and quantitative information on SMEs’ financial conditions, market positions, innovation activities, and management capabilities—factors critical for assessing investment readiness.

1.1. Investment Readiness and SME Financing

Investment readiness is a multidimensional concept reflecting an enterprise’s capacity to understand and satisfy investor expectations as captured through business planning, financial transparency, growth prospects, and managerial competence. Mason and Harrison (2001) [2] emphasized that simply increasing the availability of venture capital without ensuring SMEs are investment-ready is insufficient to address funding gaps. Many entrepreneurs are often perceived by investors as “not ready” due to weak financial documentation, unrealistic growth expectations, or an aversion to equity dilution. Douglas and Shepherd (2002) [3], similarly, found discrepancies in perceptions of investment readiness between entrepreneurs and investors, highlighting the importance of clear communication and alignment of expectations to improve SMEs’ chances of securing financing.
Investment-readiness programs have been developed globally to bridge these gaps by providing targeted training and support to SMEs, improving their business plans, and enhancing their financial transparency and overall investability [15]. However, despite their importance, such programs typically yield only modest improvements and may not fully resolve systemic issues such as informational asymmetries and entrenched investor biases [16].
Recent studies have shown that machine learning (ML) methods significantly outperform traditional econometric models in forecasting complex financial market behavior. ML models, unlike traditional approaches, can handle vast and multidimensional datasets, uncovering and capturing complex and intricate relationships and interactions between variables [17,18]. For instance, Dumitrescu et al. (2021) [6] demonstrated how hybrid ML–econometric models significantly improve credit scoring by combining predictive accuracy with interpretability.
Despite the promising potential of ML approaches, their application to the specific problem of predicting SME investment readiness remains limited. Most studies focus on broader financial forecasting contexts or specific applications such as credit risk or stock market prediction, leaving a notable research gap regarding the application of ML in predicting SME investment readiness. Alexakis et al. (2025) [19] use machine learning and the SAFE questionnaire, as in this study, to forecast investment-ready SMEs. They use (a) a slightly different definition of investment readiness, not emphasizing “openness” as a crucial factor for investment readiness, (b) a larger number of initial observations, but only 23 independent variables as opposed to the 59 we use in this study, and (c) a more updated dataset that spans eight instead of six years. They test three hypotheses on the investment-readiness determinants (entrepreneurial ecosystem, financial risk/cost, firm structure) and focus on exploring at the country level and seeking the cultural variation in their predictors using 19 EU countries. In this study, we use a richer set of relevant independent variables and a different investment-readiness definition, and we focus on models that specifically deal with the problem of class imbalance, in order to create a universal EU model able to detect investment readiness irrespective of the individual country.
Additionally, while investment-readiness programs have been implemented globally to support SMEs, their effectiveness in addressing structural financing gaps remains minimal, suggesting the need for more sophisticated, data-driven assessment tools [15,16].
From a venture capital perspective, enhancing screening efficiency to identify promising SMEs remains a challenge. The potential for ML to significantly streamline this screening process, reducing biases and enhancing deal-flow quality, has not yet been thoroughly examined in academic research [20]. From a venture capital perspective, investment-readiness screening is essential due to the high-risk nature of early-stage financing. Venture capitalists often reject many SMEs at early screening stages due to insufficient preparedness in terms of management capabilities, realistic financial projections, and business viability [20,21]. While human judgment remains essential, integrating machine learning that is independent of related human biases and misjudgments into the venture screening process could significantly streamline identification efforts, reducing biases and expanding the search for investment-ready SMEs beyond personal networks [20].
Our research directly addresses the challenge of identifying investment-ready SMEs using machine learning by empirically evaluating multiple advanced ML algorithms, i.e., Gradient Boosting, Logistic Regression, Random Forest, and ensemble methods, using a comprehensive and robust dataset provided by the European Central Bank’s Survey on Access to Finance of Enterprises (SAFE). A basic question guiding our work is whether these advanced ML models can outperform a conventional approach in correctly identifying investment-ready firms. Given the highly imbalanced nature of the SAFE survey, data by explicitly linking managerial competencies, innovation, and openness to equity financing to SME investment readiness through ML techniques, this study contributes both methodologically and substantively to the entrepreneurial finance literature. It not only advances predictive accuracy but also offers actionable insights for SMEs, investors, and policymakers, ultimately aiming to enhance resource allocation efficiency and support broader economic growth and innovation. Additionally, we include a cost-sensitive evaluation in our model. In practical terms, missing an investment-ready SME (a false negative) is often more costly in lost opportunities for growth and investment than incorrectly flagging a firm as ready though it is not (a false positive). To ensure a balanced and accurate representation of the situation, we assigned a significantly higher cost to false negatives than to false positives when evaluating the performance of the model. We investigate how this asymmetric cost affects the selection of the “optimal” model and the overall misclassification cost. We have found that by focusing on reducing false negatives, we can identify models that not only perform well on standard metrics but also align with the economic goal of maximizing successful funding outcomes.

1.2. Research Question and Hypothesis

According to the above, in this empirical research work, with the data derived from the SAFE questionnaire, the main research question is the following:
  • RQ1. Can machine learning models enhance the ability to correctly identify investment-ready SMEs given the high imbalance in questionnaire data, the significant noise created by the questionnaire respondents, and the complexity and volume of such data in the EU’s SAFE questionnaire?
Accordingly, the following research hypothesis is tested in this empirical work:
  • H1. Machine learning models demonstrate a greater balanced accuracy compared to traditional classifiers in identifying investment-ready SMEs within highly imbalanced datasets.

2. Data and Research Methodology

2.1. Data Collection and Pre-Processing

The dataset was collected from the European Central Bank Data Portal and contains anonymous qualitative responses from micro, small, medium, and large companies across Europe. The dataset provides insights into the financing conditions faced by these companies.
The dataset was thoroughly cleaned by removing duplicates, non-SME entries, sparse features, and variables closely tied to the target to ensure data relevance and prevent leakage and contamination. After cleaning, the final dataset contained 10,937 SME entries (rows) each described by 51 variables (columns). The dataset is imbalanced, with 12% of entries belonging to class 1 (investment-ready class) and the remaining 88% to class 0 (non-investment-ready class). Out of 51 variables (there is a detailed description of every variable in the Appendix ATable A1), 44 were kept as inputs, 1 was the target variable, and the rest were columns containing information on date, company ID, and country. Categorical variables were converted to numerical format using dummy encoding, creating binary columns to represent each category. In order to avoid the dummy variable trap and collinearity issues for each categorical variable with k unique values, we produced k − 1 binary columns [21].
The binary target variable takes the value of 1 when the SME is investment-ready and 0 in the opposite case. We define an SME as investment-ready (IR) when it has the following characteristics:
  • Innovative: These are the firms that have reported new developments, such as introducing a new product or service to the market, implementing a new production process or method, adopting new management practices, or exploring new ways of selling goods or services.
  • Fast-growing: These are the cases where the annual turnover increases by more than 20%.
  • Open to equity financing: Whether it reported equity as either a relevant funding source or one used in the past six months.

2.2. Machine Learning Algorithms

Machine learning is the branch of Artificial Intelligence which allows a machine—an algorithm—to learn from data and improve its performance without being explicitly programmed. ML algorithms analyze historical data to identify simple or complex patterns or to make predictions. As Norvig and Russel (2020) state, “the process where the computer observes some data, builds a model based on the data and uses it as a hypothesis about the world and as a software that can solve problems” is called machine learning. Supervised learning is described as the type of machine learning where the algorithm learns and produces a model (function) from labeled training data. The goal is for the system to generalize from these examples, meaning that it can efficiently produce the correct output to new and unseen inputs [22].

2.2.1. Logistic Regression

Logistic Regression is a statistical method of classification. It is an algorithm that models the probability of a discrete outcome. While it shares similarities with linear regression in terms of using a linear combination of input features, the key distinction is that Logistic Regression applies a non-linear logit (log-odds) transformation to model the probability of the outcome. Logistic Regression is considered relatively interpretable among classification methods because the sign of each coefficient indicates whether a predictor positively or negatively influences the outcome’s log-odds. However, the exact effect of a unit change in an input on the predicted probability is not directly intuitive due to the non-linear (logistic) transformation. In other words, while one can relate coefficient values to odds ratios, translating those into absolute probability changes requires additional calculation. The algorithm can also incorporate regularization techniques such as L1 (Lasso), L2 (Ridge), or ElasticNet regularization norms, which help prevent overfitting and improve generalization by penalizing the excessively large coefficients of the model [23].

2.2.2. K-Nearest Neighbors

The K-Nearest Neighbors algorithm is a non-parametric, instance-based machine learning method. Rather than learning an explicit model from the training data, K-NN makes predictions by directly referencing the stored training examples. For a given instance, the algorithm identifies the k closest data points (neighbors) in the feature space, typically using a distance metric such as Euclidean distance. The predicted class for the instance is the majority class among these neighbors. Choosing a small value of k can lead to overfitting, as the model becomes sensitive to noise in the training data, while a large value of k may result in underfitting by oversmoothing class boundaries [24].

2.2.3. Random Forest

The Random Forest classifier is an ensemble learning algorithm that builds multiple decision trees and combines their outputs. Each tree in the forest is trained on a random subset of the data, selected through bootstrap sampling (sampling with replacement), and at each split within a tree, a random subset of features is considered to determine the best split. This combination of data and feature randomness helps reduce overfitting and enhances the model’s generalization ability. For classification tasks, the final prediction is determined by majority voting. Random Forests are well-suited for high-dimensional datasets, are robust to noise, and generally maintain a degree of interpretability through feature importance measures [7].

2.2.4. Support Vector Machines

Support Vector Machines are a powerful class of supervised machine learning algorithms used for classification and regression tasks. SVMs’ aim is to locate the (optimal) hyperplane that maximizes the distance between the hyperplane and the closest data points of different classes, known as support vectors. When data is not linearly separable, Support Vector Machines utilize the kernel trick, which maps the input features into a higher-dimensional space where a linear separation between the classes is possible. In our experiments, we used the radial basis function kernel and the linear kernel. Regularization in SVMs helps balance the trade-off between fitting the training data well and maintaining generalization to new data [25].

2.2.5. Naïve Bayes

Naïve Bayes is a probabilistic classifier based on Bayes’ Theorem, used to classify data by estimating the probability that a given instance belongs to a particular class. It is termed “naïve” because it assumes that the features (or attributes) are conditionally independent of each other given the class label. In Naïve Bayes classification, the algorithm estimates the prior probabilities for each class by calculating the frequency of appearance of each class in the dataset. It, then, computes the likelihood of observing that feature value given the class. For a new unknown instance, the classifier computes the posterior probability for each possible class. Posterior probability is the probability that an instance belongs to a class C given the features X. The class with the highest posterior probability is selected as the predicted class [26].

2.2.6. AdaBoost

Adaptive Boosting (AdaBoost) is an ensemble machine learning algorithm that combines multiple weak classifiers to create a strong classifier. It works by sequentially training a series of simple models, in our case shallow decision trees, on repeatedly modified versions of the data. In each new model the algorithm focuses on the instances misclassified by the previous ones. AdaBoost assigns higher weights to misclassified samples, compelling subsequent models to focus on these hard-to-classify cases. The final prediction is made by aggregating the outputs of all weak classifiers through a weighted majority vote, where classifiers with better accuracy have greater influence [27].

2.2.7. Easy Ensemble

The Easy Ensemble classifier was introduced by Liu, Wu et al. (2009) [28] to address imbalanced classification problems by leveraging ensemble learning. It consists of multiple AdaBoost classifiers, each trained on distinct, balanced bootstrap samples. To achieve balance, the method applies random under-sampling in the majority class, ensuring that both classes will have equal representation in the sub-sample. Each iteration of the algorithm involves training an AdaBoost model using a different sub-sample. This process is repeated several times, generating multiple classifiers. The final prediction is produced through a voting mechanism of all models. By repeatedly resampling the majority class and boosting weak learners, Easy Ensemble effectively improves classification performance on imbalanced datasets while maintaining the strengths of boosting techniques.

2.2.8. Balanced Bagging Classifier

The Balanced Bagging classifier is an ensemble method designed to address the class imbalance in classification tasks. It is a variation on the traditional Bootstrap Aggregating (Bagging) algorithm, where multiple models are trained in parallel and their predictions are combined to make the final decision. The key difference is that Balanced Bagging classifier also handles the class imbalance. The classifier creates multiple subsets of the original data by randomly sampling data points with replacement. The subsets are balanced before the training of each classifier. The balance of the class labels is achieved by random under-sampling on the majority class to match the size of the minority class. Then, each classifier is trained on the balanced subsets. The final prediction is made by utilizing the majority voting concept on classification tasks [29].

2.2.9. Gradient Boosting Trees

Gradient Boosting classifiers are additive models, where the prediction for a given input is made by combining the outputs of several weak learners, which in our problem are decision trees. The Gradient Boosting model is built iteratively in a greedy fashion, where each new tree is trained to correct the errors (residuals) made by the previous trees. Initially, the model starts with a constant value, the mean of the target values, to minimize the loss. During each iteration, the model computes the residuals—the differences between the observed and predicted values—and fits a decision tree to these residuals. The tree is adjusted to minimize the loss function, and the model is updated by adding the contribution of the new tree. This process is repeated for a set number of trees (weak learners), with each tree improving the model by correcting the errors of previous ones.
Finally, we have also included the Logistic Regression model, from traditional econometrics, as a benchmark. This allowed us to directly compare the performance of advanced machine learning algorithms to a standard econometric classification technique.

2.2.10. Cross-Validation

Before applying any machine learning algorithm, we first split the dataset into (a) the training set, which is used to locate the best hyperparameters and identify the optimal models, and (b) the validation set, which is never used in the previous process (training) and is used to evaluate the out-of-sample performance of the trained models (Figure 1).
A critical aspect of machine learning model development is managing the trade-off between overfitting and underfitting. Underfitting occurs when a model is unable to capture the underlying structure of the data, often due to excessive simplicity or insufficient training, and is typically reflected in a poor performance across both the training and validation datasets. Overfitting occurs when a model becomes too complex relative to the amount of training data, leading it to memorize noise and idiosyncrasies within the specific training set. As a result, while training accuracy may be high, the performance on any new and unseen data, like the validation dataset, degrades significantly. In either case, the model fails to generalize beyond the training distribution. To mitigate overfitting, cross-validation techniques are employed to evaluate the model’s performance across multiple data partitions, providing an estimate of its predictive accuracy on unseen data.
Thus, to start the training process, we employed a cross-validation technique (Figure 1). The training dataset is partitioned into 5 equal-sized folds, and the training is repeated iteratively; in each iteration, 4 folds are used for training the model, while the remaining fold is used for testing. This process is repeated 5 times, with each fold serving as the test set once. For each unique set of hyperparameters, a new model is trained, and the corresponding test scores are averaged and used to identify the optimal model. Given the class imbalance in the dataset, where the minority class (class 1) comprises only 12% of the observations, we employed a stratified 5-fold cross-validation technique, rather than simple random sampling for the folds, to ensure that all folds were consistent. This ensures that each fold maintains roughly the same class distribution as the original dataset, thereby providing stable and representative validation results.

2.2.11. Model Selection

Nine classification algorithms were used in this study. The hyperparameter-tuning process was conducted using the grid search technique, a systematic approach that exhaustively explores a predefined subset of the hyperparameter space. Grid search iteratively evaluates combinations of hyperparameters by fitting models to the training data and assessing their performance using the stratified 5-fold cross-validation scheme. The optimal hyperparameters for each algorithm were chosen based on the highest average performance in cross-validation on the test sets. The classification algorithms applied in the experimental analysis are the following:
  • Logistic Regression;
  • K-Nearest Neighbors;
  • Random Forest;
  • Support Vector Machines with RBF and Linear Kernel;
  • Naïve Bayes Classifier;
  • AdaBoost;
  • Easy Ensemble;
  • Balanced Bagging Classifier;
  • Gradient Boosting Trees.

2.2.12. Data Imbalance

To address the issue of the imbalanced dependent variable, the models were trained using balanced class weights. This approach involves assigning higher weights to the underrepresented class (class 1—investment-ready) and lower weights to the overrepresented class (class 0—not investment-ready) during the training process. By doing this, the model in the training process is penalizing the misclassification of the minority-class observations more, thus encouraging it to pay more attention to the less frequent observations.
There are various empirical techniques that can be implemented to handle class imbalance. Oversampling techniques, like SMOTE, ADASYN, etc., create and add artificial observations inflating the minority class, under-sampling techniques reduce the number of observations for the majority class, and cost-sensitive learning applies a higher cost to the minority class in the cost function or assigns class weights. In our case, the minority class is approximately 12% of the data, which represents a moderate imbalance. In such settings, balanced class weights are sufficient to achieve a good performance, especially in conjunction with tree-based classifiers (e.g., Random Forest, XGBoost), which are intrinsically less sensitive to class imbalance than linear models. So, to handle class imbalance, we used both class weights in the training process and also in the cost function; in construction, the minority class is given 5 times the cost of the majority one. Moreover, some of the algorithms selected to be used for the classification are specifically designed to handle class imbalance: (a) the Easy Ensemble classifier, which was introduced by Liu, Wu et al. (2009) [28] to address imbalanced classification problems by leveraging ensemble learning; and (b) the Balanced Bagging classifier, which is a variation of the traditional Bootstrap Aggregating (Bagging) algorithm and can handle class imbalance.

2.3. Forecasting Performance Metrics

In a binary classification problem, normally, a confusion matrix (Figure 2) is created to visually summarize the models’ performance. The confusion matrix consists of four values as below:
  • True Positives (TP): The number of instances in which the model correctly predicts the positive class (in our case, the SME to be predicted as investment-ready, class 1).
  • True Negatives (TN): The number of instances in which the model correctly predicts the negative class (in our case, the SME to be predicted as non-investment-ready, class 0).
  • False Positives (FP): The number of instances in which the model incorrectly predicts the positive class when the actual class is negative.
  • False Negatives (FN): The number of instances in which the model incorrectly predicts the negative class when the actual is positive.
From these values above, various metrics are being defined and have been used to evaluate our models, as below.

2.3.1. Precision

Measures the reliability of positive predictions for each class:
Class 0 (not investment-ready):
It counts the proportion of companies forecasted as “not investment-ready” that are truly not investment-ready. High precision indicates high confidence in negative predictions.
P r e c i s i o n 0   =   T N T N + F N
Class 1 (investment-ready):
In this case, the precision measures the proportion of companies predicted as “investment-ready” that are truly investment-ready. High precision reduces false positives (e.g., mislabeling companies not investment-ready as investment-ready).
P r e c i s i o n 1   =   T P T P + F P

2.3.2. Recall (Sensitivity/Specificity)

Measures the models’ ability to identify all relevant instances of a class:
Class 0 Recall (Specificity):
Recall of class 0 counts the ability of the model to correctly identify companies not investment-ready.
R e c a l l 0   =   T N T N + F P
Class 1 Recall (Sensitivity):
Recall of class 1 counts the ability of the model to correctly identify investment-ready companies.
R e c a l l 1   =   T P T P + F N

2.3.3. F1-Score

Balances precision and recall using their harmonic mean. A high F1-score indicates a strong performance in both precision and recall for a class. It is critical for evaluating the investment-ready (minority) class, where both false positives (costly misallocations) and false negatives (missed opportunities) are consequential.
F 1   S c o r e   =   T P T P   +   1 2 F P + F N
or alternatively,
F 1   S c o r e   =   2   ×   Pr e c i s i o n   ×   R e c a l l Pr e c i s i o n   +   R e c a l l

2.3.4. Balanced Accuracy

While accuracy is the standard metric for evaluating classification models on balanced datasets, it becomes less informative in the presence of class imbalance, as in our case. In such data, the balanced accuracy measure is preferred. Defined as the average recall (sensitivity) across all classes, balanced accuracy offers a more reliable assessment of model performance by accounting for sensitivity in each class explicitly. Balanced accuracy can also be interpreted as class-wise accuracy weighted by class prevalence in the dataset [31].
B a l a n c e d   A c c u r a c y   =   1 2 T P T P + F N + T N T N + F P

2.3.5. Receiver Operating Characteristic Area Under Curve (ROC AUC)

The ROC AUC (Figure 3) quantifies the model’s ability to discriminate between classes. The ROC curve plots the true positive rate (sensitivity) against the false positive rate (1-specificity) at various threshold settings, while the area under curve (AUC) ranges from 0.5 (no discriminative power, equivalent to random guessing) to 1 (perfect discrimination).
All the above metrics matter. High precision in Class 0 ensures companies not investment-ready are correctly filtered out. High recall in Class 1 ensures investment-ready companies are not overlooked, while Class 1 precision avoids costly false positives. The F1-score and balanced accuracy provide a single metric to balance these trade-offs, which is especially vital for an imbalanced dataset.

2.3.6. Feature Engineering and Exploratory Analysis

The occurrence of each class of the 44 variables in our dataset can be found in Table A2 in the Appendix A. A potential problem that should be considered when using one-hot encoding to a wide range of categorical variables is the possibility of multicollinearity. A high correlation among predictors can potentially lead to biased coefficient estimates and large standard errors in traditional econometric linear models such as the Logistic Regression [32]. Nonetheless, and this is an additional advantage of such machine learning algorithms, tree-based ensemble models including the Random Forest and Gradient Boosting, can efficiently handle datasets with high correlations among the independent variables. As Breiman (2001) [7] explained, a decision tree, from a group of highly correlated independent variables, will typically select only one for each split as the others provide redundant information and thus have no effect on the model’s predictive power. As a result, tree-based models, by construction, inherently treat possible multicollinearity issues. Nonetheless, multicollinearity can impact the stability and interpretation of feature importance measures as the importance may be reduced or randomly allocated across the correlated variables.

2.4. Variable Importance Measure (VIM) and Shapley Additive Explanation (SHAP)

In tree-based models, the VIM is used to assess and rank the contribution of each variable to the model’s predictive performance. During the construction of a decision tree, the algorithm selects variables at each split based on how well they improve the model’s ability to distinguish between classes. This is typically performed by minimizing a metric known as impurity (e.g., Gini impurity or entropy). Variables that consistently result in greater reductions in impurity across the tree are considered more important. The VIM reflects this by aggregating the impurity reductions attributed to each variable over the entire tree (or forest, in ensemble models like Random Forest), helping to identify which features have the most influence on the model’s decisions.
We interpret the Gradient Boosting classifier with SHAP (Shapley Additive exPlanations). SHAP decomposes our predictions into additive feature contributions relative to the model expectation (baseline):
f x = φ 0 + i = 1 n φ i ( x )
where φ0 represents the baseline and φ i ( x ) is the SHAP value for feature i on instance x. In simple words, we count how much a feature “pushes” the prediction up or down relative to the baseline. In our case, a positive φ i pushes towards IR, and a negative one in the opposite direction relatively to the baseline.

3. Empirical Results

Given the significant class imbalance in our dataset (88% not investment-ready vs. 12% investment-ready), the simple accuracy is not an adequate forecasting metric. A naïve classifier that classifies all instances in the dataset as “not investment-ready” would achieve an 88% accuracy while failing entirely to identify investment-ready firms. Balanced accuracy addresses this limitation, as we described above. By averaging the recall rates for both classes, balanced accuracy gives equal importance to companies’ investment-ready and not investment-ready, regardless of their proportions in the dataset. Therefore, we use the balanced accuracy score to rank the models’ overall performance and identify the best model.
As we can see in Figure 4, the Gradient Boosting model outperforms the rest in balanced accuracy and the ROC AUC scores, with 0.754 and 0.815, respectively. Moreover, in Figure 5 and Figure 6, where we depict the forecasting metrics for the two classes separately, Gradient Boosting is on average the best forecasting model. Logistic Regression comes second in terms of balanced accuracy and ROC AUC scores (Table A3 in the Appendix A reports the balanced accuracy and ROC AUC scores of the optimal models for each methodology). In Figure 4, Figure 5 and Figure 6, we present the results for all models and various performance metrics. We can see that the Gradient Boosting model outperforms all other models for both class-wide metrics, namely balanced accuracy and ROC AUC (Figure 4). When looking at class 1 metrics, it achieved some of the best performances, only slightly falling behind in terms of recall and precision, where the models that excelled in those metrics were Logistic Regression and Random Forest, respectively. However, Gradient Boosting, with the best F1-score for class 1, offers the best trade-off between precision and recall, making it the most well-rounded model for discovering promising SMEs.
Moreover, in Figure 7 below, we present the ROC curve for the Gradient Boosting model. ROC curves for the rest models can be seen in Appendix A Figure A1, Figure A2, Figure A3, Figure A4, Figure A5 and Figure A6.
Furthermore, Easy Ensembles show a consistent performance across metrics. While not leading in any metric, with a balanced accuracy of 0.746, a precision of 0.396, and a recall of 0.717, its consistent performance suggests robustness in its capabilities. With an F1-score of 0.510 and a ROC AUC of 0.804, it performs only slightly worse than the two leading models. Finally, the Balanced Random Forest had the third highest ROC AUC score of 0.810, only marginally worse than Logistic Regression, and had a competitive recall rate of 0.711. Nonetheless, it has the lowest balanced accuracy (0.738) and precision (0.383) among the top models.
Considering precision, recall, and F1, Gradient Boosting appears to outperform most of the competition for class 0, exhibiting some of the top values across all evaluation metrics. The model achieved the best overall precision of 0.931, with the second-highest overall recall score of 0.794 and the second-highest F1-score of 0.857, only falling behind Random Forest, which had a recall score of 0.830 and an F1-score of 0.872.
Precision remains consistent across all models, with limited variability (precision variance of models is 0.031). The same is not true for recall (recall variance of models is 0.128). All models achieve exceptional class 0 precision (>90%), with Gradient Boosting offering the optimal performance.
Overall, Gradient Boosting is the optimal model for predicting SME investment readiness and is the recommended model in terms of creating a balanced predictive strategy. There are however models that excel at specific metrics, like Logistic Regression in terms of class 1 recall or Random Forest in terms of class 0 recall. Classifying the majority class proved to be an easy task for all models, having achieved very high levels of both precision and F1-scores (Table A4). This indicates that the algorithms were effective at filtering out firms not investment-ready. On the other hand, predicting the minority class proved to be a challenge. This is made evident by the overall lower F1-scores all of the models, which peaked at 0.526, and low precision levels. This highlights the challenge in avoiding false positives.
In Figure 8 and Figure 9, we present the confusion matrix and the normalized confusion matrix, respectively, for the best model, Gradient Boosting. It achieved the following:
  • 267 TP predictions (investment-ready) out of 374, or 71.3%.
  • 1440 TN predictions (not investment-ready) out of 1814, or 79.3%.
  • 374 FP predictions (predicted as investment-ready but are not), or 20.6%.
  • 107 FN predictions (predicted as not investment-ready and are not), or 28.6%.
The confusion matrices for every trained model are included in Figure A7, Figure A8, Figure A9, Figure A10, Figure A11, Figure A12, Figure A13, Figure A14, Figure A15 and Figure A16 in the Appendix A. The performance metrics values for the optimal models for each methodology can be found in Table A3 (balanced accuracy and ROC AUC score) and Table A4 (precision, recall, and F1-score) in the Appendix A.
When applying the Variable Importance Measure (VIM) technique to our best model, Gradient Boosting, the 10 most important variables and their ranking are presented in Figure 10. The variable importance plot illustrates the relative contribution of each input feature to the model’s classification performance, with higher values indicating a greater influence on the model’s predictions. Out of the ten most important variables ranked according to the VIM, the first six are comparatively the most prominent.
  • The most important predictor appears to be the firm’s “confidence in negotiations with equity investors or venture capital firms”. High levels of negotiation confidence likely reflect firm preparation, understanding of investor expectations, and stronger business fundamentals, all of which are critical traits for attracting external investment.
  • The second most important predictor is “financing growth”, which indicates that the firms that are actively seeking and managing financial growth tend to be investment-ready. This highlights the importance of proactive financial planning and scaling strategies as indicators of a firm’s investment potential.
  • “Factors in the future of financing of the firm” is ranked third in top predictors. This variable points to the importance of future financial planning. Investors may favor firms that not only demonstrate current performance but also show foresight in securing future financing.
  • “External financing factors” including market conditions or access to funding channels is also an essential feature that influences the model’s decision-making process. Firms capable of navigating external factors’ influences may be more successful in attracting external investors.
  • “Autonomous organization type”, relating to the structure of the organization, plays the fourth-most-important role in predicting an investment-ready SME. Autonomous firms might be more agile and able to innovate and thus they draw investors’ interest.
  • “Willingness of investors to invest in the enterprise”, relating to the investors’ sentiment towards the firm.
These six variables summarize the importance of both (a) internal factors such as confidence, growth, future outlook, and the firm’s autonomy and flexibility, and (b) external factors such as market conditions, availability of financing, and the overall sentiment of the investors towards investing in a firm.

3.1. SHAP Values

We present the SHAP summary figure (Figure 11). SHAP was introduced by Lundberg and Lee (2017) to provide a unified, model-agnostic approach for interpreting individual predictions via Shapley-value-based additive feature attributions that satisfy local accuracy, consistency, and missingness. The rows are ordered according to their global importance; each dot is an SME. The position of each dot is the SHAP value, and color shows the feature value (red means high and blue means low). Every point left of zero decreases IR and every point right of zero the opposite.
According to the above explanation and SHAP graph, we can observe the following for the four most-important independent variables.
  • Confidence in negotiations: High values of confidence—as expected—indicate an investment-ready SME.
  • Importance of factors in the future financing of the firm: Declining future financing prospects reduce the investment readiness of an SME.
  • Not taking autonomous financial decisions due to being a subsidiary: The results show that high values of this variable significantly decrease investment readiness for an SME.
  • Willingness of external investors to invest in a specific SME: Perceived persistence of the interest of external investors in specific SMEs seems to increase investment readiness.
Both VIM and SHAP identify an overlapping set of high-impact predictors—e.g., negotiation confidence, financing-for-growth posture, future-financing priorities, external financing conditions, organizational autonomy, and perceived investor willingness. VIM uses the global split-based importance, whereas SHAP uses the marginal contributions of the observations; agreement across these unrelated criteria indicates that the outcome is not affected by the choice of the estimator. The minor rank-order differences are expected given the distinct objectives and estimation criteria of VIM and SHAP.

3.2. Cost Function

Another way to assess the forecasting efficiency of the selected algorithms and identify the best model is by calculating the misclassification cost and including it in the evaluation process. This is achieved by using a cost function in the selection of the optimal model. The cost function that we used is defined as
Total Misclassification Cost (TMC) = CFP FP + CFN FN
where CFP is the cost of a false positive forecast, CFN is the cost of a false negative forecast, FP is the total number of false positive cases, and FN the total number of false negative cases. In our case, we set CFP = 1 and CFN = 5 following many studies in the field [33,34,35]. The use of such a weighted cost is part of the cost-sensitive evaluation methodology studied in classification problems with unequal errors [36]. Table 1 presents, for each of the five optimal models, the FN, the FP, and the resulting TMC for each model. The use of such a weighted cost is part of the cost-sensitive evaluation methodology studied in classification problems with unequal errors [36]
The Gradient Boosting model exhibits the lowest TMC (908), making it the best model, even under the cost sensitivity criterion. Thus, it is the best forecasting model even when we impose the cost function as a model selection criterion. Logistic Regression produces a slightly higher TMC cost (914), indicating a comparable cost performance. These results confirm that accounting for misclassification costs refines our evaluation: while Gradient Boosting was already strong in terms of balanced accuracy, it also minimizes the economic cost of errors under our cost assumptions.
Under the cost-optimized threshold (0.452), the confusion matrix for the Gradient Boosting model is as shown in Figure 12 and Figure 13.
The empirical evidence provides answers to our research questions as follows: With regard to the question of whether machine learning (ML) enhances identification in situations involving imbalance and noise, the gradient boosting approach attained the highest balanced accuracy (0.754) and ROC AUC (0.815), while also identifying 71.3% of investment-ready SMEs (TP = 267/374) with a class 1 F1 of 0.526 (see Figure 4, Figure 5, Figure 6, Figure 8 and Figure 9). These findings support H1 and confirm RQ1. A feature importance analysis (see Figure 10) shows that a limited number of variables have been identified that are most significant in predicting outcomes including negotiation confidence with equity investors, financing for growth, forward-looking financing plans, external financing conditions, organizational autonomy, and investor willingness. Finally, implementing asymmetric costs (CFN = 5, CFP = 1) redirected the evaluation process towards minimizing false negatives. Gradient Boosting minimized the Total Misclassification Cost (TMC = 908) versus Logistic Regression (TMC = 914) and similar models (Table 1), with a cost-effective threshold (0.452) further improving the trade-off by demonstrating that cost-sensitive model selection criteria result in better economic choices.
Based on the above empirical results, we can now discuss our research question and whether the related research hypothesis is supported by the empirical results.
Regarding RQ1 and whether machine learning (ML) enhances the identification of investment-ready SMEs in situations involving imbalanced datasets and noise, we performed the following statistical tests:
  • A DeLong test on the ROC AUCs produced by the two best-performing algorithms, the XGBoost and the logit. The hypotheses tested are the following:
    H0: The two algorithms produce AUCs that are not statistically different.
    H1: The two algorithms produce AUCs that are statistically different.
    The test is performed on the same validation set for both models. And the z-score is −0.4462 with a p-value of 0.6554, providing statistical evidence that the two algorithms do not significantly differ in terms of their ROC AUCs. The relevant confidence intervals are the following:
    XGBoost AUC at 95% probability produces a confidence interval of [0.783, 0.835].
    Logit AUC at 95% probability produces a confidence interval of [0.781, 0.833].
    Thus, with the DeLong test, we did not find evidence that one model produces a statistically better ROC AUC than the other.
2.
We performed a McNemar test, which is a non-parametric test applied on categorical observations to analyze the differences between two related groups, usually a 2 × 2 matrix with dichotomous variables. On the main diagonal, we have the count of instances where the two models agree, and on the secondary diagonal, we have the count of observations where the two models disagree.
The differences between the two methods’ performances are the following (the secondary diagonal values):
  • XGBoost correct, logit wrong: 110.
  • XGBoost wrong, logit correct: 56.
Thus, approximately double the instances are classified correctly with the XGBoost model. The total number of discordance instances is high enough (167), at more than 25, which is considered the minimum required so that the asymptotic chi-square approximation is asymptotically accurate. When evaluating whether this difference is statistically significant, the McNemar test statistic is equal to 16.922, with a p-value of 0.000. The negative difference in error rates (errors/all) is XGBoost-Logit = −0.0247. This confirms that GB is better: it has a slightly lower plain misclassification rate. In addition to that, when bootstrapping the mean error rate difference of −0.0247, we obtain a confidence interval of [−0.0361, −0.0133]. Since the confidence interval does not include zero, we can confirm that the difference is statistically significant.

4. Discussion

The model that we proposed goes beyond a routine ML model comparison in four ways:
  • We created a universal EU model using the large dataset collected by the Survey on Access to Finance of Enterprises (SAFE), instead of using a country-specific one.
  • Usually IR is estimated using capability [2,3], communication [37], governance [38] and preparedness [39]. In this paper, we combine innovation, high growth, and equity openness in a model that is broader and more operational for policy/investment targeting.
  • We addressed the problem of class imbalance using weights and appropriate methods (e.g., Easy Ensemble, Balance Bagging).
  • We created and implemented a cost-sensitive metric (false negatives weighted five times the false positives in our model) to find the optimal model, reflecting economic consequences, instead of using just simple accuracy metrics. Consequently, Gradient Boosting is not just the optimal model statistically, but more importantly also economically.
These design choices translate into concrete, practical benefits:
  • Venture capital and private equity early screening often relies on subjective judgments and/or personal networks. Perceived weaknesses may lead to early SME’s rejection. The proposed ML model, in contrast, evaluates firms on data-driven criteria, reducing the risk of bias or limited information.
  • The proposed ML-based model (and, in general, ML approaches) can scan a large dataset like SAFE, identifying IR firms outside the investor circle thus expanding the pool of potential opportunities.
  • The proposed ML-driven classification sorts SMEs by underlining those with higher predicted IR. Investors can then focus their in-depth analysis on a smaller set of IR-identified SMEs—cutting costs and improving efficiency.
  • Finally, IR in this paper is defined through concrete features like innovation, growth, and openness to equity. This means that the model can function not only as a filter to identify IR but also as a diagnostic guide, showing SMEs whose capabilities they can strengthen to improve their funding prospects.

5. Conclusions

The primary goal of this research was to apply advanced machine learning (ML) techniques to predict SME investment readiness, utilizing the large dataset collected by the Survey on Access to Finance of Enterprises (SAFE) conducted semiannually by the European Central Bank. After training and evaluating multiple ML algorithms, which included Gradient Boosting, Logistic Regression, Random Forest, and various ensemble methods, the Gradient Boosting model demonstrated the highest predictive performance, with a balanced accuracy of 75.4% and a ROC AUC score of 0.815. Logistic Regression was notably close in performance, indicating that simpler, interpretable models can also effectively predict investment readiness, providing practical benefits in scenarios requiring rapid deployment or interpretability.
Although explicit feature selection was not implemented in this study, our findings highlight several important factors influencing investment readiness. These factors include managerial confidence, innovation activities, openness to equity financing, and investor engagement. Clearly identifying these factors helps SMEs strategically prioritize improvements, thereby enhancing their likelihood of securing external financing.
Possible limitations of our analysis may be the use of self-reported, cross-sectional SAFE survey data, which may introduce biases affecting the generalizability of the results. The dataset offers a static snapshot, preventing an analysis of changes in investment readiness over time. Model validation was limited due to a lack of independent data. There was a significant class imbalance (12% investment-ready vs. 88% not) that was addressed through appropriate metrics, weights, and imbalance-resistant algorithms. Finally, applying these ML tools in SME financing requires tackling integration, privacy compliance, interpretability, and stakeholder trust issues.
Despite these limitations present to most empirical research, from an academic perspective, this research adds valuable insights to the entrepreneurial finance literature by demonstrating the practical effectiveness of machine learning techniques in financial decision-making. However, the close performance between Gradient Boosting and Logistic Regression underscores the continued relevance of traditional econometric models, particularly in situations where interpretability and prompt results are crucial.
The economic implications are considerable. Independent investors, relevant governmental agencies, and financial institutions can leverage these predictive models to better assess SMEs, minimizing costly misinvestments (Type I errors). Consequently, such models have the potential to optimize the allocation of the already scarce private and government financial resources. Policymakers can utilize these insights to tailor more effective support programs for SMEs, addressing their specific financing needs. Ultimately, adopting these predictive approaches promises more efficient capital markets, enabling SMEs to contribute more effectively to economic growth and job creation.
An additional insight from the cost evaluation is that the selection of the best model can shift significantly once the economic implications of misclassifications are accounted for. When both costs (false positive and false negative) are adequately measured and assigned to the optimization of the predictive models, a model with higher balanced accuracy is not necessarily the one with the lowest overall cost. When the two costs are asymmetric, as was assumed in this paper, and the cost of a false negative is greater than the cost of a false positive, the precision as a model selection metric will be inferior to the recall (sensitivity). This is because the latter focuses on maximizing the identification of true positives, which is the same as minimizing false negatives. This analysis confirms that cost-sensitive evaluation complements and adjusts the overall findings, providing more practical and economically sensitive guidance for model selection in real-world scenarios.
With respect to the research question and hypothesis stated in the introduction, we find support that using machine learning models can be effective in identifying investment-ready SMEs despite data imbalance and noise. Through a feature importance analysis, we have identified a subset of variables that hold the most predictive power.
Finally, a valuable next step would be to analyze how the model performance varies across different cross-sectional factors such as the country of the firm, firm size, industry or sector, and EU membership status. These steps could reveal important factors and help to create predictive models specifically designed for certain cross-sections.

Author Contributions

Conceptualization, P.G. (Periklis Gogas) and T.P.; methodology, P.G. (Periklis Gogas) and T.P.; software, N.G. and A.K.; validation, P.G. (Periklis Gogas) and T.P.; formal analysis, P.G. (Periklis Gogas), T.P. and P.G. (Panagiotis Goumenidis); investigation, P.G. (Panagiotis Goumenidis), N.G. and A.K.; resources, P.G. (Panagiotis Goumenidis), N.G. and A.K.; data curation, N.G. and A.K.; writing—original draft preparation, P.G. (Panagiotis Goumenidis), N.G. and A.K.; writing—review and editing, P.G. (Periklis Gogas), T.P. and P.G. (Panagiotis Goumenidis).; visualization, N.G. and A.K.; supervision, P.G. (Periklis Gogas), T.P. and P.G. (Panagiotis Goumenidis); project administration, P.G. (Periklis Gogas) and T.P.; funding acquisition, P.G. (Periklis Gogas) and T.P. All authors have read and agreed to the published version of the manuscript.

Funding

This research project was implemented under the framework of H.F.R.I. call “Basic Research Financing (Horizontal support of all Sciences)” under the National Recovery and Resilience Plan “Greece 2.0” funded by the European Union—Next-Generation EU (H.F.R.I. Project Number: 16856). Forecasting 07 00051 i001

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

Table A1. Variable description.
Table A1. Variable description.
#Variable NameQuestion
1AutonomousHow would you characterise your enterprise?
2MainActivityWhat is the main activity of your enterprise?
3AgeIn which year was your enterprise first registered?
4OwnershipWho owns the largest stake in your enterprise?
5profitableFirms that report, simultaneously, higher turnover and profits, lower or no interest expenses and lower or no debt-to-assets ratio.
6MostImportantProblemFacingWhich was the most important problem faced by for your enterprise during the {previous quarter and current quarter} OR {current quarter}?
7ExternalFinancing_Factors#GeneralEconomicOutlookToObtainExternalFinancingFor each of the following factors, would you say that they have improved, remained unchanged or deteriorated during the {previous quarter and current quarter} OR {current quarter}? General economic outlook, insofar as it affects the availability of external financing
8ExternalFinancing_Factors#AccessToPublicFinancialSupportIncludingGuaranteesFor each of the following factors, would you say that they have improved, remained unchanged or deteriorated during the {previous quarter and current quarter} OR {current quarter}?. Access to public financial support, including guarantees
9ExternalFinancing_Factors#YourFirmSpecificOutlookFor each of the following factors, would you say that they have improved, remained unchanged or deteriorated during the {previous quarter and current quarter} OR {current quarter}?. Your enterprise-specific outlook with respect to your sales and profitability or business plan
10ExternalFinancing_Factors#YourFirmOwnCapitalFor each of the following factors, would you say that they have improved, remained unchanged or deteriorated during the {previous quarter and current quarter} OR {current quarter}? Your enterprise’s own capital
11ExternalFinancing_Factors#YourFirmCreditHistoryFor each of the following factors, would you say that they have improved, remained unchanged or deteriorated during the {previous quarter and current quarter} OR {current quarter}? Your enterprise’s credit history
12ExternalFinancing_Factors#WillingnessOfBanksToProvideCreditFor each of the following factors, would you say that they have improved, remained unchanged or deteriorated during the {previous quarter and current quarter} OR {current quarter}? Willingness of banks to provide credit to your enterprise
13ExternalFinancing_Factors#WillingnessOfBusinessPartnersToProvideTradeCreditFor each of the following factors, would you say that they have improved, remained unchanged or deteriorated during the {previous quarter and current quarter} OR {current quarter}? Willingness of business partners to provide trade credit
14ExternalFinancing_Factors#WillingnessOfInvestorsToInvestInYourEnterpriseFor each of the following factors, would you say that they have improved, remained unchanged or deteriorated during the {previous quarter and current quarter} OR {current quarter}? Willingness of investors to invest in your enterprise
15FirmGrowth#InTermsOfEmploymentRegardingTheNumberOfFullTimeOrFullTimeEquivalentEmployeesOver the last three years (<year>-<year>), how much did your firm grow on average per year? (in terms of employment regarding the number of full-time or full-time equivalent employees)
16ConfidenceInNegotiations#WithBanksDo you feel confident talking about financing with banks and that you will obtain the desired results? And how about with equity investors/venture capital enterprises? With banks
17ConfidenceInNegotiations#WithEquityInvestorsOrVentureCapitalFirmsDo you feel confident talking about financing with banks and that you will obtain the desired results? And how about with equity investors/venture capital enterprises? With equity investors/venture capital enterprises
18FinancingGrowth_Instruments#FinancingGrowth_InstrumentsIf you need external financing to realise your growth ambitions, what type of external financing would you prefer most?
19FinancingGrowth_Amount#FinancingGrowth_AmountIf you need external financing to realise your growth ambitions over the next two to three years what amount of financing would you aim to obtain?
20FinancingGrowth_TheMostImportantLimitationWhat do you see as the most important limiting factor to get this financing?
21ExternalFinancing_Expectations#InternalFoundsLooking ahead, for each of the following types of financing available to your enterprise, please indicate whether you think their availability will improve, deteriorate or remain unchanged over the next {two quarters} OR {quarter}. Retained earnings or sale of assets/internal funds
22ExternalFinancing_Expectations#BankLoansLooking ahead, for each of the following types of financing available to your enterprise, please indicate whether you think their availability will improve, deteriorate or remain unchanged over the next {two quarters} OR {quarter}. Bank loans (excluding overdraft and credit lines)
23ExternalFinancing_Expectations#EquityInvestmentsLooking ahead, for each of the following types of financing available to your enterprise, please indicate whether you think their availability will improve, deteriorate or remain unchanged over the next {two quarters} OR {quarter}. Equity capital
24ExternalFinancing_Expectations#TradeCreditLooking ahead, for each of the following types of financing available to your enterprise, please indicate whether you think their availability will improve, deteriorate or remain unchanged over the next {two quarters} OR {quarter}. Trade credit
25ExternalFinancing_Expectations#DebtSecuritiesIssuedLooking ahead, for each of the following types of financing available to your enterprise, please indicate whether you think their availability will improve, deteriorate or remain unchanged over the next {two quarters} OR {quarter}. Debt securities issued
26ExternalFinancing_Expectations#OtherLooking ahead, for each of the following types of financing available to your enterprise, please indicate whether you think their availability will improve, deteriorate or remain unchanged over the next {two quarters} OR {quarter}. Other
27ExternalFinancing_Expectations#CreditLineBankOverdraftOrCreditCardsOverdraftLooking ahead, for each of the following types of financing available to your enterprise, please indicate whether you think their availability will improve, deteriorate or remain unchanged over the next {two quarters} OR {quarter}. Credit line, bank overdraft or credit cards overdraft
28ImportanceOfFactorsInTheFutureFinancingOfTheFirm#GuaranteesForLoansOn a scale of 1–10, where 10 means it is extremely important and 1 means it is not at all important, how important are each of the following factors for your enterprise’s financing in the future? Guarantees for loans
29ImportanceOfFactorsInTheFutureFinancingOfTheFirm#MeasuresToFacilitateEquityInvestmentsOn a scale of 1–10, where 10 means it is extremely important and 1 means it is not at all important, how important are each of the following factors for your enterprise’s financing in the future? Measures to facilitate investments
30ImportanceOfFactorsInTheFutureFinancingOfTheFirm#ExportCreditsOrGuaranteesOn a scale of 1–10, where 10 means it is extremely important and 1 means it is not at all important, how important are each of the following factors for your enterprise’s financing in the future? Export credits or guarantees
31ImportanceOfFactorsInTheFutureFinancingOfTheFirm#TaxIncentivesOn a scale of 1–10, where 10 means it is extremely important and 1 means it is not at all important, how important are each of the following factors for your enterprise’s financing in the future? Tax incentives
32ImportanceOfFactorsInTheFutureFinancingOfTheFirm#BusinessSupportServicesOn a scale of 1–10, where 10 means it is extremely important and 1 means it is not at all important, how important are each of the following factors for your enterprise’s financing in the future? Business support services
33ImportanceOfFactorsInTheFutureFinancingOfTheFirm#MakingExistingPublicMeasuresEasierToObtainOn a scale of 1–10, where 10 means it is extremely important and 1 means it is not at all important, how important are each of the following factors for your enterprise’s financing in the future? Making existing public measures easier to obtain
34FirmIncomeGenerationIndicators#LabourCostHave the following company indicators decreased, remained unchanged or increased during the {previous quarter and current quarter} OR {current quarter}? Labour costs (including social contributions)
35FirmIncomeGenerationIndicators#OtherCostHave the following company indicators decreased, remained unchanged or increased during the {previous quarter and current quarter} OR {current quarter}? Other costs (materials, energy, other)
36FirmIncomeGenerationIndicators#InterestExpensesHave the following company indicators decreased, remained unchanged or increased during the {previous quarter and current quarter} OR {current quarter}? Interest expenses
37FirmIncomeGenerationIndicators#ProfitHave the following company indicators decreased, remained unchanged or increased during the {previous quarter and current quarter} OR {current quarter}? Profit
38FirmIncomeGenerationIndicators#ProfitMarginHave the following company indicators decreased, remained unchanged or increased during the {previous quarter and current quarter} OR {current quarter}? Profit Margin
39FirmIncomeGenerationIndicators#DebtComparedToAssetsHave the following company indicators decreased, remained unchanged or increased during the {previous quarter and current quarter} OR {current quarter}? Debt compared to assets
40FinancingApplied#ApplicationOfExternalFinancing_BankLoansHave you applied for the following types of financing during the {previous quarter and current quarter} OR {current quarter}? Bank loans
41FinancingApplied#ApplicationOfExternalFinancing_TradeCreditHave you applied for the following types of financing during the {previous quarter and current quarter} OR {current quarter}? Trade credit
42FinancingApplied#ApplicationOfExternalFinancing_OtherExternalFinancingHave you applied for the following types of financing during the {previous quarter and current quarter} OR {current quarter}? Other external financing
43FinancingApplied#ApplicationOfExternalFinancing_CreditLineHave you applied for the following types of financing during the {previous quarter and current quarter} OR {current quarter}? Credit line
44vulnerableFirms that report, simultaneously, lower turnover, decreasing profits, higher interest expenses and higher or unchanged debt-to-assets ratio.
45permidId of SME
46waveSurvey wave
47idSurvey unique id
48CountryOfResidenceCountry of SME
49FirmSizeSize of Company
50DateDate of survey
51InvestmentReadyTarget variable
Table A2. Variable statistics of training set.
Table A2. Variable statistics of training set.
#Variable NameVariable TypeDescription ValuesFreq/CountPercentage
1AutonomousCategoricalan autonomous profit-oriented enterprise 711781.35
part of a profit-oriented enterprise (e.g., subsidiary* or branch) not taking autonomous financial decisions 144416.6
a subsidiary of another enterprise 1782.03
a branch of another enterprise 80.09
[DK/NA] 20.02
2MainActivityCategoricalIndustry 310935.54
Services 263030.06
Trade 216424.73
Construction 8469.67
3AgeCategorical10 years or more 679977.71
5 years or more but less than 10 years 116213.28
2 years or more but less than 5 years 4815.5
[DK/NA] 2032.32
Less than 2 years 1041.19
4OwnershipCategoricalFamily or entrepreneurs 438750.14
A natural person, one owner only 180920.68
Other firms or business associates 143816.44
Public shareholders, as your company is listed on the stock market 6016.87
Other 3123.57
Venture capital firms or business angels 1601.82
DK/NA 420.48
5profitableBinary0814893.13
16016.87
6MostImportantProblemFacingCategoricalFinding customers 182220.83
Competition 136015.54
Availability of skilled staff or experienced managers135215.45
Access to finance 123714.14
Costs of production of labour103911.88
Regulation and industrial regulations7939.06
Other 7268.3
All problems are equally pressing3003.43
DK/NA 1201.37
7ExternalFinancing_Factors#GeneralEconomicOutlookToObtainExternalFinancingCategoricalRemained unchanged 380143.45
Deteriorated 256029.26
Improved 195922.39
DK 4294.9
8ExternalFinancing_Factors#AccessToPublicFinancialSupportIncludingGuaranteesCategoricalRemained unchanged 324337.07
Not applicable 299334.21
Deteriorated 140716.08
Improved 6106.97
DK 4965.67
9ExternalFinancing_Factors#YourFirmSpecificOutlookCategoricalRemained unchanged 392344.84
Improved 316936.22
Deteriorated 126314.44
DK 3944.5
10ExternalFinancing_Factors#YourFirmOwnCapitalCategoricalRemained unchanged 427848.9
Improved 343639.27
Deteriorated 93210.65
DK 1031.18
11ExternalFinancing_Factors#YourFirmCreditHistoryCategoricalRemained unchanged 486055.55
Improved 279631.96
Deteriorated 7138.15
DK 3724.25
Not applicable 80.09
12ExternalFinancing_Factors#WillingnessOfBanksToProvideCreditCategoricalRemained unchanged 341539.03
Improved 215024.58
Deteriorated 157918.05
Not applicable 125914.39
DK 3463.95
13ExternalFinancing_Factors#WillingnessOfBusinessPartnersToProvideTradeCreditCategoricalRemained unchanged 361141.27
Not applicable 261529.89
Improved 117813.46
Deteriorated 94210.77
DK 4034.61
14ExternalFinancing_Factors#WillingnessOfInvestorsToInvestInYourEnterpriseCategoricalNot applicable 578266.09
Remained unchanged 180220.6
Improved 4785.46
DK 3774.31
Deteriorated 3103.54
15FirmGrowth#InTermsOfEmploymentRegardingTheNumberOfFullTimeOrFullTimeEquivalentEmployeesCategoricalLess than 20% per year 362041.37
No growth 220125.16
Over 20% per year 141216.14
Got smaller 140016
DK/NA 610.7
Not applicable, the firm is too recent 550.63
16ConfidenceInNegotiations#WithBanksCategoricalYes 672776.89
No 129314.78
Not applicable 6187.06
DK 1111.27
17ConfidenceInNegotiations#WithEquityInvestorsOrVentureCapitalFirmsCategoricalNot applicable 425448.62
Yes 244227.91
No 175820.09
DK 2953.38
18FinancingGrowth_Instruments#FinancingGrowth_InstrumentsCategoricalBank loan 562764.32
Loan from other sources 115613.21
Equity investment 7808.92
Other 4505.14
DK/NA 4395.02
Subordinated loans, participation loans or similar financing instruments 2973.39
19FinancingGrowth_Amount#FinancingGrowth_AmountCategoricalOver €1 million 163218.65
DK/NA 162018.52
More than €25,000 and up to €100,000 145616.64
€100,000–€1 million 144316.49
More than €250,000 and up to €1 million 116313.29
More than €100,000 and up to €250,000 8679.91
Up to €25,000 5686.50
20FinancingGrowth_TheMostImportantLimitationCategoricalThere are no obstacles 264630.23
Interest rates or price too high 147416.85
Insufficient collateral or guarantee 132115.1
Other 8249.42
Financing not available at all 5496.28
DK/NA 155817.81
Reduced control over the firm 3053.49
too much paper work 720.82
21ExternalFinancing_Expectations#InternalFoundsCategoricalWill remain unchanged 445950.97
Will improve 240527.49
Not applicable 117013.37
Will deteriorate 4515.15
DK 2643.02
22ExternalFinancing_Expectations#BankLoansCategoricalWill remain unchanged 449851.41
Will improve 179320.49
Not applicable 148016.92
Will deteriorate 7969.1
DK 1822.08
23ExternalFinancing_Expectations#EquityInvestmentsCategoricalNot applicable 429249.05
Will remain unchanged 302734.6
Will improve 97911.19
Will deteriorate 1802.06
DK 2713.1
24ExternalFinancing_Expectations#TradeCreditCategoricalWill remain unchanged 415747.51
Not applicable 267030.52
Will improve 120513.77
Will deteriorate 5235.98
DK 1942.22
25ExternalFinancing_Expectations#DebtSecuritiesIssuedCategoricalNot applicable 560564.06
Will remain unchanged 195922.39
DK 8399.59
Will improve 2052.34
Will deteriorate 1411.62
26ExternalFinancing_Expectations#OtherCategoricalNot applicable 339638.82
Will remain unchanged 316836.3
DK116713.33
Will improve 7879
Will deteriorate 2312.64
27ExternalFinancing_Expectations#CreditLineBankOverdraftOrCreditCardsOverdraftCategoricalWill remain unchanged 473954.17
Not applicable 170019.43
Will improve 137415.7
Will deteriorate 5706.52
DK 3664.18
28ImportanceOfFactorsInTheFutureFinancingOfTheFirm#GuaranteesForLoansOrdinal8155617.78
5146716.77
1128314.66
10123214.08
7101111.56
65746.56
94755.43
34705.37
23714.24
43103.55
29ImportanceOfFactorsInTheFutureFinancingOfTheFirm#MeasuresToFacilitateEquityInvestmentsOrdinal1266230.42
5147516.86
88679.91
77158.17
27058.06
36026.88
105516.3
65125.85
44184.78
92422.77
30ImportanceOfFactorsInTheFutureFinancingOfTheFirm#ExportCreditsOrGuaranteesOrdinal1321636.76
5117513.43
87828.94
26607.54
76447.36
106106.97
35316.07
64765.44
43634.15
92923.34
31ImportanceOfFactorsInTheFutureFinancingOfTheFirm#TaxIncentivesOrdinal10188621.56
8149117.04
5131014.97
7102311.69
18169.33
96327.22
66157.03
33804.34
43403.89
22562.93
32ImportanceOfFactorsInTheFutureFinancingOfTheFirm#BusinessSupportServicesOrdinal5164018.74
8119313.64
7108812.44
198311.24
68569.78
108429.62
36497.42
45706.52
24935.63
94354.97
33ImportanceOfFactorsInTheFutureFinancingOfTheFirm#MakingExistingPublicMeasuresEasierToObtainOrdinal10167019.09
8157518
5131915.08
7114813.12
67728.82
16757.72
96367.27
33493.99
43383.86
22673.05
34FirmIncomeGenerationIndicators#LabourCostCategoricalIncreased 500657.21
Remained unchanged 304134.76
Decreased 6777.74
DK/NA 250.29
35FirmIncomeGenerationIndicators#OtherCostCategoricalIncreased 553563.26
Remained unchanged 259429.65
Decreased 5856.69
DK/NA 350.4
36FirmIncomeGenerationIndicators#InterestExpensesCategoricalRemained unchanged 417747.74
Increased 255929.25
Decreased 135815.52
DK/NA 6086.95
Not applicable, the firm has no debt 470.54
37FirmIncomeGenerationIndicators#ProfitCategoricalIncreased 368842.15
Decreased 251328.72
Remained unchanged 241127.56
DK/NA 1371.57
38FirmIncomeGenerationIndicators#ProfitMarginCategoricalRemained unchanged 294533.66
Decreased 252628.87
Increased 195722.37
DK/NA 132115.1
39FirmIncomeGenerationIndicators#DebtComparedToAssetsCategoricalRemained unchanged 347139.67
Decreased 275431.48
Increased 170419.48
Not applicable, the firm has no debt 7438.49
DK 770.88
40FinancingApplied#ApplicationOfExternalFinancing_BankLoansCategoricalDid not apply because of sufficient internal funds 423448.39
Applied 235026.86
Did not apply for other reasons 163018.63
Did not apply because of possible rejection 3343.82
DK/NA 2012.3
41FinancingApplied#ApplicationOfExternalFinancing_TradeCreditCategoricalDid not apply because of sufficient internal funds 391044.69
Did not apply for other reasons 236527.04
Applied 193922.16
DK/NA 3433.92
Did not apply because of possible rejection 1922.19
42FinancingApplied#ApplicationOfExternalFinancing_OtherExternalFinancingCategoricalDid not apply because of sufficient internal funds 427248.83
Did not apply for other reasons 252728.88
Applied 137315.69
DK/NA 3704.23
Did not apply because of possible rejection 2072.37
43FinancingApplied#ApplicationOfExternalFinancing_CreditLineCategoricalDid not apply because of sufficient internal funds 425948.68
Applied 206123.56
Did not apply for other reasons 168919.31
Did not apply because of possible rejection 3183.63
DK/NA 4224.82
44vulnerableBinary0843096.35
13193.65
Table A3. Balanced accuracy and ROC AUC score for all models.
Table A3. Balanced accuracy and ROC AUC score for all models.
ModelBalanced AccuracyROC AUC
Gradient Boosting0.7540.815
Logistic Regression0.7500.811
Easy Ensemble Classifier0.7460.804
Balanced Random Forest Classifier0.7380.810
Random Forest0.7330.807
SVC0.7320.793
Balanced SVC0.7290.800
AdaBoost0.7180.794
MultinomialNB0.7110.790
Balanced MultinomialNB0.7090.789
Balanced KNeighbors0.6780.743
KNeighbors0.5520.640
Table A4. Precision, recall, and F1-score for all models.
Table A4. Precision, recall, and F1-score for all models.
ModelPrecision
Class 0
Recall
Class 0
F1
Class 0
Precision
Class 1
Recall
Class 1
F1
Class 1
Gradient Boosting0.9310.7940.8570.4170.7140.526
Logistic Regression0.9310.7790.8480.4020.7220.517
Easy Ensemble Classifier0.9300.7750.8450.3960.7170.510
Balanced Random Forest Classifier0.9280.7640.8380.3830.7110.498
Random Forest0.9170.8300.8720.4360.6360.517
SVC0.9200.8030.8580.4090.6600.505
Balanced SVC0.9250.7540.8310.3710.7030.486
AdaBoost0.9180.7680.8360.3730.6680.478
MultinomialNB0.9180.7450.8230.3540.6760.465
Balanced MultinomialNB0.9170.7460.8230.3530.6710.463
Balanced KNeighbors0.9000.7700.8300.3440.5860.433
KNeighbors0.8450.9620.8990.4340.1420.214
Figure A1. Gradient Boosting (left) and Logistic Regression (right) ROC curves.
Figure A1. Gradient Boosting (left) and Logistic Regression (right) ROC curves.
Forecasting 07 00051 g0a1
Figure A2. Easy Ensemble (left) and Balanced Random Forest (right) ROC curves.
Figure A2. Easy Ensemble (left) and Balanced Random Forest (right) ROC curves.
Forecasting 07 00051 g0a2
Figure A3. Random Forest (left) and SVC (right) ROC curves.
Figure A3. Random Forest (left) and SVC (right) ROC curves.
Forecasting 07 00051 g0a3
Figure A4. Balanced SVC (left) and Adaboost (right) ROC curves.
Figure A4. Balanced SVC (left) and Adaboost (right) ROC curves.
Forecasting 07 00051 g0a4
Figure A5. Multinomial NB (left) and balanced multinomial NB (right) ROC curves.
Figure A5. Multinomial NB (left) and balanced multinomial NB (right) ROC curves.
Forecasting 07 00051 g0a5
Figure A6. K-Neighbors (left) and Balanced K-Neighbors (right) ROC curves.
Figure A6. K-Neighbors (left) and Balanced K-Neighbors (right) ROC curves.
Forecasting 07 00051 g0a6
Figure A7. Gradient Boosting confusion matrices.
Figure A7. Gradient Boosting confusion matrices.
Forecasting 07 00051 g0a7
Figure A8. Logistic Regression model confusion matrices.
Figure A8. Logistic Regression model confusion matrices.
Forecasting 07 00051 g0a8
Figure A9. Easy Ensemble Classifier model confusion matrices.
Figure A9. Easy Ensemble Classifier model confusion matrices.
Forecasting 07 00051 g0a9
Figure A10. Balanced Random Forest model confusion matrices.
Figure A10. Balanced Random Forest model confusion matrices.
Forecasting 07 00051 g0a10
Figure A11. Random Forest model confusion matrices.
Figure A11. Random Forest model confusion matrices.
Forecasting 07 00051 g0a11
Figure A12. SVC model confusion matrices.
Figure A12. SVC model confusion matrices.
Forecasting 07 00051 g0a12
Figure A13. Balanced Bagging SVC model confusion matrices.
Figure A13. Balanced Bagging SVC model confusion matrices.
Forecasting 07 00051 g0a13
Figure A14. Adaboost model confusion matrices.
Figure A14. Adaboost model confusion matrices.
Forecasting 07 00051 g0a14
Figure A15. MultinomialNB model confusion matrices.
Figure A15. MultinomialNB model confusion matrices.
Forecasting 07 00051 g0a15
Figure A16. Balanced Bagging multinomial NB model confusion matrices.
Figure A16. Balanced Bagging multinomial NB model confusion matrices.
Forecasting 07 00051 g0a16

References

  1. World Bank. Small and Medium Enterprises (SMEs) Finance—Improving SMEs’ Access to Finance and Finding Innovative Solutions to Unlock Sources of Capital. 2023. Available online: https://www.worldbank.org (accessed on 3 September 2025).
  2. Mason, C.M.; Harrison, R.T. “Investment readiness”: A critique of government proposals to increase the demand for venture capital. Reg. Stud. 2001, 35, 663–668. [Google Scholar] [CrossRef]
  3. Douglas, E.J.; Shepherd, D.A. Exploring investor readiness: Assessments by entrepreneurs and investors in Australia. Ventur. Cap. 2002, 4, 219–236. [Google Scholar] [CrossRef]
  4. Fellnhofer, K. Literature review: Investment readiness level of small and medium-sized companies. Int. J. Manag. Financ. Account. 2015, 7, 268–284. [Google Scholar] [CrossRef]
  5. Mullainathan, S.; Spiess, J. Machine learning: An applied econometric approach. J. Econ. Perspect. 2017, 31, 87–106. [Google Scholar] [CrossRef]
  6. Dumitrescu, E.; Hué, S.; Hurlin, C.; Tokpavi, S. Machine learning or econometrics for credit scoring: Let’s get the best of both worlds. HAL Open Sci. 2021. Available online: https://hal.archives-ouvertes.fr/hal-02507499 (accessed on 3 September 2025).
  7. Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
  8. Levy, J.; Mussack, D.; Brunner, M.; Keller, U.; Cardoso-Leite, P.; Fischbach, A. Contrasting classical and machine learning approaches in the estimation of value-added scores in large-scale educational data. Front. Psychol. 2020, 11, 2190. [Google Scholar]
  9. Agresti, A. Foundations of Linear and Generalized Linear Models, 1st ed.; Wiley: Hoboken, NJ, USA, 2015; pp. 1–480. [Google Scholar]
  10. Hosmer, D.W.; Lemeshow, S.; Sturdivant, R.X. Applied Logistic Regression, 3rd ed.; Wiley: Hoboken, NJ, USA, 2013; pp. 1–528. [Google Scholar]
  11. Churpek, M.M.; Yuen, T.C.; Winslow, C.; Meltzer, D.O.; Kattan, M.W.; Edelson, D.P. Multicenter Comparison of Machine Learning Methods and Conventional Regression for Predicting Clinical Deterioration on the Wards. Crit. Care Med. 2016, 44, 368–374. [Google Scholar] [CrossRef]
  12. Hornung, O.; Smolnik, S. AI invading the workplace: Negative emotions towards the organizational use of personal virtual assistants. Electron. Markets 2022, 32, 123–138. [Google Scholar] [CrossRef]
  13. Groves, R.M.; Fowler, F.J., Jr.; Couper, M.P.; Lepkowski, J.M.; Singer, E.; Tourangeau, R. Survey Methodology, 2nd ed.; Wiley: Hoboken, NJ, USA, 2009. [Google Scholar]
  14. European Central Bank Survey on the Access to Finance of Enterprises (SAFE). 2023. Available online: https://www.ecb.europa.eu/stats/ecb_surveys/safe/html/data.en.html (accessed on 3 September 2025).
  15. Cusolito, A.P.; Dautovic, E.; McKenzie, D. Can government intervention make firms more investment-ready? A randomized experiment in the Western Balkans. Rev. Econ. Stat. 2021, 103, 428–442. [Google Scholar] [CrossRef]
  16. Owen, R.; Botelho, T.; Hussain, J.; Anwar, O. Solving the SME finance puzzle: An examination of demand and supply failure in the UK. Ventur. Cap. 2023, 25, 31–63. [Google Scholar] [CrossRef]
  17. Du, J.; Rada, R. Machine learning and financial investing. In Handbook of Research on Machine Learning Applications and Trends: Algorithms, Methods, and Techniques; IGI Global: Hershey, PA, USA, 2010; pp. 375–398. [Google Scholar] [CrossRef][Green Version]
  18. Khan, A.H.; Shah, A.; Ali, A.; Shahid, R.; Zahid, Z.U.; Sharif, M.U.; Jan, T.; Zafar, M.H. A performance comparison of machine learning models for stock market prediction with novel investment strategy. PLoS ONE 2023, 18, e0286362. [Google Scholar] [CrossRef] [PubMed]
  19. Alexakis, C.; Gogas, P.; Petrella, G.; Polemis, M.; Salvadè, F. Investigating the Investment Readiness of European SMEs: A Machine Learning Approach. 2025. Available online: https://ssrn.com/abstract=5007133 (accessed on 3 September 2025).
  20. Zana, D.; Barnard, B. Venture capital and entrepreneurship: The cost and resolution of investment readiness. SSRN Electron. J. 2019. [Google Scholar] [CrossRef]
  21. Wooldridge, J.M. Introductory Econometrics: A Modern Approach, 5th ed.; South-Western Cengage Learning: Mason, OH, USA, 2013. [Google Scholar]
  22. Norvig, R. Artificial Intelligence. A Modern Approach, 4th ed.; Pearson Series; Pearson Education, Inc.: Upper Saddle River, NJ, USA, 2020. [Google Scholar]
  23. Cox, D.R. The regression analysis of binary sequences. J. R. Stat. Soc. Ser. B (Methodol.) 1958, 20, 215–232. [Google Scholar] [CrossRef]
  24. Cover, T.M.; Hart, P.E. Nearest neighbor pattern classification. IEEE Trans. Inf. Theory 1967, 13, 21–27. [Google Scholar] [CrossRef]
  25. Cortes, C.; Vapnik, V. Support-vector networks. Mach. Learn. 1995, 20, 273–297. [Google Scholar]
  26. Schütze, H.; Manning, C.D.; Raghavan, P. Introduction to Information Retrieval; Cambridge University Press: Cambridge, UK, 2008; pp. 234–265. [Google Scholar]
  27. Freund, Y.; Schapire, R.E. A decision-theoretic generalization of on-line learning and an application to boosting. Eur. Conf. Comput. Learn. Theory 1997, 55, 119–139. [Google Scholar]
  28. Liu, X.; Wu, J.; Zhou, Z. Exploratory Undersampling for Class-Imbalance Learning. IEEE Trans. Syst. Man Cybern. Part B (Cybern.) 2009, 39, 539–550. [Google Scholar]
  29. Breiman, L. Bagging predictors. Mach. Learn. 1996, 24, 123–140. [Google Scholar]
  30. Gogas, P.; Papadimitriou, T.; Sofianos, E. Money Neutrality, Monetary Aggregates and Machine Learning. Algorithms 2019, 12, 137. [Google Scholar] [CrossRef]
  31. Brodersen, K.H.; Ong, C.S.; Stephan, K.E.; Buhmann, J.M. The balanced accuracy and its posterior distribution. In Proceedings of the 20th International Conference on Pattern Recognition 2010, Istanbul, Turkey, 23–26 August 2010; pp. 3121–3124. [Google Scholar]
  32. Hair, J.; Anderson, R.; Tatham, R.; Black, W. Multivariate Data Analysis, 5th ed.; Prentice Hall: Upper Saddle River, NJ, USA, 1998. [Google Scholar]
  33. López, V.; Fernández, A.; García, S.; Palade, V.; Herrera, F. An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics. Inf. Sci. 2013, 250, 113–141. [Google Scholar] [CrossRef]
  34. Pes, B.; Lai, G. Cost-sensitive learning strategies for high-dimensional and imbalanced data: A comparative study. PeerJ Comput. Sci. 2021, 7, e832. [Google Scholar] [CrossRef] [PubMed]
  35. Araf, I.; Idri, A.; Chairi, I. Cost-sensitive learning for imbalanced medical data: A review. Artif. Intell. Rev. 2024, 57, 80. [Google Scholar] [CrossRef]
  36. Peykani, P.; Peymany Foroushany, M.; Tanasescu, C.; Sargolzaei, M.; Kamyabfar, H. Evaluation of Cost-Sensitive Learning Models in Forecasting Business Failure of Capital Market Firms. Mathematics 2025, 13, 368. [Google Scholar] [CrossRef]
  37. Mason, C.M.; Kwok, J. Investment Readiness Programmes and Access to Finance: A Critical Review of Design Issues; Working Paper 10-03; Hunter Centre for Entrepreneurship, University of Strathclyde: Scotland, UK, 2010. [Google Scholar]
  38. Brush, C.G.; Edelman, L.F.; Manolova, T.S. Ready for funding? Entrepreneurial ventures and the pursuit of angel financing. Ventur. Cap. 2012, 14, 111–129. [Google Scholar] [CrossRef]
  39. Landström, H. The ivory tower of business angel research. Ventur. Cap. 2019, 21, 97–119. [Google Scholar] [CrossRef]
Figure 1. A graphical representation of a 3-fold cross-validation process. A graphical representation of a 3-fold cross-validation process. For every set of hyperparameters tested, each fold serves as a test sample, while the remaining folds are used to train the model. The average performance for each set of parameters over the k test folds was used to identify the best model [30].
Figure 1. A graphical representation of a 3-fold cross-validation process. A graphical representation of a 3-fold cross-validation process. For every set of hyperparameters tested, each fold serves as a test sample, while the remaining folds are used to train the model. The average performance for each set of parameters over the k test folds was used to identify the best model [30].
Forecasting 07 00051 g001
Figure 2. Confusion matrix.
Figure 2. Confusion matrix.
Forecasting 07 00051 g002
Figure 3. ROC AUC.
Figure 3. ROC AUC.
Forecasting 07 00051 g003
Figure 4. Balanced accuracy and ROC AUC.
Figure 4. Balanced accuracy and ROC AUC.
Forecasting 07 00051 g004
Figure 5. Class 0 metrics.
Figure 5. Class 0 metrics.
Forecasting 07 00051 g005
Figure 6. Class 1 metrics.
Figure 6. Class 1 metrics.
Forecasting 07 00051 g006
Figure 7. Gradient Boosting ROC curve.
Figure 7. Gradient Boosting ROC curve.
Forecasting 07 00051 g007
Figure 8. Gradient Boosting confusion matrix.
Figure 8. Gradient Boosting confusion matrix.
Forecasting 07 00051 g008
Figure 9. Gradient Boosting confusion matrix—normalized.
Figure 9. Gradient Boosting confusion matrix—normalized.
Forecasting 07 00051 g009
Figure 10. Gradient Boosting variable importance plot.
Figure 10. Gradient Boosting variable importance plot.
Forecasting 07 00051 g010
Figure 11. SHAP summary for the Gradient Boosting model.
Figure 11. SHAP summary for the Gradient Boosting model.
Forecasting 07 00051 g011
Figure 12. The Cost—Optimized Gradient Boosting normalized confusion matrix.
Figure 12. The Cost—Optimized Gradient Boosting normalized confusion matrix.
Forecasting 07 00051 g012
Figure 13. The Cost—Optimized Gradient Boosting confusion matrix.
Figure 13. The Cost—Optimized Gradient Boosting confusion matrix.
Forecasting 07 00051 g013
Table 1. Comparison of models based on False Positives (FP), False Negatives (FN), and Total Misclassification Cost (TMC).
Table 1. Comparison of models based on False Positives (FP), False Negatives (FN), and Total Misclassification Cost (TMC).
ModelFPFNTMC
Gradient Boosting43894908
Logistic Regression379107914
Easy Ensemble Classifier409103924
Random Forest44199936
AdaBoost386111941
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Gogas, P.; Papadimitriou, T.; Goumenidis, P.; Kontos, A.; Giannakis, N. Identification of Investment-Ready SMEs: A Machine Learning Framework to Enhance Equity Access and Economic Growth. Forecasting 2025, 7, 51. https://doi.org/10.3390/forecast7030051

AMA Style

Gogas P, Papadimitriou T, Goumenidis P, Kontos A, Giannakis N. Identification of Investment-Ready SMEs: A Machine Learning Framework to Enhance Equity Access and Economic Growth. Forecasting. 2025; 7(3):51. https://doi.org/10.3390/forecast7030051

Chicago/Turabian Style

Gogas, Periklis, Theophilos Papadimitriou, Panagiotis Goumenidis, Andreas Kontos, and Nikolaos Giannakis. 2025. "Identification of Investment-Ready SMEs: A Machine Learning Framework to Enhance Equity Access and Economic Growth" Forecasting 7, no. 3: 51. https://doi.org/10.3390/forecast7030051

APA Style

Gogas, P., Papadimitriou, T., Goumenidis, P., Kontos, A., & Giannakis, N. (2025). Identification of Investment-Ready SMEs: A Machine Learning Framework to Enhance Equity Access and Economic Growth. Forecasting, 7(3), 51. https://doi.org/10.3390/forecast7030051

Article Metrics

Back to TopTop