1. Introduction
Research shows that no company can be sure of its future even in times of peace and prosperity. The problem of companies’ risk of bankruptcy is highly relevant today and is being addressed by many researchers. The acceleration in interest in its solution was caused by the events of the last few years (COVID-19, war in Ukraine), especially in Europe. It is necessary to catch earlier signals of bankruptcy, to which business managers should pay increased attention in order to prevent bankruptcy. For this purpose, various methods of selecting bankruptcy prediction features, as well as various bankruptcy prediction models, are suitable. It is proven that domain knowledge plays a significant role in the given process and, when combined with a suitable prediction method, can provide significant results. This is confirmed by the studies of several authors. It is possible to mention the studies of
Veganzones and Severin (
2021), who selected features based on their popularity in the prior literature, the study of
Min and Lee (
2008), who used expert opinion, or the study of
Zhou et al. (
2015), who applied domain knowledge approach. Often used features in bankruptcy prediction are
Altman’s (
1968) features. They were used in the study of
Hu (
2009) and that of
De Andrés et al. (
2011).
Barboza et al. (
2017) combined the features of
Altman (
1968) with the features of
Carton and Hofer (
2006), which have a greater impact on financial performance models in the short term. Similarly,
Du Jardin (
2015) applied financial ratios traditionally used in the literature since
Altman (
1968). These ratios were chosen based on the main financial dimensions which govern bankruptcy.
Tseng and Hu (
2010) used features inspired by the research of
Lin (
1999) and
Lin and Piesse (
2004).
Several studies (
Kirkos 2015;
Zvarikova et al. 2017;
Kovacova et al. 2019) were published in which the authors examined the occurrence of individual features in bankruptcy prediction models. We followed up the results of the study of
Kovacova et al. (
2019), who made a review of the most often used bankruptcy prediction features in Visegrad-group countries.
Based on the above mentioned, the research question was as follows: Which way of selecting financial features for DEA model ensures higher performance of the model: the domain knowledge approach or one of data mining techniques—LASSO regression?
This paper follows previous research aimed at finding the most appropriate method of selecting features for DEA models. In previous studies, we can rarely see the comparison of domain knowledge and data mining techniques when selecting features. The mentioned approaches are mostly considered individually. This study is focused on filling this gap in the research. The LASSO+DEA approach is applied, and its results are compared with the selection of features based on expert opinion and their use in DEA (DK+DEA approach). The performances of the LASSO+DEA approach and the DK+DEA approach are compared.
In line with the above mentioned, the aim of this paper was to select the most suitable financial features for bankruptcy prediction based on the comparison of the performance of DEA prediction models.
The remainder of the paper is structured as follows: The Literature Review Section presents different approaches to defining bankruptcy risk and lists studies dealing with methods and features applied in bankruptcy prediction. The Materials and Methods Section describes the research sample and methods used for feature selection and bankruptcy prediction. The Results Section offers the results of feature selection with the use of the domain knowledge and LASSO methods and uses them to create VRS DEA model. The Discussion Section compares the results of the DK+DEA and LASSO+DEA models and discusses them from the point of view of their performance and applied features. The Conclusion Section presents the contributions, added value, limitations and future direction of this research.
2. Literature Review
Determining corporate bankruptcy risk is one of the main challenges of economic and financial research as well as one the most important issues for investors and decision-makers (
Korol 2019). Predicting, measuring and assessing the risk of bankruptcy of a company is of particular interest to investors before investing their capital, as the optimization of risk is a prerequisite for the maximum capital profit of the investment, which will ensure payment of dividends. However, value maximization can only occur if capital providers selectively choose a profitable and sustainable business from which they can obtain the maximum share of business income (
Agustia et al. 2020). The risk of bankruptcy is an important topic in many scientific articles, which is primarily reflected in the implications for the stakeholders’ decisions (
Lukason and Camacho-Miñano 2019). Bankruptcy risk (insolvency) can be understood as “the company’s inability to meet maturing obligations resulting either from current operations, whose achievement conditions the continuation of activity, or from compulsory levies” (
Bordeianu et al. 2011, p. 250). According to
Achim et al. (
2012), the risk of business bankruptcy is closely related to economic and financial risk. While financial risk is determined by the level of indebtedness, economic risk is dependent on the ratio of fixed and variable costs. It can be said that, in general, knowledge of these risks makes it possible to quantify the risk of bankruptcy of the company. Bankruptcy risk is the risk of a company no longer being able to meet its debt obligations. This risk is also referred to as the risk of failure or insolvency (
Campbell 2011).
Bankruptcy risk represents a constant threat to businesses, which determines how long they will survive (
Khan et al. 2020). If a business goes bankrupt, in fact, the probability of bankruptcy in connected businesses increases (
Battiston et al. 2007), which can have a negative effect on the entire economy. Therefore, predicting the risk of bankruptcy is the subject of many research studies dealing with the search for the most suitable bankruptcy prediction model as well as the features describing bankruptcy the best.
Research on bankruptcy prediction dates back to
Fitzpatrick (
1932), who was the first to examine the financial conditions of bankrupt and non-bankrupt firms by comparing the values of their financial ratios. He found that there are significant differences between bankrupt and non-bankrupt companies, especially between liquidity, debt and turnover indicators (
Fejér-Király 2015). In the early days of the development of bankruptcy prediction models, discriminant analysis (DA) was very popular.
Beaver (
1966) applied univariate discriminant analysis to investigate the predictive ability of 30 financial ratios. The best discriminating factor was identified as the working capital/debt ratio. The second one was the net income/total assets ratio (
Gameel and El-Geziry 2016). Despite the criticism, this method was a starting point for the development of other models. The most famous bankruptcy-risk-scoring model, known as Z-score, was published by Altman in 1968 (
Voda et al. 2021). This model was developed with the use of multiple discriminant analysis. Since the introduction of Altman’s model, many other authors (
Deakin 1972;
Altman et al. 1977;
Norton and Smith 1979;
Taffler 1983) developed their models based on multiple discriminant analysis. In the 1980s, logistic regression analysis was developed, followed by probit analysis. The first logistic regression model intended to predict the financial situation of businesses was developed by
Ohlson (
1980). In the next period, many authors (
Kim and Gu 2006;
Mihalovic 2016;
Barreda et al. 2017;
Khan 2018;
Affes and Hentati-Kaffel 2019) compared the accuracy of the multiple discriminant analysis model and the logistic regression model. These two models were the most used parametric models in bankruptcy prediction (
Fejér-Király 2015). Probit analysis has not been as widely used as logistic regression. The first probit model was developed by
Zmijewski (
1984), followed by
Zavgren (
1985). Since the 1990s, the development of computer science has enabled the use of more computationally demanding methods in bankruptcy prediction. These methods are mainly non-parametric. Within them,
Mousavi et al. (
2023) identifies two main groups: machine learning and artificial intelligence, and operation research. Most used methods within the machine learning and artificial intelligence group include artificial neural networks, such as those used by
Messier and Hansen (
1988),
Odom and Sharda (
1990),
Atiya (
2001) and
Abid and Zouari (
2002), decision trees (
Frydman et al. 1985;
Chen et al. 2011;
Stankova and Hampel 2018), the Bayesian models (
Sarkar and Sriram 2001;
Aghaie and Saeedi 2009;
Cao et al. 2022), genetic algorithms (
Kingdom and Feldman 1995;
Alfaro-Cid et al. 2007;
Bateni and Asghari 2020), modeling based on rough sets (
Ahn et al. 2000;
Wang and Wu 2017) and support vector machines (
Huang et al. 2004;
Olson et al. 2012).
The main method within operation research is Data Envelopment Analysis. This method by
Simak (
1997) was firstly used when predicting corporate failure. In his master thesis, he compared the results of DEA with the results of Altman’s Z-score. In recent years, numerous models based on Data Envelopment Analysis have been developed to predict bankruptcy and their results were compared with the results achieved based on other techniques.
Cielen et al. (
2004) found that DEA outperformed a discriminant analysis model and a rule induction (C5.0) model in terms of their classification accuracy.
Ouenniche and Tone (
2017) proposed the out-of-sample evaluation of decision-making units by applying DEA. Out-of-sample framework was based on an instance of case-based reasoning methodology. They found that “DEA as a classifier is a real contender to Discriminant Analysis, which is one of the most commonly used classifiers by practitioners” (
Ouenniche and Tone 2017, p. 249).
Premachandra et al. (
2009) compared the results of an additive DEA model with the results of a Logit model. They found that DEA outperformed the Logit model in evaluating bankruptcy out of sample.
Condello et al. (
2017, p. 2186) found that DEA has “a greater capacity for bankruptcy prediction, while Logit Regression and Discriminant Analysis perform better in non-bankruptcy and overall prediction in the short term”.
Janova et al. (
2012) achieved similar results. They found that the additive DEA model seems to perform well in correctly identifying bankrupt agricultural businesses. On the other hand, it is less powerful when identifying non-bankrupt agricultural businesses. The performance of DEA models is assessed mainly with the use of sensitivity, specificity, or overall accuracy. In this regard,
Premachandra et al. (
2011) pointed out that the cut-off point of 0.5 traditionally used to classify bankrupt and non-bankrupt businesses may not be appropriate for the DEA model. According to these authors “depending on the precision with which predictions for bankrupt and non-bankrupt businesses need to be done, the decision maker has to determine an appropriate cut-off point”,
Premachandra et al. (
2011, p. 623).
Stefko et al. (
2020) determined the optimal cut-off of the additive DEA model at a point in which the sum of sensitivity and specificity is the highest.
Stankova and Hampel (
2023) selected an optimal threshold by applying the Youden index and distance from the corner. They found that “selecting a suitable threshold improves specificity visibly with only a small reduction in the total accuracy” (
Stankova and Hampel 2023, p. 129).
In the development of the above-mentioned models, the variables included in the model are as important as the method applied (
Nurcan and Köksal 2021). In order to select appropriate variables from high-dimensional datasets, various dimensionality reduction methods can be applied. Depending on whether the original features are transformed into new features or not, feature extraction methods and feature selection methods are differentiated (
Wang et al. 2016;
Li et al. 2020). Feature extraction methods transform existing features into a lower-dimensional space (new set of features) while preserving the original relative distance between the features (
Subasi and Gursoy 2010;
Li et al. 2020). Well-known feature extraction methods often used in current research include Principal Component Analysis (
Adisa et al. 2019;
Karas and Reznakova 2020), Multidimensional Scaling (
Tang et al. 2020) and Isometric Mapping (
Gao et al. 2020). Since the new set of features is different from the original ones, it may be difficult to interpret them (
Wang et al. 2016). When using feature selection methods, the original features are sorted according to specific criteria and features with the highest ranking are selected to form a subset (
Li et al. 2020). Among the feature selection methods, we can differentiate between filter, wrapper, embedded and combined methods (
Liu et al. 2018). Filter methods examine each feature independently while ignoring the individual performance of the feature in the relation to the group. Within filter methods, researchers frequently use
t-test (
Chandra et al. 2009;
Xiao et al. 2012), correlation analysis (
Zhou et al. 2012) and stepwise methods (
Lin et al. 2010). Wrapper methods use machine learning algorithms to evaluate the performance of selected feature subsets. Within them, decision trees (
Ratanamahatana and Gunopulos 2003), Naive Bayes (
Chen et al. 2009), artificial neural networks (
Ledesma et al. 2008) and genetic algorithms (
Amini and Hu 2021) are often used. The results of wrapper methods are often superior to the results of filter methods; however, the computational cost of wrapper methods is high. Embedded methods integrate feature selection and learning procedures. Important embedded techniques are regularization approaches which have recently become more and more interesting, for example, LASSO (
Fonti and Belitser 2017;
Cao et al. 2022;
Paraschiv et al. 2021), and Elastic net (
Jones et al. 2016;
Amini and Hu 2021). Combined methods include different types of feature selection measures, such as filter and wrapper.
Various methodologies have been applied to select features for DEA models.
Cielen et al. (
2004) used variables according to their efficiency to predict bankruptcy in prior research. Similarly,
Psillaki et al. (
2010) focused on financial ratios which appeared to be most successful in previous studies.
Premachandra et al. (
2009) approached this issue in the same way. When creating DEA models, they used ratios which were applied in past bankruptcy literature, and some of them were the same as the ratios used by
Altman (
1968) and
Cielen et al. (
2004). The ratios selected by
Premachandra et al. (
2009) were later applied in the study of
Condello et al. (
2017) and other studies.
Min and Lee (
2008) combined expert opinion and factor analysis when selecting features for DEA models. The resulting set of indicators contained the most relevant financial classification dimensions, while taking into account the mathematical relationships among ratios as well.
Sueyoshi and Goto (
2009) applied Principal Component Analysis to reduce the number of financial factors in order to reduce the computational burden of the DEA-DA model.
Stefko et al. (
2021) used Principal Component Analysis and Multidimensional Scaling when selecting inputs and outputs for DEA models.
Huang et al. (
2015) selected variables for DEA models based on gray relational analysis. They proved this method to be an effective technique for obtaining variables for DEA models. Gray relational analysis was later used in this way by
Nurcan and Köksal (
2021) as well.
Lee and Cai (
2020) were dealing with the curse of dimensionality in DEA. They proposed the LASSO variable selection technique and combined it in a sign-constrained convex nonparametric least squares (SCNLS) to support estimating the production function using DEA for small datasets. They also proved that this approach provides useful guidelines for DEA with small datasets.
Chen et al. (
2021) were inspired by their approach and proposed a simplified two-step LASSO+DEA approach to handle the dimensionality of data entering the DEA models via LASSO. They used standard cross-validation LASSO to select an optimal number of regressors. These regressors were used in the DEA model. As an important advantage of this approach against the study of
Lee and Cai (
2020),
Chen et al. (
2021) state that tuning parameter λ was not chosen manually, but it was determined based on optimizing the classical cross-validation criterion to optimally select the relevant variables.
4. Results
From the results presented in
Table 3, it is clear that the analyzed sample of companies achieved the required liquidity values, which means that most of the companies are able to pay their liabilities. Since liquidity is one of the representatives of the financial risk of companies and its low values can put companies in a state of financial distress, these results can be evaluated positively. Equally good results are indicated by the median of the indicator net working capital to current assets, which represents 21%. This value is not optimal, but it can be considered acceptable from the point of view of financial risk. The results of the profitability indicators can also be evaluated positively, as the median of them is positive and ranges from 9% to 2%. The costs ratio (0.98) also corresponds to it. The results of this indicator gives companies room for profit creation.
The total asset turnover ratio reaches a value of (1.46 or 1.43), which can be considered an adequate turnover rate considering the subject of business activity.
Less good results are indicated by indebtedness values. The share of liabilities in total assets is up to 68%, 56% of which are short-term liabilities. Liabilities are 1.69 times higher than the company’s equity and, thus, the indicator liabilities to equity ratio does not reach the required optimal value. It is precisely the indebtedness of businesses that can be considered a weak point of the analyzed sample, which represents a risk of financial distress for them.
Based on the research of
Kovacova et al. (
2019) and the procedure described in
Section 3.2.
Selection of financial features, selected DK features were as follows: Total revenues to total assets, Current ratio, Net working capital to total assets, Return on assets with EAT, Return on equity, Netto cash flow to liabilities and Liabilities to total assets. These features were selected with the use of the correlation matrix.
The most relevant predictors according to LASSO penalized logistic regression were identified by optimizing the value of
using 10-fold cross validation. At the optimal lambda value 0.0071, 7 financial ratios out of 26 exhibit non-zero coefficients (see
Table 4). These indicators are as follows: Liabilities to total assets, Return on costs, Return on equity, Short-term liabilities to total assets, Net working capital to total assets, Netto cash flow to total assets, and Total asset turnover ratio. The coefficients of the rest of indicators were shrunk to zero. A similar approach was used in the study of
Chen et al. (
2021), who used LASSO, while tuning parameter λ was selected based on optimizing cross-validation criterion. In this way, the authors selected the relevant variables optimally before deploying DEA on these variables. A simplified LASSO+DEA approach was also used by
Lee and Cai (
2020). However, these authors chose to manually tune parameter λ.
Features selected with the use of the DK approach and LASSO penalized logistic regression were used as inputs and outputs for the VRS DEA models. Two VRS DEA models were formulated—the model with the application of DK features (DK+DEA) and the model with the application of LASSO features (LASSO+DEA). Their results are compared in
Table 5. In the case of the DK+DEA model, there were 41 businesses which lie on the financial distress frontier. LASSO+DEA model identified 13 less businesses lying on the financial distress frontier. In the case of DK+DEA, the most numerous group of enterprises is located in the efficiency interval
; on the contrary, in the case of LASSO+DEA, the largest number of identified enterprises is in the interval
.
For better comparability of the results, the optimal cut-offs for both models were determined with the use of the Youden index.
The optimal cut-off of LASSO+DEA model was determined at the level of 0.59. The classification accuracy for bankrupt businesses at this cut-off was 79.10% (see
Table 6). The classification accuracy for non-bankrupt businesses achieved a higher value, 86.66%. The overall classification accuracy of LASSO+DEA model at a cut-off of 0.59 was 86.29%.
In the case of the DK+DEA model, the optimal cut-off was determined at the level of 0.89. At this cut-off, the DK+DEA model achieved high classification accuracy for bankrupt businesses, 97.01%, and lower classification accuracy for non-bankrupt businesses, 78.72 (see
Table 7). The overall classification accuracy of DK+DEA model was 79.63%. A slightly higher overall classification accuracy of the DEA model with the application of DK features (85.1%) was achieved by
Cielen et al. (
2004). DEA models using DK features developed by
Premachandra et al. (
2009) achieved an overall classification accuracy of 74–86%. Similar to our results, these models achieved higher classification accuracy for bankrupt businesses.
Based on the results presented in
Table 6 and
Table 7, we can conclude that the DK+DEA model performs better when identifying bankrupt businesses. It means that features selected via the DK approach are more suitable for bankruptcy prediction. The selection of DK features, with the application of which the DEA model with a higher classification accuracy was created, represents the fulfillment of the aim set in this paper.
The confirmation of this result can also be seen on the ROC curve (see
Figure 1). The results show that both DEA models achieved excellent classification accuracy; however, the classification accuracy of the DK-DEA model was slightly higher.
5. Discussion
The summary of the research results shows interesting findings. By applying different features, the models achieved different classification accuracies.
Table 8 and
Figure 2 show the comparison of the bankruptcy prediction results achieved using the DEA model when applying features selected via DK and LASSO.
The analysis shows that when applying DK features, the VRS DEA model confirmed the assumption of bankruptcy in 44 businesses, which is 13 more businesses than when applying LASSO features. It is 23 fewer businesses than the assumption of bankruptcy. However, when applying LASSO features, it is 36 fewer businesses compared to the assumption of bankruptcy.
On the other hand, in the case of LASSO, only 3 of 31 businesses were incorrectly identified. These results indicate that the DK+DEA model has a better classification accuracy in relation to the assumption of bankruptcy. However, LASSO+DEA shows a smaller deviation in the number of identified businesses on the financial distress frontier.
Based on the above results, the application of feature selection using the LASSO method appears to be more appropriate. However, it is necessary to continue the analysis and apply other procedures and methods.
To analyze the results in more detail, it is necessary to specify the features used in both cases. They are presented in
Table 9.
Agreement in the selection of indicators occurred in the case of three indicators, highlighted in italics in
Table 9. However, the selection based on the experience of experts seems to be more relevant, as it also includes Current ratio and Insolvency ratio (Netto cash flow to liabilities). Many authors consider these indicators to be important predictors of bankruptcy. This can be confirmed by the definitions of financial health of several authors.
Szilagyi (
2004) defined a financially healthy business, and, within his definition, he pointed out that such a business is not expected to become insolvent and does not show any sign of a threat to its existence, and it is even able to adequately cover the risks related to indebtedness. The importance of ability to pay was also pointed out by
Koh et al. (
2015), who defined financial distress as a situation when a business cannot pay the amount owed on the due date.
Platt et al. (
1995) argue that financial distress occurs when the total value of a company’s assets is lower than the total value of creditors’ claims. In the long term, this situation can lead to forced liquidation or bankruptcy. For this reason, financial distress is often referred to as a harbinger of bankruptcy and is related to the availability of liquid funds and credit (
Hendel 1996).
Gestel et al. (
2006) characterize financial distress and financial failure as the result of chronic losses that cause a disproportionate increase in liabilities accompanied by a loss of assets’ value. It is possible to mention other authors who talk about the ability to repay obligations as an important predictor of bankruptcy (
Campbell 2011;
Achim et al. 2012).
This means that a financially healthy company is able to pay its obligations and has fulfilled the purpose of its existence—to be profitable. Therefore, the indicators Current liquidity, Netto cash flow to liabilities and Return on assets in DK have their justification. This selection seems to be much more relevant than the LASSO selection.
On the other hand, it should be pointed out that the Current ratio was used as one of the criteria when establishing the assumption of bankruptcy. This indicator was selected as one of the DK features as well. This fact could affect the results of the DK+DEA model.
Table 10 shows the overall performance of the constructed DEA models. We can see that the DK+DEA model achieved slightly better AUC and Somers’ D compared to the LASSO+DEA model. Based on it, we can conclude that the selection of DK features is more appropriate than the selection of LASSO features when predicting the bankruptcy of businesses. If we compare the achieved results with the literature, the results of previous studies are slightly different.
Zhou et al. (
2015) found that there is no significant difference between the classification performance of models with feature selection guided by data mining techniques and that of those guided by domain knowledge. The findings of
Lin et al. (
2014), who revealed that a model with LASSO-based feature selection achieved a slightly higher performance in terms of accuracy as well as AUC compared to DK, are also slightly different. However, the comparability of these studies depends on several factors, e.g., research sample, used model, etc.
6. Conclusions
In this research, the features for DEA models were selected with the use of the domain knowledge approach and the LASSO approach. According to DK, the following bankruptcy prediction indicators were chosen: Total revenues to total assets, Current ratio, Net working capital to total assets, Return on assets with EAT, Return on equity, Netto cash flow to liabilities, and Liabilities to total assets. LASSO identified the following predictors of bankruptcy: Liabilities to total assets, Return on costs, Return on equity, Short-term liabilities to total assets, Net working capital to total assets, Netto cash flow to total assets, and Total asset turnover ratio. Subsequently, the performance of the DK+DEA and LASSO+DEA models was compared. Performance was different for both selections at different cut-offs. For the selection of features according to LASSO, the optimal cut-off was 0.59, which means that from this value, businesses were identified as bankrupt. In the case of selecting features based on DK, the optimal cut-off value was at the level of 0.89. Based on this fact, it can be concluded that in the case of DK feature selection, more indicators were identified as predictors of businesses’ bankruptcy. Important predictors of bankruptcy found with the DK application include Current ratio, Insolvency ratio (Netto cash flow to liabilities) and Return on assets, which are missing in features selected via LASSO. These features are significant predictors of bankruptcy that are applied in many bankruptcy prediction studies (
Reznakova and Karas 2014;
Lin et al. 2014;
Pavlicko and Mazanec 2022).
The contribution of the paper is the application of DK and LASSO features and VRS DEA model in the evaluation of the financial failure of businesses. The results revealed that the DK+DEA model achieved higher classification and prediction accuracy compared to the LASSO+DEA model. On the other hand, there is a smaller deviation in the number of identified businesses on the financial distress frontier in the LASSO+DEA model.
The added value of this research lies in pointing out the importance of the indicators Current ratio and Return on assets, which were the criteria used to establish the assumption of bankruptcy. Since these indicators entered the DK+DEA model as well, this model achieved higher classification accuracy compared to LASSO+DEA. Therefore, it is necessary to pay more attention to the selection of criteria for determining the assumption of bankruptcy and subsequently to the selection of features based on DK. The managerial implications of this research enable companies and managers from the construction industry to focus on those features that are decisive for the area of evaluating the financial health of companies.
A limitation of the given research was missing and insufficient data. Another limitation was the occurrence of a relatively large number of outliers. Future research will be focused on confirming the significance of selected indicators for predicting the financial failure of companies, and especially on the Current ratio and its use in identifying prosperous and non-prosperous businesses.