1. Introduction
A topic which has attracted large amounts of interest from the machine learning and statistical communities is the importance which certain variables have when building models to predict or explain a response variable. This can be encountered in a large variety of fields, such as the measurement of the technical efficiency of a set of homogeneous entities (companies, public organizations, etc.), related to microeconomics and operations research. Some of the early contributions in this area of study can be traced back to the work by Cobb and Douglas [
1], who empirically estimated a production function. Later, Koopmans proposed a formal definition of technical efficiency [
2], and Debreu and Farrell introduced a way to measure it, following an input-oriented or output-oriented radial direction in [
3,
4], respectively. A link between the measures of efficiency and production technologies was introduced by Shephard in [
5]. Building on these foundations, a variety of approaches have been proposed, which are usually split in the literature into parametric and nonparametric methodologies. Representative examples of each approach, which are two of the most well-known techniques, are the Data Envelopment Analysis (DEA) in the nonparametric family [
6,
7], and the Stochastic Frontier Analysis (SFA) in the parametric one [
8,
9].
In this article, we focus on the nonparametric approach due to some of its characteristics, such as its flexibility and its natural multi-input, multi-output treatment. Whereas the parametric approach assumes some functional form for the production frontier, the nonparametric approach takes, as a foundation, only some properties of the underlying production frontier. In particular, the DEA methodology proposes a linear optimization program which can be used to estimate the technical efficiency of a unit with respect to a production technology which satisfies the postulates of envelopment, free disposability of inputs and outputs, and convexity. It does so by applying, at the last stage, the postulate of minimal extrapolation, which estimates the smallest among all possible sets satisfying the above postulates [
7].
The postulate of minimal extrapolation is the cause of one of the criticisms that has been leveraged against DEA, which is that it is a data-driven technique which is descriptive in nature, and thus may not generalize well [
10]. In particular, it does not allow for statistical inference tasks to be performed unless the sample size is large enough. Various authors have attempted to overcome this limitation. Among them, Simar and Wilson have adapted bootstrapping procedures to estimate bias, variance, and to construct confidence intervals [
11,
12]. Other properties studied include the consistency and speed of convergence of the DEA estimators [
13], which is deeply related to the problem of the curse of dimensionality, which results in too many units being considered efficient when the number of dimensions is large, relative to the number of units available. Some recent contributions to the DEA literature are [
14,
15].
In this paper, we turn our attention to a field related to that of operations research and optimization, which is the field of machine learning and data analytics. This is an area of knowledge which builds estimators from available data. These estimators can be broadly classified into two families: supervised and unsupervised learning. In supervised learning, some variables are used in order to estimate one or multiple objective variables. Depending on the nature of the predicted variable(s), supervised learning tasks can be a regression of a continuous variable or a classification of elements into various classes when the target variable is discrete. On the other hand, in unsupervised learning, all variables are used in order to obtain information about the process which generated the data, which includes tasks such as clustering, anomaly detection, or estimating probability densities and their supports.
Until relatively recently, there had been little contact between the fields of machine learning and the measurement of technical efficiency fields, but some examples of their proximity can be seen in the work by Kuosmanen and Johnson, who used piecewise linear estimators of a production function via the Corrected Concave Nonparametric Least Squares [
16]. Another contribution by Parmeter introduced nonparametric kernel estimators to the frontier problem [
17], while Daouia et al. proposed a procedure using constrained polynomial splines to obtain smooth frontiers [
18]. Other authors have adapted decision tree-based techniques such as Classification and Regression Trees (CART) in [
10], or probabilistic regression trees (with panel data) in [
19]. Furthermore, Valero-Carreras et al. adapted Support Vector Regression in this context [
20], while Olesen and Ruggiero proposed a representation of production frontiers using hinging hyperplanes [
21]. Finally, Guerrero et al. combined DEA with machine learning techniques through the Structural Risk Minimization principle [
22].
One feature that the works mentioned above have in common (except DEA itself) is that they are methods which use some of the variables available in the data (inputs) in order to predict the values of one (or more) output variable(s). This is usually associated with the supervised learning paradigm in machine learning. In the production context, this is justified by the natural split into inputs and outputs present, where the inputs are used to predict or explain the values of the output(s). However, these methods have drawbacks, such as the requirement to a priori partition the variables into two subsets, and they present difficulties when being extended to multi-output contexts [
23].
In contrast, unsupervised learning makes no such distinction between variables. Instead, it treats all variables homogeneously, and attempts to obtain information about the underlying Data Generating Process (DGP) that yielded the data. In this context, the estimation objective of DEA becomes to estimate the production technology, which can be seen as the support of the underlying Data Generating Process [
24] and, from this point of view, DEA resembles an unsupervised learning technique more closely. These observations enable the use of methodologies for the estimation of the support of a distribution in order to estimate production technologies.
Among the methods for estimating the support of a probability distribution, a relevant family is that of kernel methods, such as the Kernel Density Estimation [
25,
26]. Kernels are transformations of the data which endow estimators with flexibility. They can also provide smoothing to estimators of probability densities of random variables, and their corresponding supports. At the intersection of machine learning methodology and the statistical learning theory, there lies a family of kernel-based estimators called Support Vector Machines (SVM) [
27,
28]. First introduced by Vapnik, SVMs adapt the flexibility of kernel methods to varied tasks such as classification and regression in the supervised learning area. Furthermore, SVMs can be adapted to unsupervised machine learning tasks such as the estimation of support of high-dimensional distributions via, for example, the OneClass Support Vector Machines (OCSVM) estimator of [
29]. We choose this method as the basis for our estimator of the production technology.
An important topic in data-driven methodologies is the phenomenon of the decrease in the quality of the model as the dimensionality of the data increases when compared to the number of data points available [
30]. This problem is related to the rate of convergence, which depends on the number of units as well as on the dimensionality of the problem. In a nonparametric frontier analysis, this phenomenon is often called the curse of dimensionality and takes form in an increase in the number of Decision Making Units (DMUs) considered efficient, and the subsequent lack of discrimination between them. It is also related to the question of model specification, and an important task in this context is measuring the importance of each variable in the production process, which allows the ranking of the variables according to this importance.
Approaches to the evaluation of the importance of variables and their selection from the DEA literature include proposals based on regressions between input variables and efficiency scores [
31], or partial correlations among variables [
32], as well as methods evaluating the contribution of each variable to the estimated efficiency scores [
33]. Statistical hypothesis tests have been proposed to evaluate the significance of input variables [
34], as well as comparisons between the number of efficient units estimated by various models [
35]. We refer the reader to a more thorough discussion and comparison of these and other techniques in [
36]. Other contributions propose methods which evaluate the importance of subsets of variables, such as those that enrich the optimization programs using binary variables to model the inclusion or exclusion of variables [
37,
38,
39]. Along these lines, criteria such as Akaike’s Information Criteria [
40] or game-theoretic measures such as the Shapley value [
41] have been used to choose among models. In addition to the selection among the original variables, other authors have proposed methods for the aggregation of variables into new variables, such as those which use the Principal Component Analysis [
42,
43], or techniques based on bootstrapping [
12]. Some other approaches to attempt to increase the discriminating power of DEA involve super-efficiency models [
44], which omit the unit whose efficiency is being evaluated from the reference set of the technology, or the use of the distance to anti-efficient frontiers [
45], among others.
More recently, there are contributions for the selection of variables such as [
46], who provide a more recent overview of methods as well as a methodology using contribution loads; methodologies based on statistical tests such as [
47,
48]; as well as methods which enrich other estimators such as SCNLS [
16] with LASSO-based regularization terms [
49,
50,
51].
From the machine learning perspective (without contact with the technical efficiency measurement field), many varied approaches have been proposed for the ranking of importance of variables and their selection; see for example [
52,
53,
54]. In this paper, we will focus on two of the most well-known methods. One involves the random shuffling of features, first introduced with Random Forest in [
55]; and the other is an SVM-specific method which measures the importance of variables via their effect on the objective value of the dual formulation of the estimator. It was introduced in [
56].
In summary, in this paper, we propose an adaptation of the OneClass Support Vector Machine algorithm to the estimation of production technologies which generalize those of DEA, aiming to overcome the deterministic and overfitting nature of DEA. We furthermore endow it with methods for the measurement of the importance of each variable in the production process, as well as obtaining a ranking of these variables. We propose a feature shuffling method and an approach based on the objective function of the dual formulation of the model. This paper presents a new link between the ranking of the importance of variables in efficiency measurement and machine learning. Furthermore, this paper proposes, for the first time, the use of unsupervised machine learning methodologies to rank the importance of variables in production processes.
The rest of this article is structured as follows.
Section 2 introduces the main concepts of the Data Envelopment Analysis, OneClass Support Vector Machines, and describes the feature importance methods which we will adapt.
Section 3 proposes an adaptation of the OneClassSVM estimator to the nonparametric frontier estimation context and equips it with the proposed approaches for ranking the importance of features.
Section 4 describes and presents the results of a computational experiment performed to compare these methods. Finally,
Section 5 presents the conclusions of this article, as well as an outline of potential future research lines.
3. New Methods for Ranking Variables in Production Processes Using OneClass Support Vector Machines for Efficiency Measurement
In this Section, we adapt the OCSVM algorithm introduced in
Section 2 with the piecewise linear kernel (
5) to the task of estimating production technologies via appropriate modifications to satisfy convexity and other relevant microeconomic properties. We, furthermore, propose two approaches for ranking the importance of the variables involved in the production process.
As the basis of the proposal, we follow the approach from [
24] that a technology arises from an underlying Data Generating Process (DGP) which is assumed to have some statistical properties. These assumptions are that the observed DMUs are random samples of identically and independently distributed random variables with an underlying probability density function satisfying some regularity conditions. In this context, the problem of estimating a production technology has a natural interpretation, such as the task of estimating the support of a probability distribution, which enables the use of tools from that literature, such as OCSVM.
We begin by adapting the OneClass Support Vector Machine estimator (
3) to the task of estimating a production technology as follows. The resulting model, which we call OneClass for Efficiency Measurement (OCEM) is the following:
Model (8) is a quadratic program, where the objective function and restrictions (8a) and (8b) are identical to those of the OCSVM model (
3). We use the PWL transformation function
from (
5), with
for all hyperplanes. The algorithm involves two hyperparameters:
and
, which we will now characterize. These will be fine-tuned via a train-test split in order to obtain the best ones for each dataset. Restriction (8c) will ensure convexity of the estimated technology, defined by:
The corresponding efficient frontier is defined by the boundary of the polyhedral technology, that is,
Thus, restriction (8d) will guarantee that the efficient frontier will pass through the origin, that is,
. We calculate the technical inefficiency of a DMU with respect to this technology by adapting the DDF formulation (
1), with
:
With this setup, it can be proved that the estimated technology
satisfies the usual microeconomic axioms of production technologies. The convexity of
follows as in ([
60], Section 5), given that the defined
is concaved and
. Free disposability of inputs and outputs is satisfied as in CNLS (see ([
16], Section 2.2)) when imposing the condition that
for all hyperplanes, so we will determine hyperplanes satisfying this property. As a consequence of the properties of the OCSVM algorithm ([
29], Proposition 3), we obtain a bound on the fraction of outliers
in terms of the hyperparameter
, given by
Here, is the number of Support Vectors, that is, those DMUs which are important to determine the frontier. Therefore, we observe that, if , then , as a consequence of the principle of minimal extrapolation in DEA. In other words, the technology estimated by DEA is, when is small enough, a subset of the estimated technology.
The property that is a lower bound for the fraction of Support Vectors allowed, and an upper bound for the fraction of outliers, leads us to choose in the range , except when , where we choose . This choice results in a minimum of 0 outliers, and a maximum of of DMUs being outliers.
We now describe the role of the hyperplanes involved in the transformation function (
5), which is closely related to the appropriate range of values of the hyperparameter
. The hyperplane coefficients
and
involved in the model parameterize a set of hyperplanes which will allow the polyhedral frontier turning points, that is, where the edges of the faces of the polyhedral technology will be located. Since the goal is to estimate an efficient frontier which is close to the data but without overfitting, thus being close to the theoretical frontier, we are interested in hyperplanes which lie in the region enveloping the data from above. A known set of hyperplanes in this region is given by the faces of the convex closure, which we remark can be estimated by the DEA methodology. Hence, we obtain a set of hyperplanes by solving the following linear DEA problem with directional function
corresponding to the Chebyshev norm
(see [
61]). Using the netput notation, the linear program to solve for DMU
is:
By solving these
n linear programs, we obtain, for each DMU
, a corresponding hyperplane defined by
which is a hyperplane at distance
from this DMU located on the DEA-estimated efficient frontier along the direction
. Thus, we choose
. Regarding
, it is a hyperparameter which offsets the defined hyperplanes simultaneously, so we choose the value
, i.e., the negative of the largest distance from a DMU to the efficient frontier, as a lower bound for potential values of
, yielding a potential range of values for
. We remark that, by the first constraint of (
10), the objective value of (
10) (i.e.,
) is non-negative, hence,
for each DMU. We assume that there is at least one DMU which is strictly inefficient with respect to model (
10), which implies that there is some DMU
such that the objective value of program (
10) is strictly positive, hence
. Under this hypothesis, we have that
.
At this stage, we have all the information required to set up the quadratic program (8) which will be solved in order to obtain
. It remains to tune the hyperparameters
when (
and
. For this task, we choose five equally spaced values for each variable from these intervals. We then split the data into train and test sets, where we use
of the data as a training set
and the remaining
as a test set
in order to evaluate the fit of each model trained. We denote their respective cardinalities by
and
. Then, for each pair of hyperparameter values, we fit model (8) with the train set in order to obtain an estimate of the technology
. In order to choose among these candidate technologies, we evaluate them using their Mean Squared Error on the test set as follows. We evaluate model (
9) using Farrell’s output distance, that is, with
, in order to obtain the efficiency
of DMU
, and its projection
to the estimated efficient frontier. We then calculate the MSE between the observed values and the estimated output values of each DMU in the test set, via
We choose those hyperparameters which lead to the smallest Mean Squared Error on the predictions on the test set. With these hyperparameters fixed, we again fit model (8) using the whole dataset to obtain the final estimate of the technology .
The algorithm described above involves the tuning of appropriate values for hyperparameters . However, for computational reasons, we sometimes already have appropriate valid hyperparameters from a previous estimate of the technology, which we fix for comparison.
Since Program (8) is quadratic, we can use the standard tools of quadratic programming to obtain the following dual formulation, with hyperparameters
. We remark that the dual of Program (8), which is a minimization problem, is a maximization problem. However, the objective function of the maximization problem is the negative of the presented one, and is thus equivalent to the following minimization problem (this is standard in the SVM literature, see e.g., ([
29], eq. (3.11))):
We remark that, in terms of the variables of the primal program (8)), we have the equality , so that the objective value is a measure of the error of the model.
We now adapt the previously described methods for ranking the relative importance of the variables to the context of ranking the variables involved in productive processes using the OCEM algorithm.
We first describe the Shuffling-OCEM methodology involving the random shuffling of variable values, before moving to the Dual-OCEM proposal, which is based on the variation of the objective function of the dual program (
11).
3.1. Shuffling-OCEM
We proceed to adapt the methodology involving the shuffling of values of a feature to the OCEM context as follows. We first solve the full OCEM model with the original dataset
, with the tuning of hyperparameters. This yields values for
, which we use to obtain the MSE of the fitted estimator, denoted by
. We then iterate over each variable
l being tested for inclusion, randomly shuffle its values to obtain dataset
, and solve the OCEM model with the previously established hyperparameters
and dataset
. We calculate the error of this model,
, and calculate the importance measure
of the variable
l using (
6).
We remark here that we keep the hyperparameters obtained in the first model fixed for the shuffled models, since we are interested in evaluating the effect of the change in each candidate variable on the estimator. Furthermore, we performed some preliminary testing, where we observed that the version with further hyperparameter tuning takes longer computational time, around 5 times longer than without it, but does not yield better results. Thus, we consider the version where the hyperparameter tuning procedure is only performed on the full model at each step of the process, and solve the shuffled models with the hyperparameters obtained from the full model.
Once the measure of importance
has been calculated for every variable, the least important variable is that with the smallest value of
. We add this variable to the current ranking, and iterate the method without this variable, in order to continue ranking the rest of the variables. At each iteration of the method, the hyperparameters
are recomputed. The final variable, that is, the variable which is never considered as the least important among those remaining, is then considered the most important variable in the production process. Algorithm 1 shows the steps followed by Shuffling-OCEM.
Algorithm 1 Shuffling-OCEM algorithm implementation |
procedure calculate_ranking_shuffling_OCEM (,variables_to_rank) |
remaining_variables ← variables_to_rank |
ranking, ΔE ← [ ] |
while |remaining_variables| > 1 do |
n_var = |remaining_variables| |
for to n do |
solve_program_(10) |
end for |
|
|
|
|
for to do |
|
|
|
|
end for |
|
|
|
|
end while |
|
|
return ranking |
end procedure |
3.2. Dual-OCEM
This method evaluates the effect that the omission of a variable has on the model via the variation in the objective value
J of the dual problem (
11) of the OCEM model. The idea is that the removal of a variable will result in a larger effect on
J the more important the removed variable was on the model. In order to obtain values for the parameters involved in the dual program, we begin by setting up and solving the full OCEM primal problem (8), obtaining optimal hyperparameters
for this problem, as well as a set of
h hyperplane parameters
,
involved in the feature mapping
. We then fix these hyperparameters throughout this iteration and turn our attention to program (
11), the dual model to the OCEM method.
We solve the dual program (
11) and then, in order to make the computations feasible, [
56] proposes that we keep the solutions
of the dual constant. In order to evaluate the change in the objective function
J when removing a variable
l, since we keep the hyperparameters and the dual variables constant, the only changes in
J will come from the effect of removing feature
l in the transformation function
. We denote this transformation, applied to DMU
by
, and we calculate it by eliminating all contributions of variable
l to
, i.e., by setting
whenever it appears. That is,
We remark that this has no effect on the vector
, i.e.,
. The change in the objective function
J of the dual problem (
11) when removing variable
l is then:
We further remark that, in Equation (
13), the terms involving the DMUs with
vanish, so that we only need to take into account those DMUs
with
(i.e., the Support Vectors) in order to calculate
. Therefore, we only need to consider a subset of the data, which may be smaller than the original dataset. We then calculate
as
l runs over every variable that remains to be ranked. The measure of importance of variable
l used in this method is (
7):
At each step, we calculate
for each variable
l remaining. The variable that is considered the least important is the variable
l which attains the minimum value of
, so we add this variable to the ranking as the next least important variable. We then iterate the method without this variable being considered in order to continue ranking the rest of the variables. At each iteration, we recalculate the hyperparameters, hyperplane parameters, and dual variables. Finally, the last variable remaining is considered the most important variable for the production process. The steps of Dual-OCEM are shown in Algorithm 2.
Algorithm 2 Dual-OCEM algorithm implementation |
procedure calculate_ranking_dual_OCEM (,variables_to_rank) |
|
|
while do |
|
for to n do |
|
end for |
|
|
|
|
for to do |
|
|
|
end for |
|
|
|
|
end while |
|
|
return ranking |
end procedure |
4. Computational Experience
In order to evaluate and compare the performance of the proposed algorithms, we use scenarios inspired by [
62], with Cobb-Douglas production functions with Variable Returns to Scale with multiple inputs and a single output. With these production functions, the magnitude of the exponent of each input represents its level of theoretical marginal importance, with higher values indicating a more important variable. An irrelevant variable can be considered to have exponent 0. Furthermore, the sum of the exponents of the inputs is associated with the returns to scale of the production process. When the sum of the exponents is less than one, the corresponding production process exhibits non-increasing returns to scale, whereas a sum greater than one is associated to non-decreasing returns to scale. Consequently, a sum of one is associated with constant returns to scale. In this computational experience, we consider functions whose exponents add up to less than one, thus exhibiting non-increasing returns to scale. Other returns to scale could be considered, but this extension is beyond the scope of this paper. In particular, we consider scenarios with 1, 3, or 5 relevant inputs
, one additional irrelevant input
a, and one output
y, in order to test how each methodology ranks the variables by their importance. We investigate the effect on the tests of the specification of the following factors: (1) sample size, (2) inefficiency distribution, (3) median inefficiency level, and (4) production function.
To calculate the output level of DMU
j given inputs
, we use a production function
and an inefficiency term
, so that
. The production functions
which we simulate are Cobb-Douglas functions, with a variety of parameters and different exponents in each of the relevant variables, are presented below. Each function represents the maximum producible output given input profile
of relevant variables in a scenario. The true inefficiency term
is defined, as in [
62,
63], using
, where
is a non-negative random variable taking a variety of probability distributions. We, furthermore, include an additional input variable
which is not involved in the production function. Thus,
is irrelevant to the production process. The values for all the inputs
and
of DMU
j are generated independently from
. The production functions used are:
In the case of one relevant input (
):
where
takes values between
and
by
.
The technologies with three relevant inputs are (
):
In the case of five relevant inputs (
):
With each production function, we considered sample sizes of 50 and 100 DMUs, each of them with two different probability distributions for the inefficiency terms (half-normal and exponential), and in each case, in both a low and a high level of median inefficiency, yielding a total of 136 different scenarios, each of which was replicated 100 times, for a total of 13,600 datasets. Regarding the inefficiency term , we consider the following configurations. For the half-normal distribution, the low inefficiency setting considers , whereas for the high inefficiency setting, we choose . Similarly, for the exponential distribution, we consider and , respectively, for the low and high inefficiency settings. With these parameters, the median values for the inefficiency are approximately and , respectively.
With these scenarios, we execute the Shuffling-OCEM and Dual-OCEM algorithms in order to investigate whether they are able to adequately detect the relative importance of each variable, obtaining a full ranking of the input variables, which should be ordered by the values of the exponents, with higher values indicating a higher importance. The position of the irrelevant variable in the ranking should be last, as it could be considered to have exponent 0 in the Cobb-Douglas functions below.
We first consider the aggregated results of the overall experiments in
Table 1. In all subsequent tables, we present on the left-hand side the results for the Shuffling-OCEM methodology and those for the Dual-OCEM on the right. Furthermore, we split them according to the number of input variables due to the different nature of each ranking table. Overall, we observe that Shuffling-OCEM outperforms the Dual-OCEM methodology at this task.
From the results, when we consider the effects of comparisons according to various factors, we observe the following patterns:
(1) We begin with the effects of sample size. In
Table 2, we can observe that both methods improve as the number of DMUs increases from 50 to 100 DMUs, and that the Shuffling-OCEM methodology with 50 DMUs already outperforms the Dual-OCEM methodology with 100 DMUs.
(2) We now study the effects of changing the inefficiency distribution.
Table 3 reports the aggregated percentages for the scenarios with exponential and half-normal distributions separately. We can observe a slightly better performance for both methods with a half-normal distribution than with an exponential one, but the difference is very small. In fact, these two tables are almost exactly the same as the overall results in
Table 1. Thus, we can conclude that both methods are robust to the inefficiency distribution.
(3) Next, we compare the performance in the scenarios with low and high average inefficiency levels. We can observe in
Table 4 that both methods perform worse when the average inefficiency level is high. Furthermore, we observe that the Shuffling-OCEM with a high average inefficiency still performs better than the Dual-OCEM even with a low average inefficiency.
Furthermore, we consider the separate effects of low and high average inefficiency levels with each of the two inefficiency distributions, and, as in the inefficiency distribution comparison (2), we observe very similar values when changing the inefficiency distribution. Hence, we can conclude again that the type of inefficiency distribution does not affect the performance in either the low or high inefficiency. We do not report these tables, as their values are very similar to those already presented. They are available upon request.
(4) Finally, we consider the results obtained with each of the production functions used separately. The results for scenarios (
16a)–(
16d), i.e., those production functions with
can be observed in
Table 5, while those corresponding to scenarios (
15a) and (
15d), that is, with
, are presented in
Table 6. The results from scenarios with
according to the value of
, that is, with production functions (
14), are summarized in
Figure 2. We can observe that the values of the exponents highly affect the quality of the rankings, where the larger the difference between two consecutive exponents, the smaller the confusion between them. As before, Shuffling-OCEM clearly outperforms Dual-OCEM throughout. We now focus on the Shuffling-OCEM behavior according to the values of the exponents.
In the scenarios with
, we observe that when variables are misclassified, they are almost always placed in the relative position of other variables with small differences between the values of the respective exponents. For example, less confusion arises when the differences between consecutive exponents are larger, such as between
and
in Scenario (
16b) and between
and
in Scenario (
16d), whereas the rest of the consecutive differences are smaller than
and some confusion arises in the rankings.
In the scenarios with
, we observe that there is still some confusion among variables when the difference between their exponents is up to
, while larger differences result in a clear separation between their positions in the ranking. For example in Scenario (
15a), where the exponents of
and
, as well as between
and
a differ by
, there is some confusion, whereas variable
is almost always classified in either 3rd or 4th place, and hardly ever as more important. In this case, the difference in exponent between
and
is relatively large at
. Similar trends can be observed throughout.
Finally, in the scenarios with
, both methodologies improve their performance as the value of the exponent
of the relevant variable increases, as shown in
Figure 2. In particular, Shuffling-OCEM accurately identified the relevant variable as the most important in over
of simulations whenever
was at least
, outperforming the Dual-OCEM method. When
is small, the importance of the relevant variable decreases, and the proportion of successful ranking decreases, until when
, Shuffling-OCEM was still able to identify the correct order in
of replications, while Dual-OCEM basically guessed randomly, at around a
success rate.
We remark that part of the higher confusion in the scenarios between variables in the scenarios with a larger number of input variables can be attributed to the smaller exponents that lead to smaller differences while retaining the characteristics of non-increasing returns to scale of the Cobb-Douglas functions.
Regarding the position of the irrelevant variable a in the production processes under study, while it is sometimes classified higher in the rankings than some relevant variables, it is almost always placed in the relative position of variables with exponents of up to . This could indicate that exponents of such magnitudes may not be very important to the production processes being considered.
Finally, regarding the computational time taken by both methods, we mention that the programming language used was Python and that CPLEX v12.8 was utilized for solving the optimization programs. We observe that Shuffling-OCEM takes between
and
longer than Dual-OCEM, depending on the number of inputs (
m) and the number of DMUs (
n). Average computation times for each combination of number of inputs and number of DMUs are reported in
Table 7. As the number of inputs increases, the computational time of both methods increases, with Dual-OCEM scaling better than Shuffling-OCEM. However, we can observe that both methods take longer with larger sample sizes, but as the number of samples increases, the time taken by Dual-OCEM increases faster than that of Shuffling-OCEM, resulting in slightly smaller ratios in the cases with 100 DMUs than in those with 50 DMUs. In particular, Dual-OCEM takes around
times longer to execute with 100 DMUs as with 50 DMUs, while Shuffling-OCEM takes around
times longer. The simulations were executed on a PC with a 1.8 GHz dual-core Intel Core i7 processor, 8 Gigabytes of RAM, and operating system Microsoft Windows 10 Enterprise.
These results show that the Shuffling-OCEM method is significantly better than the Dual-OCEM method at establishing a correct ranking of variables, while taking, on average, 1.25 times the computational time to execute, so it can be considered a better method.
5. Conclusions and Future Work
In this paper, we have presented some methods for measuring the importance of variables in production processes based on an adaptation of the OneClass Support Vector Machine estimator to Efficiency Measurement (OCEM), and evaluated their ability to rank the variables by relative importance to efficiency measurement in a production process. This adapted estimator applies data-centric optimization to the estimation of the production technology, which can be viewed as the region of the space where the observed data lies, while attempting to improve the generalization capability of standard DEA. Based on the production technologies estimated, we evaluate the importance of variables in determining this technology, and thus for the production process.
This is an important topic in the efficiency estimation literature due to the effect that including additional variables has on the estimations, particularly when they are not very relevant. This allows for the consideration of whether to remove some of the less important variables from a productive process in order to obtain better model specifications.
In particular, we adapt two classic methodologies from the machine learning literature, based on shuffling the values of a variables (Shuffling-OCEM) and the dual formulation of the OCEM estimator (Dual-OCEM). We compare them using Cobb Douglas functions in the single-output setting, with Variable Returns to Scale and with independently generated inputs.
In these simulated scenarios, we observe that the Shuffling-OCEM methodology outperforms Dual-OCEM, with a slightly higher computational cost. Both methods improve their performance as the number of DMUs increase. They are relatively robust to the compared types of inefficiency distributions, obtaining similar results with both a half-normal and an exponential distribution. Both methods decrease their accuracy as the average level of inefficiency increases, since the effect of the high average inefficiency can surpass the small relative difference in importance between variables. We observe that the performance of both methods depends on the exponents of the variables, with higher differences between these exponents yielding more correct rankings. In particular, as the exponents decrease as more variables are included, the rankings worsen as the number of variables increase. Comparing them to each other, the Shuffling-OCEM methodology clearly outperforms the Dual-OCEM methodology overall, and in each of the comparisons. In fact, Shuffling-OCEM performs better when the average inefficiency is high or when the sample size is small than Dual-OCEM in the corresponding lower average or high sample size scenarios. In particular, we observe that Shuffling-OCEM correctly ranks the variables in 94% of the scenarios with one relevant and one irrelevant input, while Dual-OCEM achieves an overall success rate of
. When the exponent of the relevant variable is low, the performance of the Dual-OCEM methodology deteriorates more than that of the Shuffling-OCEM. Regarding the scenarios with
, Shuffling-OCEM is capable of ranking each variable correctly in at least 65% of replications of scenario (
15a), with varying results according to the relative differences between consecutive values of the exponents, while the respective Dual-OCEM percentages lie between
and
.
Therefore, we conclude that the Shuffling-OCEM methodology should be used over the Dual-OCEM methodology, at least in situations similar to those considered. This is further supported by a computational cost that is only slightly higher for the Shuffling-OCEM than the Dual-OCEM methodology. Further studies could evaluate whether these conclusions hold in more general cases, such as those with correlations among variables, or with different production functions which may not satisfy convexity or other properties.
Finally, we mention some possible avenues for further research. These are just some of the methods available in the literature for the ranking of variables, and other methods could be adapted to the OCEM estimator in order to compare their performance. The proposed OCEM estimator uses a PieceWise Linear (PWL) transformation function, but this is not the only possible choice. There is a variety of kernels and transformation functions which could be used, such as polynomial, Gaussian, or sigmoid kernels, among others. The proposed methodology treats both inputs and outputs homogeneously, so these methodologies could be considered for the ranking of outputs, or even for the ranking of both inputs and outputs simultaneously, while still taking into account the characteristics of production processes, which could be an area worth exploring. In practice, the methods proposed in this paper could be used to measure the importance of variables in real-life datasets, and as a basis to obtain models for the measurement of efficiency involving only those variables considered more important by the methods. Furthermore, in addition to the ranking of variables, these methods could be enriched with stopping rules or thresholds to determine when a variable is relevant or not, turning them into methods for selection of variables in production processes. An interesting line of future work in this direction could be the usage of the proposed methodologies with real-life datasets, to evaluate whether some variables can be considered irrelevant. Another potential line of future work is to evaluate the performance of the proposed methods in a wider variety of scenarios, with broader characteristics such as different assumptions about returns to scale, a variety of production functions, or relationships between the variables among others. Another interesting future line of research is the comparison of the new methods with other methodologies available in the literature in order to compare their performance.