2. Overview of Existing Methods for Detecting Model Training Issues
1. Expert method: learning curve—a graph showing the change in a particular metric as the machine learning model is being trained. There are two main types of learning curves. Optimization curves are learning curves calculated on the basis of the metric by which model parameters are optimized, such as various types of error functions. Performance curves are learning curves calculated on the basis of the metric by which the model will be evaluated and selected later, such as accuracy, precision, recall, or a combined F1-score.
The most popular example of a learning curve is the curve of a model’s loss over time. Loss measures a model’s error; consequently, a lower loss indicates higher performance of the model. In the long run, loss should decrease over time, indicating that the model is learning.
Another example of a learning curve is the accuracy graph, which reflects the performance of the model, whereby a higher accuracy indicates a more efficiently trained model. A model’s accuracy graph rising over time indicates that the model is improving as it accumulates experience. Over time, the graph of the performance metric reaches a plateau, which means that the model is no longer learning, and the performance limit for a given configuration has been reached.
One of the most widely used combinations of metrics is training and validation loss over time. Training loss shows how well the model matches the training data, and validation loss shows how well the model matches new data that were not available during training.
As mentioned above, the main issues that experts seek to identify during model training are model underfitting and overfitting. To detect overfitting and underfitting, machine learning experts apply a set of rules [
16,
17,
18] according to an understanding of the mathematical process of model training.
Basic case—absence of problems during model training. The training process stops when the trend of the loss function on validation changes from downward to upward.
In the absence of issues in the training phase, the values of the model error function on training data will almost always be lower than those on test data. This means that we should expect some discontinuity between training and validation loss curves. This discontinuity is one of generalization. The optimal case is considered to be one in which the training loss curve decreases until a point of stability, the validation loss curve decreases until a point of stability, or the generalization gap is minimal (almost zero in the ideal case).
If the values of the loss function are high and do not decrease with the number of iterations for both test and training curves, this indicates insufficient complexity of the model used in relation to the data, which leads to model underfitting.
If the learning curve is linear with a derivative close to zero or shows noisy values around some constant high loss value, this indicates that the model failed to reveal patterns during training, or in other words, the fact of learning is almost (or completely) absent.
The values of training and validation loss functions continuously decreasing in the last epochs of training indicate premature cessation of training and the model’s inability to learn further.
If the values of the training loss function decrease over time, reaching acceptably low values, but the values of the validation loss function decrease only up to a certain point after which they begin to increase, then it can be concluded that the model reached an overfit.
It is worth noting that the learning curves can identify not only issues associated with the model, but also issues of unrepresentativeness of the data used to train or test the model. A representative dataset proportionally reflects the statistical characteristics of another dataset from the same subject area.
The issue of an unrepresentative training dataset occurs when the data available for training are insufficient for effective training of the model relative to the test dataset. This situation is also possible when the training dataset does not reflect the statistical parameters that are inherent to the data in a given subject area.
In such cases, the training and validation loss values decrease, but there is a large gap between the curves present, which means that the datasets used for training and validation belong to different probability distributions.
The issue of an unrepresentative validation dataset occurs if the dataset used to test the quality of model learning does not provide enough statistical information to assess the generalizability of the model. In such cases, the learning loss curve matches the basic case, while the validation loss curve shows highly noisy values in a region close to the learning loss curve.
It is possible that a validation loss may be much lower than training loss, reflecting the fact that the validation dataset is easier to predict than the training dataset. In this case, the validation dataset is too small and may be widely represented in the training dataset (i.e., there is an overlap between the training and validation datasets).
The expert method of neural network training assessment implies direct control of the network training process by an expert. In turn, this task imposes additional responsibilities and risks on the experts. The expert method depends on competence and experience of a decision maker, as well as increasing development time of a neural network. Additionally, a serious disadvantage of the method is a complete or partial lack of automation.
2. Keras API callbacks. Keras is a commonly used API for building and deploying deep learning models. A Keras API callback is a function that can be executed at different stages of the model’s training pipeline. Such functions can be used for various tasks such as controlling the model’s learning rate, which can affect the behavior of the training algorithm.
The main drawback of the described method is that the majority of parameters used to detect instances of a model’s faulty behavior during training are constants specified in advance by the researchers or a set of rules formulated by researchers in advance. It is worth noting that the various events which can be detected by this method are not classified. An event detected by a Keras callback, be it overfitting or underfitting, is assumed to be hypothetical (with the final decision being made by an expert overseeing the training process) and only valid for that specific stage of model training. Another disadvantage of these callbacks is the lack of premade automatic tools for detection of a supposed model underfit [
19].
3. In [
20], the authors proposed an approach to accelerate the search for optimal hyperparameters of the neural network through an early stopping of training. The criterion for such stopping is extrapolated learning curves.
In general, the process of extrapolation of a learning curve from an arbitrary number of initial values to final values is as follows: during optimization of the neural network’s parameters using the stochastic gradient descent algorithm, regular measurements of the model’s performance function are taken. Let y1:n be the observed values of the performance function for the first n iterations of stochastic gradient descent. Then, while observing y1:n values of the performance function, it is necessary to predict the performance ym at step m, where m >> n. In this approach, such a problem is solved using a probabilistic model.
The basic approach is to model the partially observed learning curve
y1:n using a set of parametric functions {
f1, …,
fK}. In turn, each of these parametric functions
fk is described by a set of parameters
θk. Assuming Gaussian noise
ε ~
N(0;
σ2), we can use each function
fk to model network’s performance at timestep
t as
yt =
fk(
t|
θ) +
ε. The unit observation probability
yt is, thus, defined as
The authors of [
20] also presented a set of models of parametric curves, the shape of which coincides with the general ideas about the shape of performance curves. Usually, such curves represent an increasing, saturating function. In total,
K = 11 different parametric functions are presented in the paper. It is worth noting that all the models presented cover only certain aspects of learning curves, but none of them can fully describe all of the possible curves. Therefore, there is a need to combine these models.
The stronger, combined model is a weighted linear combination of simple models:
where the new combined vector of parameters,
includes the weight
wk for each model, the individual model parameters
θk, and the noise variance
σ2; then,
yt =
fcomb(
t|
ξ) +
ε.
Given such a model, it is necessary to model uncertainty and, hence, adopt a Bayesian perspective by predicting ym values using the Monte Carlo method with Markov chains.
The early stopping method using extrapolation of learning curves provides one with an opportunity to estimate the value of an accuracy function
ym at step
m; however, the decision about the model being subject to overfitting or the number of training epochs
m being insufficient (i.e., underfitting) has to be made by an external mechanism or an expert [
21,
22]. Consequently, with this approach, the automation of the learning assessment process is only partial and requires additional methods to support itself.
4. The authors of [
23] described three criteria for an early stopping of the training process. The first criterion of early stopping is the suggestion to stop training at the point when the generalization loss exceeds a certain threshold. Let
E be the target (error) function of the training algorithm.
Etr(
t) is the training error, an average over the training set measured after epoch t.
Eva(
t) is the error on the validation set.
Ete(
t) is the error on the test set. In real life, the generalization error is usually unknown, and only the validation error
Eva(
t) can be used to estimate it.
The value
Eopt(
t) is defined as the smallest error on the test set obtained before epoch
t. The generalization error at epoch
t is then defined as the relative increase in error on the test set compared to the minimum error at that point in time (as a percentage):
A high loss of generalization is one obvious possibility for stopping the training, as it directly indicates overfitting. The criterion itself can be described as
The second early stopping criterion differs from the first in that the generalization error is averaged over
k previous epochs and is compared to the minimum error at those
k epochs:
It is assumed that, with large changes in the error function on a small interval (
k of about five epochs), there is a greater chance of subsequently obtaining a smaller value of the generalization error. The second early stopping criterion is formulated as
The third criterion of early stopping is where training is halted when the generalization error increases over s consecutive sequences of k epochs. The idea behind this definition is that, according to the authors’ assumption, when the validation error increases not only once, but over s consecutive sequences, such an increase indicates the beginning of final overfitting, no matter how large the actual increase is.
Choosing a particular stopping criterion, in essence, involves a tradeoff between training time and generalization error.
5. Feature Descriptions of Learning Curves
It is useful to divide descriptions of the learning curves into two types: those calculated from the learning curves for both cost and accuracy functions, and those obtained from accuracy functions only.
This distinction is based on the fact that it makes sense to compute some of the proposed attributes using only normalized data. The time series obtained using learning curves of an accuracy function with values in the range [0, 1] are considered normalized. On the other hand, time series obtained from cost function curves are not normalized. We propose to introduce the common features described below for all considered learning curves, regardless of the type of function (cost or accuracy).
1. Standard deviation of the difference between the functions of a given metric on training and validation sets:
where
Ft is the metric function for training,
Fv is the metric function for validation, and
N is the number of training epochs of a neural network model. Using this feature, we can infer how much the values of a given metric differ between training and validation, and which function has a larger average. If the values of the training function nearly match the values of the validation function, it will produce a value of
f1 that is close to zero.
2. Standard deviation of the training metric function:
This feature can be used to understand how much the metric in question changes during training. A feature value close to zero can indicate that the neural network is not learning.
3. Standard deviation of the validation metric function:
This feature can be used to understand how much the metric in question changes during validation. For cost functions, if the initial value is unsatisfactory with respect to the intended task, the lack of change during validation may indicate that the neural network model has not learned, i.e., is not able to detect patterns
4. Average value of a metric’s derivative function at extreme training epochs:
where
k is the percentage of extreme training epochs at which the feature is calculated. The parameter
k was empirically set to 10%. The sign of the feature serves as an indicator of a function’s tendency toward increasing or decreasing. The value of the derivative can be used to see how much a given metric changes by the end of training, which, together with the sign, gives us an idea of the model’s trend towards overfitting or underfitting due to an insufficient number of training epochs.
5. Average value of a metric’s derivative function at extreme validation epochs:
6. Standard deviation of a metric’s function at extreme training epochs:
where
n is the percentage of extreme training epochs at which the feature is calculated. The parameter
k was empirically set to 20%. Using this feature, it is possible to infer how much the metric in question changes during the last training epochs. Values close to zero can indicate stabilization of the metric’s function at the end of training.
7. Standard deviation of a metric’s function at extreme training epochs:
This feature can be used to infer how much the metric in question changes during the most recent validation epochs. A value close to zero may indicate that there is no overfitting.
8. A number of discrete basis functions from active perception theory [
27,
28,
29] calculated for both metrics on training and validation:
These features provide a relationship between the time series for which they are calculated. For example, features f8 and f11, which are calculated during the training and validation phases, respectively, show the tendency of learning curves to increase or decrease.
We also propose to introduce the following features for learning curves of the accuracy function specifically.
1. Difference between the initial and final values of the accuracy function during training:
In addition to the standard deviation over the whole period of training, the value of this feature can be used to indicate how much the accuracy function has changed. A value close to zero can be interpreted as a lack of learning. More importantly, a positive sign for this feature’s value shows a trend toward improvement in the accuracy metric, and a negative sign shows a trend toward its deterioration.
2. Difference between the initial and final values of the accuracy function during validation:
interpreted similar to the previous feature.
3. Difference between final values of the accuracy function for training and validation:
This feature can be used to determine how different the accuracy metrics are for training and validation at the end of training. With the value of the training accuracy function close to one, a value of around zero is an indication that the training was performed successfully.
4. Maximum value of the training accuracy function:
5. Maximum value of the validation accuracy function:
6. Difference between the initial and final values of the accuracy function at extreme training epochs:
The parameter n was empirically set to 20%. A near-zero value of this feature, together with a near-zero standard deviation for the last training epochs, may indicate a stabilization of the accuracy function’s values.
7. Difference between the initial and final values of the accuracy function at extreme validation epochs:
A near-zero value for this feature, together with a near-zero standard deviation for the last validation epochs, can indicate stabilization of the accuracy function for validation. If the value of the accuracy function is close to one, it can be used as an indicator that the learning process was performed successfully. If the value of the accuracy function is close to zero for validation, we can say that the model has not trained enough or that it is now in a stabilized overfit.
8. The area under the curve of the accuracy function for learning/validation (f21). A near-zero value of this feature may indicate underfitting of the neural network.
9. The area under the learning curve of the accuracy function on validation (f22). A near-zero value can indicate underfitting of the neural network, provided a maintained upward trend of the accuracy function, as well as overfitting in situations where the derivative of the accuracy function is negative for the last stages of validation.
Thus, the feature descriptions of learning curves consist of 13 features for the loss functions and 22 features for the accuracy function.
7. Results
In this selection experiment, after the feature extraction process and the SMOTE [
30,
31,
32,
33] data augmentation algorithm, we obtained the following distribution of training samples: cost function learning curve data sample size—387 objects (129 objects for each class); accuracy function learning curve data sample size—225 objects (75 objects for each class). The experiment consisted of performing cross-validation (
Table 1).
From the obtained results, it is clear that, with respect to selected performance metrics, the most efficient classifier models for determining the state of the training process were the random forest classifier models.
To conduct a computational experiment comparing the proposed model with its counterparts, it was decided to create and train 20 artificial neural networks to solve a classification problem.
The criterion by which the models were compared with each other was the average value of differences in the number of training epochs after which the training process was stopped. In the case of the proposed model, the epoch at which training is halted was the epoch at which the model first outputs “overfit”.
where
M is the number of neural network models,
nm is the index of the last epoch computed by the proposed model, and
na is the last epoch computed by the counterpart.
For each neural network model, a fixed maximum of 100 training epochs was specified. The system’s performance was compared to the “early stopping” method provided by the Keras API, without any manual configuration on our part.
The obtained results showed a value of Q = −0.15 for API Keras callbacks. The negative and near-zero value of the Q parameter indicates that, in most cases, the proposed model detected overfitting simultaneously with the “early stopping” method of API Keras and, occasionally, 3–5 epochs ahead of it.
However, it is worth noting that the API Keras callback method does not provide information to interpret the training results, unlike the proposed model.
Figure 4,
Figure 5 and
Figure 6 show examples of the results of the computational experiment. The vertical line indicates the epoch at which the training process was stopped by the “early stopping” API Keras method.