3.1. Patient Profiling—Clustering
The segmentation of patient profiles was established on the premise that healthy patients undergoing only follow-up tests typically do not repeat those tests within the same year. The research findings do not allow for predicting future orders in this group. However, there exists a sizable subset of patients (approximately 10%) who undergo tests more than 10 times, indicating progress in their therapy. Some patients repeat tests until reaching a specific expected level of the parameter, signifying recovery or another condition, such as pregnancy in the case of the bHCG test. In these cases, it is possible to deduce from the test results whether they will return for additional tests and within what time frame.
The initial step involved demonstrating the statistical differences between patient groups to facilitate further conclusions. Self-organizing maps (SOMs) were utilized for this purpose [
6]. SOMs are Kohonen neural networks designed for unsupervised learning, particularly useful for clustering sequential data like time sequences or time series. They can cluster and analyze patterns within these data. The process of clustering sequence data using SOMs involves training a SOM map on the data to represent the data’s structure and relationships [
7]. SOMs can identify similar patterns in temporal sequences and group them based on their similarity characteristics. For instance, when analyzing temporal sequences in patient data, SOMs can assist in identifying similar study patterns.
The implementation of the models was carried out in Python using the MiniSom library and various tools for data analysis and machine learning, such as numpy, pandas, keras, tensorflow, etc. SOM was used for clustering sequence data of patient examinations and then presenting the results in the form of a cluster map. Cluster and sequence data were then used to train LSTM networks for future patient classification into specific profiles. The data underwent preprocessing, including normalization and padding/extending to a constant length. In the next stage, appropriate parameters for the self-organizing map were selected, and then the map was trained on the data. The implementation in the MiniSom library does not require the predefinition of layers or neurons. A SOM is a one-dimensional layer containing neurons, each of which is connected to all inputs, thus adapting to input vectors and their lengths.
The outcome of this process was a mapping of temporal sequences based on their similarities, enabling the visualization and identification of clusters or patterns, which proved valuable for analyzing and understanding the structure of the sequence data (see
Figure 2) [
8].
The research results presented in
Figure 2, characteristic of the four selected clusters, illustrates how typical variations in the outcomes of a given study may manifest in individual patients. We have a normalized value of the bHCG test on the Y axis, and the number of tests in the series is placed on the X axis. In individuals who constantly monitor the level of a certain parameter (in this case, betaHCG), this level may undergo characteristic fluctuations, which may indicate a disorder (in this case, desired pregnancy is often the case), effectiveness or ineffectiveness of therapy, etc. Depending on changes in the parameter value, further steps in therapy can be predicted and, consequently, so can specific patient behavior regarding further tests (whether a repeat test should be expected or not).
Clustering was performed using self-organizing maps, and the number of clusters was selected based on the Silhouette index. The algorithm takes a list of numerical sequences, map size, learning rate, sigma, and network size as its input and produces a list of clusters with sequences assigned to them as its output. Silhouette Index Analysis is a method used to interpret and validate consistency across data clusters. The indicator value measures the similarity of an object to its own cluster compared to other clusters. It can be used to study the separation distances between resulting clusters. The ratio is calculated by averaging the distance between clusters (a) and the distance to the nearest cluster (b) for each sample.
Additionally, the PrefixSpan sequence pattern discovery method was employed [
9]. This method aims to identify the most frequently occurring sequences of events based on a list of sequences and a minimum support value that determines their frequency. The algorithm also identifies the longest sequences, assuming a minimum number of repetitions. The algorithm also identifies the longest sequences, assuming a minimum number of repetitions. The results are sorted based on the specified criteria. The method presented produces a set of sequence patterns. Each tuple in the set contains two values: the support of the pattern and the recognized sequence, which is a list of cyclically repeated values. The data were discretized (qualitized) using the KBinsDiscretizer method [
10] from the sklearn.preprocessing library to ensure they could be used with the PrefixSpan method. Discretization is the process of transforming continuous data into discrete values for numerical processing. The method used grouped variable values into countable containers and assigned each a unique integer while maintaining the ordinal relationship between them.
This research study proposes a third algorithm for classifying sequences into individual categories using LSTM recurrent neural networks [
11]. The recurrent network is trained using the sequence and the SOM clustering result from the previous method. Test data are fed to the trained network to determine the correct operation of the algorithm. The resulting classifier can be used to classify patient data. A recurrent neural network is a type of artificial neural network that creates a directed or undirected graph along a time sequence [
12]. Its use demonstrates the dynamic behavior of data over time. Unlike feedforward neural networks, recurrent neural networks use their internal state (memory) to analyze sequences of variable-length input data. LSTM (Long Short-Term Memory) is a type of recurrent neural network that is designed to work with sequential data such as text, time series, audio, or action sequences. It has the ability to remember long-term dependencies in data [
13].
LSTM is distinguished by its capacity to retain information for longer periods than standard RNNs, thereby eliminating the vanishing gradient problem that occurs in classic RNNs. This is accomplished through the use of special structures known as ‘gates’ that determine which information should be retained and which should be discarded during sequence processing [
14].
The data are divided into training and testing sets (80:20) to train the model and evaluate its performance, as well as its accuracy on data it has not seen before. The utilized recurrent neural network model consists of the following layers:
Hidden layers:
LSTM layer: 100 neurons—A type of recurrent network that utilizes special units in addition to standard units containing memory. Thanks to these special cells, they can store information in memory for a long time, allowing them to learn long-term dependencies.
Dropout: 50% dropout rate. No neurons; it is a regularization layer. It is a layer that randomly sets input units to 0 with a certain frequency at each step during training, which helps prevent overfitting.
Dense layer (hidden layer): 100 neurons with ReLU activation function.
Output layer: Dense layer: The number of outputs, which is equal to the number of clusters (25 in this case) with softmax activation function.
The training process involves the use of the categorical_crossentropy loss function, which is typically used in multi-cluster classification tasks where a sequence can belong to one of several categories and the model must determine the appropriate category. The recurrent network was trained multiple times, and the average results are presented in
Figure 3. A graph of loss and correctness during specific epochs of data training is also provided for the best result. The results of training and testing recurrent neural networks for recognizing sequences belonging to particular categories were 91.79%. The graphs presented in
Figure 3 illustrate the training process by showing the accuracy and loss values in individual epochs.
Accuracy and loss graphs are common tools used in training recurrent neural networks (RNNs) to monitor the performance and progress of the model during training. Accuracy refers to the proportion of correctly classified instances among the total instances. In the context of RNNs, an accuracy graph shows how well the model is performing in terms of making correct predictions. Accuracy is plotted against the number of training epochs (iterations through the dataset). At the beginning of training, accuracy is low as the model is still learning patterns in the data. As training progresses, accuracy increases as the model learns to make better predictions. Accuracy graphs provide insights into whether the model is learning and improving over time. A consistently increasing accuracy curve indicates that the model is learning effectively, while fluctuations or a plateau may suggest issues such as overfitting or underfitting.
Loss, also known as error, measures how far the predicted values are from the actual values. A loss graph shows the value of the loss function (categorical cross-entropy) over the training epochs. The goal during training is to minimize the loss, that is, to make the predicted values as close as possible to the actual values. Similar to accuracy, loss is plotted against the number of training epochs. At the beginning of training, the loss is high as the model makes random predictions. As training progresses, the loss decreases as the model adjusts its parameters to make better predictions. Loss graphs help monitor the convergence of the model during training. A decreasing loss curve indicates that the model is learning and adjusting its parameters effectively.
Monitoring both graphs during training helps to assess the performance and convergence of the recurrent neural network.
3.2. Prediction of the Number of Orders—Exponential Smoothing
As part of the research, a universal model was selected to suit diverse time series of various types of research and provide precise data on the number of orders aggregated monthly. The chosen forecasting model is based on exponential smoothing, which uses a weighted moving average to reduce variance and predict future values of the series. The exponential methods employed include Brown’s Model, Holt’s Linear Model, and Winters’ Model, all time series forecasting techniques.
The Winters’ Exponential Smoothing Model is an extension of the classic exponential smoothing model and takes into account seasonality and trends in the data [
15].
Level Component: Represents the baseline level of data over a given period of time. In Winter’s Exponential Smoothing Model, it is updated according to the last observation and takes into account current level changes.
Seasonal Component: Represents cyclical changes in data, such as months of the year, days of the week, etc. It is accounted for in Winter’s model by including a seasonal factor modifying the forecasts.
Trend Component: Represents a long-term upward or downward trend in the data. It is included to predict future trend changes.
Exponential smoothing is a forecasting technique that assigns exponentially decreasing weights to historical observations. This means that the future value is not only dependent on the last observed value but on the entire series of values. The influence of older values is smaller than that of newer values. It is important to note that this technique assumes that the future value depends on the entire series of values.
Exponential smoothing models are characterized by four parameters (α, β, γ, ϕ), and various initialization methods are considered. The β parameter controls the trend, while the ϕ parameter controls the strength of extinguishing the trend. The γ value is responsible for seasonality in the model. The main challenge in the algorithm is selecting the appropriate parameters to achieve the most accurate forecasts. Furthermore, the model allows for an independent determination of the nature of each component—trend, seasonality, and residuals—detecting whether each component is either additive or multiplicative. It is also possible to assume that a component does not appear in the series, particularly for trend and seasonality. To shorten the notation, we use the standard ETS designation for exponential smoothing models. The model components are represented by the letters E for error, T for trend, and S for seasonality. The appropriate symbols for each component type are then inserted: A for additive, M for multiplicative, and N for none (only for trend and seasonality). In the case of a damped trend, the letter ‘d’ is added. For instance, a model refers to a set of models with additive errors, a multiplicative damped trend, and no seasonality.
Subsequent modifications and extensions of this algorithm gave it the form currently popularly used: AAA ETS, i.e., AAA = Additive Error, Additive Trend, and Additive Seasonality; ETS = Exponential Triple Smoothing algorithm [
16].
The research scenario stated that the ten parameters with the highest reagent consumption (i.e., the largest number of tests) would be selected for testing. A list of the ten most frequently performed tests was compiled, taking into account that some tests do not require reagents (such as blood count tests or urinalysis). Although bHCG was only ranked 43rd, it was included in the list due to the repeated tests on this hormone and its prediction being used as a reference value.
Three separate forecast sub-models were developed for each of the tests mentioned in
Table 1, one for each patient profile. Furthermore, it was noted that order values on weekends seldom exceed 5%. As a result, separate models were created for weekend forecasts. In total, 33 time series simulations were conducted, each spanning 21 months (with a three-month delay for the initial forecast). The total number of regular forecasts was 693, with an additional 33 weekend forecasts.
Figure 4 provides a few examples of individual forecasts.
A prediction plot with confidence intervals overlaid on plotted historical data from the same period is shown in
Figure 4. The forecasts were calculated based on data from two three-month periods from each of the three previous years; therefore, the timeline (x-axis) cannot be regarded as a continuous range but rather as individual measurement points. Each of the plots depicts the trend and prediction for a different examined parameter. It can be observed that at some points, the prediction lags by one time step from the historical data; however, the confidence intervals cover the variability quite well, especially considering that the studied period includes the time of the pandemic. It is also worth noting that the period during which the models were examined coincided with the pandemic, which disrupted the blood research market, affecting its dynamics.
Table 2 presents the results of forecasts for 11 factors, including the 10 most commonly performed plus bHCG, over a six-month period. The forecasts were calculated based on data from two three-month periods from each of the three previous years during the same periods of the year. The table compares the averages and standard deviations of these samples.
The Mean Absolute Percentage Error (MAPE) is the percentage error calculated separately for each parameter within a given set of forecasts and then averaged. The MAPE value of 2.16% indicates the precision of the models.
Figure 5 shows a scatter plot of the model predictions compared to the observed order quantities.
Figure 5 shows a graphical representation where, on the Y axis, we have the predictions made by a model and, on the X axis, we have the actual observed quantities of orders. Lines representing confidence intervals in a scatter plot indicate the range within which the true values of the data points are likely to fall with a 90% level of confidence. In other words, they provide an estimation of the uncertainty associated with the predicted values. Typically, the narrower the confidence interval, the more precise the predictions are considered to be.
3.3. Classification of Analyzer Overload—Decision Trees and Logistic Regression
The objective was to develop models that characterize the utilization patterns of analyzers in different time intervals and create efficient models based on them that describe the temporal variability. This enables the representation and prediction of their successive values in the future. The developed mechanisms optimize the workload of analyzers. The objective of the task was to eliminate periods of excessive workload for analysts, which cause bottlenecks, accelerated wear, and delays in issuing test results. Additionally, it is essential to reduce the time during which a sample awaits the completion of the requested analysis. The workload for analysts should be evenly distributed, ensuring scheduled downtime for maintenance and other service activities related to the equipment.
The initial phase of data processing in the acquisition, parsing, and analysis process involved creating techniques to identify workload limits for analyzers, followed by optimization. These mechanisms were then tested and optimized. Based on the results, fundamental structures and interfaces were developed to signal alarm states for analyzer workload.
An analysis was conducted on the cyclic nature of workload curves for analyzers in different cycles and various devices (
Figure 6). The factors and periodic indicators for each curve were evaluated to facilitate the selection of periods for training models when approximating workload curve models for analyzers.
Each of the graphs in
Figure 6 present the daily cycles of variability in analyzer load for different parameters. Each of the three graphs display data from multiple daily cycles. The essence of approximating the daily curve is to find statistical characteristics that best capture the cyclical variability in load and allow for the most efficient prediction of device overload moments, as this is when process bottlenecks are expected. Alerting such a situation a few minutes in advance, before overflow occurs, allows for the relocation of some samples to another analyzer and avoids queues.
The use of approximation models facilitated a statistical study to detect the maximum performance point in the analyzer model. This analysis enables the identification of the alerting point before the peak performance point in the prepared program. The aim was to find characteristics that accurately represent the variability of the load while also presenting it in a way that ensured it was smooth and resistant to momentary disturbances and deviations. Regression tools, such as linear regression, exponential regression, polynomial regression, spline regression, and DWLS (distance-weighted least squares smoothing) [
17], were used to analyze the variability of individual curves. The LOWESS (locally weighted scatterplot smoothing) method produced the most accurate curve fitting results. This method determines the individual points of the curve using polynomial regression models, resulting in a well-fitted approximation of the entire pattern in strongly nonlinear and irregular models. This captures the specific nature of the time-indexed load dependency. The LOWESS method involves fitting a regression curve to a subset of the data by selecting a window for each data point that includes only the nearest neighboring points. The window size is a hyperparameter of the model. Within this window, regression is performed using a low-degree polynomial to fit the local curve to the data. Weighted least squares are used to assign greater weight to nearby points and less weight to distant points, allowing for the consideration of local trends. The key parameter of the LOWESS method is the width of the window, which determines the number of points included in the local curve fitting. Narrower windows result in more detailed smoothing, while wider windows lead to more general trends. The LOWESS algorithm repeats this process for all data points, resulting in a smoothed curve that reflects local trends in the data. The LOWESS method is flexible and can handle nonlinear dependencies between variables. However, the algorithm may be susceptible to the influence of outliers or be unstable in the case of low-density data [
18].
The median was used as the most representative indicator. A rolling median was calculated based on historical data from the past 10 runs, followed by another calculation based on the median historical run with a window of 5. A regression coefficient was then computed to allow for slope assessment, assuming a zero intercept (intercept = 0) to maintain the intersection point on the same level for each line in the resulting dataset. The training data were labeled using the ‘signal’ parameter to indicate a binary label for when an alert should be triggered. This signifies the saturation of the analyzer and, specifically, an assessment of saturation risk, requiring the initiation of procedures related to redirecting further investigations to another device.
Logistic regression is a popular statistical model used for analyzing categorical data, where the dependent variable takes two possible binary values. Logistic regression is a classification technique that predicts the probability of something belonging to one of two classes based on the values of independent variables, such as characteristics in the form of a rolling median and time. The main idea behind logistic regression is to transform a linear regression model into the logarithm of odds space, which allows for the modeling of the probability of something belonging to a specific class. Unlike linear regression, which predicts continuous values, logistic regression estimates the probability of something belonging to one of two classes. The basic form of logistic regression is a binary model where the dependent variable takes values 0 or 1. This model is based on the logistic function, also known as the sigmoid function, which transforms the results of linear regression into the [0, 1] range [
19].
The logistic regression model achieved a satisfactory 96% correct classification rate. Other devices achieved even higher accuracy, with the largest analyzers reaching up to 99.5%. However, some analyzers had an error rate as high as 15.1%, indicating weaknesses in the model that require the careful monitoring of curve variability in the lower range. The error occurred due to temporary, minor intensities in the daily cycle. To address this issue, it is necessary to narrow the variability range in daily cycles, particularly during hours when loading the analyzer to the narrow throat level is not possible. However, calibrating each analyzer model separately is not optimal or feasible when the set of analyzers subjected to such prediction is not closed and constant.
To enhance result accuracy, we decided to use a different classifier, specifically a decision tree, along with the slope metric of the curve represented by calculated regression coefficients and a moving median for load profiles. CART (Classification and Regression Trees) models are a widely used method for classification and regression in data analysis. Decision trees are diagrams that show hierarchical decision structures. They allow decisions to be made based on a set of conditions or independent variables. CART constructs a decision tree by dividing the dataset into subsets based on the values of independent variables to minimize heterogeneity (impurity) within each subset [
20].
The aim of the optimization was to decrease the maximum load analyzer time by 5% by redirecting a portion of the samples to other devices when maximum efficiency was achieved, with a minimum accuracy of 80%. To prevent systematic errors associated with seasonal loads, a sample of 30 daily profiles was randomly selected from five random analyzers in the most burdened group. The data for one day of work of the five analyzers were collected over a period of at least three months, considering the years 2020 and 2021. As a control group, 30 random daily profiles were selected from the two previous years during the same period of the year. This corresponds to three months and five random analyzers.
Table 3 presents the results of the conducted tests. The value referred to as the “average daily load” essentially represents the percentage of time when the device operates at maximum capacity. This is evident from the characteristic plateau observed in the graphs in
Figure 6, indicating that the device has a backlog of orders and the samples are “waiting” for analysis. In this context, the standard deviation shows by how many percentage points these average values typically differ across different daily cycles. The reduction in the load, induced by transferring samples to other devices after the forecasted overload alert is observed, is measured by the “average daily load (Δtransfer)” variable. In this case, the standard deviation does not significantly differ from the results without transfer.
The accuracy of alert predictions, measured as the percentage agreement with load predictions, was calculated for ten of the most heavily loaded analyzers, as well as for 30 daily runs randomly selected for the years 2021 and 2022. The accuracy level exceeded 80% in each analyzer’s case, with an overall accuracy of 81.72%. The tests showed that each analyzer reduced working time by more than 10%, with an average reduction of 11.98% under maximum load.
3.4. Detection of the Required Calibration Moment—Neural Networks
The initial step is to identify weaknesses in the control processes, such as quality control parameters that do not accurately reflect the true variability of the measurements of the analyzers due to small initial study samples and the excessive precision of measurements on control materials. Further efforts should focus on identifying mechanisms to achieve quality control by determining appropriate methods for analyzing data from real patient measurements [
21]. Quality control (QC) based on measurements of proprietary control materials covers only the analytical stage of the result generation process. Patient-based quality control techniques have been described for over fifty years and have been widely used in hematology for forty years [
22]. However, due to practical issues, they are not widely applied in clinical chemistry laboratories. Nevertheless, recently, due to the availability of intermediary software and a greater appreciation of the benefits of these processes, there has been an interest in exploring their use as quality control tools. One method of such analysis is the assessment of “averages of normals” (AONs) proposed by Cembrowski.
The purpose of implementing these assumptions was to create models for estimating the compatibility of statistical methods with the actual results obtained on analyzers during quality control studies and after device recalibration. One of the elements involves identifying the parameters of the AoN method [
23].
Exclusion from the set of measurement values that fall outside the reference ranges [
24].
Determining the number of results necessary for calculating averages (e.g., using the Cembrowski procedure).
Establishing control ranges.
Developing a model to assess the impact of recalibrating analyzers on the statistical characteristics of the results obtained on the devices [
25].
MA QC, also known as Patient-Based Real-Time Quality Control (PBRTQC), is a mathematical procedure that averages patient test results in real time and uses the obtained mean values for quality control purposes [
26]. Patient-based QC generally uses the mean, but other algorithms, including the median, exponentially weighted moving average, and others, have also been developed and evaluated. The effectiveness of methods based on patient results depends heavily on the selected cutoff levels. Reference ranges have been studied for decades, yet there is still no effective and universal method for determining cutoff points, as noted by Cembrowski [
27]. It is important to use precise technical terms and avoid figurative language when discussing scientific concepts. The degree of interpersonal variability in a measured analyte, or the variance at the interindividual level, plays a significant role.
In laboratory practice, quality control (QC) tests are routinely conducted both daily in the morning and when there is a change in Lot or Batch, referring to the reagent used for analyses on the device. Recalibration is performed when QC checks reveal any discrepancies, as well as according to a predetermined schedule. The aim of the study in the reported task was to identify which quality controls, especially recalibrations, were unnecessary. The need for corrective action resulting from inaccurate quality control outcomes can be verified or refuted based on patient test results (see
Figure 7).
Figure 7 shows a variability plot indicating that the MA results exhibit increasing variability over time (rather than AoN), leading to a quality control, followed by calibration. They are marked by pairs of vertical lines, with the first indicating quality control and the second (marked with the dot on the top of line) marking the calibration of the device as a result. The remaining blue dots at the 100 level represent quality control marks that did not result in the need for calibration; hence, from the perspective of data quality, they were not essential. The aim is to capture the change that necessitates calibration, thus demonstrating the absence of the need for quality control in the remaining points.
Figure 7 presents the actual values of the tested material from patients in the form of data points. The diversity of results is natural, reflecting the variety of disorders and individual characteristics. However, the assumptions of methods based on patient results suggest that despite this diversity, there should be constant characteristics describing the distribution of this diversity. On the graph, we can observe quality control points, marked as blue dots, and device recalibration, marked as red vertical lines.
Statistical algorithms used for quality control on patient data include the following:
AoN (Analysis of Numbers).
Moving average of test results in a block.
Moving average of natural logarithms of test results in a block.
Moving average of square roots of test results in a block.
Moving median of test results in a block.
Moving median of natural logarithms of test results in a block.
Moving median of square roots of test results in a block.
Figure 8 presents the actual values of the examined material from patients as points, while various statistical characteristics are represented by dashed lines. The purpose of this chart is to magnify a short segment of data to observe the change in the curves resulting from device calibration. This change suggests that calibration affects the characteristics of the distribution of test result variability. To investigate how calibration affects parameters, individual points were labeled with ‘DIFF’ labels. These labels take the value of 1 when a difference is observed (
Figure 8) and 0 when calibration does not alter the indicators.
This labeling allows us to distinguish between the observation blocks from before and after calibration (
Figure 9). As observed, the results after calibration (first two box plots) do not differ significantly—their boxes overlap. However, the values in the subsequent two plots show significant variation. Therefore, the conclusion is that observations in blocks preceding the calibration, marked as DIFF −1 (i.e., necessary), are significantly different from those preceding the calibration, marked as DIFF 0 (i.e., not introducing a significant change).
The analysis visualized in
Figure 9 aimed to examine whether the selected characteristic, depicted here as the Moving Average of Square Roots of test results in a block, exhibits significant differences between values before and after calibration. Calibrations labeled as DIFF = 1, meaning those deemed necessary from a quality control standpoint, show a significant difference in the presented characteristic.
The goal of the further models was to capture the dependencies of indicator impact on the calibration designation as necessary (DIFF = 1) or unnecessary (DIFF = 0), specifically identifying quality control points that could potentially be omitted since they are not essential. To achieve this, three models were developed: the first utilized decision trees, and the other two employed neural network models.
The developed decision tree model indicates that, based on the two most important indicators, AoN and MA, an effective classifier can be created to distinguish between necessary and unnecessary calibrations without introducing significant changes. However, it was also demonstrated that the model is not entirely accurate. While many leaves exhibit high accuracy at 87% and 95%, there is a node that serves as a partition with high variability, associated with an error rate of 43% (for high values of moving averages and AoN when the device approaches a state requiring calibration). Hence, the employment of alternative techniques was deemed a necessity.
The next algorithm used in the classification of the control points was a neural network model, which yielded significantly better fitting compared to the decision trees. We utilized the Automatic Neural Network (ANN from Statistica 13.1 Statsoft Poland) algorithm for architecture searching and exploring the spaces of possible architectures, including varying the number of layers and neurons in the hidden layers, as well as MLP and RBF architectures, different activation functions, determining the optimal architecture to be MLP 9-13-2 (Statistica 13.1, Statsoft, Kraków, Poland). The MLP 9-13-2 configuration specifies the architecture of the network, indicating that it consists of 9 input neurons (determined by input variables), a hidden layer with 13 neurons, and 2 output neurons. Also, the notation BFGS 92 refers to the Broyden–Fletcher–Goldfarb–Shanno training algorithm with 92 cycles of iteration. This optimization algorithm is used to minimize the objective function during the training process. The BFGS algorithm is a quasi-Newton method that approximates the second derivative (Hessian matrix) of the objective function and updates the parameters iteratively to find the minimum of the function. SOS (“Sum of Squares”) and entropy error functions are functions used in the context of neural networks for error calculation during the training process. Since the network architecture was generated automatically and the best configurations were tested, we can observe that neither error metrics nor the types of activation functions had a significant impact on the network’s results. As evident from these comparisons, activation functions perform similarly regardless of whether they are hyperbolic tangents (tanh), exponential functions, or softmax functions. Numerous models were created, and the best exemplary architectures with good results are presented in
Table 4. The presented network architectures differ in accuracy on the validation set, with noticeable differences. Interestingly, the networks labeled ID 2 and ID 5 exhibited higher effectiveness despite weaker performance on the test set, suggesting overfitting in network ID 4.
Our observations reveal that a smaller number of input neurons, representing only two variables—AoN and MA—resulted in a decrease in accuracy. Hence, it can be inferred that utilizing additional indicators allows for better prediction. A sensitivity analysis of the network enables the assessment of the impact of individual indicators on the network output. The results can be compared by carrying out an evaluation using CART. As depicted in
Table 5, neural networks exhibit a preference for MA over AoN, although ln(MMe) also proves to be a significant factor. A detailed sensitivity analysis can be seen in
Table 5. It includes assessments of the impact of individual factors (significance) on the results obtained by the networks. The analysis showed that ln(MMe)—moving median natural logarithms—takes, on average, third place in the ranking of the influence of individual parameters.
Table 5 indicates yet another aspect of comparing the presented architectures. Each of these selected networks has a different proportion of influence on individual characteristics. Network ID 2 strongly prefers the moving average, with its importance being two orders of magnitude greater than the next characteristic. On the other hand, the last network, ID 5, prioritizes the natural logarithm of the moving median, followed by the natural logarithm of the moving average, but the differences in importance are not as pronounced. Only network ID 4 assigns significant weight to AoN.
The error that was particularly crucial in this situation, namely the False Negative, indicates that the model predicted that calibration was unnecessary when, in fact, it was essential. This error stands at 1.5% in the best model, while the overall accuracy of the model is 98.8%.