The main objective in this paper is the efficient estimation of TVC on beef samples based only on data acquired by an MSI system. No additional information, such as the specific time-steps at which beef samples were analyzed during storage, or the classification groups (fresh, semifresh or spoiled) of the acquired samples was utilized. The rationale of using only spectral information was justified by the fact, that sometimes such additional information may not be instantly available. The main concept of the proposed framework, shown at
Figure 3, is the TVC prediction not via a single regression model, but rather through an ensemble scheme, where the predictions of its individual FOS-based feature components through the aid of a metalearning algorithm, are combined in order to produce the final outcome. Two regression models, based on the proposed clustering-based neurofuzzy regression model (CAGFINN), have been trained and evaluated separately for the “mean” and “sd” cases, respectively. However, as we are interested, initially, to assess the performance of CAGFINN scheme acting as TVC prediction model, obtained results are compared with those obtained by MLP neural networks, SVM, PLSR schemes and wavelet neural networks (WNN). Although, MLP and PLSR schemes have been already applied in this domain [
37], the comparison is also extended with the addition of SVM as well as an advanced WNN model [
38]. The main reason for using this specific WNN model, is related to the fact that this structure includes an additional linear-coefficients-weights layer that resembles the TSK defuzzification scheme found in CAGFINN. The dataset, consisted of 84 meat samples, include the selected input variables (i.e., wavelengths) chosen by the feature selection fusion scheme, and the associated TVC prediction obtained by microbiological analysis. In this research study, two distinct procedures have been considered for the training/testing stages. In the first procedure, as the number of observations/samples is small, the separation of the dataset into training and testing subsets (hold-out method) was considered that it would further reduce the number of data and would result in insufficient training of the network. Therefore, in order to improve the robustness of identification process, the leave-1-out cross validation (LOOCV) technique was employed to evaluate the performance of all developed models for both “mean” and “sd” feature cases. In order to investigate further the capabilities of the implemented models for the TVC prediction, a second experiment was also carried out, where the initial dataset was divided into a training subset with approx. 90% of the data and a testing subset with the remaining 10% (i.e., eight samples). The performances of developed regression models for the prediction of TVC, for both “mean” and “sd” cases, are evaluated through a number of well-established metrics. Initially the Bias factor (B
f) and accuracy factor (A
f) performance indicators, which have been extensively used in food microbiology, have been applied [
39]. The bias factor measures the bias of the data, while the accuracy factor measures of the scatter of the data. These indicators are shown in the following equations:
and
where
,
represent the predicted and desired responses, while
is the number of observations. The usual measures of goodness-of-fit for model comparison in food microbiology is performed, in addition to B
f and A
f with the squared correlation coefficient (R
2), known also as coefficient of determination. In our case, it can be interpreted as the proportion of the variance in the predicted TVC response from the actual/objective TVC response. The higher the value (0 ≤ R
2 ≤ 1), the better is the prediction by the model. The coefficient of determination and its equivalent correlation coefficient (R) is also one of well-known metrics used for determining the effect size. Effect size is a statistical concept that measures the strength of the relationship between two variables on a numeric scale and is used mainly in health-related domains, like psychology. Alternatively, the effect size, as a way to compare two variables, is also calculated as the difference between the two variables’ means. Two similar metrics of this type are also considered in this paper for the evaluation of the implemented regression models [
40]. The first is the standardized mean difference (theta) scheme, which is shown in the following equation:
where
,
represent the means of the desired and predicted response, respectively, while
, is the standard deviation of the desired response. A slightly different approach is the Cohen’s d effect size metric which is defined as difference between the two means divided by the pooled standard deviation from the data and is defined by the following equation.
Cohen’s suggestion that effect sizes can be categorized as small, medium or large is associated with the amount of overlap of the two involved variables.
Finally, three additional chemometric indicators, which are used extensively in spectroscopic applications, namely residual prediction deviation (RPD), range error ratio (RER) and the ratio of performance to interquartile distance (RPIQ), were utilized in this paper. The RPD is calculated as the ratio of the standard deviation of the desired variable to the RMSE. Higher values for the RPD suggest increasingly accurate models and a model with RPD value above three is generally considered as an excellent model in terms of reliability [
42]. The RPD value although is considered as an important criterion, it assumes a normal distribution of the desired/observed values since it includes the standard deviation. An alternative metric is the Ratio of Performance to InterQuartile distance (RPIQ), which is defined as interquartile range of the observed values divided by the RMSE. RPIQ is based on quartiles, which better represent the spread of the dataset. The quartiles are milestones in the dataset range: Q
1 is the value below which we can find 25% of the samples while Q
3 is the value below which we find 75% of the samples. The RPIQ formula is shown as follows:
5.1. “Mean” Feature Case Study
All regression models, including CAGFINN, have been implemented utilizing the same input vector that includes eight input nodes (i.e., the selected eight “mean” features from the proposed fusion scheme). In the proposed neurofuzzy (NF) scheme, 12 clusters have been created, using the clustering preprocessing stage for the LOOCV case, while 10 clusters for the 2nd experiment with the reduced training dataset case.
The number of fuzzy rules is the same as the number of MIs and is independent from the number of input variables, creating thus a novel “multidimensional inspired” rule layer, in contrast to in contrast to traditional ANFIS architecture [
44]. The hybrid learning algorithm, resulted in a fast-training process, which was concluded in less than 500 epochs, much faster from the equivalent time used to train the MLP neural network. The scatter plot of predicted (via CAGFINN) versus observed total viable counts for the LOOCV case is illustrated in
Figure 10, and shows a very good distribution around the line of equity (y = x), with the vast majority of the data included within the ±0.5 log area. A few samples, however, are located in the borders of the designated ±0.5 log area, while five samples, namely “49A1”, “51A1”, “55A1”, “50A3” and “19A7” are clearly located outside that zone. It is interesting to investigate the characteristics of these five “outliers” samples. The “49A1”, “51A1” and “55A1”samples were stored at 0 °C and collected after 239 h, 263 h and 383 h time of storage, respectively. The “50A3” sample was stored at 4 °C and collected after 263 h time of storage and finally the “19A7” corresponds to a beef sample stored at 12 °C and collected after 54 h time of storage. Although this “outliers” set includes samples from different storage temperatures, the performance of NF model from this plot reveals a rather “nonlinear” trend as many samples are distributed above and below from the line of equity. Practically, this plot illustrates the nonlinear characteristics of the TVC prediction for the “mean” case.
The performance of NF model, using the chosen wavelengths as input variables, is revealed with more details though the application of the evaluation statistical metrics, presented in
Table 5. The bias and accuracy factors revealed a rather perfect fit, with values 1.003 and 1.035, respectively. B
f is used to determine whether the model over or under predicts the level of bacterial growth. A B
f greater than 1.0 indicates that a model is overestimated. Equally, a B
f less than 1.0 generally indicates that a model is underestimated (i.e., observed responses are larger than predicted values) [
44]. Perfect agreement between predictions and observations would lead to Bf of 1. Regarding the appropriate values of the A
f, a value of one indicates that there is perfect agreement between all the predicted and measured values. All remaining evaluation results reveal a rather excellent regression performance for the proposed NF model for the LOOCV case. The R-squared result has been calculated as 0.982, which is considered as an accepted value to evaluate the goodness-of-fit for this particular model. Cohen’s d test is almost zero, which reveals a strong overlapping between predicted and observed responses. The obtained high RPD shows that this model can definitely be used for such prediction tasks. In parallel, an additional CAGFINN model has been implemented using, however, as input variables the PCs from the PCA scheme. The aim was to compare the concept of the proposed fusion FS vs. the traditional PCA approach. In this PCA/NF version, 8 clusters have been created, using the clustering preprocessing stage for both the LOOCV case as well as the second experiment with the reduced training dataset case. Although, this PCA/NF-based model contains fewer input variables than the equivalent FS-based model, results were slightly inferior. This comparison proved the validity of the hypothesis to utilize selected wavelengths as input variables in order to have a greater level of understanding of the model.
In addition to CAGFINN model, WNN, MLP, SVM and PLSR models have been also developed to predict TVC. The concept of using wavelets in neural networks is well known and is summarized by the replacement of the Gaussian function, normally found in RBF networks, with an orthonormal scaling functions; however, in the proposed advanced WNN-LCW architecture, the connection weights between the hidden layer neurons and output neurons are replaced by a local linear model, similar to the output TSK-layer appeared in ANFIS neuro-fuzzy system [
38]. From this point of view, there is a level of similarity between CAGFINN and WNN-LCW models, and results also reveal the importance of utilizing such a “defuzzification” layer in learning-based models. In the proposed WNN-LCW, 20 wavelet Morlet functions have been used and the evaluation performance of the developed WNN was comparable with the PCA/NF model. For the case of SVM, the criterion for building the optimal model was defined by the evaluation of SEP index. The
ranges were chosen in between [1 and 1000] and
value ranged from [0.05 to 1] for training the
. The epsilon tolerance value was set to be 0.001. Penalty coefficient
was set at 120, while
parameter at 0.25. An MLP network has also been implemented, utilizing two hidden layers (with 16 and 8 nodes, respectively). Both performances from MLP and SVM models, although inferior compared to previous models, reveal nevertheless a robust performance, taking into consideration that such models have been already used in the area of food microbiology. Finally, a partial least squares regression (PLSR) scheme has been applied to the same dataset. Although SVM and MLP performances could be characterized as comparable, obtained PLSR results are worse than those obtained by other machine learning schemes. It is well known that in modelling of real processes, linear PLSR has some difficulties in its practical applications since most real problems are inherently nonlinear and dynamic [
45]. In fact, a close inspection at results in
Table 5, reveals that RPD index is less than 3, revealing that such a model cannot be considered as acceptable.
Similarly, all developed models have been also utilized in a second simulation study, in order to assess their competence to be trained with a dataset with a reduced number of samples. In this scenario, 76 samples were used for training purposes, while the remaining 8 samples were reserved for the evaluation. The plot of predicted (via CAGFINN) versus observed total viable counts for this second case is illustrated in
Figure 10, while the performance of the all developed models to predict TVCs in beef samples in terms of statistical indices is presented in
Table 6. Results from this table indicate again the superiority of implemented regression models over PLSR, even though they are inferior to those illustrated in
Table 5. In this case, all B
f indices are less than 1, indicating the underestimation nature of all models. However, a closer comparison of performances in these two tables, reveals a problem with the limited number of samples for training. All evaluation metrics are much worse in this second case, and this reflects an open problem in learning-based systems, i.e., the need to have as large as possible training datasets.
5.2. “SD” Feature Case Study
For this case, the input vector for all models consisted of the chosen six “sd” features from the proposed fusion scheme. In the proposed NF model, 14 clusters were created, using the clustering preprocessing stage for the LOOCV case, while 12 clusters were created for the second experiment. The increased number of clusters/rules indicates indirectly the different and rather more complex characteristics of the “sd” feature compared to the “mean” case. In this case, the structure for WNN-LCW consisted of 16 wavelet Morlet functions, while the MLP retained the same internal structure as in the previous “mean” case.
The plots of predicted (via CAGFINN) versus observed total viable counts for both simulation studies are illustrated in
Figure 11, and show a good distribution around the line of equity (y = x), with the vast majority of the data included within the ±0.5 log area. A few samples, however, are located in the borders of the designated ±0.5 log area, while four samples, namely “36A3”, “4A9”, “54A1” and “26A9”, are clearly located outside that zone. The “54A1” sample was stored at 0 °C and collected after 359 h time of storage, while “36A3” to a sample stored at 4 °C and collected after 120 h time of storage. Finally, “4A9” and “26A9” samples were stored at 16 °C and collected after 12 h and 73 h time of storage, respectively. This plot, however, in contrast to the previous “mean” case, reveals a rather “linear” trend as many samples are located on the line of equity. Such difference in the trends shown at
Figure 10 and
Figure 11, illustrate the different characteristics embedded in both “mean” and “sd” cases. The performances of all the developed models to predict TVCs in terms of statistical indices for the LOOCV case are presented in
Table 7.
Looking at these results, shown in
Table 7, a difference in modelling the TVC using the “sd” feature is more than evident. Although NF model achieved a superior performance in all metrics, it is interesting to explore the performance of the WNN model. Both NF and WNN models share a common defuzzification component, which is responsible for obtaining superior performance in single output nonlinear regression problems. The R-squared in this case is 0.981 for the NF scheme compared 0.654 for the PLSR, revealing the deficiency of this particular model in modelling nonlinear problems.
Similarly, all developed models have been also utilized in a second simulation study, in order to assess their competence to be trained with a dataset with a reduced number of samples. The plot of predicted (via CAGFINN) versus observed total viable counts for this second case is illustrated in
Figure 11, while the performance of the all developed models to predict TVCs in beef samples in terms of statistical indices is presented in
Table 8.
Results in this case, verify again the superiority of the proposed models over PLSR scheme. The obtained RPD and R-squared results for the PLSR model shows how inferior is such a model which currently is used extensively in the area of food microbiology. On the other hand, the proposed CAGFINN clearly outperforms both MLP and SVM schemes, which are very popular in the regression analysis, although their overall performance can be considered as acceptable for modelling a nonlinear process. In fact, RPD values for MLP and SVM in this second experiment were above 3, while also the equivalent R-squared metric is acceptable for both models.
5.3. Ensemble System
In general, ensembling is a technique of combining two or more algorithms/models of similar or dissimilar types, called base learners. The main idea is to make a more robust system which incorporates the predictions from all the base learners [
46]. A single algorithm/model may not be able to provide the perfect prediction for a given dataset. Each machine learning algorithm has its own limitations and thus producing a model with high accuracy is challenging. However, a simple way to implement such combination of models is by aggregating their outputs. Such aggregation can be realized by using different techniques, sometimes referred to as meta-algorithms.
Although the “mean” feature had been considered as the main information provider for the case of multispectral imaging analysis [
21], in our specific research case study, both “mean” and “sd” individual FOS features have been analyzed separately. For each case, an advanced NF model has been employed to predict the required TVC in beef samples.
Even though a satisfactory TVC prediction was achieved for each case, the question is how we can combine these individual TVC predictions to obtain an even better result, if possible. The first “combination” approach is the well-known averaging scheme. It is defined by taking the average of predictions from models in case of regression problem. This approach was applied to CAGFINN’s predictions for both “mean” and “sd” cases and results are summarized at
Table 9 and
Table 10. Obviously, an improvement in overall TVC prediction has been achieved, compared to the individual “mean” TVC prediction. For example, the RPD, SEP or the R-squared values reveal a clear robustness of the averaging final model for TVC prediction, while B
f is almost perfect.
However, the main aim in this section is the application of a stacking ensemble scheme to this specific problem. Stacking is another ensemble scheme which involves training of a new learning algorithm/model to combine the predictions of several base learners. First, the base learners are trained using the available training data and then a combiner or meta-algorithm/model, called the super learner, is trained to make a final prediction based on the predictions of the base learners [
47]. In contrast to bagging and boosting ensemble schemes, stacking is designed to “combine” a diverse group of strong learners. In our case, the two CAGFINN models (“mean” and “sd” case) were used for TVC prediction were used as base learners. This metamodel has been trained using the TVC (training) predictions made by base models, rather than the data used to train the base models. These predictions along with the desired TVC outputs, provide the input and output pairs of the training dataset used to train the metamodel.
In this paper, two algorithms were used as potential “metamodels”, the PLSR and a nonlinear regression using the nonlinear iterative partial least squares (NIPALS) algorithm. All ensemble schemes were developed using XLSTAT v.2019.2 software. For the case of nonlinear “metamodel”, the following fourth order model being constructed.
Related results are shown in
Table 9 and
Table 10. The scatter plot for all ensemble schemes in the LOOCV case is illustrated in
Figure 12, and shows a very good distribution around the line of equity (y = x), with the vast majority of the data included within the ±0.5 log area. The performance of the proposed nonlinear metamodel is superior compared to the one obtained via the PLSR metamodel. Evaluation metrics for both PLSR and nonregression models, shown at
Table 9 and
Table 10, reveal a clear advantage over the standard averaging combination. Looking at
Figure 12, the majority of cases in nonlinear scheme are closer to the line of equity, while a few cases from averaging and RLSR schemes are close to border lines or even outside. Similarly, for the reduced dataset simulation case, the equivalent scatter plot is shown at
Figure 13. Results also in this case verify the validity of the hypothesis of using a metamodel scheme to predict TVC in beef samples.
Although, in this specific case study, only two base models have been used and thus the associated complexity is not considered as too high, it was interesting to investigate the performance of a metamodel that develops a nonlinear mapping from the predictions of bottom layer models to the outcome.