1. Introduction
Maritime transportation is responsible for approximately
of global freight movement, and has a significant impact on the environment and the world economy. Several technologies and strategies have been developed to maintain ship machinery and marine engines, with the support of first-principle digital twins [
1]. Digital twins are expected to be crucial for real-time monitoring, predictive maintenance, and optimal marine machinery performance, enhancing safety, and more importantly, assessing the health of the ship machinery systems. In-cylinder pressure is a key parameter that conveys information for characterising marine engine operation, and hence, it has extensively been employed for fault diagnosis. Currently, digital twins are based on thermodynamic and fluid dynamic principles and pertinent conservation laws, and they have been effective in predicting the engine performance and emissions parameters. The use of digital twins is also common in several industrial sectors, including the automotive, power, and energy sectors. However, they are computationally expensive, which renders their implementation on ships challenging [
2]. As a result, the applications of traditional first-principle-based digital twins in the shipping sector are limited.
Future digital twins will be required to predict several engine performance and emission parameters (in-cylinder pressure amongst them); however, the available computational power (especially onboard ships) is expected to be insufficient to facilitate the use of physical models [
3]. Data-driven models are developed using datasets to capture the mathematical relations between the input and output parameters. Data-driven models require less computational power compared to physical (first-principle) DTs, and therefore, they can be used in ship applications’ edge computing. Advances in the field of machine learning and artificial intelligence have leveraged several regression and classification techniques, which are effective for developing data-driven models.
Regression techniques model the relationship between one or more independent variables (predictors) and a dependent variable (target) in order to predict or estimate the value of the target variable based on the values of the predictor variables. Commonly used regression techniques include linear regression, polynomial regression, support vector regression (SVR), decision trees, elastic regressions, and artificial neural networks (ANN). Several applications of these techniques are reported in the pertinent literature.
The performance and emissions parameters of several engine types were predicted using linear regression [
4] and SVR [
5]. A review of data-driven models for ship performance was conducted by Alexiou et al. [
6]. Random forest regression was used to predict the combustion profile parameters in [
7], whereas decision trees were proven effective for energy demand modelling [
8]. Applications of ANN in internal combustion engines were reported by [
9,
10,
11]. ANNs and non-linear autoregressive exogenous input (NARX-ANN) was proved effective for the prediction of marine diesel engines’ performance parameters in [
12,
13]. However, previous publications mostly focused on other engine performance parameters (time variation of their cycle-mean values) and not on their in-cycle variations. Hence, their capability for predicting high-resolution instantaneous signals for crucial parameters, such as in-cylinder pressure, needs to be investigated.
Johnsson [
14] studied several networks based on complex radial basis functions (RBF) for estimating the in-cylinder pressure profiles from a six-cylinder ethanol-fuelled engine. Saraswati and Chand [
15] tested recurrent neural networks (RNNs) to estimate the in-cylinder pressure for one engine operating point. Solmaz et al. [
16] demonstrated that the ANN approach is more effective compared to fuzzy logic to predict the in-cylinder pressure and the indicated mean effective pressure. However, these techniques are complex, cover limited operating points, and require additional datasets, which renders them less practical for implementation to marine engines. Although typical regression techniques (mentioned in the preceding paragraphs) exhibit the potential to estimate the in-cylinder pressure following appropriate customisation, challenges pertinent to their structure, scalability, and complexity must be addressed.
This study aims to comparatively assess data-driven models based on typical machine learning regression techniques, specifically linear, elastic, and polynomial regression, support vector machines (SVM), decision trees (DT), and artificial Neural Networks (ANN), for predicting the in-cylinder pressure of a marine four-stroke engine. The complete operating envelope and healthy conditions of this engine are considered in this study. Initially, feature engineering analysis is carried out to determine the data-driven models’ input and output requirements using two approaches, out of which one is selected. Subsequently, the data-driven models based on six regression techniques are developed, trained, and tested. The root mean square errors of the predicted pressure signals for nine cylinders along with the mean effective pressure and maximum pressure for each cylinder are compared to identify the most effective regression technique. Finally, a sensitivity study is performed to conclude on the recommended values of the testing datasets’ ratio and harmonics number. The required datasets are generated by using a thermodynamic digital twin for the investigated marine four-stroke engine, which was validated against shop trials’ measured parameters and experimentally acquired in-cylinder pressure in five operating points. The ensemble techniques (including AdaBoost and random forest) are not considered herein, as this study focuses on only regression techniques.
The novelty of this study stems from addressing the preceding challenges using explainable data-driven models. The most effective techniques are identified, whereas recommendations for training are provided. Insights for the development and use of the most accurate and least computationally expensive data-driven models for in-cylinder pressure prediction are also generated.
This study contributions are (a) the comparative assessment of two approaches (prediction of in-cylinder pressure by regression; prediction of harmonics coefficient by regression and in-cylinder pressure reconstruction using Fourier series function); (b) the comparative analysis of the data-driven models based on six regression techniques considering their accuracy characterised by the root mean square error (RMSE) on estimating the in-cylinder pressure and other performance parameters; (c) the bench marking of the data-driven models based on ANN regression (to predict harmonics coefficients) and in-cylinder pressure reconstruction considering the test-to-train datasets’ ratio and harmonics number.
3. Results and Discussion
Figure 3 presents the predicted along with the reference in-cylinder pressure diagram for each engine cylinder for one engine cycle (crank angle from −360 °CA to 360 °CA) using the two approaches described in Phase 2 (
Section 2.2). The dotted lines represent reference cylinder pressure (predictions of the thermodynamic digital twin), whereas the the green and red lines represent the predicted in-cylinder pressure by the first and second approach, respectively. It is inferred that the second approach provided in-cylinder pressure closer to the reference one (exhibiting a percentage error in the in-cylinder pressure prediction within
).
However, the first approach completely fails to predict the in-cylinder pressure at the open cycle. It provides adequate predictions for the closed cycle, although exhibiting higher errors compared to the predictions of the second approach. It is inferred that the first approach involves considerable errors, and does not satisfy the requirements of tools for marine engine health assessment. Therefore, the second approach, which involves the prediction of 101 Fourier coefficients per cylinder and subsequent reconstruction of the in-cylinder pressure by using Equation (
4)), is selected for comparatively assessing the six regression techniques.
The data-driven models corresponding to the second approach and the six regression techniques are evaluated by using the test datasets (
of the total generated datasets), which were separated as reported in
Section 2.1.2. The number of the test datasets corresponding to the several areas of the engine operating envelope are illustrated in
Figure 2b.
The derived RMSEs (in bar, calculated according to Equation (
13) considering the test datasets—750 samples) for the developed data-driven models corresponding to the six employed regression techniques, which were trained using training datasets derived by three different
values (0.9, 0.95, and 0.995), are presented in
Figure 4. The number of training samples for each
value is listed in
Table 5. It is deduced that the RMSE (for all regression techniques) increases with higher
values, corresponding to smaller number training datasets. Therefore, a higher number of training datasets increases the data-driven model accuracy (lower RMSE). However, the RMSE changes for LR, PR, and ANN are not so pronounced, whereas RMSE values for gamma equal to 0.9 and 0.95 are almost the same. This implies the the Fourier series coefficients (second approach) can be effectively mapped using these linear regression techniques types, requiring a relatively low training dataset number. Contrary, the SVR and DT regression techniques are sensitive to
and require higher dataset numbers. DT regression trained with the lowest datasets number exhibited low accuracy (high RMSE) on the test datasets. From the preceding discussion and presented results, it is inferred that the ANN regression is the most effective technique, as it resulted in RMSE less than 0.6 bar when trained using 20 datasets.
Figure 5 presents the average error of the predicted in-cylinder pressure (considering all the engine cylinders) by the data-driven models considering the test data (750 datasets); these data-driven models correspond to the six regression techniques trained using three different datasets (
. The range of the error distributions (whiskers) increases with the
aligning with exhibited RMSE trends (
Figure 4). The ANN regression exhibits an average error within
bar even when trained with the lowest dataset number. The DT regression performs adequately when trained using high numbers of training datasets. The ANN and PR regression techniques also showcase the minimum number of outlier error points with only 20 training samples. Therefore, it is deduced that the ANN and PR regression techniques exhibit the highest potential compared to the other regression techniques.
The predicted in-cylinder pressure diagrams (using the six regression techniques) are also used to derive other engine performance parameters, including the mean effective pressure (MEP) characterising the engine power output, and the in-cylinder maximum pressure (
) characterising the engine thermo-mechanical limits. The accurate estimation of these parameters is important for marine engine health assessment.
Figure 6 presents the average MEP error (from all the engine cylinders) considering the six regression techniques and the three
values. The average MEP error remains almost the same for the linear regression techniques (LR, ER and PR) despite using different training datasets, which is in alignment with the respective RMSE trends of the in-cylinder pressure diagram prediction (discussed in the proceeding paragraphs). The SVR and DT techniques resulted in the highest MEP error variations. ANN use resulted in the lowest MEP errors, which gradually increased with the use of smaller training datasets. However, the ANN’s performance (considering the error tolerance and outliers) was found to be the best among all of the employed regression techniques. Positive offsets (mean of MEP errors) are observed for all the regression techniques, implying the MEP overestimation, which requires correction measures (e.g., the use of a negative bias).
Figure 7 presents the maximum in-cylinder pressure error (from all the engine cylinders) considering the six regression techniques and the three
values. When using the highest training dataset number (
), the ANN and DT regression techniques provided the lowest error distributions, whereas the respective mean errors exhibited slightly negative offsets. However, when using the lowest training dataset number (
), the DT regression technique led to considerable errors, whereas the PR and ANN regression techniques resulted in errors ranging
bar. The PR regression led to mean errors with negative offset, the value of which increased with
.
Figure 8 presents the absolute error of the predicted in-cylinder pressure from the data-driven models developed by the six regression techniques (and second approach) trained using 20 datasets (
) on the whole engine operating envelope. These results are employed to characterise the ability of the employed regression techniques to predict the in-cylinder pressure throughout the investigated marine engine operating envelope. The data-driven models based on LR, ER, PR, and ANN regression techniques exhibited absolute error in predicting the in-cylinder pressure less than 1 bar considering the whole engine operating envelope. Data-driven models using the SVR and DT regression techniques exhibited higher absolute errors, reaching 4.8 and 2.4 bar, respectively, as shown in
Figure 8c,e). The ANN regression exhibited the smallest absolute error compared to the other techniques in the whole operating envelope.
From the preceding discussion, it was confirmed that the the ANN regression exhibited the lowest errors in all the considered metrics considering the whole engine operating envelope, whilst requiring the smallest number of training datasets. Hence, the use of ANN regression is recommended for developing data-driven models to predict the marine engine in-cylinder pressure.
Figure 9 presents the results of the performed sensitivity study considering the data-driven model of the second approach and the ANN regression; the
on the predicted in-cylinder pressure is plotted as function of the harmonics number (corresponding to
Fourier coefficients for each cylinder) and test-to-train ratio (
, corresponding to the used training dataset number). It is deduced that the harmonic number greatly affects
, which characterises the data-driven model error. For more than 45 harmonic orders, high
values are observed, indicating sufficient accuracy. The test to train ratio slightly impacted
. It is deduced that the proposed data-driven model using ANN regression trained with only 25% of the 4750 datasets (corresponding to engine operating points) randomly selected can achieve sufficient accuracy. This reduces the computational effort required for generating datasets (by employing the physical digital twins that are computationally expensive). The developed data-driven model can provide sufficient accuracy on the in-cylinder pressure prediction in healthy engine conditions, whilst substantially reducing the required computational effort.