Use of Synthetic Data in Maritime Applications for the Problem of Steam Turbine Exergy Analysis

Baressi Šegota, Sandi; Mrzljak, Vedran; Anđelić, Nikola; Poljak, Igor; Car, Zlatan

doi:10.3390/jmse11081595

Open AccessArticle

Use of Synthetic Data in Maritime Applications for the Problem of Steam Turbine Exergy Analysis

by

Sandi Baressi Šegota

^1,*,†

,

Vedran Mrzljak

^1,†

,

Nikola Anđelić

¹

,

Igor Poljak

²

and

Zlatan Car

¹

Faculty of Engineering, University of Rijeka, Vukovarska 58, 51000 Rijeka, Croatia

²

Department of Maritime Sciences, University of Zadar, Mihovila Pavlinovića 1, 23000 Zadar, Croatia

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

J. Mar. Sci. Eng. 2023, 11(8), 1595; https://doi.org/10.3390/jmse11081595

Submission received: 30 June 2023 / Revised: 6 August 2023 / Accepted: 14 August 2023 / Published: 15 August 2023

(This article belongs to the Special Issue Advances in Marine Propulsion II)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Machine learning applications have demonstrated the potential to generate precise models in a wide variety of fields, including marine applications. Still, the main issue with ML-based methods is the need for large amounts of data, which may be impractical to come by. To assure the quality of the models and their robustness to different inputs, synthetic data may be generated using other ML-based methods, such as Triplet Encoded Variable Autoencoder (TVAE), copulas, or a Conditional Tabular Generative Adversarial Network (CTGAN). With this approach, a dataset can be trained using ML methods such as Multilayer Perceptron (MLP) or Extreme Gradient Boosting (XGB) to improve the general performance. The methods are applied to the dataset containing mass flow, temperature, and pressure measurements in seven points of a marine steam turbine as inputs, along with the exergy efficiency (

η

) and destruction (

E x

) of the whole turbine (WT), low-pressure cylinder (LPC) and high-pressure cylinder (HPC) as outputs. The achieved results show that models trained on synthetic data achieve slightly worse results than the models trained on original data in previous research, but allow for the use of as little as two-thirds of the dataset to achieve these results. Using

R^{2}

as the main evaluation metric, the best results achieved are 0.99 for

η_{W T}

using 100 data points and MLP, 0.93 for

η_{L P C}

using 100 data points and an MLP-based model, 0.91 for

η_{H P C}

with the same method, and 0.97 for

E x_{W T}

, 0.96 for

E x_{L P C}

, and 0.98 for

E x_{H P C}

using a the XGB trained model with 100 data points.

Keywords:

exergy analysis; machine learning; marine steam turbine; maritime propulsion systems; synthetic data

1. Introduction

The number of applications of machine learning (ML) and other artificial intelligence (AI) methods have been growing in recent years in maritime applications. Many authors, including Aylak (2022) [1] comment on the wide-reaching impacts of such models applied in logistics, but such models are present in maritime engineering as well. Karatug and Arslanoglu (2022) comment on the application of this machine learning for condition-based maintenance and fault diagnosis in ship engine systems, demonstrating that high-precision models can be developed for these tasks. Gonca et al. (2022) [2] use artificial neural networks to model the maximum performance characteristics, in the case of a seven-process cycle engine. Other applications also include the prediction of fuel efficiency in the case of the cargo vessel, where Fam et al. (2022) [3] achieve high precision using kernel probability density functions and artificial neural networks, ship speed prediction by Bassam et al. (2022) [4] by five different ML-based regression algorithms, and ship fatigue damage prediction by Wang (2022) [5]—where the authors utilize artificial neural networks. The existing research shows a preference by authors to apply artificial neural networks as the main regression algorithm of choice. While high-precision models generated with such methods can be extremely beneficial to many operations in maritime environments, the underlying issue of data availability is oftentimes a common problem for many researchers and data engineers who wish to apply these types of methods. ML methods require large amounts of data, with data points necessary for a quality model numbering in at least hundreds, if not thousands [6]. Collecting this amount of data can present an issue. Many ship systems may not be equipped for automated data collection by default [7], requiring either time investment by operators or financial investment by ship owners—neither of which are looked upon kindly without explicit proof of performance improvements that can be obtained by developed models. This commonly results in relatively small “proof-of-concept” datasets containing a smaller amount of data points than desired by the data engineers. This issue is common within many environments. A common tactic in image-based ML was the use of deterministic or stochastic image augmentation processes to generate artificial data points for training [8]. Still, large amounts of data collected in maritime and engineering applications does not come in the form of images, but numerical data tables. Recently, synthetic data generation has been a common tactic for artificially increasing datasets. This refers to the generation of synthetic data points which contain the same statistical parameters such as the original data used to generate them, but offer new data. As most ML-based methods are designed to determine models based on exactly these statistical intricacies of the data, high-performance models have been achieved in multiple applications through training the models on it. Most of the examples of synthetic data in maritime applications focus on the use of the so-called simulation synthetic data—such as Kastner et al. (2022) [9] who use simulations for container flow data generation, mainly targeting maritime contained terminals—presenting a conceptual generation model. Bruns et al. (2023) [10] use enhanced CEP rule learning to develop maritime data streams and data generators. Research primarily focuses on ship activity patterns and is validated on real maritime data streams. Synthetic data has also been applied on ship wake detection, for data fusion from multiple sensors by Higgins et al. (2022) [11], showing an improvement in results when real data is bolstered by synthetically generated ones. While a wider application of synthetic data in maritime applications focuses on images [12,13], there is a clear lack of application of synthetic data generators which create additional data points based on statistical methods. The current state of research into applying ML for exergy/energy analysis shows that the field is active. Taghavifar and Perera [14] (2023) demonstrate the use of supervised ANNs for the problem of data-driven modeling of exergy and energy in marine engines. The authors use fuel type and injection angle to classify different operation modes and achieve a high-quality model, with

R^{2}

regression scores above 0.95. Strušnik [15] (2022) shows the application of ML for the tuning of a steam turbine condenser vacuum. Strušnik demonstrates that the process efficiency can increase by over 2% by using an ML model for system control. Kartal and Ozveren [16] (2022) demonstrate the chemical exergy calculation for torrefied biomass using ML, namely feed-forward ANNs, achieving a

R^{2}

score of over 0.79 and

M A P E

below 4% on the training set. Arslan et al. [17] (2023) demonstrate the application of a tree-based regressor and Pace for obtaining exergy and efficiency models, specifically mathematical equations, with the model errors ranging between slightly above 8% and below 2%. While the existing research shows that the application of ML in exergy analysis is a common research topic, most authors note the data collection process to be complex and that it is hard to obtain a large amount of data necessary for model training, especially in engineering-focused papers. This is a research gap that could be addressed with the use of synthetic data. If this process is applicable, it could allow the provision of relatively low amounts of data which can be used in part for data generation—which will, in turn, be used for model training, and in part for validation—allowing for a full validation process to be performed on “unseen” data, while having enough data for training [18]. The main goal of the presented research is to test the possibility of the application of such data methods on a dataset describing the efficiency and exergy of a gas turbine described below. The dataset was previously used for ML modeling on the original data, allowing for a direct comparison of performance on it and the original data. The authors will test the dataset generation with three methods—TVAE, copula, and CTGAN. Then, two regression techniques will be used for the performance testing—MLP and XGB. MLP was selected due to being used in the previous research, allowing for a direct performance comparison. XGB has been used because it has shown significant performance in similar tasks, with the benefit of generating comparatively simple models. The research aims to address the following research questions:

RQ1—can synthetic data be used for modeling GT using XGB and MLP?
RQ2—what is the performance of these models compared to a model trained on purely original, collected data?
RQ3—how much data needs to be used to generate synthetic data that yields satisfactory results?
RQ4—what is the performance impact of fewer data used for synthetic dataset generation?

To address these questions, the authors will provide information on the data used and methods applied, followed by the presentation of the results. The following sections will describe the general methodology of the research, along with the process of data generation and regression modeling. As mentioned, the main idea of the paper is to research the possibility of using synthetic data to generate more detailed datasets in marine applications. For this reason, the original collected dataset is used to develop synthetic data, based on different data point amounts. This data is then used in model training based on two methods—XGB and MLP. The overview of the used methodological approach is given in Figure 1.

2. Dataset

The dataset is measured on a main marine steam turbine. The turbine in question is in operation at an LNG carrier with a weight of 100,450 tons (gross tonnage). The maximum power of the turbine in question is 29,420 kW [19]. A schematic overview of the operating points used in the measurement is given in Figure 2.

As shown, the main marine steam turbine consists of two cylinders—the high-pressure cylinder (HPC) and the Low-Pressure Cylinder (LPC). The HPC consists of one Curtis and seven Rateau stages, while the LPC consists of eight Rateau stages. Each propulsion system consists of two operating steam generators, and it delivers the majority of cumulatively produced steam mass flow rate to the HPC inlet [20]. The HPC consists of one steam extraction for steam delivery to auxiliary processes. Both steam generators in this plant use a Heavy Fuel Oil Heater (HFO) and a Boil Off Gas Heater (BOG). After the extraction, the remainder of the steam mass flow rate expands through the HPC. Before the LPC, one additional extraction exists for steam delivery for a high-pressure feed water heating system, consisting of one high-pressure feed water heater and deaerator [21]. In some operating regimes, part of the steam extracted here (operating point 4) is delivered to the air heaters used for heating at the steam generator’s entrance. The remaining steam mass flow rate (operating point 5) expands through the LPC, which has one-step extraction delivering steam to a low-pressure condensate heating system—consisting of one low-pressure condensate heater and evaporator. The remaining steam mass flow rate is delivered to the main marine steam condenser for condensation [22]. The analyzed main turbine is designed without steam reheating, meaning it does not have the additional Intermediate Pressure Cylinder (IPC), and steam reheating—common on newer variants of marine propulsion steam turbines [20,23]. Three steam extraction processes are not necessarily all open during the main turbine operation. Regulation valves can close and open these extractions to regulate the extracted steam mass flow rate in each of them, according to the predefined regulation procedure. Both cylinders of the observed turbine are connected to the main marine gearbox which drives one or two propellers (P1 and P2).

It should be noted that some additional losses which occur in the plant are neglected for simplicity or due to the impossibility of measurement/calculation such as steam mass flow water leakage through gland seals of each cylinder [24], heat losses in the pipelines and through cylinder housing, mechanical losses [25], and similar. These losses have a minor impact on the exergy analysis.

Figure 3 shows the histograms of the data, which allows us to discuss the sparsity and the distribution of the data. Some of the variables, such as P2, P3, P4, P5, M1, M5 and M7 have a good distribution of the data across the entire range of the data, which means that they do not present a concern. For other data points, there are areas in the variable range that are lacking. An example of those would be M3 and P7, which have a good amount of data for most of the variable range, with some missing values. Other data points such as temperature data, P1, M2, M4 and M6 have histograms showing that there is not data in certain, larger, parts of their ranges. This is caused by the operating regimes of the turbine, as the measurements were performed on a working turbine. As these data points cannot be easily filled through additional measurements, it has to be noted that the performance of the models may be influenced for the types of data which are shown to be scarcely distributed in the dataset.

3. Physical Model of the Exergy Destruction and Exergy Efficiency

The dataset itself consists of the values of mass flow rate, temperature, and pressure at seven points as marked in Figure 2. These values allow the calculation of exergy efficiency and exergy destruction for LPC, HPC, and whole turbine (WT), according to the overall steady-state exergy balance equation [26]:

\dot{Q_{E X}} + P_{I N L E T} + \sum {\dot{E x}}_{I N L E T} = P_{O U T L E T} + \sum {\dot{E x}}_{O U T L E T} + \sum {\dot{E x}}_{D E S} .

(1)

In the above, P represents the mechanical power (used/produced) and

{\dot{E x}}_{D E S}

represents the exergy loss.

The values of mass flow rate (m), pressure (p), and temperature (t) allow for the calculation of exergy

E x

and enthalpy h in each of the points. These are in-between values and are not included in the dataset. Then, the values of the exergy loss and exergy efficiency can be determined [27]. These equations are given below.

3.1. HPC

Developed mechanical power can be defined with:

P_{H P C} = {\dot{m}}_{1} \cdot (h_{1} - h_{2}) + ({\dot{m}}_{1} - {\dot{m}}_{2}) \cdot (h_{2} - h_{3}),

(2)

exergy destruction can then be defined via:

{\dot{E x}}_{D E S, H P C} = {\dot{E x}}_{1} - {\dot{E x}}_{2} - {\dot{E x}}_{3} - P_{H P C},

(3)

with the exergy efficiency being calculated as:

η_{E X, H P C} = P_{H P C} / ({\dot{E x}}_{1} - {\dot{E x}}_{2} - {\dot{E x}}_{3})

(4)

3.2. LPC

For the developed mechanical power, expressed as:

P_{L P C} = {\dot{m}}_{5} \cdot (h_{5} - h_{6}) + ({\dot{m}}_{5} - {\dot{m}}_{6}) \cdot (h_{6} - h_{7}),

(5)

exergy destruction can be calculated with the expression:

{\dot{E x}}_{D E S, L P C} = {\dot{E x}}_{5} - {\dot{E x}}_{6} - {\dot{E x}}_{7} - P_{L P C},

(6)

and the efficiency with:

η_{E X, L P C} = P_{L P C} / ({\dot{E x}}_{5} - {\dot{E x}}_{6} - {\dot{E x}}_{7}) .

(7)

3.3. WT

The whole turbine mechanical power can be calculated as a sum of LPC and HPC powers:

P_{W T} = P_{H P C} - P_{L P C} .

(8)

Exergy destruction of the WT is then expressed as:

{\dot{E x}}_{D E S, W T} = {\dot{E x}}_{1} - {\dot{E x}}_{2} - {\dot{E x}}_{4} - {\dot{E x}}_{6} - {\dot{E x}}_{7} - P_{w t},

(9)

and the exergy efficiency as:

η_{E X, W T} = P_{W} T / ({\dot{E x}}_{1} - {\dot{E x}}_{2} - {\dot{E x}}_{4} - {\dot{E x}}_{6} - {\dot{E x}}_{7}) .

(10)

The above values (exergy destruction and exergy efficiency) are used as the outputs of the dataset, for further regression modeling. As this means that there are six outputs, and the methods used can only regress a single value at once, this means six separate models will be created and evaluated separately—one for each output.

4. Modelling the Turbine Using ML-Based Algorithms

4.1. Synthetic Data Generation

Three different methods are utilized on the described dataset, to generate statistical synthetic data. As the total dataset has 150 data points, four data thresholds were used to determine the data—100 randomly selected data points and 50 randomly selected data points, 25 randomly selected data points, and 10 randomly selected data points. Then, three methods are used to generate a total of 1000 data points, with each dataset being evaluated separately. The best-performing datasets are then selected for modeling.

To leverage machine learning, a generative adversarial network (GAN) is trained to perform the data generation processes, for each method described below. A GAN consists of two networks—a generator and a discriminator. The discriminator is trained on real data and randomly generated data. The goal of the discriminator is to determine whether data is real or generated [28]. This is performed by using a datapoint

X_{i}

as an input and then processing it through the network consisting of multiple convolutional layers. These layers perform convolution between the output of a previous layer and the random filter values, F returning a predicted value

{\hat{Y}}_{X_{i}}

, which is the predicted class (1 for real data, and 0 for random, “fake” data). The error, called loss, is then calculated as [29]:

L^{F} (X_{i}) = | {\hat{Y}}_{X_{i}} - Y_{X_{i}} | .

(11)

Based on this error, the values contained in F are adjusted based on the gradient

F^{'} = F - \frac{\partial L^{F} (X_{i})}{\partial F}

. This means that a larger error results in a larger adjustment to the parameters. This training is repeated on multiple data points until the error of the discriminator is lowered. When the discriminator is trained to a point where it can differentiate data somewhat well, generator training starts. The generator network is trained to take random noise as an input and provides data in the shape of real data as an output. The output of this process is then fed into the discriminator, which classifies it as “real” or “fake” data. This result is then used to adjust the internal parameters F of the generator until it generates data that the discriminator thinks is real.

The type of data input in this network varies, with three different approaches used in this research: copula, TVAE, and CTGAN. The methods at hand were selected as the synthetic dataset generation methods for multiple reasons. First, they are all available for use as a part of the Synthetic Data Vault library [30]. This allowed the authors to use a set of verified methods, and avoid possible errors with the authors implementing the methods manually. The methods at hand were also selected because previous research has shown them to be high-performing on numeric datasets of various types [31,32,33,34]. All of the methods are trained in such a way that they generate data within the ranges of data contained in the original dataset, and that they generate non-rounded (decimal) values. All models are trained for 500, 1000, and 4000 (iterations) [35].

4.1.1. Copula

Copulas are statistical tools that allow for the modeling of the dependence structure between variables. For a data vector

X = [X_{1}, X_{2}, X_{3}, \dots, X_{n}]

, with the appropriate cumulative distribution functions

F_{i}, \forall i \in [1, n]

a copula C can be defined. This copula is in essence a multivariate distribution function defined on a hypercube H, with the following properties [36]:

H is of dimensions ${[0, 1]}^{n}$ ,
Each of the distribution functions are uniform – $F_{i} (X_{i}) = U_{i}$ , with $U_{i}$ being a corresponding unit uniform random variable

Then, the copula can be defined as a function that defines the joint cumulative distribution function according to [37]:

C (F_{1} (X_{1}), F_{2} (X_{2}), \dots, F_{n} (X_{n})) = F (X_{1}, X_{2}, \dots, X_{n}) .

(12)

In other words, a copula defines the transition from a dataset’s cumulative distribution function into a uniform variable distribution defined in n dimensions. This function can then be used to generate data fitting the original cumulative distribution functions, by generating random data that satisfies the condition of having uniform distribution in the hypercube H space. Various functions are tested as copula candidates and the best performing one is selected, with the function parameters based on the original data. In the approach used, the copula function is defined with a GAN which performs the transformation process. The benefit of this is that neural networks can easily be inverted, with the output and input replaced, meaning that no complex evaluation of inverse copula has to be performed.

4.1.2. TVAE

Triplet-based encoding is a process in which the data is modeled in an n dimensional space, where n is the number of variables in the tested dataset. Each data point is defined with an anchor (real data point), a similar value called a positive instance, and a dissimilar value called a negative instance. Then, the model is trained with the goal of achieving such an encoding that minimizes the distance between anchors and the positive instance points. In TVAE, this model is developed using a GAN, in the same manner as the aforementioned copula. If we define the data anchor data point as A, positive instance as P, and negative instance as N, then the loss function of the trained GAN can be defined as [38]:

L = m a x (d (A, P) - d (A, N), 0)

(13)

The distance calculation function d is the Euclidian distance between two data points. The trained encoder can then also be inverted, as in the case of copula, and used to generate data, by generating points with a small L-value, and then transforming them back into the real data space [39].

4.1.3. CTGAN

The final method used is CTGAN, which is a GAN whose architecture has been tuned to work on tabular data. In other words, this GAN uses one-dimensional data points as input/output vectors, training on the two-dimensional dataset [40]. Unlike copula or TVAE, no data transformation is used and the data from the dataset is fed into the network as-is (assuming the data is given in the tabular form, as the dataset used in this research).

4.1.4. Synthetic Dataset Evaluation

The synthetic data developed needs to have its quality evaluated, which is performed by comparing correlations between two datasets [41].

Correlation evaluation is performed using Pearson’s coefficient of correlation, defined for data columns

X_{i}

and

X_{j}

as:

P (X_{i}, X_{j}) = \frac{\sum_{k = 1}^{n} (X_{i}^{(k)} - {\bar{X}}_{i}) (X_{j}^{(k)} - {\bar{X}}_{j})}{\sqrt{\sum_{k = 1}^{n} {(X_{i}^{(k)} - {\bar{X}}_{i})}^{2}} \sqrt{\sum_{k = 1}^{n} {(X_{j}^{(k)} - {\bar{X}}_{j})}^{2}}} .

(14)

Based on that, a score can be calculated between a pair of real data points

(X_{i}^{R}, X_{j}^{R})

and synthetic data points

(X_{i}^{S}, X_{j}^{S})

as [42]:

S = 1 - \frac{| P (X_{i}^{S}, X_{j}^{S}) - P (X_{i}^{R}, X_{j}^{R}) |}{2} .

(15)

This value is calculated for each of the column pairs in real and synthetic datasets, with the final output being the average value of this score across all column pairs. The value defined as this will range between 0 and 1, with the values closer to 1 indicating a higher quality synthetic dataset.

4.2. Regression

Regression is performed based on the input values collected in the dataset to determine the defined output values—

E X D

and

η

for both turbines and the entire system. This is conducted to test whether the synthetically generated data may be used for training similar models in the future. Two methods are used—MLP and XGB. MLP was selected due to its high performance and previous application of it in the research performed on the same dataset [27]. XGB was selected due to it being a so-called explainable model—the models generated by it are given in the shape of decision trees and can be further analyzed if necessary, which is not the case for models originating from MLP. It has also shown a successful application in previous research in similar domain [43].

Both of the methods have so-called hyperparameters which dictate the shape of the models generated and have a high impact on the performance. These values can be tuned via a grid-search (GS). This method takes an array of different values for a multitude of tuned hyperparameters and performs training for each combination of the hyperparameters, marking the values achieved. All values of hyperparameters used are given in the below subsections for each of the methods. Finally, both methods are trained on the synthetic data by using the synthetically generated dataset with 70% of the data used for training and 30% used for testing. This means that 70% of the data is used to adjust the internal parameters of the models, with 30% of the data used to evaluate the performance during training. The models are finally evaluated on real, collected data points.

4.2.1. Multilayer Perceptron

Multilayer perceptron (MLP) is a feed-forward artificial neural network. It consists of three main parts—an output layer consisting of a singular neuron, an input layer consisting of n neurons where n is equal to the number of dataset variables, and some hidden layers. As mentioned, these layers consist of neurons, each of which works as a sumator of weighted inputs [44]. Each neuron is densely connected, which means that its inputs consist of each neuron in the previous layer, passed through the activation function to obtain the output o as [45]:

o = F (\sum_{k = 0}^{m} w_{k} \cdot o_{k}) .

(16)

In the above, F is the activation function,

w_{k}

is the weight of the connection connecting the previous layer’s neuron k with the current neuron, and

o_{k}

is the output of the neuron k. All neuron outputs are calculated according to this equation, except the input neurons for which the output value is the value of the corresponding datasets variable for the data point currently being used. The MLP is trained in the same manner as the previously mentioned GANs, with the weight w being equivalent to filter values F and adjusted in the training process based on the loss value. The hyperparameters used for training MLP within GS are given in the Table 1. The upper limit of six hidden layers was selected for two reasons. First is the gradient loss, in which deep neural networks can experience the loss of error due to the gradient being repeatedly calculated in each of the layers [46]. The second reason is the computational complexity, as deeper neural networks require longer training times. The latter reason was also used as reasoning for not using more than 500 neurons per layer. The hyperparameters of the MLP were elected based on the previous research the authors have performed on the non-synthetic dataset [27].

4.2.2. XGBoost

XGB is an ensemble method based on creating decision trees. Decision trees form a structure that consists of nodes, and each of the nodes contains a decision choosing the direction (left/right) based on the value of a certain variable. For example, one such node could be “if the temperature of the output stream on point 7 is higher than 60”. Then, this decision leads to other nodes. These trees are constructed by selecting a variable and its value randomly and then testing the loss function of the decision. The constructed decision which causes the highest decrease in the regression value is selected. The process is then repeated until no decisions cause a significant change in the prediction value. The final model will consist of multiple decision trees, the outputs of which are averaged to determine the output value. This algorithm is commonly referred to as random forest, due to consisting of multiple decision trees which are randomly generated [47]. XGB does not generate new trees randomly but instead generates new trees with the particular goal of lowering the current loss. Instead of generating a new tree without any information, XGB will calculate loss gradients (as was performed in the case of GAN and MLP) to determine the particular error of the current ensemble model [48]. Then, it will generate trees which will lower this particular error. This serves to obtain trees that may have a poor performance by themselves but can create well-performing models when placed together, with each tree addressing different particularities of the dataset. XGB is trained within the GS scheme using the hyperparameter values presented in Table 2. The values of hyperparameters in the GS procedure were selected based on previous research, in which similar ranges of hyperparameters provided good results [43].

4.2.3. Computational Resources

The models are trained using Z4 HPC Cluster [49], using CPU nodes for MLP and XGB training and a node with GPUs to calculate and generate synthetic data. Each singular model (a single variation of the synthetic data or a single target GS for MLP or XGB) is trained on a separate node of the aforementioned type. The real execution times, as marked using the OpenPBS batching system used on the cluster are given in the Table 3, with the deviation across various variations. It can be seen that MLP has a slightly higher average training time compared to XGB. The training times for the synthetic data creation are significantly shorter compared to regression model training times.

4.2.4. Regression Evaluation

Six different metrics are used to evaluate the performance of the models: coefficient of determination (

R^{2}

), Explained Variance Score (

E V S

), Root Mean Square Error (

R M S E

), Mean Absolute Error (

M A E

), Mean Absolute Percentage Error (

M A P E

) and Maximum Percentage Error (

M P E

). If we define the predicted numerical values as

\hat{y} = {\hat{y}}_{1}, {\hat{y}}_{2}, \dots, {\hat{y}}_{n}

and the actual values as

y = y_{1}, y_{2}, \dots y_{n}

, we can define these values as per below.

R^{2}

is a statistical measure representing the proportion of the variance in the dependant variable explained by the independent variable in the regression model, calculated as [50]:

R^{2} = 1 - \frac{\sum_{i = 1}^{n} {(y_{i} - {\hat{y}}_{i})}^{2}}{\sum_{i = 1}^{n} {(y_{i} - \bar{y})}^{2}}

(17)

E V S

defines the proportion of the variance in the set y explained in the predicted value

\hat{y}

and it is calculated as [51]:

E V S = 1 - \frac{Var (y - \hat{y})}{Var (y)} .

(18)

Both of these values are expressed in the range

[0, 1]

, with the higher values indicating a better-performing model. The other metrics used are various error metrics.

M A E

was selected because it is commonly used in prediction models in data science as a loss function, and it has been used in this research as well, allowing for a direct evaluation of performance. It is calculated as [52]:

M A E = \frac{\sum_{i = 1}^{n} | y_{i} - {\hat{y}}_{i} |}{n}

(19)

M S E

error has been selected due to its common use in the field as an evaluation metric [52]:

R M S E = \frac{\sum_{i = 1}^{n} {(y_{i} - {\hat{y}}_{i})}^{2}}{n} .

(20)

Finally, Maximum Percentage Error is a common metric that shows the poorest performance of the model. This value is in this research expressed as the percentage of the total range, to allow for easier evaluation of performance [53]:

M P E = max (\frac{| y_{i} - {\hat{y}}_{i} |}{y_{i}}) \times 100 %

(21)

This is the same reason why

M A P E

is used, to more easily evaluate the error, as it is expressed as a percentage of the entire variable range [54]:

M A P E = \frac{100 %}{n} \sum_{i = 1}^{n} |\frac{y_{i} - {\hat{y}}_{i}}{y_{i}}| .

(22)

5. Results

5.1. Synthetic Data Generator Results

The scores of different generated datasets, where a different dataset is denoted as one created with a different of the three methods used and with a different number of original data points for training, are given in Table 4.

The best scores are generally achieved using TVAE, except for the 50 original data points being used, in which the copula method slightly outperforms it. CTGAN shows the poorest performance across all data point amounts, compared to the other methods. In addition to this, it is clear to see how scores show a relatively large performance decrease in comparison to original data when fewer data points are used. Because of the presented results, to speed up and simplify the further training, only TVAE datasets have been used, except for the 50 original data point datasets, in which a copula-generated dataset has been used.

5.2. Regression Results

The scores achieved using the XGB modeling algorithm are presented in Table 5. The scores are given for each of the dataset targets, and each amount of original data points used in creating the training dataset. As mentioned, validation is performed on separated 50 data points, on which the presented results are calculated.

The results demonstrate that XGB achieves the best results on the dataset created with the most, 100, original data points as the input, with the results ranging from

R^{2}

of 0.91 in the case of

η_{H P C}

to 0.99 for

η_{W} T

. The result shows a sharp decrease in performance when datasets generated with a lower amount of data points are used to create a training dataset, and this follows across all of the six regression targets.

In the same manner, the results for the MLP algorithm are given in the Table 6. The MLP algorithm shows a similar performance compared to XGB—a sharp decrease in performance for datasets developed using a lower amount of data points. Some of the performance decreases are even larger, such in the case of

η_{W T}

where the performance decreases from 0.98 to 0.74 when observing

R^{2}

as the main performance indicator, with absolute percentile error (

M A P E

) almost doubling in the same case. The results are overall satisfactory when observing the 100 original data point dataset-trained models with all models achieving

R^{2}

above 0.90. Still, some of these models do show poorer performance compared to the models that were generated on the same datasets using XGB as the regression algorithm. For easier comparison of individual results, the achieved scores have been illustrated in Figure 4, Figure 5, Figure 6, Figure 7, Figure 8 and Figure 9. Each of the figures illustrates all of the scores used—with

R^{2}

,

E V S

,

M P E

, and

M A P E

being given on the left y-scale and

M A E

and

M S E

given on the right scale, due to the difference in amounts. It should be noted that MPE and MAPE are expressed in the range

[0, 1]

instead as a percentage in the range

[0, 100] %

for easier visibility. Each of the figures contains two subfigures—with the left showing the best scores achieved with the MLP algorithm, and the right showing scores achieved with the XGB algorithm.

Figure 4 shows how similar, high results are achieved on the 100 original data point dataset. The result is similar across all of the used metrics for both of the methods, without any of the used metrics demonstrating a significant increase/decrease between the best-performing models of both methods.

Observing the results achieved for

η_{L P C}

in Figure 5, it is shown that both models achieved poorer results in comparison to the originals. Still, XGB is clearly shown as the algorithm generating a model that outperforms MLP by a significant margin. This is especially obvious observing the maximum error, expressed as the percentage—which increases from 7.21% in the case of XGB to over 30% for the MLP model.

The same behavior, with XGB outperforming MLP across all of the metrics used for the model evaluation is shown in Figure 6, showing the performance of

η_{H P C}

-developed models. The decrease here is even more significant than previously—with the coefficient of determination decreasing from 0.91 to 0.82, and similar performance decreases are observable in other metrics—with

M A P E

by almost 1% from 3.46% to 2.48%.

The results in Figure 7 show that MLP-based models achieve slightly higher scores when compared to the ones developed using XGB for

E x_{W T}

as a target. Still, the performance difference is not as high as it was in the case of

η_{L P C}

and

η_{H P C}

.

The models regressing

E x_{L P C}

show a similar performance when comparing the two regression algorithms. For example, in the highest performing data point dataset, the

E V S

decreases from 0.96 to 0.94, with this change in performance being confirmed when compared with other metrics. While this decrease in performance is not as significant as it was in the case of

η_{L P C}

and

η_{H P C}

, it is present.

Finally, the results for

E x_{H P C}

are provided in Figure 9, and they follow the trend set by the previous two exergy targets, where MLP-based models show a slightly improved performance compared to XGB-based models. Still, these models are the closest in performance—the difference between the two models in the case of

M A E

is only 6.15 (

M A E

being 81.23 for XGB, and 75.08 for MLP-based models), which is a small difference. How slight this difference is is better demonstrated with

M A P E

which decreases from 1.26% to 1.17%.

The hyperparameters of the best-performing model for each of the targets are given in Table 7.

The hyperparameters of the XGB models show that they are similar, as all of them use a slow learning rate and a large number of estimators with a large maximum depth. MLP models are also similar to each other, as all use the same number of neurons—six hidden layers of five hundred neurons in each layer. All of the MLP models use the ReLU activation function with an LBFGS solver and a slow learning rate. These values being large indicates a complex problem which was difficult for the algorithm to generate a model for. When comparing these to the hyperparameter values used to obtain high-performing models on real data in the previous research ([27]), the indication is that the regression on synthetic data seems to be harder to perform than the one on original data.

6. Discussion

The results demonstrated in the previous section point to several findings that can be interesting in the domain of addressing the research questions posed at the start of this study. The first important thing to note is the comparison with the results achieved on the original dataset, in previous research. According to the previously published research which modeled the targets on the original dataset using the original, non-synthetic, data points [27]—the models can achieve

R^{2}

scores of 0.99, except for

E x_{L P C}

which achieves at the max 0.97. Comparing those results with the results achieved in this research, the results on synthetic data show either equal results (

η_{W T}

using XGB), or a decrease in performance varying from slight (

Δ R^{2} = 0.06

for

η_{L P C}

using XGB,

Δ R^{2} = 0.02

for

E x_{W T}

using MLP,

Δ R^{2} = 0.01

for

E x_{L P C}

using MLP, and

Δ R^{2} = 0.01

for

E x_{H P C}

using MLP) to significant (

Δ R^{2} = 0.06

for

η_{H P C}

using XGB). This indicates that synthetic data may be used in similar research targeting marine power plants—but the expected model performance is lower compared to direct training on original data. Still, in cases where a large amount of original data is not available, such an approach can be valid to generate the initial research results, especially in the proof-of-concept stage. Comparing the performance of the two used regression algorithms, there is not a clear performance winner across all targets, with exergy efficiency regression showing better performance when the XGB algorithm is used, and exergy destruction regression showing better performance when evaluated with MLP-based models. For this reason, the focus should be given to the use of multiple algorithms when such data modeling is attempted. On the topic of synthetic data generation, two things can be noted. First, it is shown that the initial correlation evaluation of the synthetic datasets compared to the original datasets is a good indicator of performance for future trained models, with the lower correlation scores for datasets created based on fewer original data points providing poorer results. This indicates that in similar research in the future, time can be saved by using such a metric for direct evaluation of created datasets, before training. This is important due to ML training being highly computationally demanding, resource- and time-wise, while data generation techniques require fewer resources. The difference in performance between different synthetic data techniques is also visible, with TVAE providing the overall best results compared to the other two techniques used, indicating that future research in this domain should focus on it as a data generation technique, at least initially. Finally, the amount of original data points used shows a large impact on synthetic dataset performance and synthetic dataset-based models. Any amount of data lower than 100 has been shown to yield very poor results. This indicates that the amount of data for models cannot be lowered significantly, but it must be noted that this is still two thirds of the original dataset. Depending on the application and the specific case in which the data is being collected, lowering the necessary data by one third of the total amount can simplify and speed up the process, indicating that this approach should be considered in some cases, where data collection is impractical.

7. Conclusions

In this research, the authors have used a previously collected dataset containing information on the operation of a steam turbine, namely the measured temperature, mass flow rate, and pressure in seven measurement points. Then, the calculation of exergy destruction and efficiency was performed for three elements—the LPC, the HPC, and the turbine as a whole. This dataset was then split into multiple smaller parts—50 data points were reserved for validation and the rest was used to construct training subsets consisting of 100, 50, 25, and 10 data points. Using these subsets, three methods were used to generate statistical synthetic data points—TVAE, CTGAN, and copula. The best-performing synthetic datasets were then used to generate ML-based models using two methods—MLP and XGB. After the evaluation of the reserved original 50 data points, the performance of the models was concluded to be satisfactory, where models with a higher

R^{2}

than 0.9 were achieved. From the obtained results, the originally posed research questions can be addressed:

RQ1—synthetic data can be used to generate data used for gas turbine modeling.
RQ2—in comparison to models trained on original data, the performance shows a slight decrease, depending on the method used.
RQ3—synthetic datasets based on less than two-thirds of the original data points show poor performance when evaluated in comparison to the original and in modeling.
RQ4—the performance impact of using less original data for data generation is significant, with large decreases in performance visible in the obtained results.

The main conclusion of the article can be summarized as follows—it should be possible to use synthetically generated data for modeling in maritime applications, especially in such cases where extreme precision is not necessary, and where the collection of larger datasets is expensive or highly impractical. The limitations of this research lay in the use of a single dataset for analysis. A clear limitation of the presented work is a lack of external validation on a dataset collected from a different steam turbine. In the future, further model validation should be performed using a separate validation dataset. Another limitation to note was that no hyperparameter tuning based on sensitivity analysis was performed. While the achieved results were satisfactory within the context of the study (testing the performance of models on synthetic data), such analysis could further improve the models and should be performed—especially in the case of real-world applications. The distribution of the collected data, due to measurements on a working turbine, is not uniform. This may cause poorer performance of models in such cases where the data given for prediction comes from the area which is sparsely populated in the dataset. Wider-reaching conclusions should focus on the use of multiple datasets surrounding similar maritime applications to observe the data more clearly. This is planned to be addressed in future research.

Author Contributions

Conceptualization, I.P. and V.M.; methodology, S.B.Š. and N.A.; software, S.B.Š.; validation, N.A., I.P. and Z.C.; formal analysis, V.M.; investigation, N.A.; resources, I.P.; data curation, I.P.; writing—original draft preparation, S.B.Š., V.M. and N.A.; writing—review and editing, I.P. and Z.C.; visualization, S.B.Š.; supervision, V.M. and Z.C.; project administration, Z.C.; funding acquisition, Z.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author. The data are not publicly available due to approval for data sharing needing to be obtained from data provider.

Acknowledgments

This research was (partly) supported by the CEEPUS network CIII-HR-0108, the European Regional Development Fund under Grant KK.01.1.1.01.0009 (DATACROSS), the Erasmus+ project WICT under Grant 2021-1-HR01-KA220-HED-000031177, Croatian Science Foundation under the project IP-2018-01-3739, the University of Rijeka Scientific Grants uniri-mladi-technic-22-61, uniri-tehnic-18-18-1146, uniri-tehnic-18-14 and uniri-tehnic-18-275-1447.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

CTGAN	Conditional Tabular Generative Adversarial Network
$E V S$	Explained Variance Score
GAN	Generative Adversarial Network
HFO	Heavy Fuel Oil Heater
IPC	Intermediate Pressure Cylinder
BOG	Boil Off Gas Heater
HPC	High Pressure Cylinder
LPC	Low Pressure Cylinder
$M A E$	Mean Absolute Error
$M A P E$	Mean Absolute Percentile Error
$M P E$	Maximum Percentile Error
MLP	Multilayer Perceptron
P1	Propulsion Propeller 1
P2	Propulsion Propeller 2
P	Pearson Correlation Coefficient
$R^{2}$	Coefficient of Determination
$R M S E$	Root Mean Square Error
TVAE	Triplet-Encoded Variable Autoencoder
WT	Whole Turbine
XGB	Extreme Gradient Boosting

References

Aylak, B.L. The impacts of the applications of artificial intelligence in maritime logistics. Avrupa Bilim Teknol. Derg. 2022, 34, 217–225. [Google Scholar] [CrossRef]
Gonca, G.; Sahin, B.; Genc, I. Investigation of maximum performance characteristics of seven-process cycle engine. Int. J. Exergy 2022, 37, 302–312. [Google Scholar] [CrossRef]
Fam, M.L.; Tay, Z.Y.; Konovessis, D. An Artificial Neural Network for fuel efficiency analysis for cargo vessel operation. Ocean. Eng. 2022, 264, 112437. [Google Scholar] [CrossRef]
Bassam, A.M.; Phillips, A.B.; Turnock, S.R.; Wilson, P.A. Ship speed prediction based on machine learning for efficient shipping operation. Ocean. Eng. 2022, 245, 110449. [Google Scholar] [CrossRef]
Wang, Q.; Yu, P.; Chang, X.; Fan, G.; Li, G. A Novel Ship Fatigue Damage’s Prediction Model Based on the Artificial Neural Network Approach. In Proceedings of the 32nd International Ocean and Polar Engineering Conference, Shanghai, China, 6–10 June 2022; OnePetro: Richardson, TX, USA, 2022. [Google Scholar]
Hastie, T.; Tibshirani, R.; Friedman, J.H.; Friedman, J.H. The Elements of Statistical Learning: Data Mining, Inference, and Prediction; Springer: Berlin/Heidelberg, Germany, 2009; Volume 2. [Google Scholar]
Plaza-Hernández, M.; Gil-González, A.B.; Rodríguez-González, S.; Prieto-Tejedor, J.; Corchado-Rodríguez, J.M. Integration of IoT technologies in the maritime industry. In Proceedings of the Distributed Computing and Artificial Intelligence, Special Sessions, 17th International Conference, L’Aquila, Italy, 17–19 June 2020; Springer: Berlin/Heidelberg, Germany, 2021; pp. 107–115. [Google Scholar]
Chen, Z.; Luo, X.; Sun, Y. Synthetic data augmentation rules for maritime object detection. Int. J. Comput. Sci. Eng. 2020, 23, 169–176. [Google Scholar] [CrossRef]
Kastner, M.; Grasse, O.; Jahn, C. Container Flow Generation for Maritime Container Terminals. In Proceedings of the Dynamics in Logistics: Proceedings of the 8th International Conference LDIC, Bremen, Germany, 23–25 February 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 133–143. [Google Scholar]
Bruns, R.; Dunkel, J.; Seremet, S. Learning Ship Activity Patterns in Maritime Data Streams: Enhancing CEP Rule Learning by Temporal and Spatial Relations and Domain-Specific Functions. IEEE Trans. Intell. Transp. Syst. 2023. [Google Scholar] [CrossRef]
Higgins, E.; Sobien, D.; Freeman, L.; Pitt, J.S. Ship wake detection using data fusion in multi-sensor remote sensing applications. In Proceedings of the AIAA SCITECH 2022 Forum, San Diego, CA, USA, 3–7 January 2022; p. 0997. [Google Scholar]
He, X.; Ji, W. Single maritime image dehazing using unpaired adversarial learning. Signal Image Video Process. 2023, 17, 593–600. [Google Scholar] [CrossRef]
Ribeiro, M.; Damas, B.; Bernardino, A. Real-Time Ship Segmentation in Maritime Surveillance Videos Using Automatically Annotated Synthetic Datasets. Sensors 2022, 22, 8090. [Google Scholar] [CrossRef]
Taghavifar, H.; Perera, L.P. Data-driven modeling of energy-exergy in marine engines by supervised ANNs based on fuel type and injection angle classification. Process. Saf. Environ. Prot. 2023, 172, 546–561. [Google Scholar] [CrossRef]
Strušnik, D. Integration of machine learning to increase steam turbine condenser vacuum and efficiency through gasket resealing and higher heat extraction into the atmosphere. Int. J. Energy Res. 2022, 46, 3189–3212. [Google Scholar] [CrossRef]
Kartal, F.; Özveren, U. Investigation of the chemical exergy of torrefied biomass from raw biomass by means of machine learning. Biomass Bioenergy 2022, 159, 106383. [Google Scholar] [CrossRef]
Arslan, E.; Das, M.; Akpinar, E. Obtaining mathematical equations for energy, exergy and electrical efficiency: A machine learning approach. Energy Sources Part Recover. Util. Environ. Eff. 2023, 45, 4370–4385. [Google Scholar] [CrossRef]
El Emam, K. Seven ways to evaluate the utility of synthetic data. IEEE Secur. Priv. 2020, 18, 56–59. [Google Scholar] [CrossRef]
Hyunday-Mitsubishi. Marine Steam Turbine MS40-2—Instruction Book for Marine Turbine Unit; Hyundai Heavy Industries, Co., Ltd.: Ulsan, Republic of Korea, 2004. [Google Scholar]
Çiçek, A. Exergy Analysis of a Crude Oil Carrier Steam Plant. Ph.D. Thesis, Istanbul Technical University, Istanbul, Turkey, 2009. [Google Scholar]
Mrzljak, V.; Poljak, I.; Medica-Viola, V. Thermodynamical analysis of high-pressure feed water heater in steam propulsion system during exploitation. Brodogr. Teor. Praksa Brodogr. Pomor. Teh. 2017, 68, 45–61. [Google Scholar]
Škopac, L.; Medica-Viola, V.; Mrzljak, V. Selection Maps of Explicit Colebrook Approximations according to Calculation Time and Precision. Heat Transf. Eng. 2021, 42, 839–853. [Google Scholar]
Koroglu, T.; Sogut, O.S. Conventional and advanced exergy analyses of a marine steam power plant. Energy 2018, 163, 392–403. [Google Scholar] [CrossRef]
Kocijel, L.; Poljak, I.; Mrzljak, V.; Car, Z. Energy loss analysis at the gland seals of a marine turbo-generator steam turbine. Teh. Glas. 2020, 14, 19–26. [Google Scholar] [CrossRef] [Green Version]
Moran, M.J.; Shapiro, H.N.; Boettner, D.D.; Bailey, M.B. Fundamentals of Engineering Thermodynamics; John Wiley & Sons: Hoboken, NJ, USA, 2010. [Google Scholar]
Mrzljak, V.; Blecich, P.; Anđelić, N.; Lorencin, I. Energy and exergy analyses of forced draft fan for marine steam propulsion system during load change. J. Mar. Sci. Eng. 2019, 7, 381. [Google Scholar] [CrossRef] [Green Version]
Baressi Šegota, S.; Lorencin, I.; Anđelić, N.; Mrzljak, V.; Car, Z. Improvement of marine steam turbine conventional exergy analysis by neural network application. J. Mar. Sci. Eng. 2020, 8, 884. [Google Scholar] [CrossRef]
Uddin, M.S.; Pamie-George, R.; Wilkins, D.; Sousa-Poza, A.; Canan, M.; Kovacic, S.; Li, J. Ship Deck Segmentation In Engineering Document Using Generative Adversarial Networks. In Proceedings of the 2022 IEEE World AI IoT Congress (AIIoT), Seattle, WA, USA, 6–9 June 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 207–212. [Google Scholar]
Fan, G.; He, Z.; Li, J. Structural dynamic response reconstruction using self-attention enhanced generative adversarial networks. Eng. Struct. 2023, 276, 115334. [Google Scholar]
Patki, N.; Wedge, R.; Veeramachaneni, K. The synthetic data vault. In Proceedings of the 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA), Montreal, QC, Canada, 17–19 October 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 399–410. [Google Scholar]
Kiran, A.; Kumar, S.S. A Comparative Analysis of GAN and VAE based Synthetic Data Generators for High Dimensional, Imbalanced Tabular data. In Proceedings of the 2023 2nd International Conference for Innovation in Technology (INOCON), Bangalore, India, 3–5 March 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 1–6. [Google Scholar]
Sei, Y.; Onesimu, J.A.; Ohsuga, A. Machine learning model generation with copula-based synthetic dataset for local differentially private numerical data. IEEE Access 2022, 10, 101656–101671. [Google Scholar] [CrossRef]
Šegota, S.B.; Anđelić, N.; Štifanić, D.; Štifanić, J.; Car, Z. On differentiating synthetic and real data in medical applications. In Proceedings of the Second Serbian International Conference on Applied Artificial Intelligence (SICAAI), University of Kragujevac, Kragujevac, Serbia, 19–20 May 2023; pp. 57–61. [Google Scholar]
Dina, A.S.; Siddique, A.; Manivannan, D. Effect of balancing data using synthetic data on the performance of machine learning classifiers for intrusion detection in computer networks. IEEE Access 2022, 10, 96731–96747. [Google Scholar] [CrossRef]
Zhang, K.; Patki, N.; Veeramachaneni, K. Sequential Models in the Synthetic Data Vault. arXiv 2022, arXiv:2207.14406. [Google Scholar]
Sepúlveda-García, J.J.; Alvarez, D.A. On the use of copulas in geotechnical engineering: A tutorial and state-of-the-art-review. Arch. Comput. Methods Eng. 2022, 29, 4683–4733. [Google Scholar] [CrossRef]
Shen, Z.; Zang, C.; Chen, X.; Hu, S.; Liu, X.e. Uncertainty quantification for correlated variables combining p-box with copula upon limited observed data. Eng. Comput. 2022, 39, 2144–2161. [Google Scholar] [CrossRef]
Chalé, M.; Bastian, N.D. Generating realistic cyber data for training and evaluating machine learning classifiers for network intrusion detection systems. Expert Syst. Appl. 2022, 207, 117936. [Google Scholar] [CrossRef]
Lee, T.; Park, C.S.; Nam, K.; Kim, S.S. Query Transformation for Approximate Query Processing Using Synthetic Data from Deep Generative Models. In Proceedings of the 2022 IEEE International Conference on Consumer Electronics-Asia (ICCE-Asia), Yeosu, Republic of Korea, 26–28 October 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 1–4. [Google Scholar]
Fang, M.L.; Dhami, D.S.; Kersting, K. Dp-ctgan: Differentially private medical data generation using ctgans. In Proceedings of the Artificial Intelligence in Medicine: 20th International Conference on Artificial Intelligence in Medicine, AIME 2022, Halifax, NS, Canada, 14–17 June 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 178–188. [Google Scholar]
Hernandez, M.; Epelde, G.; Alberdi, A.; Cilla, R.; Rankin, D. Synthetic data generation for tabular health records: A systematic review. Neurocomputing 2022, 493, 28–45. [Google Scholar] [CrossRef]
Saadatmorad, M.; Talookolaei, R.A.J.; Pashaei, M.H.; Khatir, S.; Wahab, M.A. Pearson correlation and discrete wavelet transform for crack identification in steel beams. Mathematics 2022, 10, 2689. [Google Scholar] [CrossRef]
Majnarić, D.; Šegota, S.B.; Lorencin, I.; Car, Z. Prediction of main particulars of container ships using artificial intelligence algorithms. Ocean. Eng. 2022, 265, 112571. [Google Scholar] [CrossRef]
Pakkiraiah, C.; Satyanarayana, R. FPGA realization of low power multi-layer perceptron full adder to minimize EDP of modular multiplier. Int. J. Electron. Eng. Appl 2022, 10, 1–12. [Google Scholar]
Nguyen, D.D.; Roussis, P.C.; Pham, B.T.; Ferentinou, M.; Mamou, A.; Vu, D.Q.; Bui, Q.A.T.; Trong, D.K.; Asteris, P.G. Bagging and Multilayer Perceptron Hybrid Intelligence Models Predicting the Swelling Potential of Soil. Transp. Geotech. 2022, 36, 100797. [Google Scholar] [CrossRef]
Wang, S.; Teng, Y.; Perdikaris, P. Understanding and mitigating gradient pathologies in physics-informed neural networks. arXiv 2020, arXiv:2001.04536. [Google Scholar] [CrossRef]
Khan, M.A.; Shah, M.I.; Javed, M.F.; Khan, M.I.; Rasheed, S.; El-Shorbagy, M.; El-Zahar, E.R.; Malik, M. Application of random forest for modelling of surface water salinity. Ain Shams Eng. J. 2022, 13, 101635. [Google Scholar] [CrossRef]
Pan, S.; Zheng, Z.; Guo, Z.; Luo, H. An optimized XGBoost method for predicting reservoir porosity using petrophysical logs. J. Pet. Sci. Eng. 2022, 208, 109520. [Google Scholar] [CrossRef]
Baressi Šegota, S.; Anđelić, N.; Lorencin, I.; Štifanić, D.; Musulin, J.; Ca, Z. Z4 HPC Cluster. In Proceedings of the RI-STEM-2021, Rijeka, Croatia, 5 June 2021; pp. 51–56. [Google Scholar]
Ozer, D.J. Correlation and the coefficient of determination. Psychol. Bull. 1985, 97, 307. [Google Scholar] [CrossRef]
Good, R.; Fletcher, H.J. Reporting explained variance. J. Res. Sci. Teach. 1981, 18, 1–7. [Google Scholar] [CrossRef]
Chai, T.; Draxler, R.R. Root mean square error (RMSE) or mean absolute error (MAE). Geosci. Model Dev. Discuss. 2014, 7, 1525–1534. [Google Scholar]
De Myttenaere, A.; Golden, B.; Le Grand, B.; Rossi, F. Using the Mean Absolute Percentage Error for Regression Models. In Proceedings of the ESANN, Bruges, Belgium, 22–24 April 2015. [Google Scholar]
McKenzie, J. Mean absolute percentage error and bias in economic forecasting. Econ. Lett. 2011, 113, 259–262. [Google Scholar] [CrossRef]

Figure 1. The overview of the methodological approach. (TVAE—Triplet Encoded Variable Autoencoder, CTGAN—Conditional Tabular Generative Adversarial Network, MLP—Multilayer Perceptron, XGB—Gradient Boosted Trees).

Figure 2. Schematic of the main marine steam turbine, with operating points used in exergy analysis noted and numbered. HPC—High Pressure Cylinder; LPC—Low Pressure Cylinder; P1—Propulsion Propeller 1; P2—Propulsion Propeller 2.

Figure 3. Histograms of the dataset (T—temperature at the point, P—pressure at the point, M—mass flow rate at the point, 1–7—measurement points, as given in the previous figure).

Figure 4. Best achieved regression scores for

η_{W T}

. (

R^{2}

,

E V S

—higher is better,

M A E

,

M A P E

,

M P E

,

M S E

—lower is better).

Figure 4. Best achieved regression scores for

η_{W T}

. (

R^{2}

,

E V S

—higher is better,

M A E

,

M A P E

,

M P E

,

M S E

—lower is better).

Figure 5. Best achieved regression scores for

η_{L P C}

. (

R^{2}

,

E V S

—higher is better,

M A E

,

M A P E

,

M P E

,

M S E

—lower is better).

Figure 5. Best achieved regression scores for

η_{L P C}

. (

R^{2}

,

E V S

—higher is better,

M A E

,

M A P E

,

M P E

,

M S E

—lower is better).

Figure 6. Best achieved regression scores for

η_{H P C}

. (

R^{2}

,

E V S

—higher is better,

M A E

,

M A P E

,

M P E

,

M S E

—lower is better).

Figure 6. Best achieved regression scores for

η_{H P C}

. (

R^{2}

,

E V S

—higher is better,

M A E

,

M A P E

,

M P E

,

M S E

—lower is better).

Figure 7. Best achieved regression scores for

E x_{W T}

. (

R^{2}

,

E V S

—higher is better,

M A E

,

M A P E

,

M P E

,

M S E

—lower is better).

Figure 7. Best achieved regression scores for

E x_{W T}

. (

R^{2}

,

E V S

—higher is better,

M A E

,

M A P E

,

M P E

,

M S E

—lower is better).

Figure 8. Best achieved regression scores for

E x_{L P C}

. (

R^{2}

,

E V S

—higher is better,

M A E

,

M A P E

,

M P E

,

M S E

—lower is better).

Figure 8. Best achieved regression scores for

E x_{L P C}

. (

R^{2}

,

E V S

—higher is better,

M A E

,

M A P E

,

M P E

,

M S E

—lower is better).

Figure 9. Best achieved regression scores for

E x_{H P C}

. (

R^{2}

,

E V S

—higher is better,

M A E

,

M A P E

,

M P E

,

M S E

—lower is better).

Figure 9. Best achieved regression scores for

E x_{H P C}

. (

R^{2}

,

E V S

—higher is better,

M A E

,

M A P E

,

M P E

,

M S E

—lower is better).

Table 1. Hyperparameters used in GS for MLP.

Hyperparameter	Values	Count
Number of hidden layers	1, 2, 3, 4, 5, 6	6
Number of neurons per hidden layer	5, 10, 25, 50, 100, 200, 500	4
Activation	Identity, Logistic, TanH, ReLU	1
Solver	LBFGS, Adam	2
Learning rate type	Constant, Inverse scaling, adaptive	3
Initial learning rate	0.001, 0.01, 0.1, 1	4
L2	0.0001, 0.001, 0.01, 0.1, 1	5
		2880

Table 2. Hyperparameters used in the GS for XGB.

Hyperparameter	Value	Count
Learning rate	0.01, 0.05, 0.1, 0.2, 0.3	5
Number of estimators	100, 200, 300, 400, 500	5
Maximum depth	3, 4, 5, 6, 7, 8, 9, 10	8
Minimum child weight	1, 2, 3, 4, 5, 6, 7, 8, 9	9
Gamma	0, 0.1, 0.2, 0.3, 0.4	5
Subsample	0.5, 0.6, 0.7, 0.8, 0.9	5
Column sample By tree	0.5, 0.6, 0.7, 0.8, 0.9	5
L2 regularization	0, 0.1, 0.2, 0.3, 0.4	5
		1,125,000

Table 3. Computational resources and average training times (

\bar{T}

—average real compute time,

σ

—average deviation).

Table 3. Computational resources and average training times (

\bar{T}

—average real compute time,

σ

—average deviation).

Method	MLP	XGB	TVAE	CTGAN	Copula
CPU	AMD Epyc Rome 7532		2 Intel Xeon Gold 6240R
RAM	128 GB DDR4 ECC		768 GB DDR4 ECC
Storage	Micron 5300 240 GB		Intel D3-S4510 240 GB
GPU	-		5 NVIDIA Quadro RTX 6000
MBO	H12SST-PS		SuperMicro X11DPG-OT-CPU
Platform	Supermicro 2014TP-HTR		SuperMicro 6049GP-TRTKPL
$\bar{T}$ [hours]	123.53	106.50	0.354	0.342	0.265
$σ$ [hours]	1.08	1.32	0.102	0.094	0.054

Table 4. Achieved scores on datasets generated with different methods, on a different amount of original data points.

Original Datapoints	100	50	25	10
$S_{T V A E}$	0.85	0.79	0.71	0.65
$S_{C O P U L A}$	0.83	0.80	0.69	0.64
$S_{C T G A N}$	0.79	0.70	0.62	0.55

Table 5. The results for each of the targets with models created using the XGB algorithm on the synthetic datasets created with different original data point amounts (

R^{2}

,

E V S

—higher is better,

M A E

,

M A P E

,

M P E

,

M S E

—lower is better).

Table 5. The results for each of the targets with models created using the XGB algorithm on the synthetic datasets created with different original data point amounts (

R^{2}

,

E V S

—higher is better,

M A E

,

M A P E

,

M P E

,

M S E

—lower is better).

Target	Original Data Points	$R^{2}$	$MAE$	$MSE$	$EVS$	$MPE$	$MAPE$
$η_{W T}$	100	0.99	1.54	4.55	0.98	6.15	1.81
	50	0.84	1.64	5.79	0.84	8.45	1.93
	25	0.75	2.18	8.06	0.75	8.80	2.56
	10	0.66	3.43	10.90	0.66	12.60	4.03
$η_{L P C}$	100	0.93	1.92	6.82	0.93	7.21	2.22
	50	0.81	2.27	8.83	0.81	7.53	2.63
	25	0.80	2.87	13.99	0.80	9.04	3.33
	10	0.76	4.00	21.31	0.77	9.45	4.63
$η_{H P C}$	100	0.91	2.02	7.11	0.90	8.38	2.48
	50	0.88	2.68	9.11	0.88	13.27	3.29
	25	0.85	4.17	10.49	0.85	15.48	5.12
	10	0.79	6.25	11.46	0.79	24.67	7.66
$E x_{W T}$	100	0.95	231.81	95,813.87	0.96	36.56	8.38
	50	0.90	370.49	118,394.75	0.89	51.30	13.39
	25	0.87	499.92	185,554.10	0.87	78.62	18.07
	10	0.79	739.61	242,285.90	0.79	103.37	26.73
$E x_{L P C}$	100	0.94	201.22	54,211.42	0.94	19.07	5.24
	50	0.81	215.28	64,406.55	0.81	26.63	5.61
	25	0.70	276.92	101,246.77	0.70	33.29	7.21
	10	0.66	431.40	113,144.51	0.66	49.88	11.24
$E x_{H P C}$	100	0.97	81.23	9002.36	0.92	3.44	1.26
	50	0.82	117.48	13,007.82	0.82	3.98	1.83
	25	0.78	121.29	19,683.38	0.78	6.12	1.88
	10	0.74	156.93	25,075.91	0.74	7.97	2.44

Table 6. The results for each of the targets with models created using MLP algorithm on the synthetic datasets created with different original data point amounts (

R^{2}

,

E V S

—higher is better,

M A E

,

M A P E

,

M P E

,

M S E

—lower is better).

Table 6. The results for each of the targets with models created using MLP algorithm on the synthetic datasets created with different original data point amounts (

R^{2}

,

E V S

—higher is better,

M A E

,

M A P E

,

M P E

,

M S E

—lower is better).

Target	Original Data Points	$R^{2}$	$EVS$	$MAE$	$MPE$	$MAPE$	$MSE$
$η_{W T}$	100	0.98	0.98	1.68	9.05	1.98	4.76
	50	0.74	0.73	2.93	10.72	3.45	16.23
	25	0.64	0.64	3.11	13.46	3.66	30.11
	10	0.55	0.55	4.52	15.59	5.32	34.11
$η_{L P C}$	100	0.88	0.89	2.62	30.87	3.04	28.12
	50	0.64	0.64	3.45	40.01	4.00	40.25
	25	0.67	0.67	4.18	48.54	4.84	50.23
	10	0.59	0.60	6.23	70.02	7.22	70.14
$η_{H P C}$	100	0.82	0.82	2.82	32.44	3.46	36.94
	50	0.70	0.71	4.15	35.35	5.09	48.12
	25	0.65	0.65	5.23	41.08	6.41	61.40
	10	0.61	0.61	6.41	49.17	7.86	69.88
$E x_{W T}$	100	0.97	0.99	249.13	34.49	9.01	94,965.88
	50	0.91	0.91	356.53	44.72	12.89	122,124.19
	25	0.83	0.83	563.00	62.79	20.35	127,349.20
	10	0.78	0.76	801.67	93.29	28.98	153,143.73
$E x_{L P C}$	100	0.96	0.96	184.09	17.10	4.80	51,408.59
	50	0.90	0.91	190.60	27.39	4.96	51,713.12
	25	0.84	0.84	212.90	31.32	5.55	60,343.57
	10	0.76	0.76	312.71	35.54	8.15	100,387.11
$E x_{H P C}$	100	0.98	0.98	75.08	3.47	1.17	8695.26
	50	0.88	0.89	97.92	5.58	1.52	10,898.70
	25	0.74	0.74	158.12	5.82	2.46	12,681.59
	10	0.66	0.66	206.16	7.02	3.20	16,297.85

Table 7. The best achieved hyperparameters. (LR—Learning Rate, HL—Hidden Layers, CW—Minimum Child Weight).

Target	Method	Hyperparameters
$η_{W T}$	XGB	LR	Estimators	Max. depth	CW	Gamma	Subsample	Col. Sample	L2
$η_{W T}$	XGB	0.05	500	9	3	0.1	0.7	0.6	0.01
$η_{L P C}$	XGB	LR	Estimators	Max. depth	CW	Gamma	Subsample	Col. Sample	L2
$η_{L P C}$	XGB	0.1	400	9	7	0.2	0.9	0.8	0.1
$η_{H P C}$	XGB	LR	Estimators	Max. depth	CW	Gamma	Subsample	Col. Sample	L2
$η_{H P C}$	XGB	0.1	500	10	4	0.2	0.8	0.8	0.1
$E x_{W T}$	MLP	HL	Neurons	Activation	Solver	LR	Init. LR	L2	–
$E x_{W T}$	MLP	6	500	relu	LBFGS	constant	0.1	0.001	–
$E x_{L P C}$	MLP	HL	Neurons	Activation	Solver	LR	Init. LR	L2	–
$E x_{L P C}$	MLP	6	500	relu	LBFGS	adaptive	0.001	0.0001	–
$E x_{H P C}$	MLP	HL	Neurons	Activation	Solver	LR	Init. LR	L2	–
$E x_{H P C}$	MLP	6	500	relu	LBFGS	inverse scaling	0.001	0.0001	–

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Baressi Šegota, S.; Mrzljak, V.; Anđelić, N.; Poljak, I.; Car, Z. Use of Synthetic Data in Maritime Applications for the Problem of Steam Turbine Exergy Analysis. J. Mar. Sci. Eng. 2023, 11, 1595. https://doi.org/10.3390/jmse11081595

AMA Style

Baressi Šegota S, Mrzljak V, Anđelić N, Poljak I, Car Z. Use of Synthetic Data in Maritime Applications for the Problem of Steam Turbine Exergy Analysis. Journal of Marine Science and Engineering. 2023; 11(8):1595. https://doi.org/10.3390/jmse11081595

Chicago/Turabian Style

Baressi Šegota, Sandi, Vedran Mrzljak, Nikola Anđelić, Igor Poljak, and Zlatan Car. 2023. "Use of Synthetic Data in Maritime Applications for the Problem of Steam Turbine Exergy Analysis" Journal of Marine Science and Engineering 11, no. 8: 1595. https://doi.org/10.3390/jmse11081595

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Use of Synthetic Data in Maritime Applications for the Problem of Steam Turbine Exergy Analysis

Abstract

1. Introduction

2. Dataset

3. Physical Model of the Exergy Destruction and Exergy Efficiency

3.1. HPC

3.2. LPC

3.3. WT

4. Modelling the Turbine Using ML-Based Algorithms

4.1. Synthetic Data Generation

4.1.1. Copula

4.1.2. TVAE

4.1.3. CTGAN

4.1.4. Synthetic Dataset Evaluation

4.2. Regression

4.2.1. Multilayer Perceptron

4.2.2. XGBoost

4.2.3. Computational Resources

4.2.4. Regression Evaluation

5. Results

5.1. Synthetic Data Generator Results

5.2. Regression Results

6. Discussion

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI