Ship Engine Model Selection by Applying Machine Learning Classification Techniques Using Imputation and Dimensionality Reduction

Skarlatos, Kyriakos; Papageorgiou, Grigorios; Biris, Panagiotis; Skamnia, Ekaterini; Economou, Polychronis; Bersimis, Sotirios

doi:10.3390/jmse12010097

Open AccessArticle

Ship Engine Model Selection by Applying Machine Learning Classification Techniques Using Imputation and Dimensionality Reduction

by

Kyriakos Skarlatos

¹

,

Grigorios Papageorgiou

²,

Panagiotis Biris

²,

Ekaterini Skamnia

²,

Polychronis Economou

^2,*

and

Sotirios Bersimis

^1,*

¹

Department of Business Administration, University of Piraeus, 18534 Piraeus, Greece

²

Department of Civil Engineering, University of Patras, 26504 Patras, Greece

^*

Authors to whom correspondence should be addressed.

J. Mar. Sci. Eng. 2024, 12(1), 97; https://doi.org/10.3390/jmse12010097

Submission received: 3 December 2023 / Revised: 31 December 2023 / Accepted: 31 December 2023 / Published: 3 January 2024

(This article belongs to the Special Issue Machine Learning and Modeling for Ship Design)

Download

Browse Figures

Versions Notes

Abstract

:

The maritime is facing a gradual proliferation of data, which is frequently coupled with the presence of subpar information that contains missing and duplicate data, erroneous records, and flawed entries as a result of human intervention or a lack of access to sensitive and important collaborative information. Data limitations and restrictions have a crucial impact on inefficient data-driven decisions, leading to decreased productivity, augmented operating expenses, and the consequent substantial decline in a competitive edge. The missing or inadequate presentation of significant information, such as the vessel’s primary engine model, critically affects its capabilities and operating expenses as well as its environmental impact. In this study, a comprehensive study was employed, using and comparing several machine learning classification techniques to classify a ship’s main engine model, along with different imputation methods for handling the missing values and dimensionality reduction methods. The classification is based on the technical and operational characteristics of the vessel, including the physical dimensions, various capacities, speeds and consumption. Briefly, three dimensionality reduction methods (Principal Component Analysis, Uniform Manifold Approximation and Projection, and t-Distributed Stochastic Neighbor Embedding) were considered and combined with a variety of classifiers and the appropriate parameters of the dimensionality reduction methods. According to the classification results, the ExtraTreeClassifier with PCA with 4 components, the ExtraTreeClassifier with t-SNE with perplexity equal to 10 and 3 components, and the same classifier with UMAP with 10 neighbors and 3 components outperformed the rest of the combinations. This classification could provide significant information for shipowners to enhance the vessel’s operation by optimizing it.

Keywords:

machine learning in shipping; dimensionality reduction; supervised learning; model comparison and selection; ship engine classification

1. Introduction

The marine industry might greatly benefit from machine learning in a variety of areas, including, among others, increased productivity [1], security [2] and decision making [3]. Specific tasks of machine learning, for instance, may involve the examination of sensor data from equipment and ships to determine when a repair is required [4]. Further, using the real-time monitoring of ship conditions, potential safety hazards or security threats could be identified [5]. Another existing example could also be the inexplicably increased consumption of oil, where oil spills might be the reason [6]. Techniques of machine learning can assist in developing preventative strategies that lower downtime and maintenance expenses, while increasing vessel safety. Similarly, they can assist in analyzing the performance of vessels via monitoring the engine and other components, which is a great assistance for the operators to make data-driven decisions, related to increased productivity and savings of operating expenses [7]. In addition, machine learning can improve monitoring and ensure compliance with environmental regulations, such as those pertaining to emissions, by following and examining data on environmental effects [8]. Although the maritime sector does not provide open access data, it is crucial to have access to high-quality data from numerous sensors and sources on board (the ship) to properly use the ship’s data in machine learning applications [9].

In general, the marine sector has many diverse stakeholders, including shipowners, operators, shipping firms, port agents and regulatory organizations. Data may get scattered among numerous entities as a result of this fragmentation, making it difficult to centralize and communicate information. Further, the marine industry is highly concerned about data privacy and security; thus, sharing certain types of data, such as vessel tracking information, can raise security and commercial concerns. Shipowners may also be hesitant to provide vital operating information, such as routes and cargo, due to worries concerning piracy, theft, or corporate espionage [10]. Businesses might consider the data they have in their possession as a competitive advantage in the fiercely competitive shipping sector [11]. As a result, they can be reluctant to divulge information that would help their rivals or weaken their negotiating position in the market.

Furthermore, numerous vessels, particularly the elderly ones, lack advanced data-capturing and transmitting capabilities. To provide instantaneous data, modernizing the entire global fleet would present a considerable and costly undertaking. It is also worth noting that the marine business is significantly impacted by a complex network of international and national regulations [12]. Concerning data reporting requirements, the range of standards can be expansive, and adhering to them may prove challenging. For all these, harmonizing data standards and ensuring global compliance pose a fundamental challenge [13].

A contentious issue that can be detected in the marine industry is the ownership and control of data. In particular, ship owners may assert rights over the data generated by their vessels, whereas other parties advocate for greater openness and exchange. Moreover, dependable internet and a communication infrastructure are often scarce at sea, particularly in heavily trafficked or remote maritime areas [14]. This lack of connectivity could impede real-time data transmission by ships. Additionally, establishing systems for data collection and exchange may prove costly for smaller enterprises. Thus, this might mean that the financial resources to invest in digital infrastructure may not be available.

Despite these difficulties, attempts are being undertaken to increase the accessibility to marine data [15]. The amount of data available in the maritime industry is gradually growing, as a result of initiatives like the International Maritime Organization’s (IMO) mandatory reporting requirements, the use of satellite-based tracking systems like the Automatic Identification System (AIS), and improvements on ‘Internet of Things’ (IoT) technologies [16]. Increased standardization and coordination efforts among industry players can also aid in addressing some of the issues related to marine data sharing.

To create and implement effective machine learning techniques in the maritime sector, it is also essential to collaborate with specialists in machine learning and data science who have domain expertise in maritime operations. In general, the models that are created can be unsupervised, where the structure is sought on unlabeled data (mainly used for cluster analysis [17]), or supervised when a model is built on input data to describe and analyze the output data. Regression, the prediction of continuous numeric values, classification, and the identification of groups or categories for data points are other examples of machine learning tasks. In the marine industry, however, supervised machine learning has a wide range of applications [18].

In this work, a procedure with the aid of supervised machine learning is proposed to classify the most frequent vessel’s main engine model types. The resulting classification model may be exploited for optimizing the design and operation of a vessel, for faster engine selection, for developing optimal strategies for the new vessel’s use, as well as for evaluating the ship’s performance before engine placement since it is based on the technical and operational characteristics of the vessel. Furthermore, in this work, to conclude with the optimal model, multiple imputation methods are used and compared in order to bypass the problem of missing data, while multiple dimensionality reduction methods are considered and combined with a variety of classifiers and the appropriate parameters of the dimensionality reduction methods.

The rest of the paper is organized as follows. Section 2 presents the motivation behind this work. In Section 3, the methods along with the materials employed are introduced. More specifically, this contains a short reference for the data collection and preprocessing, the data imputation procedure, the dimensionality reduction methods that were considered, and finally the classification techniques along with a short description of their evaluation performance. Next, in Section 4, the data of the application are presented along with a description and an exploratory analysis of them. The main and most significant results are displayed in Section 4.5, and last but not least, the conclusions that can be exported by the analysis proposed are presented in Section 5.

2. Motivation

An attribute that is oftentimes missing, both for privacy reasons and lack of data availability, is the model of the engine. It could be considered to be among the most advantageous information for shipowners. That is, the choice of an engine model for a ship can have a profound impact on the shipping industry in a competitive context. The engine model plays a key role in determining the fuel efficiency of a ship, leading to lower costs for shipping companies since fuel consumption is among the most significant costs [19]. More efficient engines can effectively reduce operating costs, making a company more competitive by offering lower shipping rates.

The quantity of data collected is pivotal in data analysis, as it eradicates ambiguity and leads to stronger findings. Thus, it plays a crucial role in the process. Regarding marine transportation, however, the amount collected is relatively small [3]. As a consequence, compared to other industries, the use of machine learning techniques in marine transportation is limited [3]. In addition, given the pure quality and the data size, bias avoidance is inevitable in most cases [20]. In an effort to overcome such issues, as well as to cover a broad framework of interest covering a variety of issues, we propose the use of supervised machine learning for the vessel’s main engine model type classification.

As environmental regulations become more stringent [21], ships with more environmentally friendly engines, such as those with lower emissions, or those burning alternative fuels such as LNG (liquefied natural gas), can gain a competitive advantage. Companies investing in engine technology can comply with the regulations and potentially benefit from favorable incentive treatment. Besides the above mentioned, the engine models also affect the speed and performance of a ship [22]. In some cases, faster ships may claim higher prices for faster delivery times, while others may prioritize slow sailing for fuel savings. Engine selection should be aligned though, with the company’s strategic objectives and the market’s demands as well.

There are occasions where shipping companies employ engine model selection as a marketing strategy to distinguish themselves in the market. They may highlight their dedication to sustainability, fuel efficiency, or state-of-the-art technology to appeal to customers who are environmentally conscious or focused on efficiency. During the ship design process, the selection of the main engine model is a crucial factor. The selection process is influenced by various factors, including power, fuel consumption, purchase price, service life and maintenance cost. At the beginning of the design process, several types of main diesel engines may be considered. While some main engines may be expensive, they may offer high reliability and low fuel consumption. Although diesel engines might be inexpensive, they often have high fuel consumption and failure rates [23]. Compliance with international and regional regulations related to emissions, fuel quality, and safety is of vital importance. The engine model that is chosen must be in compliance with these standards to avoid legal issues, penalties, and/or operational disruptions [24]. In the global shipping industry, competition comes from companies around the world. Engine model selection can help a company compete on a global scale, ensuring that it meets international standards and customer expectations.

The reliability and maintenance requirements of an engine can also affect the competitiveness of a shipping company. Engines requiring less maintenance and higher reliability can lead to reduced downtime, lower repair costs and a better reputation for on-time deliveries overall.

While the initial cost of an engine is important, companies must also consider long-term costs. Engines with higher initial costs but lower operating costs over their lifetime can provide a competitive advantage in the long term. Furthermore, companies that invest in engines designed to be adaptable and compatible with future technologies (such as alternative fuels or hybrid systems) can position themselves for long-term competitiveness as the industry evolves.

It should also be noted that maintaining competitiveness in the marine industry often involves the adoption of the latest technological developments in engine design. Newer engine models may incorporate advanced features, such as improved automation, data analysis and remote monitoring, which can improve operational efficiency [25].

In conclusion, the selection of a ship’s engine model has significant effects on a shipping company’s ability to compete. Cost, adherence to environmental regulations, operational effectiveness and market position are all impacted. Therefore, companies should carefully assess and select engine types that match their strategic aims and market demands if they want to compete in the extremely demanding and competitive shipping sector.

Last but not least, the classification model provided in the following sections, can be exploited as a calibration model for optimizing vessel design (i.e., which engine is optimal for a vessel of specific dimensions and characteristics) under the general restrictions imposed by the company (strategy, etc.). Furthermore, the resulting classification model can be used for faster engine selection, developing optimal strategies for the new vessel’s use, as well as developing digital twin applications (or simulators), allowing the evaluation of the ship’s performance before engine placement, which results in operating costs reduction and environment conformance enhancement. Nevertheless, digital twins can be valuable in performance monitoring, predictive maintenance, optimization of operations, enhanced safety, risk management, support for design, innovation, decision making, further data analysis by exploiting simulation results, lifecycle management, etc. [26,27,28].

3. Materials and Methods

In general, the main idea of the classification of a vessel’s main engine model is achieved through the concept of learning from examples, which has simply been formalized through supervised learning.

More specifically, the design of the process includes four basic steps (see Figure 1) that were implemented in Python (version 3.9.7) programming language. The initial step, as in most studies, is the selection of the key variables and the preprocessing of the observed/available data. The next step is to substitute the missing values using a proper suitable imputation method, such as the “Multiple Imputation by Chained Equations” (MICE). Thence, the data-forming process is completed, and machine learning techniques are taken into account. In the present paper, several types of classification algorithms along with dimensionality reduction methods are applied and compared.

3.1. Data Collection

The initial and most principal part in conducting an analysis or an application is undoubtedly related to the data collection. Considering that there is a variety of data sources and data sets, the necessity of deriving a more unified and complete data set has emerged. More precisely, assume

i, i = 1, \dots, n

data sources (or providers), with each one maintaining information in

k_{i}

,

i = 1, \dots, m

data sets (complementary or not), where every data set is to be comprised of

p_{i k_{i}}

variables. The notation

p_{i k_{i}}

refers to the total count of variables that exist in the

k_{i}

-th data set of the i-th provider.

After collecting the data, data integration involves combining these various sets of data into a single, cohesive data set. This procedure is frequently carried out when there is a need to consolidate information stemming from different sources, databases, or formats for analytical purposes, comparison or to extract valuable insights.

Typically, merging data requires identifying common elements or keys in the data sets in order to accurately link and consolidate the information. Of course, the variables of all the data sets do not have to be common since it is logical to assume that they have been created for different purposes. As a consequence, an appropriate selection of variables is crucial. This selection might depend on the purpose of the analysis; thus, only the necessary—for the scientist—variables are kept or could be an outcome of statistics, for instance, a procedure that could be incorporated in the stage of preprocessing. Furthermore, source selection and data collection must be continuous processes, with regular evaluations and updates, to preserve data precision and relevance.

3.2. Preprocessing/Exploratory Data Analysis

As a primary stage of research, data collection is typically the most time-consuming task. Gathering valuable and relevant information is an unstructured process that also demands the collector’s judgment. To ensure data quality and the choice of quality sources, apart from establishing clear data collection procedures, it is essential to validate data through audits or checks, use established best practices and conduct thorough research on the sources’ reputations.

The quality of outcomes is contingent upon data collection and preprocessing procedures. Hence, the methodical selection, consolidation and exploration of data features hold key importance. Data stored in complex systems for scientific studies are often unstructured, making it a prerequisite to execute fundamental steps beforehand, wherever feasible.

Therefore, data preprocessing and data exploration are usually a necessary part of data analysis before proceeding to the construction of prediction models. More precisely, during this procedure, if required, a feature selection could be employed, while, also, the available data set is explored for missing values, duplicated rows and generally total counts of each variable and, finally, is handled appropriately. Furthermore, possible existing relations between the variables could lead to useful conclusions concerning the construction of machine learning models. As a consequence, correlations between variables should be taken into consideration.

3.3. Data Imputation

After observing the number of missing values in each variable, the next aim is to try to predict these missing values as a process of gathering additional information that can be used for safer and more robust predictions. Another approach could be the deletion of those variables having an excess number of missing values, but in this case, if there is a large number of missing values in the data set, important information will be excluded and therefore, this will lead to biased and unreliable prediction models [29]. Thus, the best option considered is the estimation of the missing values of the data set using a statistical method.

In order to achieve that, and taking into consideration the missing values of the data set, an imputation method is required. Commonly, imputation is the procedure of calculating alternative values based on current data and appropriate statistical methods in order to replace those that are missing. For the imputation of missing values, the “Multiple imputation by Chained Equation” (MICE) approach is considered [30].

The purpose of this statistical technique is to impute missing values by using other features from the data set through an iterative series of predictive models. Initially, the MICE algorithm sets as the response variable Y each variable that contains missing values and as

X_{i}

the other variables that are considered predictors.

The model is constructed by using observations where Y is present. Then, the estimated model is applied to predict the missing observations. The process is repeated several times after selecting a random sample of data in each iteration of the algorithm and finally, replacing the missing values with the mean of the predictions. The MICE algorithm assumes that missing data are either missing at random (MAR) or missing completely at random (MCAR) [31]. If this assumption is not true, then the results concerning the estimations might be biased and therefore not reliable. In any case, the MCAR is preferable since multiple imputation is not only valid but also no bias is expected as in the MAR in which a negligible bias can be present [32]. It is of note that Little’s test can be implemented to check whether data are MCAR or not [33]. Several models can be used to describe the relationship between the variables, in the context of MICE (e.g., the Light Gradient Boosting [34], Extreme Gradient Boosting [35] and Bayesian Ridge regressors [36]).

In order to test the quality, accuracy and the reliability of the MICE imputation approach, three evaluating metrics are used to compare the actual values to predicted values, on a subset of known data that were hidden randomly in each testing trial. These three evaluating metrics are the Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and Mean Absolute Error (MAE) [37,38]. In addition to these three evaluation metrics, the accuracy of the algorithms in predicting missing values was assessed using the F-test for testing the general linear hypothesis. More specifically, the F-test was used to test, simultaneously, if the intercept and slope of the population linear relationship are equal to zero and unity, respectively, i.e., to test if the generated predictions are randomly scattered around the

45^{\circ}

line in a scatterplot of the generated prediction versus the real values, demonstrating the fact that the predictions do not systematically over- or underestimate the corresponding real values. This test has been employed in several previous studies (see for example [39,40,41,42]) in order to evaluate the performance of a prediction algorithm. A detailed presentation of this test can be found in [43].

3.4. Dimensionality Reduction Methods

Dimensionality reduction is often used before a classification algorithm is applied in order to remove redundant features, and noisy and irrelevant data, and thus improve the learning feature accuracy [44]. For the purpose of determining the optimal technique for dimensionality reduction on the basis of our data set, various approaches were assessed.

In particular, the Principal Component Analysis (PCA), the Uniform Manifold Approximation and Projection, also known as UMAP and the t-Distributed Stochastic Neighbor Embedding, or simply t-SNE, were used. These three methods are briefly discussed next.

3.4.1. Principal Components Analysis—PCA

As an additional step in the analysis, Principal Component Analysis (PCA) is performed on the available data set, after the imputation process, in order to create new uncorrelated variables called components that are linear combinations of the initial data. The advantage of PCA implementation is the dimensionality reduction in the initial data set and the fact that the newly created components will explain as much of the variation of the original data as possible. Furthermore, these components might describe latent relations that could be a corollary of the existence of some common factors that would possibly develop these secret relations. For details on the methodology and the applications of PCA, the reader may refer to [45,46,47,48].

The total number of principal components that need to be preserved can be obtained by using either graphical techniques, such as the “Scree plot” or the “Explained Variance plot”, or by using a widely known criterion introduced by Kaiser in 1960, which assumes that principal components that have eigenvalues greater than 1 should be preserved [49].

3.4.2. Uniform Manifold Approximation and Projection—UMAP

Another method for dimensionality reduction is the UMAP technique. This method was introduced a few years ago [50], and its purpose is to reduce high-dimensional data in a lower-dimensional space, preserving as much of both the data’s global and local structures as possible. Its assumption is that the data are uniformly distributed across a manifold which can be projected in a lower-dimensional space. In contrast to PCA, the UMAP reduction method can capture the non-linear structure in high-dimensional data. Usually, it is used for visualization purposes [51] or as a preprocessing technique before classification [52] or clustering [53]. In the literature, many techniques are used for visualization purposes in case of multivariate data in real-life applications, like UMAP (e.g., Andrews curves, see, for example, [54,55]).

Concerning the dimensionality reduction, deciding the optimal values for the hyperparameters of the algorithm is considered a challenging task when using this method. In this case, many values for the hyperparameters of the UMAP algorithm, such as the number of neighbors, the metric that will be used to compute the distances between data points and the number of components, should be tested and compared. For visualization purposes, the UMAP algorithm can represent the high-dimensional data adequately when two or three components are being used. However, the number of components can be tuned in order to achieve the best possible performance when using additional machine learning techniques, such as clustering and classification.

3.4.3. t-Distributed Stochastic Neighbor Embedding

In addition to the aforementioned methods of PCA and UMAP, the t-Distributed Stochastic Neighbor Embedding (t-SNE) visualizes high-dimensional data by giving every data point a location on a two-dimensional or three-dimensional map. The t-SNE technique was introduced in [56] and has been used for visualization in a wide range of applications, including natural language processing [57], geological data [58,59,60], health domain [61,62,63] and many others.

This method is a version of Stochastic Neighbor Embedding [64] that reduces the propensity of points to concentrate in the center of the map, making it much easier to optimize and yield much better visuals. With regards to producing a single map that displays the structure at a variety of scales, t-SNE performs better than previous methods. The main parameter of this method that needs to be initialized is the so-called perplexity parameter. This parameter estimates the number of nearby points each point has. According to the original article, SNE displays reasonably strong performance, even with changes in perplexity, usually ranging from 5 to 50. The results (see Section 4.5) of this study demonstrate a comparison of the performance of t-SNE with that of PCA and UMAP.

3.5. Classification

Classification involves the systematic categorization of data or objects into predefined distinct classes or groups, based on shared characteristics or attributes. This formal process relies on the definition of specific rules or algorithms, often derived from training data, to assign items to appropriate categories. It provides a structured framework for organizing and understanding complex data sets, enabling efficient information retrieval and pattern recognition. Evaluating this process is of paramount importance to ensure the reliability and effectiveness of classification outcomes. Evaluation serves as a critical checkpoint in the entire workflow, allowing to gauge the performance of their classification models.

Classification algorithms are formalized computational methods that play a pivotal role in machine learning and data analysis. These algorithms are designed to categorize data points or instances into predefined classes or categories based on their attributes. The formalization of classification algorithms involves defining a set of mathematical or logical rules, which are often learned from training data. Common formal classification algorithms include, among others, decision trees, k-nearest neighbors, support vector machines, and neural networks. The following subsections describe the classifiers with the highest performance among those tested. Due to the availability of numerous algorithms that could have been included in this study, the selection of those that will be reported was based on their performance as detailed in Section 4.5.

3.5.1. Decision Tree Classifier

Decision trees (DTs) are a non-parametric method for supervised learning, commonly used for classification and regression. The goal is to create a model that predicts the value of a target variable by learning simple decision rules derived from the characteristics of the data. Decision trees learn from the data to approximate a sine curve with a set of if–then–else decision rules. The deeper the tree, the more complex the decision rules and the more precise the model.

Decision trees offer several advantages that make them a popular choice in machine learning and data analysis. Firstly, they are simple to understand and interpret, as they can be visualized and represented in an intuitive manner. Moreover, decision trees require minimal data preparation, eliminating the need for extensive data normalization or the creation of virtual variables. Additionally, the computational cost of using a decision tree for data prediction scales logarithmically with the number of data points used to train the tree, while they can handle both numerical and categorical data. They are particularly beneficial for multi-output problems and are known for their transparency, as they are considered white box models.

However, decision trees are not without limitations. They can become overly complex, resulting in the poor generalization of data, a phenomenon known as overfitting. To mitigate this issue, techniques like pruning, setting minimum samples at leaf nodes, or specifying maximum tree depth are necessary. Decision trees can also be unstable, with small changes in the data leading to entirely different tree structures. The predictions made by decision trees are not continuous, making them less suitable for extrapolation tasks. Moreover, learning an optimal decision tree is a computationally challenging problem, often relying on heuristic algorithms that may not guarantee a globally optimal solution. Finally, decision tree models can become biased if some classes dominate in the data set, making it essential to balance the data before fitting a decision tree.

3.5.2. Random Forest Classifier

The RandomForestClassifier and RandomForestRegressor classes, presented in the scikit-learn library, represent implementations of the random forest algorithm [65]. This approach is a prominent learning method extensively utilized in machine learning for classification and regression purposes, respectively. The ensemble comprises various decision trees constructed during the training phase. Each tree is trained on a randomized subset of the data, including a randomly selected subset of features for each split.

In the case of classification, the final prediction is determined through a majority vote among the constituent trees. In the case of regression, the predicted value is derived from the average prediction across the entirety of trees. This method fortifies the model, counteracting overfitting and improving generalizability. Concurrent parameters across the two categories consist of the count of the trees in the forest (n_estimators), the greatest depth of every tree (max_depth) and the number of features assessed for every split (max_features).

In contrast to the original paper [66], the scikit-learn implementation algorithm combines classifiers by averaging their probabilistic predictions, rather than letting each classifier vote for a single class.

3.5.3. Extra Trees Classifier

Extra Trees are a type of ensemble learning method that operates on the foundation of decision trees. It sets itself through a unique and deliberate injection of randomness during the tree construction process. Unlike traditional decision trees that carefully select the optimal split point for each node based on a subset of features, Extra Trees takes a more radical approach. It not only considers a random subset of features for each split but also randomly selects the split point without assessing optimality. The outcome of this process is the creation of a forest of fully randomized trees, dissociated from the nuances of the training data’s output values.

Essentially, Extra Trees uses the strength of chance to build a forest of decision trees that, when combined, provide a powerful solution for predictive modeling. The algorithm’s ability to strike a balance between predicted accuracy and unpredictability makes it a useful tool in machine learning, especially when dealing with noisy or high-dimensional data. Lastly, possible data set imbalances can lead to biased models, making data balancing essential before fitting an ExtraTreeClassifier [67].

3.5.4. Naive Bayes Classifiers

The Naive Bayes classifiers are a family of classification algorithms whose operation is based on Bayes’ theorem. That is, in data classification problems, these kinds of algorithms seek to predict that the class to which an observation belongs is the category that maximizes the posterior probability. These classification algorithms assume that the variables are independent, and thus, there are no pairs of variables that show correlation. Further, each variable contributes equally to predict the value of the dependent variable and hence the category to which the data belongs. The Naive Bayes classifier, Gaussian Naive Bayes classifier, and Bernoulli Naive Bayes classifier belong in this category of classification algorithms.

3.5.5. K-Nearest Neighbor Classifier

The k-nearest neighbors algorithm [68] is a non-parametric supervised machine learning classification and regression model that makes no assumptions about the data set. The KNN algorithm attempts to make predictions about the category to which the new data belong by collecting data from a training data set. This is achieved by considering the similarity between observations. More specifically, it assumes that data points that are in close proximity are also more similar and hence, will belong to the same category. This method assigns a category label to a new data point, based on the category to which its k-nearest neighboring data points in the training data set belong. In order for the algorithm to find the “k-nearest neighbors” for a new data point, it must calculate the distance between this new data point and the data points that already exist in n-dimensional space with p features. An algorithm that works in a similar way as the KNN algorithm is the “Nearest centroid” classifier [69], and this algorithm was evaluated as well, for its predictive accuracy.

3.5.6. Other Classifiers

The category of linear classifiers includes all those algorithms that attempt to predict the category in which the data should be classified, based on the linear combinations of the independent variables in the data set [70]. There are many algorithms that use linear combinations of the features in order to make a classification decision. In this work, the classifiers of this category that were tested and evaluated were the support vector machines classifier, Linear SVC, Logistic Regression, Ridge classifier, Ridge CV, SGD classifier, Passive Aggressive classifier [71], linear Perceptron classifier, and Linear Discriminant analysis.

Another broad category of classifiers that were considered and employed is the Label Propagation Algorithms. This classification method and its algorithms are considered a semi-supervised machine learning technique, which uses the labels of the already labeled data points with the aim to predict the labels of the unlabeled data points of the data set [72]. The two algorithms that were employed in this paper are Label Propagation and Label Spreading.

3.5.7. Ensemble Learning

The Bagging Classifier, also known as Bootstrap Aggregating, operates by generating several subsets of the training data, utilizing random sampling with replacement. Each subset trains a base learning algorithm independently, producing a varied set of models. The ultimate prediction is subsequently determined by merging the individual model predictions by either majority voting for classification tasks or averaging for regression tasks.

When random subsets of the data are selected as random subsets of the samples, this algorithm is referred to as Paste [73]. If samples are drawn with replacement, the method is referred to as Bagging [74]. When random subsets of the data set are chosen as random subsets of the features, the resulting technique is referred to as Random Subspaces [75]. Finally, if the basic estimators are developed on subsets of both samples and features, only then is the method referred to as Random Patches [76].

Apart from Bagging, ensemble learning also includes the Boosting classifiers. Boosting classifiers are a combination of weak classifiers that have a mediocre prediction accuracy that if they are combined can be converted into a strong classifier that can possibly predict with great accuracy the class of given data points with a small error as well [77]. From the related set of Boosting algorithms, and in the context of this study, the Light Gradient Boosting Machine classifier and the AdaBoost classifier were used and evaluated.

3.6. Evaluation

Evaluation metrics for classifiers are essential tools for assessing the performance of machine learning models in solving classification problems. These metrics provide valuable insights into the model’s ability to correctly classify instances and help data scientists and machine learning practitioners make informed decisions. One commonly employed metric is accuracy, which measures the proportion of correctly classified instances to the total number of instances in the data set. While accuracy is a fundamental metric, it may not always be the most appropriate choice, especially when dealing with imbalanced data sets.

Balanced accuracy is an alternative evaluation metric that addresses the limitations of accuracy in imbalanced data sets. In imbalanced scenarios, where one class significantly outnumbers the other(s), a classifier may achieve high accuracy by simply predicting the majority class, while ignoring the minority class. Balanced accuracy, on the other hand, takes into account the sensitivity (True Positive Rate) and specificity (True Negative Rate) of the classifier. It calculates the arithmetic mean of these two rates, providing a more comprehensive view of the model’s performance. This metric ensures that both the minority and majority classes are given equal importance, making it particularly valuable when the cost of misclassifying the minority class is higher, or when one class is more critical than the other in real-world applications.

4. Application

In addressing missing data within our data set, a formalized approach that combines imputation methods for handling numerical missing values and classification techniques for predicting categorical values is performed. Imputation is preferred because it preserves partial records and simultaneously fills in the missing data by making educated guesses. This method is especially pertinent when the missing-data columns are valued and make a significant contribution to the data set as a whole. Furthermore, imputation is consistent with the goal of preserving the integrity and completeness of the data set, which is essential for a comprehensive and objective analysis and guarantees that the classifier can efficiently learn from the available data across all characteristics [78]. The avoidance of using deletion, coupled with data augmentation, was implemented because doing so could result in a reduction in data set size and a potential loss of valuable information [79]. The preference for imputation over applying classification techniques specifically designed to handle missing information, such as the classification algorithms developed by [80,81], is driven by the intention to explore a wider range of classification algorithms. While classification techniques specialized for missing information can be effective in certain contexts, choosing imputation allows for a more expansive investigation of diverse classification algorithms that might not automatically address missing data.

4.1. Data Collection

In order to apply the proposed methodology and compare the different classifiers regarding their ability to correctly classify a ship’s engine, data were combined from two different data sources providers using additional advanced text mining techniques to extract data from various unstructured sources of information, such as emails (free text) and PDF files.

The first data source provider was that of 27 Research PC, which provided three different databases, denoted as DB1, DB2, and DB3, each one consisting of the ships’ characteristics extracted from different sources and using different methods covering a period from January 2019 until November 2023. These data are available upon request by the authors under the permission of the data provider.

The second data source was the Thetis-MRV, an open-access database in which CO

_{2}

emissions from ships according to the EU Regulation 2015/757 are Monitored, Reported, and Verified (MRV) on a yearly base (available from 2018 to 2022). Thetis-MRV is a framework utilized in relation to climate change and emissions reduction to monitor and report greenhouse gas emissions and other pertinent data. It is typically linked with actions to lower emissions and combat climate change, especially in sectors like transport and industry.

In Table 1, all data providers are presented along with the number of observations and features they have offered.

It should be noted that, despite the seemingly large number of observations (285,801 records), the databases contain duplicate rows and several sparse or irrelevant characteristics for the ships, such as, for example, previous ship’s names, principal place of operation, name, and address of the shipowner, contact person’s address, telephone, e-mail, etc. Additionally, the classification of ships considers various fundamental variables (Length Overall, Beam, Design Draft, Gross Tonnage, Deadweight Tonnage, etc.), that describe their physical characteristics and capabilities. Due to the interconnection of these variables which aid in defining the ship’s purpose, design, and operational aspects, the ship type was removed. Furthermore, additional information on engine models such as engine stroke (mm), engine builder, engine cylinders, and propeller were not included due to direct dependence on engine type. It should also be noted that we exploited the entirety of the available features (and data) related to operational conditions. Finally, using the process of elimination to remove irrelevant features (as mentioned above) combined with the removal of features with a significantly small number of observations, the resulting common features from the different data sets have been selected for further analysis.

As a result, merging all these data and selecting the most useful and informative features for this study was crucial and a necessary first step for the rest of the analysis. A key feature for the data merging was the IMO ship identification number, which provides the reliable identification of each ship since multiple ships may have the same name or a single ship may change its name multiple times during its lifetime. Note that instead of describing all features and to avoid any unnecessary repetitions due to the large number of variables, only the set of features (groups) used in further analysis are described in detail in Table 2 and Table 3.

Generally, there are two main groups of variables, with the first one consisting of the ship’s characteristics that remain unchanged, while the second group corresponds to the operational aspects. The dependent, categorical variable which is the engine model of the vessels was also included. Given that the main engine model variable is not of a numerical type, the dependent variable was converted from a string type into an integer.

4.2. Preprocessing/Exploratory Data Analysis

It has already been mentioned that data preprocessing and data exploration are necessary parts of data analysis, before proceeding to the construction of prediction models. In terms of this work, the following procedure, involving appropriately handling missing/related data, etc., occurs as a result.

As depicted in Table 2 and Table 3, the data set consists of 17 variables related to physical dimensions and environmental characteristics such as the ship’s consumption, and speeds, plus the Ship Identification Number (IMO) and the target variable that represents the engine model of a vessel, leading to 19 variables in total. The final number of variables that were taken into consideration were derived after considering only the common features with a satisfying number of observations among all the available databases. The final unified data set was a result of a merging method (inner join) based on the Ship Identification Number (IMO) as a key variable.

More specifically, prior to implementing the classification methods on the data set, a comprehensive preprocessing methodology was employed. The code for the preprocessing of the data and for the following steps of the analysis is available upon request by the authors.

Duplicated rows were removed using the variables of Table 1 and Table 2, resulting in a decrease from 285,801 to 57,004 observations. Any feature vector containing even one missing value (including the main engine model variable) causes a row to be deleted. The resultant data set consisted of 3219 observations. This data set will be later utilized to evaluate the imputation procedure.

Initially, duplicated rows were dropped, but there were still missing values in 17 out of 18 explanatory variables since the variable named IMO did not have missing values. The percentages of the missing values for each of these variables ranged from 20% to 70%. These missing values need either to be dropped or to be imputed. In this work, in order to gain as much information as possible, an imputation approach was employed as it is described in the following subsection.

Further, apart from dealing with the missing values, an important step to gain an initial insight into the data structure is the computation of the correlations between the variables. Correlations between the variables can be used as a first glance in order to verify the necessity of a feature selection method (i.e., to select a subset of variables, or to combine their information), with the intention to reduce the dimension of the data space [82]. In particular, variables that are strongly and highly correlated indicate the need to apply some dimensionality reduction method to achieve more concise and explanatory results.

Toward that aim, a correlation matrix was also constructed. The Spearman’s (Sp) correlation coefficient was used for 3219 observations since none of the variables were normally distributed according to the Kolmogorov–Smirnov normality test that was conducted. From the analysis, it is clear that the variables that are related to physical dimensions present a high positive correlation between them (Sp > 0.7). More specifically, the variables

G a_{2}

to

G a_{7}

,

G a_{9}

and

G a_{10}

exhibited strong and positive correlations between them (Figure 2). Regarding the rest of the variables, their correlation is less strong and does not seem to create a wide group of highly correlated variables. In particular, considering the variables that are associated with the environmental characteristics and the performance of vessels, only a few pairs of variables seem to present a high positive correlation. These pairs of variables are

G b_{1}

and

G b_{2}

,

G b_{3}

and

G b_{4}

, and

G b_{5}

and

G b_{6}

(Figure 2), which is not surprising since they refer to the same measurement under two different conditions (i.e., when a vessel travels empty or largely empty and when a vessel travels loaded).

4.3. Imputation

Following the procedure presented in Section 3, after applying the cleaning process of the data, the MICE algorithm was applied to impute the missing values of the explanatory variables. By applying the imputation method to the 57,004 observations, to leverage any available information on the data, a total of 7773 observations were finally obtained after removing again any duplicate entries in the imputed data set and eliminating any record with missing information on the main engine model. This process ensured a more streamlined and representative data set for subsequent analytical procedures.

Before the MICE algorithm was implemented, Little’s test was used to check whether data were MCAR or not. The p-value of the test indicated that the missing data were indeed MCAR (p-value > 0.05) and so any missing value can be considered unrelated to the unknown value of the variable or to other variables. The MICE algorithm was run for a maximum of ten iterations, and three regressors were compared and used as estimator parameters to predict the response variable. Light Gradient Boosting [34], Extreme Gradient Boosting [35] and Bayesian Ridge regressors [36] were employed, each one for a specific number of variables. The appropriate regressor was selected according to the scores of evaluating metrics to achieve the best possible imputation.

In particular, a two-step imputation method was utilized to maximize the usefulness of the lower proportion of missing values presented in the first set of variables. Specifically, solely the variables in this set were taken into account during the imputation process for any missing values. Subsequently, all variables from the “operational” and “exterior measurements” groups were included to impute the missing values.

The comparison of the values of the three evaluating metrics described above and the results of the F-test are depicted in Table 4, Table 5 and Table 6. From the results, it is clear that when Bayesian Ridge estimator was used as the estimator parameter in the MICE algorithm, the algorithm failed to generate accurate and reliable estimates for all the variables (see for example the p-value of the F-test for the

G b_{5}

and

G b_{6}

variables in Table 4). On the other hand, the other two approaches present better behavior by generating unbiased predictions. From these two approaches, the most accurate implementation of MICE imputation was achieved using the Extreme Gradient Boosting regressor as the estimator parameter in the MICE algorithm since the values of the three evaluation metrics were generally smaller or at least comparable for all the variables with respect to the corresponding values obtained using the Light Gradient Boosting estimator.

4.4. Dimensionality Reduction

After the implementation of the imputation process, two executions of PCA were conducted using the variables from Table 2 and Table 3. The first was performed on the initial data set (the one that consists of 3219 observations) and the other on the imputed data set (7773 observations) to compare the outcomes of both implementations. To ensure equal contribution to the Principal Component Analysis, the variables were standardized before executing PCA.

The number of resulting variables used in the dimensionality reduction application is not high. However, dimensionality reduction techniques are applied for comparison purposes and to discover if there are any latent variables. For both applications of PCA, the number of components that were kept in the analysis was equal to 4 since the eigenvalues of each one of the first four principal components were greater than the unit (see the left plot in Figure 3). In the case of the imputed data set, the first four Principal Components together explained 83.27% of the variability of the initial data. On the other hand, when the imputation procedure was not taken into consideration, the first four Principal Components explained a slightly smaller percentage (82.47%) of the variability of the initial data (see the right plot in Figure 3). This suggests that the imputation not only effectively avoided introducing noise into the data set but also maintained its underlying structure, and therefore the imputed data were used in the rest of the study.

The latter is also demonstrated in the similar values of the loadings of the four Principal Components using the two different data sets. For that reason, in Table 7, the PCA loadings for the first four PCs are presented only for the imputed data set. All large values of loadings for the first PC are positive, evidencing a positive correlation between ships with physical dimensions and the first PC. Regarding the second PC, it seems that the first four performance variables, related to speed and VLSFO consumption have the largest contribution. The other four performance variables seem to have a larger association with the third and the fourth PC. Furthermore, at the fourth PC, the variable

G a_{8}

(year completion of the ship) seems also to have a relatively large contribution reflecting probably the different standards and technological status and achievements over the years and their impact on the performance of the ships.

The loadings in Table 7 can be used to calculate the score of Principal Components for any ship. For example, the score of the first Principal Component can be obtained as follows:

\begin{matrix} P C_{1} = & 0.322 \cdot \frac{G a_{2} - 53, 838}{31, 941.41} + 0.304 \cdot \frac{G a_{3} - 11.85}{2.09} + 0.315 \cdot \frac{G a_{4} - 193.48}{26.74} + 0.310 \cdot \frac{G a_{5} - 31.22}{3.89} + \\ 0.311 \cdot \frac{G a_{6} - 16.75}{2.68} + 0.324 \cdot \frac{G a_{7} - 66, 295.6}{33, 194} - 0.054 \cdot \frac{G a_{8} - 2010.85}{3.54} + 0.322 \cdot \frac{G a_{9} - 30, 954.1}{15, 610.1} + \\ 0.322 \cdot \frac{G a_{1} 0 - 17, 930.3}{10, 150.8} - 0.110 \cdot \frac{G b_{1} - 13.26}{0.92} - 0.068 \cdot \frac{G b_{2} - 12.82}{0.87} + 0.261 \cdot \frac{G b_{3} - 25.2}{6.66} + \\ 0.262 \cdot \frac{G b_{4} - 26.09}{6.63} + 0.053 \cdot \frac{G b_{5} - 0.14}{0.16} + 0.050 \cdot \frac{G b_{6} - 0.18}{0.27} + 0.180 \cdot \frac{G b_{7} - 2.92}{0.63} + \\ 0.040 \cdot \frac{G b_{8} - 0.29}{0.30} . \end{matrix}

In addition to the PCA method, the t-SNE and UMAP methods were also applied to compare their results by reassessing the classification of the ships’ main engine models. The tuning of the parameters for UMAP and t-SNE along with the impact of the three dimensionality reduction methods on the classification performance are discussed in the next section.

The graphical representation is achieved by reducing the dimensionality of the data to a smaller dimensional space, and in this case, to a two-dimensional space for both the UMAP and t-SNE methods. Additionally, for clarified illustration purposes, the separation of the five leading classes of ships’ main engine models is illustrated in Figure 4 using the imputed data (similar figures were obtained for the non-imputed data). The analytical encoding on the engine’s model follows in Table 8. The UMAP and t-SNE methods, as well as the PCA method, appear to separate similarly the engine type classes, while overlapping classes are also distinguished.

4.5. Classification Results

The large number of different engine models led the analysis to a restricted selection of engines, with at least 20 observations, resulting in a total of 7159 observations. In Table 8, the engine models that were retained along with their encoding and their frequency in the final data set are presented.

Although data augmentation is frequently used to correct imbalances, it is not always the optimal solution for every classification problem. While SMOTE has been used extensively and has shown useful in some situations, its application is constrained by inherent limitations and potential adverse effects. Ref. [83] elucidates that data sets with extreme imbalances exhibit suboptimal performance even after the generation of synthetic samples. Additionally, ref. [84] asserts that the benefits of balancing techniques such as SMOTE are discernible for weak classifiers but not necessarily for robust ones. Thus, since data augmentation is not used, to mitigate the impact of imbalanced data sets on evaluation metrics, the balance accuracy metric is employed as a robust measure to assess the performance of classifiers. The variable “IMO” was excluded from consideration in the classification process, as it obviously did not contribute to the relevant features for the analysis. A comprehensive evaluation of several classification methods for identifying correctly the ship’s engine type under different dimensionality reduction methods is presented in the current section. More specifically, 25 classifiers, in total, were compared with respect to their performance of balanced accuracy metric under the three dimensionality reduction methods. Each one of the dimensionality reduction methods were also tested under different scenarios.

The PCA method was systematically evaluated by considering the utilization of the first two, three and four principal components. Additionally, UMAP was employed for dimensional reduction with two and three components, investigating various near-neighbor values, specifically, 10, 30, and 123, leading to six different combinations. Furthermore, t-SNE was applied to achieve dimensionality reduction, exploring configurations with two and three components and perplexity values of 10, 30, and 123. Thus, again, six combinations of the parameters of the t-SNE method were evaluated. This comprehensive comparative analysis aimed to determine the efficacy of these techniques under various parameter combinations.

The data set was initially partitioned into training and test sets using the “train_test_split” function from the “model_selection” module provided by the scikit-learn library. Specifically, 80% of the data were allocated for training purposes, while the remaining 20% were reserved for evaluating the models’ performance. This division ensures that the model is trained on a substantial portion of the data and then tested on an independent subset to assess its generalization capabilities. Moreover, a robust k-fold cross validator was also used to systematically evaluate the model’s performance by dividing the data into five consecutive folds.

The candidate classification methods were all applied through the utilization of the “LazyClassifier” function from the “lazypredict.Supervised” library in Python (version 3.9.7). The “lazypredict” library was initially adopted, as it facilitates the comparison of various machine learning models by using the default hyperparameters for scikit-learn classifiers. Further on, for the optimal classifier (ExtraTreesClassifier), an additional evaluation across different values of the number of trees (n_estimators = [10, 50, 100, 150, 500]) pertaining to both imputation techniques and dimensional reduction methods were applied. Furthermore, all the classifiers were also compared on the basis of the duration of their training processes.

Table 9 delineates the optimal performance results corresponding to each specific feature selection method (PCA, UMAP, and t-SNE), using k-fold and having compared the various parameters tested for each technique. That is the number of principal components, number of neighbors, and perplexity. The dimensionality reduction methods that are presented are PCA with four components (PCA-4), t-SNE with perplexity equal to 10 and three components (Tsne10-3), and UMAP with 10 neighbors and three components (Umap10-3). Additionally, the largest value(s) for the balanced accuracy metric is marked in bold. Moreover, for illustration purposes, the values of the balanced accuracy for the classifiers are also depicted in Figure 5 for the imputed data case.

The optimal performance results presented in Table 9 emerged from the results depicted in Table A7, Table A8, Table A9, Table A10 and Table A11 presented in Appendix A. In particular, Table A7 shows that the four-component PCA paired with an ExtraTreeClassifier outperforms all the other combinations of classifiers and principal components. Among the combinations of t-SNE with two components, the combination of t-SNE with a perplexity of 10, paired with an ExtraTreeClassifier, appears to yield the most favorable outcomes based on the observed results (Table A9).

Table A1 delineates the optimal performance results corresponding to each specific feature selection method (PCA, UMAP, and t-SNE), having compared the various parameters tested for each technique, that is, the number of principal components, number of neighbors, and perplexity. The dimensionality reduction methods that are presented are PCA with four components (PCA-4), t-SNE with perplexity equal to 10 and two components (Tsne10-2), and UMAP with 10 neighbors and two components (Umap10-2). Additionally, the largest value(s) for balanced accuracy metric is marked in bold.

Concerning the combination of t-SNE with two components, again, t-SNE with a perplexity of 10 paired with an ExtraTreeClassifier gives the optimal performance (Table A4). However, comparably, t-SNE with three components and a perplexity of 10 has better performance than t-SNE with two components and a perplexity of 10, specifically,

90.20 %

against

89.21 %

. From the observed results of Table A6, it is clear that the combination of UMAP with a near-neighbor parameter equal to 10, paired with an ExtraTreeClassifier, presents the best performance (

83.64 %

), while the same happens in Table A5 for the combination of UMAP with a near-neighbor parameter equal to 10, paired with an ExtraTreeClassifier that results in the percentage of

81.10 %

. Therefore, comparing the latter two, UMAP with three components and a near-neighbor parameter equal to 10 was optimal.

In the context of imputation data analysis, the ExtraTreeClassifier emerges as the best classifier compared to the rest, exhibiting exceptionally good performance with an accuracy of 95.07%. Notably, as it could be derived from the above mentioned, when evaluated across various dimensionality reduction methods, the ExtraTreesClassifier consistently demonstrates exceptional performance, outperforming the other classifiers. While these findings highlight the consistent proficiency of the ExtraTreesClassifier across diverse dimensionality reduction techniques, it is worth mentioning that t-SNE stands out as the top-performing method, displaying the highest accuracy in all evaluated scenarios.

Also, the optimal performance results using “train test split” as well as the emerged results depicted in Table A1, Table A2, Table A3, Table A4, Table A5 and Table A6 presented in the Appendix A. In particular, Table A2 shows that the four-component PCA paired with an ExtraTreeClassifier outperforms all the other combinations of classifiers and principal components. Among the combinations of t-SNE with two components, the combination of t-SNE with a perplexity of 10 paired with an ExtraTreeClassifier, appears to yield the most favorable outcomes based on the observed results (Table A3).

Further, given the results, it is concluded that decision tree logic-based classifiers (i.e., ExtraTreesClassifier, DecisionTreeClassifier, RandomForestClassifier, DecisionTreeClassifier, BaggingClassifier, and ExtraTreeClassifier) perform better and successfully classify with great accuracy (up to

96 %

) the engine model type of a ship. In addition to the most accurate results obtained for the imputed data, the reduced data achieved satisfactory outcomes in the dimension reduction tests, even with only two components instead of the original 18 variables. Although there may be various explanations for why the aforementioned algorithms outperform the others, the most common ones are highlighted.

Decision trees are capable of adapting well to the characteristics of the data. They can create complex decision boundaries when necessary and are effective in capturing non-linear relationships in the structure of the data. Other classifiers, such as linear models, may struggle when the underlying patterns are not linear. Further, decision trees inherently perform feature selection by ranking the importance of features. That is, they can focus on the most important features for decision making, which can be a significant advantage when dealing with high-dimensional data.

They can also serve as a base model for ensemble methods, such as random forests and gradient boosting. Ensemble methods combine multiple decision trees to reduce overfitting and improve prediction accuracy. This may lead to superior performance compared to standalone classifiers. Furthermore, decision trees can handle categorical data without having to take into consideration single-point coding or other preprocessing steps, simplifying the modeling process and potentially leading to better results in situations where a combination of categorical and numerical features exists.

Another explanation could be the fact that decision trees are relatively robust to outliers and missing data, which can be common problems in real-world data sources. They can handle these situations gracefully, potentially leading to better overall performance.

It is important to note that the relative performance of classifiers and the feature selection technique can vary depending on the data set and the specific problem. While decision tree-based classifiers have these advantages, there are cases where other types of classifiers, such as support vector machines, neural networks, or k-nearest neighbors, may be more appropriate. The choice of classifier and dimensionality reduction method should be based on the characteristics of the data and the goals of the machine learning task.

Table 10 shows the balanced accuracy results from the ExtraTreesClassifier over several values of the “n_estimators” hyperparameter. The columns correspond to different configurations of the “n_estimators” parameter, while the rows denote the imputation performance as well as the different dimensional methods. The data set is methodically divided into five consecutive folds using the k-fold cross-validation approach. The final performance metric is derived by averaging the scores obtained across the five iterations of the k-fold cross-validation technique.

While the classifiers’ results are deemed satisfactory, it is important to examine the source of the minor classification error. When considering Table 11 based on the results given by the top classifier (ExtraTreesClassifier) along with the illustration in Figure 4, a patchwork effect can be discerned among certain encoded engine types (five most frequent). Combining both the visualization, which utilized the dimensionality reduction techniques, and the confusion matrix based on the imputed data set, it is evident that in all instances, there is misclassification towards specific engine types. This can be attributed to the presence of similar characteristics and variables accompanying them.

5. Conclusions

The ability to classify correctly a ship’s engine type using mostly available data could provide an important advantage in a highly competitive market such as the shipping industry. To achieve this, a thorough examination to identify the best dimensionality reduction method and the best classifier was conducted. The identification of the best combination was made through a detailed comparison and evaluation of their performance.

This classification procedure could lead to the drawing of some important, innovative, and valuable conclusions for the marine industry. Ship industries could reduce their operating costs if the type of the main engine of their vessels is known. Thus, given the physical dimensions of the ships and their operational behavior, shipowners would select and deploy the appropriate engine model for their fleet depending on their strategic plan and their operational schedule.

Furthermore, if shipping companies are aware of which main engine models cause higher fuel consumption, they will be able to decide which types of engines to use with the ultimate aim of reducing emissions from ships and thus sailing legally under rigorous, international and environmental regulations without any burdens.

The ideal main engine model used in most cases may push shipping companies to purchase such engine models for the reason that these engine models may improve the durability and sustainability that ships will have in terms of both voyage time and operating costs. Of course, this reinforced sustainability could reduce the occurrence of malfunctioning vessels by avoiding unexpected and undesirable accidents in the future, which may result in serious damage and loss of life.

Another inference of this study is that marine industries, as long as they have a large amount of data at their disposal, could result in making safe and reliable decisions for their strategic plan with a smaller subset of data which refer mainly to physical and environmental attributes.

A more detailed presentation of the advantages of this method over the traditional ones could include, for instance, a faster selection of engine, developing optimal strategies for a new vessel’s use, or even developing digital twin applications or simulators, allowing the evaluation of the ship’s performance before engine placement. The last will result in operating costs reduction. The above-mentioned digital twins, as already mentioned, can be an asset in performance monitoring and predictive maintenance.

Considering the faster engine selection, the proposed model can be used during the design phase of a ship. That is, a quick approximate assessment and identification of the optimum marine engine could be provided. It is of note that there are hundreds of different types of machines in the market; thus, the capability provided by the proposed methodology for a quick screening and a quick selection of the optimal engine type is of great value.

Selecting the optimal engine for the vessel which emerges naturally by using the proposed methodology can be integrated into the field of developing optimal use strategies. This can be achieved by considering the physical dimensions of the vessel and strategically determining the intended use of the ship as well as the desired operational behavior. For instance, the consumption rates, speeds, etc., all arise from its shipowner’s strategy.

Also, as already mentioned, applications of digital twins could be developed as well. These digital models are developed using operational behavior data, historical information, and advanced analytics. They mirror the physical counterpart’s characteristics, behavior, and performance in a virtual environment. Thus, the shipbuilding industry can reduce its operating costs by knowing which type of main engine is the optimum for a vessel.

Furthermore, the proposed methodology offers a tool for reducing harmful emissions from ships and their wider use in accordance with strict international environmental standards without any restrictions, and further, such classification procedures may lead to the identification of some important innovative and valuable findings for the maritime industry.

The main limitation of this study is that there was insufficient information on all of the 123 available engine models of the ships. Further, despite the fact that we exploited the entirety of the available features and data, encompassing both operational and design data, we lacked information on the operating conditions of ships or correspondingly whether they pertain to personalized features of their design. Additionally, it should be highlighted that a considerable number of values are missing in the variables pertaining to the operational efficiency of the ships.

The proposed methodology reflects the actual data used to train the model in this research, i.e., the current reality in ships’ design and ships’ engine selection. However, the current reality, as depicted in the data, encompasses both best practices and less or non-optimal engine choices. Naturally, we expect that good or best practices have been applied to the majority of the vessels at our disposal. Through the extensive data set available to us, the proposed method identifies the underlying data structure and selects the prevailing strategy. Therefore, it validates the best strategies on one hand, while on the other, it provides a quick and straightforward way for the universal application of the best strategies during the design phase.

In conclusion, whilst the estimated missing values were accurate, the outcomes would have been more reliable and robust if there were more observations for this group of variables. Consequently, it would be wise and prudent to gather as much data as possible regarding the operational efficiency of ships in future studies, if the challenging circumstances and competition permit this.

Author Contributions

Conceptualization, S.B. and K.S.; methodology, S.B., K.S. and P.E.; software, K.S., G.P. and P.B.; validation, S.B. and P.E.; formal analysis, K.S., S.B., P.E. and E.S.; resources, K.S. and P.B.; data curation, K.S.; writing original draft preparation, K.S., P.B., G.P., E.S., S.B. and P.E.; writing review and editing, K.S., P.B., G.P., E.S., S.B. and P.E. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The first data source used in this study (EU Monitoring, Reporting, Verification (MRV) mechanism) can be accessed through https://mrv.emsa.europa.eu/#public/emission-report (accessed on 15 November 2023). The second data set provided by the 27 Research PC is available upon request by the authors under the permission of the data provider.

Acknowledgments

The authors thank the 27 Research PC and especially its Director for providing the data for this analysis.

Conflicts of Interest

No conflicts of interest exist in the submission of this manuscript, and the manuscript is approved by all authors for publication.

Appendix A

Table A1. Performance of evaluation metrics (balanced accuracy) for imputation data and the top performance for each of the three dimensionality reduction methods using “train test split”. The time taken for each algorithm is also included.

Classification Algorithm		Imput	PCA-4	Tsne10-2	Umap10-2		Imput	PCA-4	Tsne10-2	Umap10-2
ExtraTreesClassifier	Balanced accuracy	0.9593	0.8968	0.9206	0.8537	Time taken	0.3884	0.4583	0.3577	0.4488
LGBMClassifier		0.9562	0.8862	0.9105	0.7620		1.2307	1.4661	1.4541	1.5109
RandomForestClassifier		0.9529	0.8942	0.9136	0.8416		0.5582	1.1150	0.6894	0.7291
DecisionTreeClassifier		0.9458	0.8699	0.9064	0.8148		0.0329	0.0349	0.0239	0.0239
BaggingClassifier		0.9438	0.8705	0.8951	0.8093		0.1686	0.1954	0.1177	0.1207
ExtraTreeClassifier		0.9315	0.8390	0.9049	0.7948		0.0110	0.0110	0.0100	0.0100
LabelPropagation		0.9159	0.5606	0.3408	0.3324		0.9220	0.6638	0.6498	0.6034
LabelSpreading		0.9159	0.5372	0.3315	0.3295		1.4094	0.9950	0.9384	0.9421
KNeighborsClassifier		0.6964	0.5847	0.6946	0.6546		0.1481	0.0439	0.0379	0.0375
GaussianNB		0.5676	0.2970	0.1435	0.0846		0.0189	0.0189	0.0180	0.0210
LinearDiscriminantAnalysis		0.5548	0.1928	0.0642	0.0592		0.0376	0.0140	0.0100	0.0100
LogisticRegression		0.5190	0.2042	0.0746	0.0589		0.7321	0.7281	0.7288	0.7118
SVC		0.4608	0.2677	0.2216	0.1628		1.3545	1.4457	1.5451	1.8055
LinearSVC		0.4554	0.0956	0.0639	0.0577		2.0496	3.0429	0.5207	0.7550
NearestCentroid		0.4481	0.2532	0.1928	0.1596		0.0110	0.0090	0.0090	0.0090
CalibratedClassifierCV		0.4287	0.0929	0.0562	0.0524		8.3219	11.5316	2.5174	3.3163
SGDClassifier		0.3982	0.1119	0.0805	0.0623		0.3361	0.1725	0.1272	0.1227
Perceptron		0.3587	0.0920	0.0396	0.0211		0.1287	0.0768	0.0478	0.0598
PassiveAggressiveClassifier		0.3157	0.0695	0.0704	0.0348		0.1546	0.0678	0.0618	0.0598
BernoulliNB		0.2333	0.0504	0.0570	0.0386		0.0352	0.0121	0.0110	0.0119
RidgeClassifier		0.1438	0.0497	0.0390	0.0321		0.0249	0.0199	0.0190	0.0179
RidgeClassifierCV		0.1438	0.0497	0.0390	0.0321		0.0284	0.0199	0.0219	0.0189
AdaBoostClassifier		0.1266	0.0766	0.0671	0.0389		0.4466	0.4468	0.3829	0.3832
DummyClassifier		0.0256	0.0256	0.0256	0.0256		0.0090	0.0090	0.0110	0.0070
QuadraticDiscriminantAnalysis		0.0256	0.3961	0.1391	0.1194		0.0199	0.0130	0.0189	0.0130

Table A2. Performance of evaluation metric (balanced accuracy) for imputation data using PCA method for different components using “train test split”, along with the time taken for the training of each algorithm.

Classification Algorithm		PCA-2	PCA-3	PCA-4		PCA-2	PCA-3	PCA-4
ExtraTreesClassifier	Balanced accuracy	0.7914	0.8601	0.8968	Time taken	0.3680	0.4675	0.4599
LGBMClassifier		0.7213	0.8557	0.8862		1.5523	2.0348	1.5234
RandomForestClassifier		0.7613	0.8610	0.8942		0.6910	0.7451	1.1152
DecisionTreeClassifier		0.7509	0.8247	0.8699		0.0250	0.0299	0.0397
BaggingClassifier		0.7297	0.8336	0.8705		0.1192	0.1630	0.1985
ExtraTreeClassifier		0.7110	0.7700	0.8390		0.0100	0.0110	0.0110
LabelPropagation		0.2029	0.3442	0.5606		0.6273	0.6662	0.6717
LabelSpreading		0.1980	0.3391	0.5372		1.0045	1.0298	1.1671
KNeighborsClassifier		0.4938	0.5456	0.5847		0.0379	0.0409	0.0439
GaussianNB		0.1352	0.2187	0.2970		0.0209	0.0259	0.0222
LinearDiscriminantAnalysis		0.1393	0.1564	0.1928		0.0100	0.0110	0.0140
LogisticRegression		0.1171	0.1252	0.2042		0.7938	0.8086	0.7721
SVC		0.1445	0.1746	0.2677		1.5649	1.5430	1.4709
LinearSVC		0.0604	0.0684	0.0956		0.5096	2.7077	3.0354
NearestCentroid		0.1761	0.1765	0.2532		0.0090	0.0090	0.0110
CalibratedClassifierCV		0.0562	0.0616	0.0929		2.5534	10.7399	11.6460
SGDClassifier		0.0679	0.0776	0.1119		0.1057	0.1411	0.1725
Perceptron		0.0388	0.0723	0.0920		0.0529	0.0598	0.0738
PassiveAggressiveClassifier		0.0568	0.0617	0.0695		0.0549	0.0588	0.0819
BernoulliNB		0.0504	0.0504	0.0504		0.0110	0.0119	0.0109
RidgeClassifier		0.0502	0.0497	0.0497		0.0180	0.0199	0.0169
RidgeClassifierCV		0.0501	0.0497	0.0497		0.0379	0.0229	0.0220
AdaBoostClassifier		0.0766	0.0766	0.0766		0.4220	0.4561	0.4828
DummyClassifier		0.0256	0.0256	0.0256		0.0070	0.0080	0.0080
QuadraticDiscriminantAnalysis		0.1575	0.2326	0.3961		0.0150	0.0131	0.0130

Table A3. Performance of evaluation metrics (balanced accuracy) for imputation data using t-SNE with 2 components and different perplexity values using “train test split”, along with the time taken for the training of each algorithm.

Classification Algorithm		Tsne10-2	Tsne30-2	Tsne123-2		Tsne10-2	Tsne30-2	Tsne123-2
ExtraTreesClassifier	Balanced accuracy	0.9206	0.9173	0.9132	Time taken	0.3680	0.3640	0.3705
LGBMClassifier		0.9105	0.8904	0.8719		1.5523	1.9767	1.8871
RandomForestClassifier		0.9136	0.9139	0.9074		0.6910	0.6855	0.7211
DecisionTreeClassifier		0.9064	0.8964	0.8991		0.0250	0.0239	0.0239
BaggingClassifier		0.8951	0.8887	0.8839		0.1192	0.1141	0.1219
ExtraTreeClassifier		0.9049	0.8954	0.8948		0.0100	0.0110	0.0110
LabelPropagation		0.3408	0.2747	0.2292		0.6273	0.6551	0.6260
LabelSpreading		0.3315	0.2724	0.2240		1.0045	1.0087	1.0044
KNeighborsClassifier		0.6946	0.6890	0.6573		0.0379	0.0389	0.0389
GaussianNB		0.1435	0.1832	0.1400		0.0209	0.0209	0.0219
LinearDiscriminantAnalysis		0.0642	0.1241	0.1320		0.0100	0.0120	0.0100
LogisticRegression		0.0746	0.1270	0.1328		0.7938	0.7783	0.7849
SVC		0.2216	0.2309	0.1747		1.5649	1.4505	1.3898
LinearSVC		0.0639	0.0458	0.1009		0.5096	0.4235	0.4019
NearestCentroid		0.1928	0.2227	0.1823		0.0090	0.0115	0.0100
CalibratedClassifierCV		0.0562	0.0592	0.0856		2.5534	1.9490	2.1004
SGDClassifier		0.0805	0.0928	0.1019		0.1057	0.1097	0.1177
Perceptron		0.0396	0.0491	0.0927		0.0529	0.0520	0.0519
PassiveAggressiveClassifier		0.0704	0.0605	0.1028		0.0549	0.0610	0.0588
BernoulliNB		0.0570	0.0442	0.0692		0.0110	0.0110	0.0113
RidgeClassifier		0.0390	0.0409	0.0538		0.0180	0.0189	0.0199
RidgeClassifierCV		0.0390	0.0409	0.0538		0.0379	0.0232	0.0229
AdaBoostClassifier		0.0671	0.0634	0.0731		0.4220	0.4099	0.4118
DummyClassifier		0.0256	0.0256	0.0256		0.0070	0.0070	0.0070
QuadraticDiscriminantAnalysis		0.1391	0.2037	0.2103		0.0150	0.0130	0.0120

Table A4. Performance of evaluation metric (balanced accuracy) for imputation data using t-SNE with 3 components and different perplexity values using a “train–test split”, along with the time taken for the training of each algorithm.

Classification Algorithm		Tsne10-3	Tsne30-3	Tsne123-3		Tsne10-3	Tsne30-3	Tsne123-3
ExtraTreesClassifier	Balanced accuracy	0.9181	0.9146	0.9169	Time taken	0.3680	0.3612	0.3671
LGBMClassifier		0.0427	0.9104	0.0453		1.5523	2.2633	0.9634
RandomForestClassifier		0.9171	0.9135	0.9141		0.6910	0.7550	0.7540
DecisionTreeClassifier		0.9091	0.9054	0.8967		0.0250	0.0321	0.0320
BaggingClassifier		0.8912	0.8883	0.8852		0.1192	0.1725	0.1649
ExtraTreeClassifier		0.8884	0.8951	0.8978		0.0100	0.0130	0.0100
LabelPropagation		0.6550	0.5999	0.5336		0.6273	0.6228	0.6521
LabelSpreading		0.6397	0.5715	0.5151		1.0045	0.9969	1.0462
KNeighborsClassifier		0.7047	0.6905	0.6748		0.0379	0.0399	0.0399
GaussianNB		0.1815	0.2226	0.2120		0.0209	0.0219	0.0228
LinearDiscriminantAnalysis		0.1132	0.1791	0.1520		0.0100	0.0120	0.0129
LogisticRegression		0.1398	0.1891	0.1622		0.7938	0.9065	0.9106
SVC		0.3146	0.2956	0.2561		1.5649	1.3408	1.3334
LinearSVC		0.0925	0.1385	0.1255		0.5096	0.4877	0.6084
NearestCentroid		0.2424	0.2867	0.2457		0.0090	0.0100	0.0110
CalibratedClassifierCV		0.0866	0.1378	0.1429		2.5534	2.3304	2.5602
SGDClassifier		0.0920	0.1101	0.0887		0.1057	0.1336	0.1307
Perceptron		0.0858	0.0947	0.0609		0.0529	0.0648	0.0598
PassiveAggressiveClassifier		0.0911	0.1228	0.1137		0.0549	0.0648	0.0598
BernoulliNB		0.0426	0.0607	0.0774		0.0110	0.0120	0.0120
RidgeClassifier		0.0437	0.0601	0.0499		0.0180	0.0189	0.0209
RidgeClassifierCV		0.0437	0.0601	0.0499		0.0379	0.0219	0.0229
AdaBoostClassifier		0.0574	0.0551	0.0787		0.4220	0.4538	0.4643
DummyClassifier		0.0256	0.0256	0.0256		0.0070	0.0090	0.0079
QuadraticDiscriminantAnalysis		0.2661	0.2965	0.2812		0.0150	0.0130	0.0160

Table A5. Performance of evaluation metric (balanced accuracy) for imputation data using UMAP with 2 components and different near-neighbor values using a “train–test split”, along with the time taken for the training of each algorithm.

Classification Algorithm		UMAP10-2	UMAP30-2	UMAP123-2		UMAP10-2	UMAP30-2	UMAP123-2
ExtraTreesClassifier	Balanced accuracy	0.8537	0.7868	0.6164	Time taken	0.3680	0.4836	0.5185
LGBMClassifier		0.7620	0.0534	0.5915		1.5523	0.8876	2.0281
RandomForestClassifier		0.8416	0.7250	0.6037		0.6910	0.7556	0.7909
DecisionTreeClassifier		0.8148	0.7153	0.5816		0.0250	0.0260	0.0269
BaggingClassifier		0.8093	0.7071	0.6060		0.1192	0.1267	0.1297
ExtraTreeClassifier		0.7948	0.6567	0.5167		0.0100	0.0130	0.0120
LabelPropagation		0.3324	0.2198	0.1450		0.6273	0.6445	0.6467
LabelSpreading		0.3295	0.2127	0.1436		1.0045	1.0696	0.9904
KNeighborsClassifier		0.6546	0.5961	0.5631		0.0379	0.0379	0.0379
GaussianNB		0.0846	0.1651	0.1901		0.0209	0.0279	0.0315
LinearDiscriminantAnalysis		0.0592	0.1074	0.1084		0.0100	0.0130	0.0110
LogisticRegression		0.0589	0.1042	0.1234		0.7938	0.8240	0.8344
SVC		0.1628	0.1898	0.1604		1.5649	1.3771	1.3968
LinearSVC		0.0577	0.0884	0.1109		0.5096	0.6259	0.6231
NearestCentroid		0.1596	0.2173	0.1927		0.0090	0.0110	0.0100
CalibratedClassifierCV		0.0524	0.0809	0.1010		2.5534	2.7130	2.7518
SGDClassifier		0.0623	0.0980	0.1540		0.1057	0.1207	0.1099
Perceptron		0.0211	0.1231	0.1019		0.0529	0.0595	0.0562
PassiveAggressiveClassifier		0.0348	0.0973	0.0827		0.0549	0.0598	0.0611
BernoulliNB		0.0386	0.0732	0.0754		0.0110	0.0110	0.0120
RidgeClassifier		0.0321	0.0478	0.0757		0.0180	0.0199	0.0200
RidgeClassifierCV		0.0321	0.0478	0.0757		0.0379	0.0239	0.0209
AdaBoostClassifier		0.0389	0.1004	0.0936		0.4220	0.4199	0.4144
DummyClassifier		0.0256	0.0256	0.0256		0.0070	0.0080	0.0080
QuadraticDiscriminantAnalysis		0.1194	0.2193	0.1941		0.0150	0.0130	0.0120

Table A6. Performance of evaluation metric (balanced accuracy) for imputation data using UMAP with 3 components and different near-neighbor values using a “train–test split”, along with the time taken for the training of each algorithm.

Classification Algorithm		UMAP10-3	UMAP30-3	UMAP123-3		UMAP10-3	UMAP30-3	UMAP123-3
ExtraTreesClassifier	Balanced accuracy	0.8501	0.8152	0.7342	Time taken	0.3680	0.4879	0.5206
LGBMClassifier		0.8059	0.0527	0.6550		1.5523	0.8830	2.0156
RandomForestClassifier		0.8327	0.7849	0.6810		0.6910	0.7508	0.9259
DecisionTreeClassifier		0.8270	0.7603	0.6591		0.0250	0.0359	0.0329
BaggingClassifier		0.8035	0.7696	0.6690		0.1192	0.1676	0.1750
ExtraTreeClassifier		0.8102	0.7291	0.6009		0.0100	0.0130	0.0130
LabelPropagation		0.4042	0.2543	0.2140		0.6273	0.6401	0.6611
LabelSpreading		0.4047	0.2541	0.1995		1.0045	1.0271	1.0150
KNeighborsClassifier		0.6502	0.6539	0.5882		0.0379	0.0399	0.0399
GaussianNB		0.1618	0.1966	0.2085		0.0209	0.0279	0.0289
LinearDiscriminantAnalysis		0.0987	0.1297	0.1382		0.0100	0.0100	0.0139
LogisticRegression		0.1051	0.1389	0.1403		0.7938	0.7934	1.0130
SVC		0.2495	0.2222	0.1655		1.5649	1.3185	1.3388
LinearSVC		0.0696	0.1279	0.1383		0.5096	1.0410	0.8294
NearestCentroid		0.1851	0.2014	0.2403		0.0090	0.0100	0.0169
CalibratedClassifierCV		0.0696	0.1152	0.1161		2.5534	4.5479	3.8081
SGDClassifier		0.0678	0.1025	0.1533		0.1057	0.1563	0.1351
Perceptron		0.0787	0.1017	0.1128		0.0529	0.0792	0.0957
PassiveAggressiveClassifier		0.0532	0.1223	0.1028		0.0549	0.0607	0.0947
BernoulliNB		0.0459	0.0761	0.1018		0.0110	0.0120	0.0120
RidgeClassifier		0.0647	0.0758	0.1018		0.0180	0.0189	0.0199
RidgeClassifierCV		0.0637	0.0758	0.1018		0.0379	0.0209	0.0200
AdaBoostClassifier		0.1124	0.0762	0.0761		0.4220	0.4830	0.4514
DummyClassifier		0.0256	0.0256	0.0256		0.0070	0.0100	0.0070
QuadraticDiscriminantAnalysis		0.2144	0.2767	0.2725		0.0150	0.0160	0.0200

Table A7. Performance of evaluation metric (balanced accuracy) for imputation data using the PCA method for different components using k-fold cross validation, along with the time taken for the training of each algorithm.

Classification Algorithm		PCA-2	PCA-3	PCA-4		PCA-2	PCA-3	PCA-4
ExtraTreesClassifier	Balanced accuracy	0.7769	0.8358	0.8709	Time taken	0.4277	0.4301	0.4136
RandomForestClassifier		0.7451	0.8253	0.8577		0.7073	0.7143	1.0609
BaggingClassifier		0.7012	0.7793	0.8319		0.1218	0.1573	0.1886
DecisionTreeClassifier		0.7308	0.7820	0.8267		0.0228	0.0286	0.0344
ExtraTreeClassifier		0.7311	0.7674	0.8240		0.0104	0.0097	0.0100
LabelPropagation		0.1994	0.3401	0.5514		0.6835	0.7046	0.7152
LabelSpreading		0.1932	0.3267	0.5311		1.3347	1.3334	1.3164
KNeighborsClassifier		0.4742	0.5219	0.5683		0.0334	0.0362	0.0377
LGBMClassifier		0.7255	0.8084	0.5301		1.7637	1.8761	1.5696
LinearDiscriminantAnalysis		0.1332	0.1479	0.2008		0.0098	0.0106	0.0113
GaussianNB		0.1361	0.2129	0.3126		0.0104	0.0110	0.0112
LogisticRegression		0.1109	0.1298	0.2023		0.7755	0.7843	0.7778
SVC		0.1442	0.1816	0.2498		1.1508	1.1726	1.1509
LinearSVC		0.0610	0.0670	0.0917		1.2491	2.0262	2.1846
NearestCentroid		0.1630	0.1893	0.2534		0.0084	0.0084	0.0086
CalibratedClassifierCV		0.0561	0.0661	0.0812		5.1969	8.2681	8.6005
SGDClassifier		0.0741	0.0789	0.1144		0.1237	0.1397	0.1587
QuadraticDiscriminantAnalysis		0.1475	0.2270	0.4012		0.0106	0.0116	0.0116
Perceptron		0.0725	0.0650	0.0869		0.0577	0.0619	0.0651
PassiveAggressiveClassifier		0.0595	0.0611	0.0777		0.0596	0.0639	0.0696
BernoulliNB		0.0503	0.0503	0.0503		0.0093	0.0090	0.0093
RidgeClassifier		0.0496	0.0490	0.0481		0.0115	0.0114	0.0122
RidgeClassifierCV		0.0496	0.0490	0.0481		0.0203	0.0211	0.0212
AdaBoostClassifier		0.1009	0.1008	0.0865		0.4045	0.4412	0.4750
DummyClassifier		0.0256	0.0256	0.0256		0.0058	0.0061	0.0060

Table A8. Performance of evaluation metrics (balanced accuracy) for imputation data using t-SNE with 2 components and different perplexity values using k-fold cross validation, along with the time taken for the training of each algorithm.

Classification Algorithm		Tsne10-2	Tsne30-2	Tsne123-2		Tsne10-2	Tsne30-2	Tsne123-2
ExtraTreesClassifier	Balanced accuracy	0.8921	0.8929	0.8722	Time taken	0.3265	0.3241	0.3290
RandomForestClassifier		0.8886	0.8835	0.8628		0.6431	0.6288	0.6483
BaggingClassifier		0.8669	0.8621	0.8373		0.1090	0.1071	0.1132
DecisionTreeClassifier		0.8863	0.8815	0.8514		0.0208	0.0204	0.0219
ExtraTreeClassifier		0.8753	0.8700	0.8470		0.0088	0.0086	0.0090
LabelPropagation		0.3385	0.2772	0.2379		0.6733	0.6728	0.6687
LabelSpreading		0.3294	0.2734	0.2304		1.3099	1.2979	1.2775
KNeighborsClassifier		0.6750	0.6863	0.6650		0.0328	0.0333	0.0332
LGBMClassifier		0.5336	0.7035	0.8494		1.5011	1.6371	1.7765
LinearDiscriminantAnalysis		0.0656	0.1204	0.1369		0.0099	0.0098	0.0096
GaussianNB		0.1334	0.1789	0.1535		0.0104	0.0105	0.0100
LogisticRegression		0.0740	0.1223	0.1377		0.7673	0.7828	0.7868
SVC		0.2208	0.2254	0.1828		1.1766	1.0702	1.0515
LinearSVC		0.0643	0.0488	0.1015		0.4130	0.3105	0.3483
NearestCentroid		0.1886	0.2232	0.1755		0.0082	0.0087	0.0083
CalibratedClassifierCV		0.0517	0.0574	0.0903		1.9721	1.5766	1.6991
SGDClassifier		0.0739	0.0949	0.1271		0.1121	0.1167	0.1138
QuadraticDiscriminantAnalysis		0.1271	0.2039	0.2138		0.0103	0.0108	0.0106
Perceptron		0.0630	0.0751	0.0775		0.0552	0.0553	0.0556
PassiveAggressiveClassifier		0.0777	0.0642	0.0660		0.0577	0.0590	0.0584
BernoulliNB		0.0566	0.0445	0.0710		0.0089	0.0092	0.0089
RidgeClassifier		0.0393	0.0417	0.0505		0.0119	0.0116	0.0116
RidgeClassifierCV		0.0393	0.0416	0.0505		0.0208	0.0208	0.0202
AdaBoostClassifier		0.0680	0.0643	0.0739		0.3950	0.3911	0.3930
DummyClassifier		0.0256	0.0256	0.0258		0.0060	0.0061	0.0056

Table A9. Performance of evaluation metric (balanced accuracy) for imputation data using t-SNE with 3 components and different perplexity values using k-fold cross validation, along with the time taken for the training of each algorithm.

Classification Algorithm		Tsne10-3	Tsne30-3	Tsne123-3		Tsne10-3	Tsne30-3	Tsne123-3
ExtraTreesClassifier	Balanced accuracy	0.9020	0.8949	0.8782	Time taken	0.3401	0.3438	0.3352
RandomForestClassifier		0.8931	0.8874	0.8698		0.6713	0.6871	0.6670
BaggingClassifier		0.8700	0.8680	0.8491		0.1562	0.1625	0.1540
DecisionTreeClassifier		0.8859	0.8789	0.8594		0.0282	0.0291	0.0284
ExtraTreeClassifier		0.8596	0.8642	0.8472		0.0096	0.0092	0.0090
LabelPropagation		0.6281	0.5808	0.5165		0.6709	0.6964	0.6811
LabelSpreading		0.6157	0.5648	0.4993		1.2951	1.3629	1.3262
KNeighborsClassifier		0.7082	0.6895	0.6872		0.0350	0.0353	0.0349
LGBMClassifier		0.5525	0.8778	0.2000		1.5646	1.8609	1.2253
LinearDiscriminantAnalysis		0.1147	0.1780	0.1422		0.0104	0.0109	0.0103
GaussianNB		0.1871	0.2149	0.1996		0.0106	0.0112	0.0109
LogisticRegression		0.1378	0.1809	0.1572		0.8007	0.8137	0.7964
SVC		0.3073	0.2797	0.2390		1.1099	1.0340	1.0214
LinearSVC		0.0845	0.1382	0.1246		0.4434	0.3702	0.4347
NearestCentroid		0.2421	0.2732	0.2344		0.0086	0.0088	0.0083
CalibratedClassifierCV		0.0847	0.1343	0.1344		2.0596	1.7766	1.9859
SGDClassifier		0.0899	0.1176	0.1127		0.1249	0.1255	0.1270
QuadraticDiscriminantAnalysis		0.2680	0.2825	0.2649		0.0116	0.0116	0.0112
Perceptron		0.0893	0.1057	0.0892		0.0555	0.0563	0.0571
PassiveAggressiveClassifier		0.0835	0.1015	0.0978		0.0623	0.0652	0.0604
BernoulliNB		0.0451	0.0631	0.0794		0.0092	0.0091	0.0090
RidgeClassifier		0.0438	0.0597	0.0514		0.0114	0.0124	0.0120
RidgeClassifierCV		0.0437	0.0597	0.0514		0.0209	0.0217	0.0209
AdaBoostClassifier		0.0680	0.0532	0.0632		0.4350	0.4517	0.4383
DummyClassifier		0.0256	0.0256	0.0256		0.0064	0.0062	0.0060

Table A10. Performance of evaluation metric (balanced accuracy) for imputation data using UMAP with 2 components and different near-neighbor values using k-fold cross validation, along with the time taken for the training of each algorithm.

Classification Algorithm		UMAP10-2	UMAP30-2	UMAP123-2		UMAP10-2	UMAP30-2	UMAP123-2
ExtraTreesClassifier	Balanced accuracy	0.8110	0.7424	0.6083	Time taken	0.4148	0.4404	0.5185
RandomForestClassifier		0.7918	0.6940	0.5798		0.6966	0.6987	0.7151
BaggingClassifier		0.7718	0.6826	0.5693		0.1154	0.1199	0.1241
DecisionTreeClassifier		0.7774	0.6911	0.5701		0.0220	0.0232	0.0238
ExtraTreeClassifier		0.7567	0.6468	0.5354		0.0098	0.0098	0.0102
LabelPropagation		0.3291	0.2181	0.1495		0.6906	0.6815	0.6858
LabelSpreading		0.3192	0.2117	0.1475		1.3000	1.3359	1.3080
KNeighborsClassifier		0.6586	0.6123	0.5288		0.0332	0.0334	0.0340
LGBMClassifier		0.7266	0.3032	0.2506		1.8212	1.3384	1.3348
LinearDiscriminantAnalysis		0.0597	0.1092	0.1059		0.0098	0.0097	0.0098
GaussianNB		0.0837	0.1664	0.1777		0.0104	0.0104	0.0106
LogisticRegression		0.0597	0.1090	0.1301		0.7845	0.7887	0.7860
SVC		0.1746	0.1861	0.1589		1.3408	1.0326	1.0518
LinearSVC		0.0572	0.0889	0.1008		0.6045	0.4664	0.4674
NearestCentroid		0.1542	0.2195	0.1912		0.0084	0.0084	0.0086
CalibratedClassifierCV		0.0515	0.0826	0.0980		2.7329	2.1119	2.1660
SGDClassifier		0.0623	0.0999	0.1255		0.1183	0.1174	0.1116
QuadraticDiscriminantAnalysis		0.1039	0.2194	0.1824		0.0106	0.0106	0.0109
Perceptron		0.0439	0.0973	0.0933		0.0555	0.0573	0.0559
PassiveAggressiveClassifier		0.0478	0.1005	0.1054		0.0589	0.0584	0.0593
BernoulliNB		0.0384	0.0729	0.0756		0.0094	0.0088	0.0090
RidgeClassifier		0.0323	0.0478	0.0755		0.0114	0.0116	0.0116
RidgeClassifierCV		0.0323	0.0478	0.0755		0.0202	0.0202	0.0208
AdaBoostClassifier		0.0750	0.1011	0.0828		0.4024	0.4058	0.4120
DummyClassifier		0.0256	0.0256	0.0256		0.0060	0.0062	0.0058

Table A11. Performance of evaluation metric (balanced accuracy) for imputation data using UMAP with 3 components and different near-neighbor values using k-fold cross validation, along with the time taken for the training of each algorithm.

Classification Algorithm		UMAP10-3	UMAP30-3	UMAP123-3		UMAP10-3	UMAP30-3	UMAP123-3
ExtraTreesClassifier	Balanced accuracy	0.8364	0.7895	0.7095	Time taken	0.4025	0.4238	0.4366
RandomForestClassifier		0.8125	0.7496	0.6620		0.7047	0.6986	0.7160
BaggingClassifier		0.7875	0.7276	0.6565		0.1689	0.1582	0.1657
DecisionTreeClassifier		0.7928	0.7254	0.6328		0.0317	0.0290	0.0304
ExtraTreeClassifier		0.7730	0.6983	0.6043		0.0094	0.0098	0.0098
LabelPropagation		0.4010	0.2603	0.2189		0.6862	0.6830	0.6763
LabelSpreading		0.3954	0.2556	0.2128		1.3110	1.3243	1.3079
KNeighborsClassifier		0.6526	0.6310	0.6115		0.0346	0.0342	0.0344
LGBMClassifier		0.7635	0.4674	0.6154		1.8766	1.5890	1.8907
LinearDiscriminantAnalysis		0.1032	0.1261	0.1419		0.0103	0.0102	0.0102
GaussianNB		0.1583	0.2012	0.1972		0.0108	0.0108	0.0106
LogisticRegression		0.1041	0.1424	0.1455		0.7614	0.7653	0.7951
SVC		0.2490	0.2245	0.1585		1.1175	0.9994	1.0043
LinearSVC		0.0725	0.1315	0.1385		0.7941	0.8242	0.7081
NearestCentroid		0.2129	0.1877	0.2450		0.0082	0.0086	0.0086
CalibratedClassifierCV		0.0698	0.1198	0.1306		3.3967	3.4009	3.0388
SGDClassifier		0.0881	0.1174	0.1538		0.1399	0.1354	0.1227
QuadraticDiscriminantAnalysis		0.2310	0.2862	0.2889		0.0115	0.0116	0.0112
Perceptron		0.0788	0.1166	0.0929		0.0595	0.0622	0.0590
PassiveAggressiveClassifier		0.0832	0.1077	0.0796		0.0622	0.0648	0.0624
BernoulliNB		0.0459	0.0759	0.1015		0.0090	0.0094	0.0092
RidgeClassifier		0.0619	0.0739	0.1017		0.0116	0.0120	0.0119
RidgeClassifierCV		0.0615	0.0738	0.1017		0.0204	0.0218	0.0205
AdaBoostClassifier		0.0714	0.0755	0.0761		0.4289	0.4329	0.4294
DummyClassifier		0.0256	0.0256	0.0256		0.0059	0.0060	0.0062

References

Hu, Z.; Jin, Y.; Hu, Q.; Sen, S.; Zhou, T.; Osman, M.T. Prediction of fuel consumption for enroute ship based on machine learning. IEEE Access 2019, 7, 119497–119505. [Google Scholar] [CrossRef]
Rawson, A.; Brito, M.; Sabeur, Z.; Tran-Thanh, L. A machine learning approach for monitoring ship safety in extreme weather events. Saf. Sci. 2021, 141, 105336. [Google Scholar] [CrossRef]
Akyuz, E.; Cicek, K.; Celik, M. A comparative research of machine learning impact to future of maritime transportation. Procedia Comput. Sci. 2019, 158, 275–280. [Google Scholar] [CrossRef]
İnceişçi, F.K.; Ayça, A. Fault Analysis of Ship Machinery Using Machine Learning Techniques. Int. J. Marit. Eng. 2022, 164. [Google Scholar] [CrossRef]
Hwang, T.; Youn, I.H. Navigation Situation Clustering Model of Human-Operated Ships for Maritime Autonomous Surface Ship Collision Avoidance Tests. J. Mar. Sci. Eng. 2021, 9, 1458. [Google Scholar] [CrossRef]
Yekeen, S.T.; Balogun, A.L.; Yusof, K.B.W. A novel deep learning instance segmentation model for automated marine oil spill detection. ISPRS J. Photogramm. Remote Sens. 2020, 167, 190–200. [Google Scholar] [CrossRef]
Uyanık, T.; Karatuğ, Ç.; Arslanoğlu, Y. Machine learning approach to ship fuel consumption: A case of container vessel. Transp. Res. Part D Transp. Environ. 2020, 84, 102389. [Google Scholar] [CrossRef]
Huang, L.; Pena, B.; Liu, Y.; Anderlini, E. Machine learning in sustainable ship design and operation: A review. Ocean Eng. 2022, 266, 112907. [Google Scholar] [CrossRef]
Du, Y.; Chen, Y.; Li, X.; Schönborn, A.; Sun, Z. Data fusion and machine learning for ship fuel efficiency modeling: Part III–Sensor data and meteorological data. Commun. Transp. Res. 2022, 2, 100072. [Google Scholar] [CrossRef]
Oruc, A. Claims of state-sponsored cyberattack in the maritime industry. In Proceedings of the Conference Proceedings of INEC, Online, 5–9 October 2020. [Google Scholar]
Lee, C.B.; Wan, J.; Shi, W.; Li, K. A cross-country study of competitiveness of the shipping industry. Transp. Policy 2014, 35, 366–376. [Google Scholar] [CrossRef]
Zaman, I.; Pazouki, K.; Norman, R.; Younessi, S.; Coleman, S. Challenges and opportunities of big data analytics for upcoming regulations and future transformation of the shipping industry. Procedia Eng. 2017, 194, 537–544. [Google Scholar] [CrossRef]
Bui, K.Q.; Perera, L.P. The compliance challenges in emissions control regulations to reduce air pollution from shipping. In Proceedings of the OCEANS 2019-Marseille, Marseille, France, 17–20 June 2019; pp. 1–8. [Google Scholar]
Buixadé Farré, A.; Stephenson, S.R.; Chen, L.; Czub, M.; Dai, Y.; Demchev, D.; Efimov, Y.; Graczyk, P.; Grythe, H.; Keil, K.; et al. Commercial Arctic shipping through the Northeast Passage: Routes, resources, governance, technology, and infrastructure. Polar Geogr. 2014, 37, 298–324. [Google Scholar] [CrossRef]
Shepherd, I. European efforts to make marine data more accessible. Ethics Sci. Environ. Politics 2018, 18, 75–81. [Google Scholar] [CrossRef]
Arifin, M.D. Application of Internet of Things (IoT) and Big Data in the Maritime Industries: Ship Allocation Model. Int. J. Mar. Eng. Innov. Res. 2023, 8, 97–108. [Google Scholar] [CrossRef]
Skarlatos, K.; Fousteris, A.; Georgakellos, D.; Economou, P.; Bersimis, S. Assessing Ships’ Environmental Performance Using Machine Learning. Energies 2023, 16, 2544. [Google Scholar] [CrossRef]
Rawson, A.; Brito, M. A survey of the opportunities and challenges of supervised machine learning in maritime risk analysis. Transp. Rev. 2023, 43, 108–130. [Google Scholar] [CrossRef]
Tsaganos, G.; Nikitakos, N.; Dalaklis, D.; Ölcer, A.; Papachristos, D. Machine learning algorithms in shipping: Improving engine fault detection and diagnosis via ensemble methods. WMU J. Marit. Aff. 2020, 19, 51–72. [Google Scholar] [CrossRef]
Gu, J.; Oelke, D. Understanding bias in machine learning. arXiv 2019, arXiv:1909.01866. [Google Scholar]
Lindstad, H.E.; Eskeland, G.S. Environmental regulations in shipping: Policies leaning towards globalization of scrubbers deserve scrutiny. Transp. Res. Part D Transp. Environ. 2016, 47, 67–76. [Google Scholar] [CrossRef]
Psaraftis, H.N.; Kontovas, C.A. Speed models for energy-efficient maritime transportation: A taxonomy and survey. Transp. Res. Part C Emerg. Technol. 2013, 26, 331–351. [Google Scholar] [CrossRef]
Geng, J.B.; Cai, J.B.; Luo, M.J.; Niu, J.Z. Main Diesel Engine Selection for Ships Based on Life Cycle Costing. In 2015 International Conference on Management Science and Management Innovation (MSMI 2015); Atlantis Press: Amsterdam, The Netherlands, 2015; pp. 361–366. [Google Scholar]
Tadros, M.; Ventura, M.; Soares, C.G. Surrogate models of the performance and exhaust emissions of marine diesel engines for ship conceptual design. Transport 2018, 2, 105–112. [Google Scholar]
Papanikolaou, A. Ship Design: Methodologies of Preliminary Design; Springer: Berlin/Heidelberg, Germany, 2014. [Google Scholar]
Avgeridis, L.; Lentzos, K.; Skoutas, D.; Emiris, I.Z. Time Series Analysis for Digital Twins in Green Shipping. In SNAME International Symposium on Ship Operations, Management and Economics; SNAME: Alexandria, VA, USA, 2023; p. D011S003R003. [Google Scholar]
Giering, J.E.; Dyck, A. Maritime Digital Twin architecture: A concept for holistic Digital Twin application for shipbuilding and shipping. at-Automatisierungstechnik 2021, 69, 1081–1095. [Google Scholar] [CrossRef]
Zavareh, B.; Foroozan, H.; Gheisarnejad, M.; Khooban, M.H. New trends on digital twin-based blockchain technology in zero-emission ship applications. Nav. Eng. J. 2021, 133, 115–135. [Google Scholar]
Emmanuel, T.; Maupong, T.; Mpoeleng, D.; Semong, T.; Mphago, B.; Tabona, O. A survey on missing data in machine learning. J. Big Data 2021, 8, 1–37. [Google Scholar] [CrossRef] [PubMed]
Bouhlila, D.S.; Sellaouti, F. Multiple imputation using chained equations for missing data in TIMSS: A case study. Large-Scale Assess. Educ. 2013, 1, 4. [Google Scholar] [CrossRef]
Seu, K.; Kang, M.S.; Lee, H. An intelligent missing data imputation techniques: A review. JOIV Int. J. Inform. Vis. 2022, 6, 278–283. [Google Scholar] [CrossRef]
Henry, A.J.; Hevelone, N.D.; Lipsitz, S.; Nguyen, L.L. Comparative methods for handling missing data in large databases. J. Vasc. Surg. 2013, 58, 1353–1359.e6. [Google Scholar] [CrossRef]
Little, R.J. A test of missing completely at random for multivariate data with missing values. J. Am. Stat. Assoc. 1988, 83, 1198–1202. [Google Scholar] [CrossRef]
Shehadeh, A.; Alshboul, O.; Al Mamlook, R.E.; Hamedat, O. Machine learning models for predicting the residual value of heavy construction equipment: An evaluation of modified decision tree, LightGBM, and XGBoost regression. Autom. Constr. 2021, 129, 103827. [Google Scholar] [CrossRef]
Jeganathan, S.; Lakshminarayanan, A.R.; Ramachandran, N.; Tunze, G.B. Predicting Academic Performance of Immigrant Students Using XGBoost Regressor. Int. J. Inf. Technol. Web Eng. (IJITWE) 2022, 17, 1–19. [Google Scholar] [CrossRef]
Imane, M.; Aoula, E.S.; Achouyab, E.H. Using Bayesian ridge regression to predict the overall equipment effectiveness performance. In Proceedings of the 2022 2nd International Conference on Innovative Research in Applied Science, Engineering and Technology (IRASET), Meknes, Morocco, 3–4 March 2022; pp. 1–4. [Google Scholar]
Botchkarev, A. Evaluating Performance of Regression Machine Learning Models Using Multiple Error Metrics in Azure Machine Learning Studio. 2018. Available online: https://ssrn.com/abstract=3177507 (accessed on 15 November 2023).
Handelman, G.S.; Kok, H.K.; Chandra, R.V.; Razavi, A.H.; Huang, S.; Brooks, M.; Lee, M.J.; Asadi, H. Peering into the black box of artificial intelligence: Evaluation metrics of machine learning methods. Am. J. Roentgenol. 2019, 212, 38–43. [Google Scholar] [CrossRef] [PubMed]
Bekri, E.; Yannopoulos, P.; Economou, P. Methodology for improving reliability of river discharge measurements. J. Environ. Manag. 2019, 247, 371–384. [Google Scholar] [CrossRef] [PubMed]
Alexopoulos, P.; Skondra, M.; Kontogianni, E.; Vratsista, A.; Frounta, M.; Konstantopoulou, G.; Aligianni, S.I.; Charalampopoulou, M.; Lentzari, I.; Gourzis, P.; et al. Validation of the cognitive telephone screening instruments COGTEL and COGTEL+ in identifying clinically diagnosed neurocognitive disorder due to Alzheimer’s disease in a naturalistic clinical setting. J. Alzheimer’s Dis. 2021, 83, 259–268. [Google Scholar] [CrossRef] [PubMed]
Tsikas, P.K.; Chassiakos, A.P.; Papadimitropoulos, V.C. Seismic damage assessment of highway bridges by means of soft computing techniques. In Structure and Infrastructure Engineering; Taylor & Francis: London, UK, 2022. [Google Scholar]
Zhang, L.; Zhou, L.; Yuan, B.; Hu, F.; Zhang, Q.; Wei, W.; Sun, D. Spatiotemporal Evolution Characteristics of Urban Land Surface Temperature Based on Local Climate Zones in Xi’an Metropolitan, China. In Chinese Geographical Science; Springer: New York, NY, USA, 2023; pp. 1–16. [Google Scholar]
Economou, P.; Batsidis, A.; Kounetas, K. Evaluation of the OECD’s prediction algorithm for the annual GDP growth rate. Commun. Stat. Case Stud. Data Anal. Appl. 2021, 7, 67–87. [Google Scholar] [CrossRef]
Velliangiri, S.; Alagumuthukrishnan, S.; Thankumar joseph, S.I. A Review of Dimensionality Reduction Techniques for Efficient Computation. Procedia Comput. Sci. 2019, 165, 104–111. [Google Scholar] [CrossRef]
Jackson, J.E. A User’s Guide to Principal Components; John Wiley & Sons: Hoboken, NJ, USA, 2005. [Google Scholar]
Bersimis, S.; Georgakellos, D. A probabilistic framework for the evaluation of products’ environmental performance using life cycle approach and Principal Component Analysis. J. Clean. Prod. 2013, 42, 103–115. [Google Scholar] [CrossRef]
Bersimis, S.; Sgora, A.; Psarakis, S. Methods for interpreting the out-of-control signal of multivariate control charts: A comparison study. Qual. Reliab. Eng. Int. 2017, 33, 2295–2326. [Google Scholar] [CrossRef]
Maravelakis, P.; Bersimis, S.; Panaretos, J.; Psarakis, S. Identifying the out of control variable in a multivariate control chart. Commun. Stat.-Theory Methods 2002, 31, 2391–2408. [Google Scholar] [CrossRef]
Kaiser, H.F. The application of electronic computers to factor analysis. Educ. Psychol. Meas. 1960, 20, 141–151. [Google Scholar] [CrossRef]
McInnes, L.; Healy, J.; Melville, J. Umap: Uniform manifold approximation and projection for dimension reduction. arXiv 2018, arXiv:1802.03426. [Google Scholar]
Vermeulen, M.; Smith, K.; Eremin, K.; Rayner, G.; Walton, M. Application of Uniform Manifold Approximation and Projection (UMAP) in spectral imaging of artworks. Spectrochim. Acta Part A Mol. Biomol. Spectrosc. 2021, 252, 119547. [Google Scholar] [CrossRef]
Milošević, D.; Medeiros, A.S.; Piperac, M.S.; Cvijanović, D.; Soininen, J.; Milosavljević, A.; Predić, B. The application of Uniform Manifold Approximation and Projection (UMAP) for unconstrained ordination and classification of biological indicators in aquatic ecology. Sci. Total Environ. 2022, 815, 152365. [Google Scholar] [CrossRef] [PubMed]
Yu, T.T.; Chen, C.Y.; Wu, T.H.; Chang, Y.C. Application of high-dimensional uniform manifold approximation and projection (UMAP) to cluster existing landfills on the basis of geographical and environmental features. Sci. Total Environ. 2023, 904, 167013. [Google Scholar] [CrossRef] [PubMed]
Maravelakis, P.E.; Bersimis, S. The use of Andrews curves for detecting the out-of-control variables when a multivariate control chart signals. Stat. Pap. 2009, 50, 51–65. [Google Scholar] [CrossRef]
Skamnia, E.; Economou, P.; Bersimis, S.; Frouda, M.; Politis, A.; Alexopoulos, P. Hot spot identification method based on Andrews curves: An application on the COVID-19 crisis effects on caregiver distress in neurocognitive disorder. J. Appl. Stat. 2023, 50, 2388–2407. [Google Scholar] [CrossRef] [PubMed]
Van der Maaten, L.; Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 2008, 9, 2579–2605. [Google Scholar]
Hamel, P.; Eck, D. Learning features from music audio with deep belief networks. In Proceedings of the ISMIR, Utrecht, The Netherlands, 9–13 August 2010; Volume 10, p. 341. [Google Scholar]
Balamurali, M.; Silversides, K.L.; Melkumyan, A. A comparison of t-SNE, SOM and SPADE for identifying material type domains in geological data. Comput. Geosci. 2019, 125, 78–89. [Google Scholar] [CrossRef]
Balamurali, M.; Melkumyan, A. t-SNE based visualisation and clustering of geological domain. In Proceedings of the Neural Information Processing: 23rd International Conference, ICONIP 2016, Kyoto, Japan, 16–21 October 2016; Proceedings, Part IV 23. pp. 565–572. [Google Scholar]
Leung, R.; Balamurali, M.; Melkumyan, A. Sample truncation strategies for outlier removal in geochemical data: The MCD robust distance approach versus t-SNE ensemble clustering. Math. Geosci. 2021, 53, 105–130. [Google Scholar] [CrossRef]
Jamieson, A.R.; Giger, M.L.; Drukker, K.; Li, H.; Yuan, Y.; Bhooshan, N. Exploring nonlinear feature space dimension reduction and data representation in breast CADx with Laplacian eigenmaps and-SNE. Med. Phys. 2010, 37, 339–351. [Google Scholar] [CrossRef]
Wallach, I.; Lilien, R. The protein–small-molecule database, a non-redundant structural resource for the analysis of protein-ligand binding. Bioinformatics 2009, 25, 615–620. [Google Scholar] [CrossRef]
Birjandtalab, J.; Pouyan, M.B.; Nourani, M. Nonlinear dimension reduction for EEG-based epileptic seizure detection. In Proceedings of the 2016 IEEE-EMBS International Conference on Biomedical and Health Informatics (BHI), Las Vegas, NV, USA, 24–27 February 2016; pp. 595–598. [Google Scholar]
Hinton, G.E.; Roweis, S. Stochastic neighbor embedding. In Proceedings of the Advances in Neural Information Processing Systems 15 (NIPS 2002), Vancouver, BC, Canada, 9–14 December 2002; Volume 15. [Google Scholar]
Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Breiman, L. Arcing Classifiers; Technical Report; University of California, Department of Statistics: Berkeley, CA, USA, 1996. [Google Scholar]
Geurts, P.; Ernst, D.; Wehenkel, L. Extremely randomized trees. Mach. Learn. 2006, 63, 3–42. [Google Scholar] [CrossRef]
Kramer, O.; Kramer, O. K-nearest neighbors. In Dimensionality Reduction with Unsupervised Nearest Neighbors; Springer: Berlin/Heidelberg, Germany, 2013; pp. 13–23. [Google Scholar]
Gou, J.; Yi, Z.; Du, L.; Xiong, T. A local mean-based k-nearest centroid neighbor classifier. Comput. J. 2012, 55, 1058–1071. [Google Scholar] [CrossRef]
Yuan, G.X.; Ho, C.H.; Lin, C.J. Recent advances of large-scale linear classification. Proc. IEEE 2012, 100, 2584–2603. [Google Scholar] [CrossRef]
Crammer, K.; Dekel, O.; Keshet, J.; Shalev-Shwartz, S.; Singer, Y. Online passive aggressive algorithms. J. Mach. Learn. Res. 2006, 7, 551–585. [Google Scholar]
Zhu, X.; Ghahramani, Z. Learning from labeled and unlabeled data with label propagation. ProQuest Number Inf. All Users 2002. [Google Scholar]
Breiman, L. Pasting small votes for classification in large databases and on-line. Mach. Learn. 1999, 36, 85–103. [Google Scholar] [CrossRef]
Breiman, L. Bagging predictors. Mach. Learn. 1996, 24, 123–140. [Google Scholar] [CrossRef]
Ho, T.K. The random subspace method for constructing decision forests. IEEE Trans. Pattern Anal. Mach. Intell. 1998, 20, 832–844. [Google Scholar]
Louppe, G.; Geurts, P. Ensembles on random patches. In Proceedings of the Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2012, Bristol, UK, 24–28 September 2012; Proceedings, Part I 23. pp. 346–361. [Google Scholar]
Ferreira, A.J.; Figueiredo, M.A. Boosting algorithms: A review of methods, theory, and applications. In Ensemble Machine Learning: Methods and Applications; Springer: New York, NY, USA, 2012; pp. 35–85. [Google Scholar]
Jordanov, I.; Petrov, N.; Petrozziello, A. Classifiers Accuracy Improvement Based on Missing Data Imputation. J. Artif. Intell. Soft Comput. Res. 2018, 8, 31–48. [Google Scholar] [CrossRef]
Little, R.J.; Rubin, D.B. Statistical Analysis with Missing Data; John Wiley & Sons: Hoboken, NJ, USA, 2019; Volume 793. [Google Scholar]
Ramoni, M.; Sebastiani, P. Robust Bayes classifiers. Artif. Intell. 2001, 125, 209–226. [Google Scholar] [CrossRef]
Zhang, X.; Song, S.; Wu, C. Robust bayesian classification with incomplete data. Cogn. Comput. 2013, 5, 170–187. [Google Scholar] [CrossRef]
Guyon, I. Practical feature selection: From correlation to causality. In Mining Massive Data Sets for Security: Advances in Data Mining, Search, Social Networks and Text Mining, and Their Applications to Security; IOS Press: Amsterdam, The Netherlands, 2008; pp. 27–43. [Google Scholar]
Anis, M.; Ali, M. Investigating the performance of smote for class imbalanced learning: A case study of credit scoring datasets. Eur. Sci. J. 2017, 13, 340–353. [Google Scholar] [CrossRef]
Elor, Y.; Averbuch-Elor, H. To SMOTE, or not to SMOTE? arXiv 2022, arXiv:2201.08528. [Google Scholar]

Figure 1. The 4 main steps of the implementation process.

Figure 2. Spearman’s correlation coefficients between the independent variables.

Figure 3. Left plot: Scree plot of Principal Components for both the imputed and non-imputed data sets. Right plot: Variance explained by Principal Components for both the imputed and non-imputed data sets.

Figure 4. The 5 most frequent engine types in the reduced two-dimensional space obtained by the three-dimensionality reduction methods (upper left plot: PCA, upper right plot: UMAP, bottom plot: t-SNE.

Figure 5. Performance of evaluation metrics (balanced accuracy) for imputed data using k-fold cross validation.

Table 1. Description of each subset of the data set, based on data providers.

Source	Observations	Features	Year
DB1	141,080	72	January 2019–November 2023
DB2	78,065	81	January 2019–November 2023
DB3	4733	60	January 2019–November 2023
MRV1	12,255	61	2018
MRV2	12,399	61	2019
MRV3	12,067	61	2020
MRV4	12,290	61	2021
MRV5	12,912	61	2022

Table 2. Description and variables’ names of the first group (exterior measurements of a ship).

Name	Variable	Description
IMO	$G a_{1}$	Ship identification number.
DWT	$G a_{2}$	Deadweight of the ship.
Design Draft	$G a_{3}$	Vertical distance between the waterline and the bottom of the hull.
LOA	$G a_{4}$	The length of the ship.
Beam	$G a_{5}$	The width of the ship at its widest point.
Depth	$G a_{6}$	The depth which is measured at the middle of the length, from the top of the keel to the top of the deck beam at the side of the uppermost continuous deck.
Grain Capacity	$G a_{7}$	The capacity of cargo spaces measured laterally to the outside of frames, and vertically from the tank tops to the top of the under weatherdeck beams, including the area contained within a ship’s hatchway coamings.
Built year	$G a_{8}$	The year of completion of ship.
Gross Tonnage	$G a_{9}$	The volume of the ship in cubic meters below the main deck and the enclosed spaces above the main deck.
Net Tonnage	$G a_{10}$	The volume of the cargo space.

Table 3. Description and variables’ names of the second group (operational).

Name	Variable	Description
Bal Speed	$G b_{1}$	Speed where a vessel travels empty or largely empty.
Lad Speed	$G b_{2}$	Speed where a vessel travels loaded.
Bal VLSFO	$G b_{3}$	VLSFO $^{1}$ consumption where a vessel travels empty or largely empty.
Lad VLSFO	$G b_{4}$	VLSFO consumption where a vessel travels loaded.
Bal MGO	$G b_{5}$	MDO/MGO $^{2}$ consumption where a vessel travels empty or largely empty.
Lad MGO	$G b_{6}$	MDO/MGO consumption where a vessel travels loaded.
VLSFO pi	$G b_{7}$	VLSFO consumption during idle state.
MGO pi	$G b_{8}$	MDO/MGO consumption during idle state.

^{1}

Very Low Sulfur Fuel Oil,

^{2}

Marine Diesel Oil/Marine Gas Oil.

Table 4. Evaluating metrics of MICE imputation, considering the Bayesian Ridge estimator as a parameter for the imputation algorithm.

Variables	MSE	RMSE	MAE	F-Test p Values
$G a_{2}$	$13.12 \times 10^{6}$	3623.057	1851.556	0.063
$G a_{3}$	0.403	0.635	0.339	0.054
$G a_{4}$	34.446	5.869	4.282	0.152
$G a_{5}$	1.422	1.192	0.702	0.130
$G a_{6}$	0.567	0.753	0.494	0.075
$G a_{7}$	$19.13 \times 10^{6}$	4443.108	2357.83	0.086
$G a_{8}$	13.099	3.612	2.666	0.073
$G a_{9}$	$3.74 \times 10^{6}$	1934.556	975.555	0.078
$G a_{10}$	$2.46 \times 10^{6}$	1570.111	832.247	0.076
$G b_{1}$	0.425	0.652	0.470	0.143
$G b_{2}$	0.391	0.626	0.446	0.098
$G b_{3}$	10.664	3.265	2.310	0.127
$G b_{4}$	27.881	5.281	2.123	0.103
$G b_{5}$	0.015	0.122	0.054	0.004
$G b_{6}$	0.020	0.143	0.053	0.007
$G b_{7}$	1.168	1.108	0.508	0.057
$G b_{8}$	0.084	0.290	0.167	0.065

Table 5. Evaluating metrics of MICE imputation, considering the Light Gradient Boosting estimator as a parameter for the imputation algorithm.

Variables	MSE	RMSE	MAE	F-Test p Values
$G a_{2}$	$1.06 \times 10^{6}$	1030.534	377.893	0.232
$G a_{3}$	0.399	0.632	0.199	0.258
$G a_{4}$	1.757	1.325	0.445	0.136
$G a_{5}$	0.163	0.404	0.113	0.143
$G a_{6}$	0.028	0.169	0.060	0.185
$G a_{7}$	$2.84 \times 10^{6}$	1685.302	535.409	0.097
$G a_{8}$	2.248	1.499	1.034	0.137
$G a_{9}$	$6.07 \times 10^{5}$	779.478	261.106	0.096
$G a_{10}$	$1.29 \times 10^{6}$	1139.792	219.017	0.146
$G b_{1}$	0.182	0.427	0.278	0.201
$G b_{2}$	0.187	0.433	0.271	0.233
$G b_{3}$	5.620	2.370	1.189	0.219
$G b_{4}$	5.721	2.392	1.142	0.182
$G b_{5}$	0.014	0.120	0.029	0.056
$G b_{6}$	0.022	0.150	0.039	0.061
$G b_{7}$	0.202	0.449	0.327	0.153
$G b_{8}$	0.032	0.179	0.090	0.166

Table 6. Evaluating metrics of MICE imputation, considering the Extreme Gradient Boosting estimator as a parameter for the imputation algorithm.

Variables	MSE	RMSE	MAE	F-Test p Values
$G a_{2}$	$2.9 \times 10^{6}$	1711.004	265.092	0.304
$G a_{3}$	0.288	0.536	0.166	0.279
$G a_{4}$	1.280	1.131	0.236	0.455
$G a_{5}$	0.137	0.370	0.076	0.226
$G a_{6}$	0.027	0.166	0.037	0.314
$G a_{7}$	$4.60 \times 10^{6}$	2146.545	432.547	0.151
$G a_{8}$	1.378	1.174	0.745	0.186
$G a_{9}$	$4.4 \times 10^{5}$	670.529	155.101	0.115
$G a_{10}$	$1.78 \times 10^{6}$	1337.567	168.707	0.247
$G b_{1}$	0.192	0.438	0.258	0.293
$G b_{2}$	0.188	0.434	0.255	0.356
$G b_{3}$	4.262	2.064	1.038	0.346
$G b_{4}$	2.568	1.602	0.905	0.206
$G b_{5}$	0.012	0.111	0.024	0.078
$G b_{6}$	0.020	0.142	0.032	0.081
$G b_{7}$	0.174	0.417	0.295	0.254
$G b_{8}$	0.034	0.184	0.082	0.188

Table 7. Loadings of PCA for the imputed data set.

Variables	PC1	PC2	PC3	PC4
$G a_{2}$	0.322	−0.023	−0.018	−0.003
$G a_{3}$	0.304	−0.019	0.005	0.010
$G a_{4}$	0.315	−0.035	−0.018	0.014
$G a_{5}$	0.310	−0.013	−0.047	−0.025
$G a_{6}$	0.311	−0.016	−0.007	0.024
$G a_{7}$	0.324	−0.025	−0.018	0.002
$G a_{8}$	−0.054	0.098	−0.233	0.564
$G a_{9}$	0.322	−0.021	−0.027	0.007
$G a_{10}$	0.322	−0.027	−0.013	−0.006
$G b_{1}$	−0.110	0.594	−0.002	0.024
$G b_{2}$	−0.068	0.644	−0.012	0.025
$G b_{3}$	0.261	0.304	−0.014	−0.092
$G b_{4}$	0.262	0.339	−0.009	−0.076
$G b_{5}$	0.053	−0.018	0.644	0.261
$G b_{6}$	0.050	0.038	0.595	0.394
$G b_{7}$	0.180	0.010	−0.218	0.329
$G b_{8}$	0.040	0.078	0.350	−0.575

Table 8. The final main engine models of the procedure along with their encoding and their frequency in the data set.

Encoding	Main Engine Model	Frequency	Encoding	Main Engine Model	Frequency
2	6S50MC-C	1791	38	6S50ME-B9	95
7	6S42MC	1082	49	6RTA48T	94
1	6UEC45LSE	369	9	6S60ME-C8	93
31	5S50MC-C	369	11	5S60MC-C	93
13	6S46MC-C	333	28	6RT-FLEX50	92
20	5S50ME-B9	231	6	6UEC45LSE-ECOB2	72
43	6S46ME-B8	194	3	6S60MC-C8	52
17	6S60MC-C	189	15	5RT-FL50D	52
19	6S50MC-C8	174	59	5S60MC-C8	46
30	7S50MC-C	168	21	6S42MC7	43
4	6S60MC	153	71	7S35MC	42
5	6UEC52LA	135	33	5G60ME-C9	38
23	6S50MC	133	110	5UEC45LSE	36
27	6RT-FL50	131	97	5S50MC-C8	33
10	5S60ME-C8	127	57	6RTA48T-B	32
25	6S70MC-C	127	81	6RT-FL48T	31
64	6S46MC-C8	115	32	6UEC50LSII	30
0	6S70MC	107	88	5RT-FLEX58T-B	26
12	5S50MC	107	77	6G70ME-C9	24
22	6S50ME-C8	100

Table 9. Performance of evaluation metrics (balanced accuracy) for imputation data and the top performance for each of the three dimensionality reduction methods using k-fold cross validation. The time taken for each algorithm is also included.

Classification Algorithm		Imput	PCA-4	Tsne10-3	Umap10-3		Imput	PCA-4	Tsne10-3	Umap10-3
ExtraTreesClassifier	Balanced accuracy	0.9507	0.8709	0.9020	0.8364	Time taken	0.3544	0.4136	0.3401	0.4025
RandomForestClassifier		0.9502	0.8577	0.8931	0.8125		0.5524	1.0609	0.6713	0.7047
BaggingClassifier		0.9408	0.8319	0.8700	0.7875		0.1792	0.1886	0.1562	0.1689
DecisionTreeClassifier		0.9340	0.8267	0.8859	0.7928		0.0299	0.0344	0.0282	0.0317
ExtraTreeClassifier		0.9034	0.8240	0.8596	0.7730		0.0106	0.0100	0.0096	0.0094
LabelPropagation		0.8848	0.5514	0.6281	0.4010		1.0217	0.7152	0.6709	0.6862
LabelSpreading		0.8845	0.5311	0.6157	0.3954		1.6083	1.3164	1.2951	1.3110
KNeighborsClassifier		0.7034	0.5683	0.7082	0.6526		0.1459	0.0377	0.0350	0.0346
LGBMClassifier		0.5920	0.5301	0.5525	0.7635		1.4434	1.5696	1.5646	1.8766
LinearDiscriminantAnalysis		0.5778	0.2008	0.1147	0.1032		0.0207	0.0113	0.0104	0.0103
GaussianNB		0.5775	0.3126	0.1871	0.1583		0.0146	0.0112	0.0106	0.0108
LogisticRegression		0.5093	0.2023	0.1378	0.1041		0.8430	0.7778	0.8007	0.7614
SVC		0.4757	0.2498	0.3073	0.2490		1.0982	1.1509	1.1099	1.1175
LinearSVC		0.4678	0.0917	0.0845	0.0725		1.4458	2.1846	0.4434	0.7941
NearestCentroid		0.4491	0.2534	0.2421	0.2129		0.0107	0.0086	0.0086	0.0082
CalibratedClassifierCV		0.4146	0.0812	0.0847	0.0698		5.6612	8.6005	2.0596	3.3967
SGDClassifier		0.3895	0.1144	0.0899	0.0881		0.3062	0.1587	0.1249	0.1399
QuadraticDiscriminantAnalysis		0.3193	0.4012	0.2680	0.2310		0.0259	0.0116	0.0116	0.0115
Perceptron		0.3112	0.0869	0.0893	0.0788		0.1129	0.0651	0.0555	0.0595
PassiveAggressiveClassifier		0.2612	0.0777	0.0835	0.0832		0.1347	0.0696	0.0623	0.0622
BernoulliNB		0.2585	0.0503	0.0451	0.0459		0.0118	0.0093	0.0092	0.0090
RidgeClassifier		0.1477	0.0481	0.0438	0.0619		0.0144	0.0122	0.0114	0.0116
RidgeClassifierCV		0.1477	0.0481	0.0437	0.0615		0.0281	0.0212	0.0209	0.0204
AdaBoostClassifier		0.1263	0.0865	0.0680	0.0714		0.4822	0.4750	0.4350	0.4289
DummyClassifier		0.0256	0.0256	0.0256	0.0256		0.0078	0.0060	0.0064	0.0059

Table 10. The average balanced accuracy scores from five iterations of the k-fold cross-validation technique of the ExtraTreesClassifier across different “n_estimators” values (for imputed and non-imputed data as well as for multiple dimensionality reduction methods).

n	10	50	100	150	500
Imput	0.9491	0.9537	0.9479	0.9513	0.9541
Umap-2-10	0.8089	0.8267	0.8205	0.8189	0.8273
Umap-2-30	0.7309	0.7454	0.7468	0.7555	0.7523
Umap-2-123	0.5896	0.6174	0.6142	0.6144	0.6186
Umap-3-10	0.8251	0.8224	0.8322	0.8218	0.8357
Umap-3-30	0.7672	0.7886	0.8015	0.7911	0.7963
Umap-3-123	0.6823	0.7089	0.7200	0.7076	0.7177
PCA-2	0.7474	0.7673	0.7701	0.7753	0.7693
PCA-3	0.8228	0.8285	0.8280	0.8354	0.8282
PCA-4	0.8522	0.8594	0.8630	0.8614	0.8667
tsne-2-10	0.8880	0.9048	0.8954	0.8947	0.8941
tsne-2-30	0.8820	0.8910	0.8934	0.8910	0.8996
tsne-2-123	0.8786	0.8725	0.8801	0.8805	0.8813
tsne-3-10	0.8961	0.8913	0.8861	0.9029	0.9005
tsne-3-30	0.8919	0.8954	0.8927	0.8870	0.8899
tsne-3-123	0.8753	0.8840	0.8777	0.8841	0.8830

Table 11. Confusion matrix of the 5 most frequent engine models (imputed data) and n_estimator = 50.

Index	1	2	7	13	31
1	68	0	1	1	0
2	0	358	0	0	0
7	5	0	212	0	0
13	0	1	0	65	0
31	0	0	0	0	71

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Skarlatos, K.; Papageorgiou, G.; Biris, P.; Skamnia, E.; Economou, P.; Bersimis, S. Ship Engine Model Selection by Applying Machine Learning Classification Techniques Using Imputation and Dimensionality Reduction. J. Mar. Sci. Eng. 2024, 12, 97. https://doi.org/10.3390/jmse12010097

AMA Style

Skarlatos K, Papageorgiou G, Biris P, Skamnia E, Economou P, Bersimis S. Ship Engine Model Selection by Applying Machine Learning Classification Techniques Using Imputation and Dimensionality Reduction. Journal of Marine Science and Engineering. 2024; 12(1):97. https://doi.org/10.3390/jmse12010097

Chicago/Turabian Style

Skarlatos, Kyriakos, Grigorios Papageorgiou, Panagiotis Biris, Ekaterini Skamnia, Polychronis Economou, and Sotirios Bersimis. 2024. "Ship Engine Model Selection by Applying Machine Learning Classification Techniques Using Imputation and Dimensionality Reduction" Journal of Marine Science and Engineering 12, no. 1: 97. https://doi.org/10.3390/jmse12010097

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Ship Engine Model Selection by Applying Machine Learning Classification Techniques Using Imputation and Dimensionality Reduction

Abstract

1. Introduction

2. Motivation

3. Materials and Methods

3.1. Data Collection

3.2. Preprocessing/Exploratory Data Analysis

3.3. Data Imputation

3.4. Dimensionality Reduction Methods

3.4.1. Principal Components Analysis—PCA

3.4.2. Uniform Manifold Approximation and Projection—UMAP

3.4.3. t-Distributed Stochastic Neighbor Embedding

3.5. Classification

3.5.1. Decision Tree Classifier

3.5.2. Random Forest Classifier

3.5.3. Extra Trees Classifier

3.5.4. Naive Bayes Classifiers

3.5.5. K-Nearest Neighbor Classifier

3.5.6. Other Classifiers

3.5.7. Ensemble Learning

3.6. Evaluation

4. Application

4.1. Data Collection

4.2. Preprocessing/Exploratory Data Analysis

4.3. Imputation

4.4. Dimensionality Reduction

4.5. Classification Results

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI