Next Article in Journal
Sensitivity Analysis of Underwater Structural-Acoustic Problems Based on Coupled Finite Element Method/Fast Multipole Boundary Element Method with Non-Uniform Rational B-Splines
Next Article in Special Issue
Trajectory Mining and Routing: A Cross-Sectoral Approach
Previous Article in Journal
Numerical Investigation into the Dynamic Responses of Floating Photovoltaic Platform and Mooring Line Structures under Freak Waves
Previous Article in Special Issue
ShipGen: A Diffusion Model for Parametric Ship Hull Generation with Multiple Objectives and Constraints
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Ship Engine Model Selection by Applying Machine Learning Classification Techniques Using Imputation and Dimensionality Reduction

by
Kyriakos Skarlatos
1,
Grigorios Papageorgiou
2,
Panagiotis Biris
2,
Ekaterini Skamnia
2,
Polychronis Economou
2,* and
Sotirios Bersimis
1,*
1
Department of Business Administration, University of Piraeus, 18534 Piraeus, Greece
2
Department of Civil Engineering, University of Patras, 26504 Patras, Greece
*
Authors to whom correspondence should be addressed.
J. Mar. Sci. Eng. 2024, 12(1), 97; https://doi.org/10.3390/jmse12010097
Submission received: 3 December 2023 / Revised: 31 December 2023 / Accepted: 31 December 2023 / Published: 3 January 2024
(This article belongs to the Special Issue Machine Learning and Modeling for Ship Design)

Abstract

:
The maritime is facing a gradual proliferation of data, which is frequently coupled with the presence of subpar information that contains missing and duplicate data, erroneous records, and flawed entries as a result of human intervention or a lack of access to sensitive and important collaborative information. Data limitations and restrictions have a crucial impact on inefficient data-driven decisions, leading to decreased productivity, augmented operating expenses, and the consequent substantial decline in a competitive edge. The missing or inadequate presentation of significant information, such as the vessel’s primary engine model, critically affects its capabilities and operating expenses as well as its environmental impact. In this study, a comprehensive study was employed, using and comparing several machine learning classification techniques to classify a ship’s main engine model, along with different imputation methods for handling the missing values and dimensionality reduction methods. The classification is based on the technical and operational characteristics of the vessel, including the physical dimensions, various capacities, speeds and consumption. Briefly, three dimensionality reduction methods (Principal Component Analysis, Uniform Manifold Approximation and Projection, and t-Distributed Stochastic Neighbor Embedding) were considered and combined with a variety of classifiers and the appropriate parameters of the dimensionality reduction methods. According to the classification results, the ExtraTreeClassifier with PCA with 4 components, the ExtraTreeClassifier with t-SNE with perplexity equal to 10 and 3 components, and the same classifier with UMAP with 10 neighbors and 3 components outperformed the rest of the combinations. This classification could provide significant information for shipowners to enhance the vessel’s operation by optimizing it.

1. Introduction

The marine industry might greatly benefit from machine learning in a variety of areas, including, among others, increased productivity [1], security [2] and decision making [3]. Specific tasks of machine learning, for instance, may involve the examination of sensor data from equipment and ships to determine when a repair is required [4]. Further, using the real-time monitoring of ship conditions, potential safety hazards or security threats could be identified [5]. Another existing example could also be the inexplicably increased consumption of oil, where oil spills might be the reason [6]. Techniques of machine learning can assist in developing preventative strategies that lower downtime and maintenance expenses, while increasing vessel safety. Similarly, they can assist in analyzing the performance of vessels via monitoring the engine and other components, which is a great assistance for the operators to make data-driven decisions, related to increased productivity and savings of operating expenses [7]. In addition, machine learning can improve monitoring and ensure compliance with environmental regulations, such as those pertaining to emissions, by following and examining data on environmental effects [8]. Although the maritime sector does not provide open access data, it is crucial to have access to high-quality data from numerous sensors and sources on board (the ship) to properly use the ship’s data in machine learning applications [9].
In general, the marine sector has many diverse stakeholders, including shipowners, operators, shipping firms, port agents and regulatory organizations. Data may get scattered among numerous entities as a result of this fragmentation, making it difficult to centralize and communicate information. Further, the marine industry is highly concerned about data privacy and security; thus, sharing certain types of data, such as vessel tracking information, can raise security and commercial concerns. Shipowners may also be hesitant to provide vital operating information, such as routes and cargo, due to worries concerning piracy, theft, or corporate espionage [10]. Businesses might consider the data they have in their possession as a competitive advantage in the fiercely competitive shipping sector [11]. As a result, they can be reluctant to divulge information that would help their rivals or weaken their negotiating position in the market.
Furthermore, numerous vessels, particularly the elderly ones, lack advanced data-capturing and transmitting capabilities. To provide instantaneous data, modernizing the entire global fleet would present a considerable and costly undertaking. It is also worth noting that the marine business is significantly impacted by a complex network of international and national regulations [12]. Concerning data reporting requirements, the range of standards can be expansive, and adhering to them may prove challenging. For all these, harmonizing data standards and ensuring global compliance pose a fundamental challenge [13].
A contentious issue that can be detected in the marine industry is the ownership and control of data. In particular, ship owners may assert rights over the data generated by their vessels, whereas other parties advocate for greater openness and exchange. Moreover, dependable internet and a communication infrastructure are often scarce at sea, particularly in heavily trafficked or remote maritime areas [14]. This lack of connectivity could impede real-time data transmission by ships. Additionally, establishing systems for data collection and exchange may prove costly for smaller enterprises. Thus, this might mean that the financial resources to invest in digital infrastructure may not be available.
Despite these difficulties, attempts are being undertaken to increase the accessibility to marine data [15]. The amount of data available in the maritime industry is gradually growing, as a result of initiatives like the International Maritime Organization’s (IMO) mandatory reporting requirements, the use of satellite-based tracking systems like the Automatic Identification System (AIS), and improvements on ‘Internet of Things’ (IoT) technologies [16]. Increased standardization and coordination efforts among industry players can also aid in addressing some of the issues related to marine data sharing.
To create and implement effective machine learning techniques in the maritime sector, it is also essential to collaborate with specialists in machine learning and data science who have domain expertise in maritime operations. In general, the models that are created can be unsupervised, where the structure is sought on unlabeled data (mainly used for cluster analysis [17]), or supervised when a model is built on input data to describe and analyze the output data. Regression, the prediction of continuous numeric values, classification, and the identification of groups or categories for data points are other examples of machine learning tasks. In the marine industry, however, supervised machine learning has a wide range of applications [18].
In this work, a procedure with the aid of supervised machine learning is proposed to classify the most frequent vessel’s main engine model types. The resulting classification model may be exploited for optimizing the design and operation of a vessel, for faster engine selection, for developing optimal strategies for the new vessel’s use, as well as for evaluating the ship’s performance before engine placement since it is based on the technical and operational characteristics of the vessel. Furthermore, in this work, to conclude with the optimal model, multiple imputation methods are used and compared in order to bypass the problem of missing data, while multiple dimensionality reduction methods are considered and combined with a variety of classifiers and the appropriate parameters of the dimensionality reduction methods.
The rest of the paper is organized as follows. Section 2 presents the motivation behind this work. In Section 3, the methods along with the materials employed are introduced. More specifically, this contains a short reference for the data collection and preprocessing, the data imputation procedure, the dimensionality reduction methods that were considered, and finally the classification techniques along with a short description of their evaluation performance. Next, in Section 4, the data of the application are presented along with a description and an exploratory analysis of them. The main and most significant results are displayed in Section 4.5, and last but not least, the conclusions that can be exported by the analysis proposed are presented in Section 5.

2. Motivation

An attribute that is oftentimes missing, both for privacy reasons and lack of data availability, is the model of the engine. It could be considered to be among the most advantageous information for shipowners. That is, the choice of an engine model for a ship can have a profound impact on the shipping industry in a competitive context. The engine model plays a key role in determining the fuel efficiency of a ship, leading to lower costs for shipping companies since fuel consumption is among the most significant costs [19]. More efficient engines can effectively reduce operating costs, making a company more competitive by offering lower shipping rates.
The quantity of data collected is pivotal in data analysis, as it eradicates ambiguity and leads to stronger findings. Thus, it plays a crucial role in the process. Regarding marine transportation, however, the amount collected is relatively small [3]. As a consequence, compared to other industries, the use of machine learning techniques in marine transportation is limited [3]. In addition, given the pure quality and the data size, bias avoidance is inevitable in most cases [20]. In an effort to overcome such issues, as well as to cover a broad framework of interest covering a variety of issues, we propose the use of supervised machine learning for the vessel’s main engine model type classification.
As environmental regulations become more stringent [21], ships with more environmentally friendly engines, such as those with lower emissions, or those burning alternative fuels such as LNG (liquefied natural gas), can gain a competitive advantage. Companies investing in engine technology can comply with the regulations and potentially benefit from favorable incentive treatment. Besides the above mentioned, the engine models also affect the speed and performance of a ship [22]. In some cases, faster ships may claim higher prices for faster delivery times, while others may prioritize slow sailing for fuel savings. Engine selection should be aligned though, with the company’s strategic objectives and the market’s demands as well.
There are occasions where shipping companies employ engine model selection as a marketing strategy to distinguish themselves in the market. They may highlight their dedication to sustainability, fuel efficiency, or state-of-the-art technology to appeal to customers who are environmentally conscious or focused on efficiency. During the ship design process, the selection of the main engine model is a crucial factor. The selection process is influenced by various factors, including power, fuel consumption, purchase price, service life and maintenance cost. At the beginning of the design process, several types of main diesel engines may be considered. While some main engines may be expensive, they may offer high reliability and low fuel consumption. Although diesel engines might be inexpensive, they often have high fuel consumption and failure rates [23]. Compliance with international and regional regulations related to emissions, fuel quality, and safety is of vital importance. The engine model that is chosen must be in compliance with these standards to avoid legal issues, penalties, and/or operational disruptions [24]. In the global shipping industry, competition comes from companies around the world. Engine model selection can help a company compete on a global scale, ensuring that it meets international standards and customer expectations.
The reliability and maintenance requirements of an engine can also affect the competitiveness of a shipping company. Engines requiring less maintenance and higher reliability can lead to reduced downtime, lower repair costs and a better reputation for on-time deliveries overall.
While the initial cost of an engine is important, companies must also consider long-term costs. Engines with higher initial costs but lower operating costs over their lifetime can provide a competitive advantage in the long term. Furthermore, companies that invest in engines designed to be adaptable and compatible with future technologies (such as alternative fuels or hybrid systems) can position themselves for long-term competitiveness as the industry evolves.
It should also be noted that maintaining competitiveness in the marine industry often involves the adoption of the latest technological developments in engine design. Newer engine models may incorporate advanced features, such as improved automation, data analysis and remote monitoring, which can improve operational efficiency [25].
In conclusion, the selection of a ship’s engine model has significant effects on a shipping company’s ability to compete. Cost, adherence to environmental regulations, operational effectiveness and market position are all impacted. Therefore, companies should carefully assess and select engine types that match their strategic aims and market demands if they want to compete in the extremely demanding and competitive shipping sector.
Last but not least, the classification model provided in the following sections, can be exploited as a calibration model for optimizing vessel design (i.e., which engine is optimal for a vessel of specific dimensions and characteristics) under the general restrictions imposed by the company (strategy, etc.). Furthermore, the resulting classification model can be used for faster engine selection, developing optimal strategies for the new vessel’s use, as well as developing digital twin applications (or simulators), allowing the evaluation of the ship’s performance before engine placement, which results in operating costs reduction and environment conformance enhancement. Nevertheless, digital twins can be valuable in performance monitoring, predictive maintenance, optimization of operations, enhanced safety, risk management, support for design, innovation, decision making, further data analysis by exploiting simulation results, lifecycle management, etc. [26,27,28].

3. Materials and Methods

In general, the main idea of the classification of a vessel’s main engine model is achieved through the concept of learning from examples, which has simply been formalized through supervised learning.
More specifically, the design of the process includes four basic steps (see Figure 1) that were implemented in Python (version 3.9.7) programming language. The initial step, as in most studies, is the selection of the key variables and the preprocessing of the observed/available data. The next step is to substitute the missing values using a proper suitable imputation method, such as the “Multiple Imputation by Chained Equations” (MICE). Thence, the data-forming process is completed, and machine learning techniques are taken into account. In the present paper, several types of classification algorithms along with dimensionality reduction methods are applied and compared.

3.1. Data Collection

The initial and most principal part in conducting an analysis or an application is undoubtedly related to the data collection. Considering that there is a variety of data sources and data sets, the necessity of deriving a more unified and complete data set has emerged. More precisely, assume i , i = 1 , , n data sources (or providers), with each one maintaining information in k i , i = 1 , , m data sets (complementary or not), where every data set is to be comprised of p i k i variables. The notation p i k i refers to the total count of variables that exist in the k i -th data set of the i-th provider.
After collecting the data, data integration involves combining these various sets of data into a single, cohesive data set. This procedure is frequently carried out when there is a need to consolidate information stemming from different sources, databases, or formats for analytical purposes, comparison or to extract valuable insights.
Typically, merging data requires identifying common elements or keys in the data sets in order to accurately link and consolidate the information. Of course, the variables of all the data sets do not have to be common since it is logical to assume that they have been created for different purposes. As a consequence, an appropriate selection of variables is crucial. This selection might depend on the purpose of the analysis; thus, only the necessary—for the scientist—variables are kept or could be an outcome of statistics, for instance, a procedure that could be incorporated in the stage of preprocessing. Furthermore, source selection and data collection must be continuous processes, with regular evaluations and updates, to preserve data precision and relevance.

3.2. Preprocessing/Exploratory Data Analysis

As a primary stage of research, data collection is typically the most time-consuming task. Gathering valuable and relevant information is an unstructured process that also demands the collector’s judgment. To ensure data quality and the choice of quality sources, apart from establishing clear data collection procedures, it is essential to validate data through audits or checks, use established best practices and conduct thorough research on the sources’ reputations.
The quality of outcomes is contingent upon data collection and preprocessing procedures. Hence, the methodical selection, consolidation and exploration of data features hold key importance. Data stored in complex systems for scientific studies are often unstructured, making it a prerequisite to execute fundamental steps beforehand, wherever feasible.
Therefore, data preprocessing and data exploration are usually a necessary part of data analysis before proceeding to the construction of prediction models. More precisely, during this procedure, if required, a feature selection could be employed, while, also, the available data set is explored for missing values, duplicated rows and generally total counts of each variable and, finally, is handled appropriately. Furthermore, possible existing relations between the variables could lead to useful conclusions concerning the construction of machine learning models. As a consequence, correlations between variables should be taken into consideration.

3.3. Data Imputation

After observing the number of missing values in each variable, the next aim is to try to predict these missing values as a process of gathering additional information that can be used for safer and more robust predictions. Another approach could be the deletion of those variables having an excess number of missing values, but in this case, if there is a large number of missing values in the data set, important information will be excluded and therefore, this will lead to biased and unreliable prediction models [29]. Thus, the best option considered is the estimation of the missing values of the data set using a statistical method.
In order to achieve that, and taking into consideration the missing values of the data set, an imputation method is required. Commonly, imputation is the procedure of calculating alternative values based on current data and appropriate statistical methods in order to replace those that are missing. For the imputation of missing values, the “Multiple imputation by Chained Equation” (MICE) approach is considered [30].
The purpose of this statistical technique is to impute missing values by using other features from the data set through an iterative series of predictive models. Initially, the MICE algorithm sets as the response variable Y each variable that contains missing values and as X i the other variables that are considered predictors.
The model is constructed by using observations where Y is present. Then, the estimated model is applied to predict the missing observations. The process is repeated several times after selecting a random sample of data in each iteration of the algorithm and finally, replacing the missing values with the mean of the predictions. The MICE algorithm assumes that missing data are either missing at random (MAR) or missing completely at random (MCAR) [31]. If this assumption is not true, then the results concerning the estimations might be biased and therefore not reliable. In any case, the MCAR is preferable since multiple imputation is not only valid but also no bias is expected as in the MAR in which a negligible bias can be present [32]. It is of note that Little’s test can be implemented to check whether data are MCAR or not [33]. Several models can be used to describe the relationship between the variables, in the context of MICE (e.g., the Light Gradient Boosting [34], Extreme Gradient Boosting [35] and Bayesian Ridge regressors [36]).
In order to test the quality, accuracy and the reliability of the MICE imputation approach, three evaluating metrics are used to compare the actual values to predicted values, on a subset of known data that were hidden randomly in each testing trial. These three evaluating metrics are the Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and Mean Absolute Error (MAE) [37,38]. In addition to these three evaluation metrics, the accuracy of the algorithms in predicting missing values was assessed using the F-test for testing the general linear hypothesis. More specifically, the F-test was used to test, simultaneously, if the intercept and slope of the population linear relationship are equal to zero and unity, respectively, i.e., to test if the generated predictions are randomly scattered around the 45 line in a scatterplot of the generated prediction versus the real values, demonstrating the fact that the predictions do not systematically over- or underestimate the corresponding real values. This test has been employed in several previous studies (see for example [39,40,41,42]) in order to evaluate the performance of a prediction algorithm. A detailed presentation of this test can be found in [43].

3.4. Dimensionality Reduction Methods

Dimensionality reduction is often used before a classification algorithm is applied in order to remove redundant features, and noisy and irrelevant data, and thus improve the learning feature accuracy [44]. For the purpose of determining the optimal technique for dimensionality reduction on the basis of our data set, various approaches were assessed.
In particular, the Principal Component Analysis (PCA), the Uniform Manifold Approximation and Projection, also known as UMAP and the t-Distributed Stochastic Neighbor Embedding, or simply t-SNE, were used. These three methods are briefly discussed next.

3.4.1. Principal Components Analysis—PCA

As an additional step in the analysis, Principal Component Analysis (PCA) is performed on the available data set, after the imputation process, in order to create new uncorrelated variables called components that are linear combinations of the initial data. The advantage of PCA implementation is the dimensionality reduction in the initial data set and the fact that the newly created components will explain as much of the variation of the original data as possible. Furthermore, these components might describe latent relations that could be a corollary of the existence of some common factors that would possibly develop these secret relations. For details on the methodology and the applications of PCA, the reader may refer to [45,46,47,48].
The total number of principal components that need to be preserved can be obtained by using either graphical techniques, such as the “Scree plot” or the “Explained Variance plot”, or by using a widely known criterion introduced by Kaiser in 1960, which assumes that principal components that have eigenvalues greater than 1 should be preserved [49].

3.4.2. Uniform Manifold Approximation and Projection—UMAP

Another method for dimensionality reduction is the UMAP technique. This method was introduced a few years ago [50], and its purpose is to reduce high-dimensional data in a lower-dimensional space, preserving as much of both the data’s global and local structures as possible. Its assumption is that the data are uniformly distributed across a manifold which can be projected in a lower-dimensional space. In contrast to PCA, the UMAP reduction method can capture the non-linear structure in high-dimensional data. Usually, it is used for visualization purposes [51] or as a preprocessing technique before classification [52] or clustering [53]. In the literature, many techniques are used for visualization purposes in case of multivariate data in real-life applications, like UMAP (e.g., Andrews curves, see, for example, [54,55]).
Concerning the dimensionality reduction, deciding the optimal values for the hyperparameters of the algorithm is considered a challenging task when using this method. In this case, many values for the hyperparameters of the UMAP algorithm, such as the number of neighbors, the metric that will be used to compute the distances between data points and the number of components, should be tested and compared. For visualization purposes, the UMAP algorithm can represent the high-dimensional data adequately when two or three components are being used. However, the number of components can be tuned in order to achieve the best possible performance when using additional machine learning techniques, such as clustering and classification.

3.4.3. t-Distributed Stochastic Neighbor Embedding

In addition to the aforementioned methods of PCA and UMAP, the t-Distributed Stochastic Neighbor Embedding (t-SNE) visualizes high-dimensional data by giving every data point a location on a two-dimensional or three-dimensional map. The t-SNE technique was introduced in [56] and has been used for visualization in a wide range of applications, including natural language processing [57], geological data [58,59,60], health domain [61,62,63] and many others.
This method is a version of Stochastic Neighbor Embedding [64] that reduces the propensity of points to concentrate in the center of the map, making it much easier to optimize and yield much better visuals. With regards to producing a single map that displays the structure at a variety of scales, t-SNE performs better than previous methods. The main parameter of this method that needs to be initialized is the so-called perplexity parameter. This parameter estimates the number of nearby points each point has. According to the original article, SNE displays reasonably strong performance, even with changes in perplexity, usually ranging from 5 to 50. The results (see Section 4.5) of this study demonstrate a comparison of the performance of t-SNE with that of PCA and UMAP.

3.5. Classification

Classification involves the systematic categorization of data or objects into predefined distinct classes or groups, based on shared characteristics or attributes. This formal process relies on the definition of specific rules or algorithms, often derived from training data, to assign items to appropriate categories. It provides a structured framework for organizing and understanding complex data sets, enabling efficient information retrieval and pattern recognition. Evaluating this process is of paramount importance to ensure the reliability and effectiveness of classification outcomes. Evaluation serves as a critical checkpoint in the entire workflow, allowing to gauge the performance of their classification models.
Classification algorithms are formalized computational methods that play a pivotal role in machine learning and data analysis. These algorithms are designed to categorize data points or instances into predefined classes or categories based on their attributes. The formalization of classification algorithms involves defining a set of mathematical or logical rules, which are often learned from training data. Common formal classification algorithms include, among others, decision trees, k-nearest neighbors, support vector machines, and neural networks. The following subsections describe the classifiers with the highest performance among those tested. Due to the availability of numerous algorithms that could have been included in this study, the selection of those that will be reported was based on their performance as detailed in Section 4.5.

3.5.1. Decision Tree Classifier

Decision trees (DTs) are a non-parametric method for supervised learning, commonly used for classification and regression. The goal is to create a model that predicts the value of a target variable by learning simple decision rules derived from the characteristics of the data. Decision trees learn from the data to approximate a sine curve with a set of if–then–else decision rules. The deeper the tree, the more complex the decision rules and the more precise the model.
Decision trees offer several advantages that make them a popular choice in machine learning and data analysis. Firstly, they are simple to understand and interpret, as they can be visualized and represented in an intuitive manner. Moreover, decision trees require minimal data preparation, eliminating the need for extensive data normalization or the creation of virtual variables. Additionally, the computational cost of using a decision tree for data prediction scales logarithmically with the number of data points used to train the tree, while they can handle both numerical and categorical data. They are particularly beneficial for multi-output problems and are known for their transparency, as they are considered white box models.
However, decision trees are not without limitations. They can become overly complex, resulting in the poor generalization of data, a phenomenon known as overfitting. To mitigate this issue, techniques like pruning, setting minimum samples at leaf nodes, or specifying maximum tree depth are necessary. Decision trees can also be unstable, with small changes in the data leading to entirely different tree structures. The predictions made by decision trees are not continuous, making them less suitable for extrapolation tasks. Moreover, learning an optimal decision tree is a computationally challenging problem, often relying on heuristic algorithms that may not guarantee a globally optimal solution. Finally, decision tree models can become biased if some classes dominate in the data set, making it essential to balance the data before fitting a decision tree.

3.5.2. Random Forest Classifier

The RandomForestClassifier and RandomForestRegressor classes, presented in the scikit-learn library, represent implementations of the random forest algorithm [65]. This approach is a prominent learning method extensively utilized in machine learning for classification and regression purposes, respectively. The ensemble comprises various decision trees constructed during the training phase. Each tree is trained on a randomized subset of the data, including a randomly selected subset of features for each split.
In the case of classification, the final prediction is determined through a majority vote among the constituent trees. In the case of regression, the predicted value is derived from the average prediction across the entirety of trees. This method fortifies the model, counteracting overfitting and improving generalizability. Concurrent parameters across the two categories consist of the count of the trees in the forest (n_estimators), the greatest depth of every tree (max_depth) and the number of features assessed for every split (max_features).
In contrast to the original paper [66], the scikit-learn implementation algorithm combines classifiers by averaging their probabilistic predictions, rather than letting each classifier vote for a single class.

3.5.3. Extra Trees Classifier

Extra Trees are a type of ensemble learning method that operates on the foundation of decision trees. It sets itself through a unique and deliberate injection of randomness during the tree construction process. Unlike traditional decision trees that carefully select the optimal split point for each node based on a subset of features, Extra Trees takes a more radical approach. It not only considers a random subset of features for each split but also randomly selects the split point without assessing optimality. The outcome of this process is the creation of a forest of fully randomized trees, dissociated from the nuances of the training data’s output values.
Essentially, Extra Trees uses the strength of chance to build a forest of decision trees that, when combined, provide a powerful solution for predictive modeling. The algorithm’s ability to strike a balance between predicted accuracy and unpredictability makes it a useful tool in machine learning, especially when dealing with noisy or high-dimensional data. Lastly, possible data set imbalances can lead to biased models, making data balancing essential before fitting an ExtraTreeClassifier [67].

3.5.4. Naive Bayes Classifiers

The Naive Bayes classifiers are a family of classification algorithms whose operation is based on Bayes’ theorem. That is, in data classification problems, these kinds of algorithms seek to predict that the class to which an observation belongs is the category that maximizes the posterior probability. These classification algorithms assume that the variables are independent, and thus, there are no pairs of variables that show correlation. Further, each variable contributes equally to predict the value of the dependent variable and hence the category to which the data belongs. The Naive Bayes classifier, Gaussian Naive Bayes classifier, and Bernoulli Naive Bayes classifier belong in this category of classification algorithms.

3.5.5. K-Nearest Neighbor Classifier

The k-nearest neighbors algorithm [68] is a non-parametric supervised machine learning classification and regression model that makes no assumptions about the data set. The KNN algorithm attempts to make predictions about the category to which the new data belong by collecting data from a training data set. This is achieved by considering the similarity between observations. More specifically, it assumes that data points that are in close proximity are also more similar and hence, will belong to the same category. This method assigns a category label to a new data point, based on the category to which its k-nearest neighboring data points in the training data set belong. In order for the algorithm to find the “k-nearest neighbors” for a new data point, it must calculate the distance between this new data point and the data points that already exist in n-dimensional space with p features. An algorithm that works in a similar way as the KNN algorithm is the “Nearest centroid” classifier [69], and this algorithm was evaluated as well, for its predictive accuracy.

3.5.6. Other Classifiers

The category of linear classifiers includes all those algorithms that attempt to predict the category in which the data should be classified, based on the linear combinations of the independent variables in the data set [70]. There are many algorithms that use linear combinations of the features in order to make a classification decision. In this work, the classifiers of this category that were tested and evaluated were the support vector machines classifier, Linear SVC, Logistic Regression, Ridge classifier, Ridge CV, SGD classifier, Passive Aggressive classifier [71], linear Perceptron classifier, and Linear Discriminant analysis.
Another broad category of classifiers that were considered and employed is the Label Propagation Algorithms. This classification method and its algorithms are considered a semi-supervised machine learning technique, which uses the labels of the already labeled data points with the aim to predict the labels of the unlabeled data points of the data set [72]. The two algorithms that were employed in this paper are Label Propagation and Label Spreading.

3.5.7. Ensemble Learning

The Bagging Classifier, also known as Bootstrap Aggregating, operates by generating several subsets of the training data, utilizing random sampling with replacement. Each subset trains a base learning algorithm independently, producing a varied set of models. The ultimate prediction is subsequently determined by merging the individual model predictions by either majority voting for classification tasks or averaging for regression tasks.
When random subsets of the data are selected as random subsets of the samples, this algorithm is referred to as Paste [73]. If samples are drawn with replacement, the method is referred to as Bagging [74]. When random subsets of the data set are chosen as random subsets of the features, the resulting technique is referred to as Random Subspaces [75]. Finally, if the basic estimators are developed on subsets of both samples and features, only then is the method referred to as Random Patches [76].
Apart from Bagging, ensemble learning also includes the Boosting classifiers. Boosting classifiers are a combination of weak classifiers that have a mediocre prediction accuracy that if they are combined can be converted into a strong classifier that can possibly predict with great accuracy the class of given data points with a small error as well [77]. From the related set of Boosting algorithms, and in the context of this study, the Light Gradient Boosting Machine classifier and the AdaBoost classifier were used and evaluated.

3.6. Evaluation

Evaluation metrics for classifiers are essential tools for assessing the performance of machine learning models in solving classification problems. These metrics provide valuable insights into the model’s ability to correctly classify instances and help data scientists and machine learning practitioners make informed decisions. One commonly employed metric is accuracy, which measures the proportion of correctly classified instances to the total number of instances in the data set. While accuracy is a fundamental metric, it may not always be the most appropriate choice, especially when dealing with imbalanced data sets.
Balanced accuracy is an alternative evaluation metric that addresses the limitations of accuracy in imbalanced data sets. In imbalanced scenarios, where one class significantly outnumbers the other(s), a classifier may achieve high accuracy by simply predicting the majority class, while ignoring the minority class. Balanced accuracy, on the other hand, takes into account the sensitivity (True Positive Rate) and specificity (True Negative Rate) of the classifier. It calculates the arithmetic mean of these two rates, providing a more comprehensive view of the model’s performance. This metric ensures that both the minority and majority classes are given equal importance, making it particularly valuable when the cost of misclassifying the minority class is higher, or when one class is more critical than the other in real-world applications.

4. Application

In addressing missing data within our data set, a formalized approach that combines imputation methods for handling numerical missing values and classification techniques for predicting categorical values is performed. Imputation is preferred because it preserves partial records and simultaneously fills in the missing data by making educated guesses. This method is especially pertinent when the missing-data columns are valued and make a significant contribution to the data set as a whole. Furthermore, imputation is consistent with the goal of preserving the integrity and completeness of the data set, which is essential for a comprehensive and objective analysis and guarantees that the classifier can efficiently learn from the available data across all characteristics [78]. The avoidance of using deletion, coupled with data augmentation, was implemented because doing so could result in a reduction in data set size and a potential loss of valuable information [79]. The preference for imputation over applying classification techniques specifically designed to handle missing information, such as the classification algorithms developed by [80,81], is driven by the intention to explore a wider range of classification algorithms. While classification techniques specialized for missing information can be effective in certain contexts, choosing imputation allows for a more expansive investigation of diverse classification algorithms that might not automatically address missing data.

4.1. Data Collection

In order to apply the proposed methodology and compare the different classifiers regarding their ability to correctly classify a ship’s engine, data were combined from two different data sources providers using additional advanced text mining techniques to extract data from various unstructured sources of information, such as emails (free text) and PDF files.
The first data source provider was that of 27 Research PC, which provided three different databases, denoted as DB1, DB2, and DB3, each one consisting of the ships’ characteristics extracted from different sources and using different methods covering a period from January 2019 until November 2023. These data are available upon request by the authors under the permission of the data provider.
The second data source was the Thetis-MRV, an open-access database in which CO 2 emissions from ships according to the EU Regulation 2015/757 are Monitored, Reported, and Verified (MRV) on a yearly base (available from 2018 to 2022). Thetis-MRV is a framework utilized in relation to climate change and emissions reduction to monitor and report greenhouse gas emissions and other pertinent data. It is typically linked with actions to lower emissions and combat climate change, especially in sectors like transport and industry.
In Table 1, all data providers are presented along with the number of observations and features they have offered.
It should be noted that, despite the seemingly large number of observations (285,801 records), the databases contain duplicate rows and several sparse or irrelevant characteristics for the ships, such as, for example, previous ship’s names, principal place of operation, name, and address of the shipowner, contact person’s address, telephone, e-mail, etc. Additionally, the classification of ships considers various fundamental variables (Length Overall, Beam, Design Draft, Gross Tonnage, Deadweight Tonnage, etc.), that describe their physical characteristics and capabilities. Due to the interconnection of these variables which aid in defining the ship’s purpose, design, and operational aspects, the ship type was removed. Furthermore, additional information on engine models such as engine stroke (mm), engine builder, engine cylinders, and propeller were not included due to direct dependence on engine type. It should also be noted that we exploited the entirety of the available features (and data) related to operational conditions. Finally, using the process of elimination to remove irrelevant features (as mentioned above) combined with the removal of features with a significantly small number of observations, the resulting common features from the different data sets have been selected for further analysis.
As a result, merging all these data and selecting the most useful and informative features for this study was crucial and a necessary first step for the rest of the analysis. A key feature for the data merging was the IMO ship identification number, which provides the reliable identification of each ship since multiple ships may have the same name or a single ship may change its name multiple times during its lifetime. Note that instead of describing all features and to avoid any unnecessary repetitions due to the large number of variables, only the set of features (groups) used in further analysis are described in detail in Table 2 and Table 3.
Generally, there are two main groups of variables, with the first one consisting of the ship’s characteristics that remain unchanged, while the second group corresponds to the operational aspects. The dependent, categorical variable which is the engine model of the vessels was also included. Given that the main engine model variable is not of a numerical type, the dependent variable was converted from a string type into an integer.

4.2. Preprocessing/Exploratory Data Analysis

It has already been mentioned that data preprocessing and data exploration are necessary parts of data analysis, before proceeding to the construction of prediction models. In terms of this work, the following procedure, involving appropriately handling missing/related data, etc., occurs as a result.
As depicted in Table 2 and Table 3, the data set consists of 17 variables related to physical dimensions and environmental characteristics such as the ship’s consumption, and speeds, plus the Ship Identification Number (IMO) and the target variable that represents the engine model of a vessel, leading to 19 variables in total. The final number of variables that were taken into consideration were derived after considering only the common features with a satisfying number of observations among all the available databases. The final unified data set was a result of a merging method (inner join) based on the Ship Identification Number (IMO) as a key variable.
More specifically, prior to implementing the classification methods on the data set, a comprehensive preprocessing methodology was employed. The code for the preprocessing of the data and for the following steps of the analysis is available upon request by the authors.
Duplicated rows were removed using the variables of Table 1 and Table 2, resulting in a decrease from 285,801 to 57,004 observations. Any feature vector containing even one missing value (including the main engine model variable) causes a row to be deleted. The resultant data set consisted of 3219 observations. This data set will be later utilized to evaluate the imputation procedure.
Initially, duplicated rows were dropped, but there were still missing values in 17 out of 18 explanatory variables since the variable named IMO did not have missing values. The percentages of the missing values for each of these variables ranged from 20% to 70%. These missing values need either to be dropped or to be imputed. In this work, in order to gain as much information as possible, an imputation approach was employed as it is described in the following subsection.
Further, apart from dealing with the missing values, an important step to gain an initial insight into the data structure is the computation of the correlations between the variables. Correlations between the variables can be used as a first glance in order to verify the necessity of a feature selection method (i.e., to select a subset of variables, or to combine their information), with the intention to reduce the dimension of the data space [82]. In particular, variables that are strongly and highly correlated indicate the need to apply some dimensionality reduction method to achieve more concise and explanatory results.
Toward that aim, a correlation matrix was also constructed. The Spearman’s (Sp) correlation coefficient was used for 3219 observations since none of the variables were normally distributed according to the Kolmogorov–Smirnov normality test that was conducted. From the analysis, it is clear that the variables that are related to physical dimensions present a high positive correlation between them (Sp > 0.7). More specifically, the variables G a 2 to G a 7 , G a 9 and G a 10 exhibited strong and positive correlations between them (Figure 2). Regarding the rest of the variables, their correlation is less strong and does not seem to create a wide group of highly correlated variables. In particular, considering the variables that are associated with the environmental characteristics and the performance of vessels, only a few pairs of variables seem to present a high positive correlation. These pairs of variables are G b 1 and G b 2 , G b 3 and G b 4 , and G b 5 and G b 6 (Figure 2), which is not surprising since they refer to the same measurement under two different conditions (i.e., when a vessel travels empty or largely empty and when a vessel travels loaded).

4.3. Imputation

Following the procedure presented in Section 3, after applying the cleaning process of the data, the MICE algorithm was applied to impute the missing values of the explanatory variables. By applying the imputation method to the 57,004 observations, to leverage any available information on the data, a total of 7773 observations were finally obtained after removing again any duplicate entries in the imputed data set and eliminating any record with missing information on the main engine model. This process ensured a more streamlined and representative data set for subsequent analytical procedures.
Before the MICE algorithm was implemented, Little’s test was used to check whether data were MCAR or not. The p-value of the test indicated that the missing data were indeed MCAR (p-value > 0.05) and so any missing value can be considered unrelated to the unknown value of the variable or to other variables. The MICE algorithm was run for a maximum of ten iterations, and three regressors were compared and used as estimator parameters to predict the response variable. Light Gradient Boosting [34], Extreme Gradient Boosting [35] and Bayesian Ridge regressors [36] were employed, each one for a specific number of variables. The appropriate regressor was selected according to the scores of evaluating metrics to achieve the best possible imputation.
In particular, a two-step imputation method was utilized to maximize the usefulness of the lower proportion of missing values presented in the first set of variables. Specifically, solely the variables in this set were taken into account during the imputation process for any missing values. Subsequently, all variables from the “operational” and “exterior measurements” groups were included to impute the missing values.
The comparison of the values of the three evaluating metrics described above and the results of the F-test are depicted in Table 4, Table 5 and Table 6. From the results, it is clear that when Bayesian Ridge estimator was used as the estimator parameter in the MICE algorithm, the algorithm failed to generate accurate and reliable estimates for all the variables (see for example the p-value of the F-test for the G b 5 and G b 6 variables in Table 4). On the other hand, the other two approaches present better behavior by generating unbiased predictions. From these two approaches, the most accurate implementation of MICE imputation was achieved using the Extreme Gradient Boosting regressor as the estimator parameter in the MICE algorithm since the values of the three evaluation metrics were generally smaller or at least comparable for all the variables with respect to the corresponding values obtained using the Light Gradient Boosting estimator.

4.4. Dimensionality Reduction

After the implementation of the imputation process, two executions of PCA were conducted using the variables from Table 2 and Table 3. The first was performed on the initial data set (the one that consists of 3219 observations) and the other on the imputed data set (7773 observations) to compare the outcomes of both implementations. To ensure equal contribution to the Principal Component Analysis, the variables were standardized before executing PCA.
The number of resulting variables used in the dimensionality reduction application is not high. However, dimensionality reduction techniques are applied for comparison purposes and to discover if there are any latent variables. For both applications of PCA, the number of components that were kept in the analysis was equal to 4 since the eigenvalues of each one of the first four principal components were greater than the unit (see the left plot in Figure 3). In the case of the imputed data set, the first four Principal Components together explained 83.27% of the variability of the initial data. On the other hand, when the imputation procedure was not taken into consideration, the first four Principal Components explained a slightly smaller percentage (82.47%) of the variability of the initial data (see the right plot in Figure 3). This suggests that the imputation not only effectively avoided introducing noise into the data set but also maintained its underlying structure, and therefore the imputed data were used in the rest of the study.
The latter is also demonstrated in the similar values of the loadings of the four Principal Components using the two different data sets. For that reason, in Table 7, the PCA loadings for the first four PCs are presented only for the imputed data set. All large values of loadings for the first PC are positive, evidencing a positive correlation between ships with physical dimensions and the first PC. Regarding the second PC, it seems that the first four performance variables, related to speed and VLSFO consumption have the largest contribution. The other four performance variables seem to have a larger association with the third and the fourth PC. Furthermore, at the fourth PC, the variable G a 8 (year completion of the ship) seems also to have a relatively large contribution reflecting probably the different standards and technological status and achievements over the years and their impact on the performance of the ships.
The loadings in Table 7 can be used to calculate the score of Principal Components for any ship. For example, the score of the first Principal Component can be obtained as follows:
P C 1 = 0.322 · G a 2 53 , 838 31 , 941.41 + 0.304 · G a 3 11.85 2.09 + 0.315 · G a 4 193.48 26.74 + 0.310 · G a 5 31.22 3.89 + 0.311 · G a 6 16.75 2.68 + 0.324 · G a 7 66 , 295.6 33 , 194 0.054 · G a 8 2010.85 3.54 + 0.322 · G a 9 30 , 954.1 15 , 610.1 + 0.322 · G a 1 0 17 , 930.3 10 , 150.8 0.110 · G b 1 13.26 0.92 0.068 · G b 2 12.82 0.87 + 0.261 · G b 3 25.2 6.66 + 0.262 · G b 4 26.09 6.63 + 0.053 · G b 5 0.14 0.16 + 0.050 · G b 6 0.18 0.27 + 0.180 · G b 7 2.92 0.63 + 0.040 · G b 8 0.29 0.30 .
In addition to the PCA method, the t-SNE and UMAP methods were also applied to compare their results by reassessing the classification of the ships’ main engine models. The tuning of the parameters for UMAP and t-SNE along with the impact of the three dimensionality reduction methods on the classification performance are discussed in the next section.
The graphical representation is achieved by reducing the dimensionality of the data to a smaller dimensional space, and in this case, to a two-dimensional space for both the UMAP and t-SNE methods. Additionally, for clarified illustration purposes, the separation of the five leading classes of ships’ main engine models is illustrated in Figure 4 using the imputed data (similar figures were obtained for the non-imputed data). The analytical encoding on the engine’s model follows in Table 8. The UMAP and t-SNE methods, as well as the PCA method, appear to separate similarly the engine type classes, while overlapping classes are also distinguished.

4.5. Classification Results

The large number of different engine models led the analysis to a restricted selection of engines, with at least 20 observations, resulting in a total of 7159 observations. In Table 8, the engine models that were retained along with their encoding and their frequency in the final data set are presented.
Although data augmentation is frequently used to correct imbalances, it is not always the optimal solution for every classification problem. While SMOTE has been used extensively and has shown useful in some situations, its application is constrained by inherent limitations and potential adverse effects. Ref. [83] elucidates that data sets with extreme imbalances exhibit suboptimal performance even after the generation of synthetic samples. Additionally, ref. [84] asserts that the benefits of balancing techniques such as SMOTE are discernible for weak classifiers but not necessarily for robust ones. Thus, since data augmentation is not used, to mitigate the impact of imbalanced data sets on evaluation metrics, the balance accuracy metric is employed as a robust measure to assess the performance of classifiers. The variable “IMO” was excluded from consideration in the classification process, as it obviously did not contribute to the relevant features for the analysis. A comprehensive evaluation of several classification methods for identifying correctly the ship’s engine type under different dimensionality reduction methods is presented in the current section. More specifically, 25 classifiers, in total, were compared with respect to their performance of balanced accuracy metric under the three dimensionality reduction methods. Each one of the dimensionality reduction methods were also tested under different scenarios.
The PCA method was systematically evaluated by considering the utilization of the first two, three and four principal components. Additionally, UMAP was employed for dimensional reduction with two and three components, investigating various near-neighbor values, specifically, 10, 30, and 123, leading to six different combinations. Furthermore, t-SNE was applied to achieve dimensionality reduction, exploring configurations with two and three components and perplexity values of 10, 30, and 123. Thus, again, six combinations of the parameters of the t-SNE method were evaluated. This comprehensive comparative analysis aimed to determine the efficacy of these techniques under various parameter combinations.
The data set was initially partitioned into training and test sets using the “train_test_split” function from the “model_selection” module provided by the scikit-learn library. Specifically, 80% of the data were allocated for training purposes, while the remaining 20% were reserved for evaluating the models’ performance. This division ensures that the model is trained on a substantial portion of the data and then tested on an independent subset to assess its generalization capabilities. Moreover, a robust k-fold cross validator was also used to systematically evaluate the model’s performance by dividing the data into five consecutive folds.
The candidate classification methods were all applied through the utilization of the “LazyClassifier” function from the “lazypredict.Supervised” library in Python (version 3.9.7). The “lazypredict” library was initially adopted, as it facilitates the comparison of various machine learning models by using the default hyperparameters for scikit-learn classifiers. Further on, for the optimal classifier (ExtraTreesClassifier), an additional evaluation across different values of the number of trees (n_estimators = [10, 50, 100, 150, 500]) pertaining to both imputation techniques and dimensional reduction methods were applied. Furthermore, all the classifiers were also compared on the basis of the duration of their training processes.
Table 9 delineates the optimal performance results corresponding to each specific feature selection method (PCA, UMAP, and t-SNE), using k-fold and having compared the various parameters tested for each technique. That is the number of principal components, number of neighbors, and perplexity. The dimensionality reduction methods that are presented are PCA with four components (PCA-4), t-SNE with perplexity equal to 10 and three components (Tsne10-3), and UMAP with 10 neighbors and three components (Umap10-3). Additionally, the largest value(s) for the balanced accuracy metric is marked in bold. Moreover, for illustration purposes, the values of the balanced accuracy for the classifiers are also depicted in Figure 5 for the imputed data case.
The optimal performance results presented in Table 9 emerged from the results depicted in Table A7, Table A8, Table A9, Table A10 and Table A11 presented in Appendix A. In particular, Table A7 shows that the four-component PCA paired with an ExtraTreeClassifier outperforms all the other combinations of classifiers and principal components. Among the combinations of t-SNE with two components, the combination of t-SNE with a perplexity of 10, paired with an ExtraTreeClassifier, appears to yield the most favorable outcomes based on the observed results (Table A9).
Table A1 delineates the optimal performance results corresponding to each specific feature selection method (PCA, UMAP, and t-SNE), having compared the various parameters tested for each technique, that is, the number of principal components, number of neighbors, and perplexity. The dimensionality reduction methods that are presented are PCA with four components (PCA-4), t-SNE with perplexity equal to 10 and two components (Tsne10-2), and UMAP with 10 neighbors and two components (Umap10-2). Additionally, the largest value(s) for balanced accuracy metric is marked in bold.
Concerning the combination of t-SNE with two components, again, t-SNE with a perplexity of 10 paired with an ExtraTreeClassifier gives the optimal performance (Table A4). However, comparably, t-SNE with three components and a perplexity of 10 has better performance than t-SNE with two components and a perplexity of 10, specifically, 90.20 % against 89.21 % . From the observed results of Table A6, it is clear that the combination of UMAP with a near-neighbor parameter equal to 10, paired with an ExtraTreeClassifier, presents the best performance ( 83.64 % ), while the same happens in Table A5 for the combination of UMAP with a near-neighbor parameter equal to 10, paired with an ExtraTreeClassifier that results in the percentage of 81.10 % . Therefore, comparing the latter two, UMAP with three components and a near-neighbor parameter equal to 10 was optimal.
In the context of imputation data analysis, the ExtraTreeClassifier emerges as the best classifier compared to the rest, exhibiting exceptionally good performance with an accuracy of 95.07%. Notably, as it could be derived from the above mentioned, when evaluated across various dimensionality reduction methods, the ExtraTreesClassifier consistently demonstrates exceptional performance, outperforming the other classifiers. While these findings highlight the consistent proficiency of the ExtraTreesClassifier across diverse dimensionality reduction techniques, it is worth mentioning that t-SNE stands out as the top-performing method, displaying the highest accuracy in all evaluated scenarios.
Also, the optimal performance results using “train test split” as well as the emerged results depicted in Table A1, Table A2, Table A3, Table A4, Table A5 and Table A6 presented in the Appendix A. In particular, Table A2 shows that the four-component PCA paired with an ExtraTreeClassifier outperforms all the other combinations of classifiers and principal components. Among the combinations of t-SNE with two components, the combination of t-SNE with a perplexity of 10 paired with an ExtraTreeClassifier, appears to yield the most favorable outcomes based on the observed results (Table A3).
Further, given the results, it is concluded that decision tree logic-based classifiers (i.e., ExtraTreesClassifier, DecisionTreeClassifier, RandomForestClassifier, DecisionTreeClassifier, BaggingClassifier, and ExtraTreeClassifier) perform better and successfully classify with great accuracy (up to 96 % ) the engine model type of a ship. In addition to the most accurate results obtained for the imputed data, the reduced data achieved satisfactory outcomes in the dimension reduction tests, even with only two components instead of the original 18 variables. Although there may be various explanations for why the aforementioned algorithms outperform the others, the most common ones are highlighted.
Decision trees are capable of adapting well to the characteristics of the data. They can create complex decision boundaries when necessary and are effective in capturing non-linear relationships in the structure of the data. Other classifiers, such as linear models, may struggle when the underlying patterns are not linear. Further, decision trees inherently perform feature selection by ranking the importance of features. That is, they can focus on the most important features for decision making, which can be a significant advantage when dealing with high-dimensional data.
They can also serve as a base model for ensemble methods, such as random forests and gradient boosting. Ensemble methods combine multiple decision trees to reduce overfitting and improve prediction accuracy. This may lead to superior performance compared to standalone classifiers. Furthermore, decision trees can handle categorical data without having to take into consideration single-point coding or other preprocessing steps, simplifying the modeling process and potentially leading to better results in situations where a combination of categorical and numerical features exists.
Another explanation could be the fact that decision trees are relatively robust to outliers and missing data, which can be common problems in real-world data sources. They can handle these situations gracefully, potentially leading to better overall performance.
It is important to note that the relative performance of classifiers and the feature selection technique can vary depending on the data set and the specific problem. While decision tree-based classifiers have these advantages, there are cases where other types of classifiers, such as support vector machines, neural networks, or k-nearest neighbors, may be more appropriate. The choice of classifier and dimensionality reduction method should be based on the characteristics of the data and the goals of the machine learning task.
Table 10 shows the balanced accuracy results from the ExtraTreesClassifier over several values of the “n_estimators” hyperparameter. The columns correspond to different configurations of the “n_estimators” parameter, while the rows denote the imputation performance as well as the different dimensional methods. The data set is methodically divided into five consecutive folds using the k-fold cross-validation approach. The final performance metric is derived by averaging the scores obtained across the five iterations of the k-fold cross-validation technique.
While the classifiers’ results are deemed satisfactory, it is important to examine the source of the minor classification error. When considering Table 11 based on the results given by the top classifier (ExtraTreesClassifier) along with the illustration in Figure 4, a patchwork effect can be discerned among certain encoded engine types (five most frequent). Combining both the visualization, which utilized the dimensionality reduction techniques, and the confusion matrix based on the imputed data set, it is evident that in all instances, there is misclassification towards specific engine types. This can be attributed to the presence of similar characteristics and variables accompanying them.

5. Conclusions

The ability to classify correctly a ship’s engine type using mostly available data could provide an important advantage in a highly competitive market such as the shipping industry. To achieve this, a thorough examination to identify the best dimensionality reduction method and the best classifier was conducted. The identification of the best combination was made through a detailed comparison and evaluation of their performance.
This classification procedure could lead to the drawing of some important, innovative, and valuable conclusions for the marine industry. Ship industries could reduce their operating costs if the type of the main engine of their vessels is known. Thus, given the physical dimensions of the ships and their operational behavior, shipowners would select and deploy the appropriate engine model for their fleet depending on their strategic plan and their operational schedule.
Furthermore, if shipping companies are aware of which main engine models cause higher fuel consumption, they will be able to decide which types of engines to use with the ultimate aim of reducing emissions from ships and thus sailing legally under rigorous, international and environmental regulations without any burdens.
The ideal main engine model used in most cases may push shipping companies to purchase such engine models for the reason that these engine models may improve the durability and sustainability that ships will have in terms of both voyage time and operating costs. Of course, this reinforced sustainability could reduce the occurrence of malfunctioning vessels by avoiding unexpected and undesirable accidents in the future, which may result in serious damage and loss of life.
Another inference of this study is that marine industries, as long as they have a large amount of data at their disposal, could result in making safe and reliable decisions for their strategic plan with a smaller subset of data which refer mainly to physical and environmental attributes.
A more detailed presentation of the advantages of this method over the traditional ones could include, for instance, a faster selection of engine, developing optimal strategies for a new vessel’s use, or even developing digital twin applications or simulators, allowing the evaluation of the ship’s performance before engine placement. The last will result in operating costs reduction. The above-mentioned digital twins, as already mentioned, can be an asset in performance monitoring and predictive maintenance.
Considering the faster engine selection, the proposed model can be used during the design phase of a ship. That is, a quick approximate assessment and identification of the optimum marine engine could be provided. It is of note that there are hundreds of different types of machines in the market; thus, the capability provided by the proposed methodology for a quick screening and a quick selection of the optimal engine type is of great value.
Selecting the optimal engine for the vessel which emerges naturally by using the proposed methodology can be integrated into the field of developing optimal use strategies. This can be achieved by considering the physical dimensions of the vessel and strategically determining the intended use of the ship as well as the desired operational behavior. For instance, the consumption rates, speeds, etc., all arise from its shipowner’s strategy.
Also, as already mentioned, applications of digital twins could be developed as well. These digital models are developed using operational behavior data, historical information, and advanced analytics. They mirror the physical counterpart’s characteristics, behavior, and performance in a virtual environment. Thus, the shipbuilding industry can reduce its operating costs by knowing which type of main engine is the optimum for a vessel.
Furthermore, the proposed methodology offers a tool for reducing harmful emissions from ships and their wider use in accordance with strict international environmental standards without any restrictions, and further, such classification procedures may lead to the identification of some important innovative and valuable findings for the maritime industry.
The main limitation of this study is that there was insufficient information on all of the 123 available engine models of the ships. Further, despite the fact that we exploited the entirety of the available features and data, encompassing both operational and design data, we lacked information on the operating conditions of ships or correspondingly whether they pertain to personalized features of their design. Additionally, it should be highlighted that a considerable number of values are missing in the variables pertaining to the operational efficiency of the ships.
The proposed methodology reflects the actual data used to train the model in this research, i.e., the current reality in ships’ design and ships’ engine selection. However, the current reality, as depicted in the data, encompasses both best practices and less or non-optimal engine choices. Naturally, we expect that good or best practices have been applied to the majority of the vessels at our disposal. Through the extensive data set available to us, the proposed method identifies the underlying data structure and selects the prevailing strategy. Therefore, it validates the best strategies on one hand, while on the other, it provides a quick and straightforward way for the universal application of the best strategies during the design phase.
In conclusion, whilst the estimated missing values were accurate, the outcomes would have been more reliable and robust if there were more observations for this group of variables. Consequently, it would be wise and prudent to gather as much data as possible regarding the operational efficiency of ships in future studies, if the challenging circumstances and competition permit this.

Author Contributions

Conceptualization, S.B. and K.S.; methodology, S.B., K.S. and P.E.; software, K.S., G.P. and P.B.; validation, S.B. and P.E.; formal analysis, K.S., S.B., P.E. and E.S.; resources, K.S. and P.B.; data curation, K.S.; writing original draft preparation, K.S., P.B., G.P., E.S., S.B. and P.E.; writing review and editing, K.S., P.B., G.P., E.S., S.B. and P.E. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The first data source used in this study (EU Monitoring, Reporting, Verification (MRV) mechanism) can be accessed through https://mrv.emsa.europa.eu/#public/emission-report (accessed on 15 November 2023). The second data set provided by the 27 Research PC is available upon request by the authors under the permission of the data provider.

Acknowledgments

The authors thank the 27 Research PC and especially its Director for providing the data for this analysis.

Conflicts of Interest

No conflicts of interest exist in the submission of this manuscript, and the manuscript is approved by all authors for publication.

Appendix A

Table A1. Performance of evaluation metrics (balanced accuracy) for imputation data and the top performance for each of the three dimensionality reduction methods using “train test split”. The time taken for each algorithm is also included.
Table A1. Performance of evaluation metrics (balanced accuracy) for imputation data and the top performance for each of the three dimensionality reduction methods using “train test split”. The time taken for each algorithm is also included.
Classification Algorithm ImputPCA-4Tsne10-2Umap10-2 ImputPCA-4Tsne10-2Umap10-2
ExtraTreesClassifierBalanced accuracy0.95930.89680.92060.8537Time taken  0.38840.45830.35770.4488
LGBMClassifier0.95620.88620.91050.76201.23071.46611.45411.5109
RandomForestClassifier0.95290.89420.91360.84160.55821.11500.68940.7291
DecisionTreeClassifier0.94580.86990.90640.81480.03290.03490.02390.0239
BaggingClassifier0.94380.87050.89510.80930.16860.19540.11770.1207
ExtraTreeClassifier0.93150.83900.90490.79480.01100.01100.01000.0100
LabelPropagation0.91590.56060.34080.33240.92200.66380.64980.6034
LabelSpreading0.91590.53720.33150.32951.40940.99500.93840.9421
KNeighborsClassifier0.69640.58470.69460.65460.14810.04390.03790.0375
GaussianNB0.56760.29700.14350.08460.01890.01890.01800.0210
LinearDiscriminantAnalysis0.55480.19280.06420.05920.03760.01400.01000.0100
LogisticRegression0.51900.20420.07460.05890.73210.72810.72880.7118
SVC0.46080.26770.22160.16281.35451.44571.54511.8055
LinearSVC0.45540.09560.06390.05772.04963.04290.52070.7550
NearestCentroid0.44810.25320.19280.15960.01100.00900.00900.0090
CalibratedClassifierCV0.42870.09290.05620.05248.321911.53162.51743.3163
SGDClassifier0.39820.11190.08050.06230.33610.17250.12720.1227
Perceptron0.35870.09200.03960.02110.12870.07680.04780.0598
PassiveAggressiveClassifier0.31570.06950.07040.03480.15460.06780.06180.0598
BernoulliNB0.23330.05040.05700.03860.03520.01210.01100.0119
RidgeClassifier0.14380.04970.03900.03210.02490.01990.01900.0179
RidgeClassifierCV0.14380.04970.03900.03210.02840.01990.02190.0189
AdaBoostClassifier0.12660.07660.06710.03890.44660.44680.38290.3832
DummyClassifier0.02560.02560.02560.02560.00900.00900.01100.0070
QuadraticDiscriminantAnalysis0.02560.39610.13910.11940.01990.01300.01890.0130
Table A2. Performance of evaluation metric (balanced accuracy) for imputation data using PCA method for different components using “train test split”, along with the time taken for the training of each algorithm.
Table A2. Performance of evaluation metric (balanced accuracy) for imputation data using PCA method for different components using “train test split”, along with the time taken for the training of each algorithm.
Classification Algorithm PCA-2PCA-3PCA-4 PCA-2PCA-3PCA-4
ExtraTreesClassifierBalanced accuracy0.79140.86010.8968Time taken0.36800.46750.4599
LGBMClassifier0.72130.85570.88621.55232.03481.5234
RandomForestClassifier0.76130.86100.89420.69100.74511.1152
DecisionTreeClassifier0.75090.82470.86990.02500.02990.0397
BaggingClassifier0.72970.83360.87050.11920.16300.1985
ExtraTreeClassifier0.71100.77000.83900.01000.01100.0110
LabelPropagation0.20290.34420.56060.62730.66620.6717
LabelSpreading0.19800.33910.53721.00451.02981.1671
KNeighborsClassifier0.49380.54560.58470.03790.04090.0439
GaussianNB0.13520.21870.29700.02090.02590.0222
LinearDiscriminantAnalysis0.13930.15640.19280.01000.01100.0140
LogisticRegression0.11710.12520.20420.79380.80860.7721
SVC0.14450.17460.26771.56491.54301.4709
LinearSVC0.06040.06840.09560.50962.70773.0354
NearestCentroid0.17610.17650.25320.00900.00900.0110
CalibratedClassifierCV0.05620.06160.09292.553410.739911.6460
SGDClassifier0.06790.07760.11190.10570.14110.1725
Perceptron0.03880.07230.09200.05290.05980.0738
PassiveAggressiveClassifier0.05680.06170.06950.05490.05880.0819
BernoulliNB0.05040.05040.05040.01100.01190.0109
RidgeClassifier0.05020.04970.04970.01800.01990.0169
RidgeClassifierCV0.05010.04970.04970.03790.02290.0220
AdaBoostClassifier0.07660.07660.07660.42200.45610.4828
DummyClassifier0.02560.02560.02560.00700.00800.0080
QuadraticDiscriminantAnalysis0.15750.23260.39610.01500.01310.0130
Table A3. Performance of evaluation metrics (balanced accuracy) for imputation data using t-SNE with 2 components and different perplexity values using “train test split”, along with the time taken for the training of each algorithm.
Table A3. Performance of evaluation metrics (balanced accuracy) for imputation data using t-SNE with 2 components and different perplexity values using “train test split”, along with the time taken for the training of each algorithm.
Classification Algorithm Tsne10-2Tsne30-2Tsne123-2 Tsne10-2Tsne30-2Tsne123-2
ExtraTreesClassifierBalanced accuracy0.92060.91730.9132Time taken0.36800.36400.3705
LGBMClassifier0.91050.89040.87191.55231.97671.8871
RandomForestClassifier0.91360.91390.90740.69100.68550.7211
DecisionTreeClassifier0.90640.89640.89910.02500.02390.0239
BaggingClassifier0.89510.88870.88390.11920.11410.1219
ExtraTreeClassifier0.90490.89540.89480.01000.01100.0110
LabelPropagation0.34080.27470.22920.62730.65510.6260
LabelSpreading0.33150.27240.22401.00451.00871.0044
KNeighborsClassifier0.69460.68900.65730.03790.03890.0389
GaussianNB0.14350.18320.14000.02090.02090.0219
LinearDiscriminantAnalysis0.06420.12410.13200.01000.01200.0100
LogisticRegression0.07460.12700.13280.79380.77830.7849
SVC0.22160.23090.17471.56491.45051.3898
LinearSVC0.06390.04580.10090.50960.42350.4019
NearestCentroid0.19280.22270.18230.00900.01150.0100
CalibratedClassifierCV0.05620.05920.08562.55341.94902.1004
SGDClassifier0.08050.09280.10190.10570.10970.1177
Perceptron0.03960.04910.09270.05290.05200.0519
PassiveAggressiveClassifier0.07040.06050.10280.05490.06100.0588
BernoulliNB0.05700.04420.06920.01100.01100.0113
RidgeClassifier0.03900.04090.05380.01800.01890.0199
RidgeClassifierCV0.03900.04090.05380.03790.02320.0229
AdaBoostClassifier0.06710.06340.07310.42200.40990.4118
DummyClassifier0.02560.02560.02560.00700.00700.0070
QuadraticDiscriminantAnalysis0.13910.20370.21030.01500.01300.0120
Table A4. Performance of evaluation metric (balanced accuracy) for imputation data using t-SNE with 3 components and different perplexity values using a “train–test split”, along with the time taken for the training of each algorithm.
Table A4. Performance of evaluation metric (balanced accuracy) for imputation data using t-SNE with 3 components and different perplexity values using a “train–test split”, along with the time taken for the training of each algorithm.
Classification Algorithm Tsne10-3Tsne30-3Tsne123-3 Tsne10-3Tsne30-3Tsne123-3
ExtraTreesClassifierBalanced accuracy0.91810.91460.9169Time taken0.36800.36120.3671
LGBMClassifier0.04270.91040.04531.55232.26330.9634
RandomForestClassifier0.91710.91350.91410.69100.75500.7540
DecisionTreeClassifier0.90910.90540.89670.02500.03210.0320
BaggingClassifier0.89120.88830.88520.11920.17250.1649
ExtraTreeClassifier0.88840.89510.89780.01000.01300.0100
LabelPropagation0.65500.59990.53360.62730.62280.6521
LabelSpreading0.63970.57150.51511.00450.99691.0462
KNeighborsClassifier0.70470.69050.67480.03790.03990.0399
GaussianNB0.18150.22260.21200.02090.02190.0228
LinearDiscriminantAnalysis0.11320.17910.15200.01000.01200.0129
LogisticRegression0.13980.18910.16220.79380.90650.9106
SVC0.31460.29560.25611.56491.34081.3334
LinearSVC0.09250.13850.12550.50960.48770.6084
NearestCentroid0.24240.28670.24570.00900.01000.0110
CalibratedClassifierCV0.08660.13780.14292.55342.33042.5602
SGDClassifier0.09200.11010.08870.10570.13360.1307
Perceptron0.08580.09470.06090.05290.06480.0598
PassiveAggressiveClassifier0.09110.12280.11370.05490.06480.0598
BernoulliNB0.04260.06070.07740.01100.01200.0120
RidgeClassifier0.04370.06010.04990.01800.01890.0209
RidgeClassifierCV0.04370.06010.04990.03790.02190.0229
AdaBoostClassifier0.05740.05510.07870.42200.45380.4643
DummyClassifier0.02560.02560.02560.00700.00900.0079
QuadraticDiscriminantAnalysis0.26610.29650.28120.01500.01300.0160
Table A5. Performance of evaluation metric (balanced accuracy) for imputation data using UMAP with 2 components and different near-neighbor values using a “train–test split”, along with the time taken for the training of each algorithm.
Table A5. Performance of evaluation metric (balanced accuracy) for imputation data using UMAP with 2 components and different near-neighbor values using a “train–test split”, along with the time taken for the training of each algorithm.
Classification Algorithm UMAP10-2UMAP30-2UMAP123-2 UMAP10-2UMAP30-2UMAP123-2
ExtraTreesClassifierBalanced accuracy0.85370.78680.6164Time taken0.36800.48360.5185
LGBMClassifier0.76200.05340.59151.55230.88762.0281
RandomForestClassifier0.84160.72500.60370.69100.75560.7909
DecisionTreeClassifier0.81480.71530.58160.02500.02600.0269
BaggingClassifier0.80930.70710.60600.11920.12670.1297
ExtraTreeClassifier0.79480.65670.51670.01000.01300.0120
LabelPropagation0.33240.21980.14500.62730.64450.6467
LabelSpreading0.32950.21270.14361.00451.06960.9904
KNeighborsClassifier0.65460.59610.56310.03790.03790.0379
GaussianNB0.08460.16510.19010.02090.02790.0315
LinearDiscriminantAnalysis0.05920.10740.10840.01000.01300.0110
LogisticRegression0.05890.10420.12340.79380.82400.8344
SVC0.16280.18980.16041.56491.37711.3968
LinearSVC0.05770.08840.11090.50960.62590.6231
NearestCentroid0.15960.21730.19270.00900.01100.0100
CalibratedClassifierCV0.05240.08090.10102.55342.71302.7518
SGDClassifier0.06230.09800.15400.10570.12070.1099
Perceptron0.02110.12310.10190.05290.05950.0562
PassiveAggressiveClassifier0.03480.09730.08270.05490.05980.0611
BernoulliNB0.03860.07320.07540.01100.01100.0120
RidgeClassifier0.03210.04780.07570.01800.01990.0200
RidgeClassifierCV0.03210.04780.07570.03790.02390.0209
AdaBoostClassifier0.03890.10040.09360.42200.41990.4144
DummyClassifier0.02560.02560.02560.00700.00800.0080
QuadraticDiscriminantAnalysis0.11940.21930.19410.01500.01300.0120
Table A6. Performance of evaluation metric (balanced accuracy) for imputation data using UMAP with 3 components and different near-neighbor values using a “train–test split”, along with the time taken for the training of each algorithm.
Table A6. Performance of evaluation metric (balanced accuracy) for imputation data using UMAP with 3 components and different near-neighbor values using a “train–test split”, along with the time taken for the training of each algorithm.
Classification Algorithm UMAP10-3UMAP30-3UMAP123-3 UMAP10-3UMAP30-3UMAP123-3
ExtraTreesClassifierBalanced accuracy0.85010.81520.7342Time taken0.36800.48790.5206
LGBMClassifier0.80590.05270.65501.55230.88302.0156
RandomForestClassifier0.83270.78490.68100.69100.75080.9259
DecisionTreeClassifier0.82700.76030.65910.02500.03590.0329
BaggingClassifier0.80350.76960.66900.11920.16760.1750
ExtraTreeClassifier0.81020.72910.60090.01000.01300.0130
LabelPropagation0.40420.25430.21400.62730.64010.6611
LabelSpreading0.40470.25410.19951.00451.02711.0150
KNeighborsClassifier0.65020.65390.58820.03790.03990.0399
GaussianNB0.16180.19660.20850.02090.02790.0289
LinearDiscriminantAnalysis0.09870.12970.13820.01000.01000.0139
LogisticRegression0.10510.13890.14030.79380.79341.0130
SVC0.24950.22220.16551.56491.31851.3388
LinearSVC0.06960.12790.13830.50961.04100.8294
NearestCentroid0.18510.20140.24030.00900.01000.0169
CalibratedClassifierCV0.06960.11520.11612.55344.54793.8081
SGDClassifier0.06780.10250.15330.10570.15630.1351
Perceptron0.07870.10170.11280.05290.07920.0957
PassiveAggressiveClassifier0.05320.12230.10280.05490.06070.0947
BernoulliNB0.04590.07610.10180.01100.01200.0120
RidgeClassifier0.06470.07580.10180.01800.01890.0199
RidgeClassifierCV0.06370.07580.10180.03790.02090.0200
AdaBoostClassifier0.11240.07620.07610.42200.48300.4514
DummyClassifier0.02560.02560.02560.00700.01000.0070
QuadraticDiscriminantAnalysis0.21440.27670.27250.01500.01600.0200
Table A7. Performance of evaluation metric (balanced accuracy) for imputation data using the PCA method for different components using k-fold cross validation, along with the time taken for the training of each algorithm.
Table A7. Performance of evaluation metric (balanced accuracy) for imputation data using the PCA method for different components using k-fold cross validation, along with the time taken for the training of each algorithm.
Classification Algorithm PCA-2PCA-3PCA-4 PCA-2PCA-3PCA-4
ExtraTreesClassifierBalanced accuracy0.77690.83580.8709Time taken0.42770.43010.4136
RandomForestClassifier0.74510.82530.85770.70730.71431.0609
BaggingClassifier0.70120.77930.83190.12180.15730.1886
DecisionTreeClassifier0.73080.78200.82670.02280.02860.0344
ExtraTreeClassifier0.73110.76740.82400.01040.00970.0100
LabelPropagation0.19940.34010.55140.68350.70460.7152
LabelSpreading0.19320.32670.53111.33471.33341.3164
KNeighborsClassifier0.47420.52190.56830.03340.03620.0377
LGBMClassifier0.72550.80840.53011.76371.87611.5696
LinearDiscriminantAnalysis0.13320.14790.20080.00980.01060.0113
GaussianNB0.13610.21290.31260.01040.01100.0112
LogisticRegression0.11090.12980.20230.77550.78430.7778
SVC0.14420.18160.24981.15081.17261.1509
LinearSVC0.06100.06700.09171.24912.02622.1846
NearestCentroid0.16300.18930.25340.00840.00840.0086
CalibratedClassifierCV0.05610.06610.08125.19698.26818.6005
SGDClassifier0.07410.07890.11440.12370.13970.1587
QuadraticDiscriminantAnalysis0.14750.22700.40120.01060.01160.0116
Perceptron0.07250.06500.08690.05770.06190.0651
PassiveAggressiveClassifier0.05950.06110.07770.05960.06390.0696
BernoulliNB0.05030.05030.05030.00930.00900.0093
RidgeClassifier0.04960.04900.04810.01150.01140.0122
RidgeClassifierCV0.04960.04900.04810.02030.02110.0212
AdaBoostClassifier0.10090.10080.08650.40450.44120.4750
DummyClassifier0.02560.02560.02560.00580.00610.0060
Table A8. Performance of evaluation metrics (balanced accuracy) for imputation data using t-SNE with 2 components and different perplexity values using k-fold cross validation, along with the time taken for the training of each algorithm.
Table A8. Performance of evaluation metrics (balanced accuracy) for imputation data using t-SNE with 2 components and different perplexity values using k-fold cross validation, along with the time taken for the training of each algorithm.
Classification Algorithm Tsne10-2Tsne30-2Tsne123-2 Tsne10-2Tsne30-2Tsne123-2
ExtraTreesClassifierBalanced accuracy0.89210.89290.8722Time taken0.32650.32410.3290
RandomForestClassifier0.88860.88350.86280.64310.62880.6483
BaggingClassifier0.86690.86210.83730.10900.10710.1132
DecisionTreeClassifier0.88630.88150.85140.02080.02040.0219
ExtraTreeClassifier0.87530.87000.84700.00880.00860.0090
LabelPropagation0.33850.27720.23790.67330.67280.6687
LabelSpreading0.32940.27340.23041.30991.29791.2775
KNeighborsClassifier0.67500.68630.66500.03280.03330.0332
LGBMClassifier0.53360.70350.84941.50111.63711.7765
LinearDiscriminantAnalysis0.06560.12040.13690.00990.00980.0096
GaussianNB0.13340.17890.15350.01040.01050.0100
LogisticRegression0.07400.12230.13770.76730.78280.7868
SVC0.22080.22540.18281.17661.07021.0515
LinearSVC0.06430.04880.10150.41300.31050.3483
NearestCentroid0.18860.22320.17550.00820.00870.0083
CalibratedClassifierCV0.05170.05740.09031.97211.57661.6991
SGDClassifier0.07390.09490.12710.11210.11670.1138
QuadraticDiscriminantAnalysis0.12710.20390.21380.01030.01080.0106
Perceptron0.06300.07510.07750.05520.05530.0556
PassiveAggressiveClassifier0.07770.06420.06600.05770.05900.0584
BernoulliNB0.05660.04450.07100.00890.00920.0089
RidgeClassifier0.03930.04170.05050.01190.01160.0116
RidgeClassifierCV0.03930.04160.05050.02080.02080.0202
AdaBoostClassifier0.06800.06430.07390.39500.39110.3930
DummyClassifier0.02560.02560.02580.00600.00610.0056
Table A9. Performance of evaluation metric (balanced accuracy) for imputation data using t-SNE with 3 components and different perplexity values using k-fold cross validation, along with the time taken for the training of each algorithm.
Table A9. Performance of evaluation metric (balanced accuracy) for imputation data using t-SNE with 3 components and different perplexity values using k-fold cross validation, along with the time taken for the training of each algorithm.
Classification Algorithm Tsne10-3Tsne30-3Tsne123-3 Tsne10-3Tsne30-3Tsne123-3
ExtraTreesClassifierBalanced accuracy0.90200.89490.8782Time taken0.34010.34380.3352
RandomForestClassifier0.89310.88740.86980.67130.68710.6670
BaggingClassifier0.87000.86800.84910.15620.16250.1540
DecisionTreeClassifier0.88590.87890.85940.02820.02910.0284
ExtraTreeClassifier0.85960.86420.84720.00960.00920.0090
LabelPropagation0.62810.58080.51650.67090.69640.6811
LabelSpreading0.61570.56480.49931.29511.36291.3262
KNeighborsClassifier0.70820.68950.68720.03500.03530.0349
LGBMClassifier0.55250.87780.20001.56461.86091.2253
LinearDiscriminantAnalysis0.11470.17800.14220.01040.01090.0103
GaussianNB0.18710.21490.19960.01060.01120.0109
LogisticRegression0.13780.18090.15720.80070.81370.7964
SVC0.30730.27970.23901.10991.03401.0214
LinearSVC0.08450.13820.12460.44340.37020.4347
NearestCentroid0.24210.27320.23440.00860.00880.0083
CalibratedClassifierCV0.08470.13430.13442.05961.77661.9859
SGDClassifier0.08990.11760.11270.12490.12550.1270
QuadraticDiscriminantAnalysis0.26800.28250.26490.01160.01160.0112
Perceptron0.08930.10570.08920.05550.05630.0571
PassiveAggressiveClassifier0.08350.10150.09780.06230.06520.0604
BernoulliNB0.04510.06310.07940.00920.00910.0090
RidgeClassifier0.04380.05970.05140.01140.01240.0120
RidgeClassifierCV0.04370.05970.05140.02090.02170.0209
AdaBoostClassifier0.06800.05320.06320.43500.45170.4383
DummyClassifier0.02560.02560.02560.00640.00620.0060
Table A10. Performance of evaluation metric (balanced accuracy) for imputation data using UMAP with 2 components and different near-neighbor values using k-fold cross validation, along with the time taken for the training of each algorithm.
Table A10. Performance of evaluation metric (balanced accuracy) for imputation data using UMAP with 2 components and different near-neighbor values using k-fold cross validation, along with the time taken for the training of each algorithm.
Classification Algorithm UMAP10-2UMAP30-2UMAP123-2 UMAP10-2UMAP30-2UMAP123-2
ExtraTreesClassifierBalanced accuracy0.81100.74240.6083Time taken0.41480.44040.5185
RandomForestClassifier0.79180.69400.57980.69660.69870.7151
BaggingClassifier0.77180.68260.56930.11540.11990.1241
DecisionTreeClassifier0.77740.69110.57010.02200.02320.0238
ExtraTreeClassifier0.75670.64680.53540.00980.00980.0102
LabelPropagation0.32910.21810.14950.69060.68150.6858
LabelSpreading0.31920.21170.14751.30001.33591.3080
KNeighborsClassifier0.65860.61230.52880.03320.03340.0340
LGBMClassifier0.72660.30320.25061.82121.33841.3348
LinearDiscriminantAnalysis0.05970.10920.10590.00980.00970.0098
GaussianNB0.08370.16640.17770.01040.01040.0106
LogisticRegression0.05970.10900.13010.78450.78870.7860
SVC0.17460.18610.15891.34081.03261.0518
LinearSVC0.05720.08890.10080.60450.46640.4674
NearestCentroid0.15420.21950.19120.00840.00840.0086
CalibratedClassifierCV0.05150.08260.09802.73292.11192.1660
SGDClassifier0.06230.09990.12550.11830.11740.1116
QuadraticDiscriminantAnalysis0.10390.21940.18240.01060.01060.0109
Perceptron0.04390.09730.09330.05550.05730.0559
PassiveAggressiveClassifier0.04780.10050.10540.05890.05840.0593
BernoulliNB0.03840.07290.07560.00940.00880.0090
RidgeClassifier0.03230.04780.07550.01140.01160.0116
RidgeClassifierCV0.03230.04780.07550.02020.02020.0208
AdaBoostClassifier0.07500.10110.08280.40240.40580.4120
DummyClassifier0.02560.02560.02560.00600.00620.0058
Table A11. Performance of evaluation metric (balanced accuracy) for imputation data using UMAP with 3 components and different near-neighbor values using k-fold cross validation, along with the time taken for the training of each algorithm.
Table A11. Performance of evaluation metric (balanced accuracy) for imputation data using UMAP with 3 components and different near-neighbor values using k-fold cross validation, along with the time taken for the training of each algorithm.
Classification Algorithm UMAP10-3UMAP30-3UMAP123-3 UMAP10-3UMAP30-3UMAP123-3
ExtraTreesClassifierBalanced accuracy0.83640.78950.7095Time taken0.40250.42380.4366
RandomForestClassifier0.81250.74960.66200.70470.69860.7160
BaggingClassifier0.78750.72760.65650.16890.15820.1657
DecisionTreeClassifier0.79280.72540.63280.03170.02900.0304
ExtraTreeClassifier0.77300.69830.60430.00940.00980.0098
LabelPropagation0.40100.26030.21890.68620.68300.6763
LabelSpreading0.39540.25560.21281.31101.32431.3079
KNeighborsClassifier0.65260.63100.61150.03460.03420.0344
LGBMClassifier0.76350.46740.61541.87661.58901.8907
LinearDiscriminantAnalysis0.10320.12610.14190.01030.01020.0102
GaussianNB0.15830.20120.19720.01080.01080.0106
LogisticRegression0.10410.14240.14550.76140.76530.7951
SVC0.24900.22450.15851.11750.99941.0043
LinearSVC0.07250.13150.13850.79410.82420.7081
NearestCentroid0.21290.18770.24500.00820.00860.0086
CalibratedClassifierCV0.06980.11980.13063.39673.40093.0388
SGDClassifier0.08810.11740.15380.13990.13540.1227
QuadraticDiscriminantAnalysis0.23100.28620.28890.01150.01160.0112
Perceptron0.07880.11660.09290.05950.06220.0590
PassiveAggressiveClassifier0.08320.10770.07960.06220.06480.0624
BernoulliNB0.04590.07590.10150.00900.00940.0092
RidgeClassifier0.06190.07390.10170.01160.01200.0119
RidgeClassifierCV0.06150.07380.10170.02040.02180.0205
AdaBoostClassifier0.07140.07550.07610.42890.43290.4294
DummyClassifier0.02560.02560.02560.00590.00600.0062

References

  1. Hu, Z.; Jin, Y.; Hu, Q.; Sen, S.; Zhou, T.; Osman, M.T. Prediction of fuel consumption for enroute ship based on machine learning. IEEE Access 2019, 7, 119497–119505. [Google Scholar] [CrossRef]
  2. Rawson, A.; Brito, M.; Sabeur, Z.; Tran-Thanh, L. A machine learning approach for monitoring ship safety in extreme weather events. Saf. Sci. 2021, 141, 105336. [Google Scholar] [CrossRef]
  3. Akyuz, E.; Cicek, K.; Celik, M. A comparative research of machine learning impact to future of maritime transportation. Procedia Comput. Sci. 2019, 158, 275–280. [Google Scholar] [CrossRef]
  4. İnceişçi, F.K.; Ayça, A. Fault Analysis of Ship Machinery Using Machine Learning Techniques. Int. J. Marit. Eng. 2022, 164. [Google Scholar] [CrossRef]
  5. Hwang, T.; Youn, I.H. Navigation Situation Clustering Model of Human-Operated Ships for Maritime Autonomous Surface Ship Collision Avoidance Tests. J. Mar. Sci. Eng. 2021, 9, 1458. [Google Scholar] [CrossRef]
  6. Yekeen, S.T.; Balogun, A.L.; Yusof, K.B.W. A novel deep learning instance segmentation model for automated marine oil spill detection. ISPRS J. Photogramm. Remote Sens. 2020, 167, 190–200. [Google Scholar] [CrossRef]
  7. Uyanık, T.; Karatuğ, Ç.; Arslanoğlu, Y. Machine learning approach to ship fuel consumption: A case of container vessel. Transp. Res. Part D Transp. Environ. 2020, 84, 102389. [Google Scholar] [CrossRef]
  8. Huang, L.; Pena, B.; Liu, Y.; Anderlini, E. Machine learning in sustainable ship design and operation: A review. Ocean Eng. 2022, 266, 112907. [Google Scholar] [CrossRef]
  9. Du, Y.; Chen, Y.; Li, X.; Schönborn, A.; Sun, Z. Data fusion and machine learning for ship fuel efficiency modeling: Part III–Sensor data and meteorological data. Commun. Transp. Res. 2022, 2, 100072. [Google Scholar] [CrossRef]
  10. Oruc, A. Claims of state-sponsored cyberattack in the maritime industry. In Proceedings of the Conference Proceedings of INEC, Online, 5–9 October 2020. [Google Scholar]
  11. Lee, C.B.; Wan, J.; Shi, W.; Li, K. A cross-country study of competitiveness of the shipping industry. Transp. Policy 2014, 35, 366–376. [Google Scholar] [CrossRef]
  12. Zaman, I.; Pazouki, K.; Norman, R.; Younessi, S.; Coleman, S. Challenges and opportunities of big data analytics for upcoming regulations and future transformation of the shipping industry. Procedia Eng. 2017, 194, 537–544. [Google Scholar] [CrossRef]
  13. Bui, K.Q.; Perera, L.P. The compliance challenges in emissions control regulations to reduce air pollution from shipping. In Proceedings of the OCEANS 2019-Marseille, Marseille, France, 17–20 June 2019; pp. 1–8. [Google Scholar]
  14. Buixadé Farré, A.; Stephenson, S.R.; Chen, L.; Czub, M.; Dai, Y.; Demchev, D.; Efimov, Y.; Graczyk, P.; Grythe, H.; Keil, K.; et al. Commercial Arctic shipping through the Northeast Passage: Routes, resources, governance, technology, and infrastructure. Polar Geogr. 2014, 37, 298–324. [Google Scholar] [CrossRef]
  15. Shepherd, I. European efforts to make marine data more accessible. Ethics Sci. Environ. Politics 2018, 18, 75–81. [Google Scholar] [CrossRef]
  16. Arifin, M.D. Application of Internet of Things (IoT) and Big Data in the Maritime Industries: Ship Allocation Model. Int. J. Mar. Eng. Innov. Res. 2023, 8, 97–108. [Google Scholar] [CrossRef]
  17. Skarlatos, K.; Fousteris, A.; Georgakellos, D.; Economou, P.; Bersimis, S. Assessing Ships’ Environmental Performance Using Machine Learning. Energies 2023, 16, 2544. [Google Scholar] [CrossRef]
  18. Rawson, A.; Brito, M. A survey of the opportunities and challenges of supervised machine learning in maritime risk analysis. Transp. Rev. 2023, 43, 108–130. [Google Scholar] [CrossRef]
  19. Tsaganos, G.; Nikitakos, N.; Dalaklis, D.; Ölcer, A.; Papachristos, D. Machine learning algorithms in shipping: Improving engine fault detection and diagnosis via ensemble methods. WMU J. Marit. Aff. 2020, 19, 51–72. [Google Scholar] [CrossRef]
  20. Gu, J.; Oelke, D. Understanding bias in machine learning. arXiv 2019, arXiv:1909.01866. [Google Scholar]
  21. Lindstad, H.E.; Eskeland, G.S. Environmental regulations in shipping: Policies leaning towards globalization of scrubbers deserve scrutiny. Transp. Res. Part D Transp. Environ. 2016, 47, 67–76. [Google Scholar] [CrossRef]
  22. Psaraftis, H.N.; Kontovas, C.A. Speed models for energy-efficient maritime transportation: A taxonomy and survey. Transp. Res. Part C Emerg. Technol. 2013, 26, 331–351. [Google Scholar] [CrossRef]
  23. Geng, J.B.; Cai, J.B.; Luo, M.J.; Niu, J.Z. Main Diesel Engine Selection for Ships Based on Life Cycle Costing. In 2015 International Conference on Management Science and Management Innovation (MSMI 2015); Atlantis Press: Amsterdam, The Netherlands, 2015; pp. 361–366. [Google Scholar]
  24. Tadros, M.; Ventura, M.; Soares, C.G. Surrogate models of the performance and exhaust emissions of marine diesel engines for ship conceptual design. Transport 2018, 2, 105–112. [Google Scholar]
  25. Papanikolaou, A. Ship Design: Methodologies of Preliminary Design; Springer: Berlin/Heidelberg, Germany, 2014. [Google Scholar]
  26. Avgeridis, L.; Lentzos, K.; Skoutas, D.; Emiris, I.Z. Time Series Analysis for Digital Twins in Green Shipping. In SNAME International Symposium on Ship Operations, Management and Economics; SNAME: Alexandria, VA, USA, 2023; p. D011S003R003. [Google Scholar]
  27. Giering, J.E.; Dyck, A. Maritime Digital Twin architecture: A concept for holistic Digital Twin application for shipbuilding and shipping. at-Automatisierungstechnik 2021, 69, 1081–1095. [Google Scholar] [CrossRef]
  28. Zavareh, B.; Foroozan, H.; Gheisarnejad, M.; Khooban, M.H. New trends on digital twin-based blockchain technology in zero-emission ship applications. Nav. Eng. J. 2021, 133, 115–135. [Google Scholar]
  29. Emmanuel, T.; Maupong, T.; Mpoeleng, D.; Semong, T.; Mphago, B.; Tabona, O. A survey on missing data in machine learning. J. Big Data 2021, 8, 1–37. [Google Scholar] [CrossRef] [PubMed]
  30. Bouhlila, D.S.; Sellaouti, F. Multiple imputation using chained equations for missing data in TIMSS: A case study. Large-Scale Assess. Educ. 2013, 1, 4. [Google Scholar] [CrossRef]
  31. Seu, K.; Kang, M.S.; Lee, H. An intelligent missing data imputation techniques: A review. JOIV Int. J. Inform. Vis. 2022, 6, 278–283. [Google Scholar] [CrossRef]
  32. Henry, A.J.; Hevelone, N.D.; Lipsitz, S.; Nguyen, L.L. Comparative methods for handling missing data in large databases. J. Vasc. Surg. 2013, 58, 1353–1359.e6. [Google Scholar] [CrossRef]
  33. Little, R.J. A test of missing completely at random for multivariate data with missing values. J. Am. Stat. Assoc. 1988, 83, 1198–1202. [Google Scholar] [CrossRef]
  34. Shehadeh, A.; Alshboul, O.; Al Mamlook, R.E.; Hamedat, O. Machine learning models for predicting the residual value of heavy construction equipment: An evaluation of modified decision tree, LightGBM, and XGBoost regression. Autom. Constr. 2021, 129, 103827. [Google Scholar] [CrossRef]
  35. Jeganathan, S.; Lakshminarayanan, A.R.; Ramachandran, N.; Tunze, G.B. Predicting Academic Performance of Immigrant Students Using XGBoost Regressor. Int. J. Inf. Technol. Web Eng. (IJITWE) 2022, 17, 1–19. [Google Scholar] [CrossRef]
  36. Imane, M.; Aoula, E.S.; Achouyab, E.H. Using Bayesian ridge regression to predict the overall equipment effectiveness performance. In Proceedings of the 2022 2nd International Conference on Innovative Research in Applied Science, Engineering and Technology (IRASET), Meknes, Morocco, 3–4 March 2022; pp. 1–4. [Google Scholar]
  37. Botchkarev, A. Evaluating Performance of Regression Machine Learning Models Using Multiple Error Metrics in Azure Machine Learning Studio. 2018. Available online: https://ssrn.com/abstract=3177507 (accessed on 15 November 2023).
  38. Handelman, G.S.; Kok, H.K.; Chandra, R.V.; Razavi, A.H.; Huang, S.; Brooks, M.; Lee, M.J.; Asadi, H. Peering into the black box of artificial intelligence: Evaluation metrics of machine learning methods. Am. J. Roentgenol. 2019, 212, 38–43. [Google Scholar] [CrossRef] [PubMed]
  39. Bekri, E.; Yannopoulos, P.; Economou, P. Methodology for improving reliability of river discharge measurements. J. Environ. Manag. 2019, 247, 371–384. [Google Scholar] [CrossRef] [PubMed]
  40. Alexopoulos, P.; Skondra, M.; Kontogianni, E.; Vratsista, A.; Frounta, M.; Konstantopoulou, G.; Aligianni, S.I.; Charalampopoulou, M.; Lentzari, I.; Gourzis, P.; et al. Validation of the cognitive telephone screening instruments COGTEL and COGTEL+ in identifying clinically diagnosed neurocognitive disorder due to Alzheimer’s disease in a naturalistic clinical setting. J. Alzheimer’s Dis. 2021, 83, 259–268. [Google Scholar] [CrossRef] [PubMed]
  41. Tsikas, P.K.; Chassiakos, A.P.; Papadimitropoulos, V.C. Seismic damage assessment of highway bridges by means of soft computing techniques. In Structure and Infrastructure Engineering; Taylor & Francis: London, UK, 2022. [Google Scholar]
  42. Zhang, L.; Zhou, L.; Yuan, B.; Hu, F.; Zhang, Q.; Wei, W.; Sun, D. Spatiotemporal Evolution Characteristics of Urban Land Surface Temperature Based on Local Climate Zones in Xi’an Metropolitan, China. In Chinese Geographical Science; Springer: New York, NY, USA, 2023; pp. 1–16. [Google Scholar]
  43. Economou, P.; Batsidis, A.; Kounetas, K. Evaluation of the OECD’s prediction algorithm for the annual GDP growth rate. Commun. Stat. Case Stud. Data Anal. Appl. 2021, 7, 67–87. [Google Scholar] [CrossRef]
  44. Velliangiri, S.; Alagumuthukrishnan, S.; Thankumar joseph, S.I. A Review of Dimensionality Reduction Techniques for Efficient Computation. Procedia Comput. Sci. 2019, 165, 104–111. [Google Scholar] [CrossRef]
  45. Jackson, J.E. A User’s Guide to Principal Components; John Wiley & Sons: Hoboken, NJ, USA, 2005. [Google Scholar]
  46. Bersimis, S.; Georgakellos, D. A probabilistic framework for the evaluation of products’ environmental performance using life cycle approach and Principal Component Analysis. J. Clean. Prod. 2013, 42, 103–115. [Google Scholar] [CrossRef]
  47. Bersimis, S.; Sgora, A.; Psarakis, S. Methods for interpreting the out-of-control signal of multivariate control charts: A comparison study. Qual. Reliab. Eng. Int. 2017, 33, 2295–2326. [Google Scholar] [CrossRef]
  48. Maravelakis, P.; Bersimis, S.; Panaretos, J.; Psarakis, S. Identifying the out of control variable in a multivariate control chart. Commun. Stat.-Theory Methods 2002, 31, 2391–2408. [Google Scholar] [CrossRef]
  49. Kaiser, H.F. The application of electronic computers to factor analysis. Educ. Psychol. Meas. 1960, 20, 141–151. [Google Scholar] [CrossRef]
  50. McInnes, L.; Healy, J.; Melville, J. Umap: Uniform manifold approximation and projection for dimension reduction. arXiv 2018, arXiv:1802.03426. [Google Scholar]
  51. Vermeulen, M.; Smith, K.; Eremin, K.; Rayner, G.; Walton, M. Application of Uniform Manifold Approximation and Projection (UMAP) in spectral imaging of artworks. Spectrochim. Acta Part A Mol. Biomol. Spectrosc. 2021, 252, 119547. [Google Scholar] [CrossRef]
  52. Milošević, D.; Medeiros, A.S.; Piperac, M.S.; Cvijanović, D.; Soininen, J.; Milosavljević, A.; Predić, B. The application of Uniform Manifold Approximation and Projection (UMAP) for unconstrained ordination and classification of biological indicators in aquatic ecology. Sci. Total Environ. 2022, 815, 152365. [Google Scholar] [CrossRef] [PubMed]
  53. Yu, T.T.; Chen, C.Y.; Wu, T.H.; Chang, Y.C. Application of high-dimensional uniform manifold approximation and projection (UMAP) to cluster existing landfills on the basis of geographical and environmental features. Sci. Total Environ. 2023, 904, 167013. [Google Scholar] [CrossRef] [PubMed]
  54. Maravelakis, P.E.; Bersimis, S. The use of Andrews curves for detecting the out-of-control variables when a multivariate control chart signals. Stat. Pap. 2009, 50, 51–65. [Google Scholar] [CrossRef]
  55. Skamnia, E.; Economou, P.; Bersimis, S.; Frouda, M.; Politis, A.; Alexopoulos, P. Hot spot identification method based on Andrews curves: An application on the COVID-19 crisis effects on caregiver distress in neurocognitive disorder. J. Appl. Stat. 2023, 50, 2388–2407. [Google Scholar] [CrossRef] [PubMed]
  56. Van der Maaten, L.; Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 2008, 9, 2579–2605. [Google Scholar]
  57. Hamel, P.; Eck, D. Learning features from music audio with deep belief networks. In Proceedings of the ISMIR, Utrecht, The Netherlands, 9–13 August 2010; Volume 10, p. 341. [Google Scholar]
  58. Balamurali, M.; Silversides, K.L.; Melkumyan, A. A comparison of t-SNE, SOM and SPADE for identifying material type domains in geological data. Comput. Geosci. 2019, 125, 78–89. [Google Scholar] [CrossRef]
  59. Balamurali, M.; Melkumyan, A. t-SNE based visualisation and clustering of geological domain. In Proceedings of the Neural Information Processing: 23rd International Conference, ICONIP 2016, Kyoto, Japan, 16–21 October 2016; Proceedings, Part IV 23. pp. 565–572. [Google Scholar]
  60. Leung, R.; Balamurali, M.; Melkumyan, A. Sample truncation strategies for outlier removal in geochemical data: The MCD robust distance approach versus t-SNE ensemble clustering. Math. Geosci. 2021, 53, 105–130. [Google Scholar] [CrossRef]
  61. Jamieson, A.R.; Giger, M.L.; Drukker, K.; Li, H.; Yuan, Y.; Bhooshan, N. Exploring nonlinear feature space dimension reduction and data representation in breast CADx with Laplacian eigenmaps and-SNE. Med. Phys. 2010, 37, 339–351. [Google Scholar] [CrossRef]
  62. Wallach, I.; Lilien, R. The protein–small-molecule database, a non-redundant structural resource for the analysis of protein-ligand binding. Bioinformatics 2009, 25, 615–620. [Google Scholar] [CrossRef]
  63. Birjandtalab, J.; Pouyan, M.B.; Nourani, M. Nonlinear dimension reduction for EEG-based epileptic seizure detection. In Proceedings of the 2016 IEEE-EMBS International Conference on Biomedical and Health Informatics (BHI), Las Vegas, NV, USA, 24–27 February 2016; pp. 595–598. [Google Scholar]
  64. Hinton, G.E.; Roweis, S. Stochastic neighbor embedding. In Proceedings of the Advances in Neural Information Processing Systems 15 (NIPS 2002), Vancouver, BC, Canada, 9–14 December 2002; Volume 15. [Google Scholar]
  65. Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
  66. Breiman, L. Arcing Classifiers; Technical Report; University of California, Department of Statistics: Berkeley, CA, USA, 1996. [Google Scholar]
  67. Geurts, P.; Ernst, D.; Wehenkel, L. Extremely randomized trees. Mach. Learn. 2006, 63, 3–42. [Google Scholar] [CrossRef]
  68. Kramer, O.; Kramer, O. K-nearest neighbors. In Dimensionality Reduction with Unsupervised Nearest Neighbors; Springer: Berlin/Heidelberg, Germany, 2013; pp. 13–23. [Google Scholar]
  69. Gou, J.; Yi, Z.; Du, L.; Xiong, T. A local mean-based k-nearest centroid neighbor classifier. Comput. J. 2012, 55, 1058–1071. [Google Scholar] [CrossRef]
  70. Yuan, G.X.; Ho, C.H.; Lin, C.J. Recent advances of large-scale linear classification. Proc. IEEE 2012, 100, 2584–2603. [Google Scholar] [CrossRef]
  71. Crammer, K.; Dekel, O.; Keshet, J.; Shalev-Shwartz, S.; Singer, Y. Online passive aggressive algorithms. J. Mach. Learn. Res. 2006, 7, 551–585. [Google Scholar]
  72. Zhu, X.; Ghahramani, Z. Learning from labeled and unlabeled data with label propagation. ProQuest Number Inf. All Users 2002. [Google Scholar]
  73. Breiman, L. Pasting small votes for classification in large databases and on-line. Mach. Learn. 1999, 36, 85–103. [Google Scholar] [CrossRef]
  74. Breiman, L. Bagging predictors. Mach. Learn. 1996, 24, 123–140. [Google Scholar] [CrossRef]
  75. Ho, T.K. The random subspace method for constructing decision forests. IEEE Trans. Pattern Anal. Mach. Intell. 1998, 20, 832–844. [Google Scholar]
  76. Louppe, G.; Geurts, P. Ensembles on random patches. In Proceedings of the Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2012, Bristol, UK, 24–28 September 2012; Proceedings, Part I 23. pp. 346–361. [Google Scholar]
  77. Ferreira, A.J.; Figueiredo, M.A. Boosting algorithms: A review of methods, theory, and applications. In Ensemble Machine Learning: Methods and Applications; Springer: New York, NY, USA, 2012; pp. 35–85. [Google Scholar]
  78. Jordanov, I.; Petrov, N.; Petrozziello, A. Classifiers Accuracy Improvement Based on Missing Data Imputation. J. Artif. Intell. Soft Comput. Res. 2018, 8, 31–48. [Google Scholar] [CrossRef]
  79. Little, R.J.; Rubin, D.B. Statistical Analysis with Missing Data; John Wiley & Sons: Hoboken, NJ, USA, 2019; Volume 793. [Google Scholar]
  80. Ramoni, M.; Sebastiani, P. Robust Bayes classifiers. Artif. Intell. 2001, 125, 209–226. [Google Scholar] [CrossRef]
  81. Zhang, X.; Song, S.; Wu, C. Robust bayesian classification with incomplete data. Cogn. Comput. 2013, 5, 170–187. [Google Scholar] [CrossRef]
  82. Guyon, I. Practical feature selection: From correlation to causality. In Mining Massive Data Sets for Security: Advances in Data Mining, Search, Social Networks and Text Mining, and Their Applications to Security; IOS Press: Amsterdam, The Netherlands, 2008; pp. 27–43. [Google Scholar]
  83. Anis, M.; Ali, M. Investigating the performance of smote for class imbalanced learning: A case study of credit scoring datasets. Eur. Sci. J. 2017, 13, 340–353. [Google Scholar] [CrossRef]
  84. Elor, Y.; Averbuch-Elor, H. To SMOTE, or not to SMOTE? arXiv 2022, arXiv:2201.08528. [Google Scholar]
Figure 1. The 4 main steps of the implementation process.
Figure 1. The 4 main steps of the implementation process.
Jmse 12 00097 g001
Figure 2. Spearman’s correlation coefficients between the independent variables.
Figure 2. Spearman’s correlation coefficients between the independent variables.
Jmse 12 00097 g002
Figure 3. Left plot: Scree plot of Principal Components for both the imputed and non-imputed data sets. Right plot: Variance explained by Principal Components for both the imputed and non-imputed data sets.
Figure 3. Left plot: Scree plot of Principal Components for both the imputed and non-imputed data sets. Right plot: Variance explained by Principal Components for both the imputed and non-imputed data sets.
Jmse 12 00097 g003
Figure 4. The 5 most frequent engine types in the reduced two-dimensional space obtained by the three-dimensionality reduction methods (upper left plot: PCA, upper right plot: UMAP, bottom plot: t-SNE.
Figure 4. The 5 most frequent engine types in the reduced two-dimensional space obtained by the three-dimensionality reduction methods (upper left plot: PCA, upper right plot: UMAP, bottom plot: t-SNE.
Jmse 12 00097 g004
Figure 5. Performance of evaluation metrics (balanced accuracy) for imputed data using k-fold cross validation.
Figure 5. Performance of evaluation metrics (balanced accuracy) for imputed data using k-fold cross validation.
Jmse 12 00097 g005
Table 1. Description of each subset of the data set, based on data providers.
Table 1. Description of each subset of the data set, based on data providers.
SourceObservationsFeaturesYear
DB1141,08072January 2019–November 2023
DB278,06581January 2019–November 2023
DB3473360January 2019–November 2023
MRV112,255612018
MRV212,399612019
MRV312,067612020
MRV412,290612021
MRV512,912612022
Table 2. Description and variables’ names of the first group (exterior measurements of a ship).
Table 2. Description and variables’ names of the first group (exterior measurements of a ship).
NameVariableDescription
IMO G a 1 Ship identification number.
DWT G a 2 Deadweight of the ship.
Design Draft G a 3 Vertical distance between the waterline and the bottom of the hull.
LOA G a 4 The length of the ship.
Beam G a 5 The width of the ship at its widest point.
Depth G a 6 The depth which is measured at the middle of the length, from the top of the keel to the top of the deck beam at the side of the uppermost continuous deck.
Grain Capacity G a 7 The capacity of cargo spaces measured laterally to the outside of frames, and vertically from the tank tops to the top of the under weatherdeck beams, including the area contained within a ship’s hatchway coamings.
Built year G a 8 The year of completion of ship.
Gross Tonnage G a 9 The volume of the ship in cubic meters below the main deck and the enclosed spaces above the main deck.
Net Tonnage G a 10 The volume of the cargo space.
Table 3. Description and variables’ names of the second group (operational).
Table 3. Description and variables’ names of the second group (operational).
NameVariableDescription
Bal Speed G b 1 Speed where a vessel travels empty or largely empty.
Lad Speed G b 2 Speed where a vessel travels loaded.
Bal VLSFO G b 3 VLSFO 1 consumption where a vessel travels empty or largely empty.
Lad VLSFO G b 4 VLSFO consumption where a vessel travels loaded.
Bal MGO G b 5 MDO/MGO 2 consumption where a vessel travels empty or largely empty.
Lad MGO G b 6 MDO/MGO consumption where a vessel travels loaded.
VLSFO pi G b 7 VLSFO consumption during idle state.
MGO pi G b 8 MDO/MGO consumption during idle state.
1 Very Low Sulfur Fuel Oil, 2 Marine Diesel Oil/Marine Gas Oil.
Table 4. Evaluating metrics of MICE imputation, considering the Bayesian Ridge estimator as a parameter for the imputation algorithm.
Table 4. Evaluating metrics of MICE imputation, considering the Bayesian Ridge estimator as a parameter for the imputation algorithm.
VariablesMSERMSEMAEF-Test p Values
G a 2 13.12 × 10 6 3623.0571851.5560.063
G a 3 0.4030.6350.3390.054
G a 4 34.4465.8694.2820.152
G a 5 1.4221.1920.7020.130
G a 6 0.5670.7530.4940.075
G a 7 19.13 × 10 6 4443.1082357.830.086
G a 8 13.0993.6122.6660.073
G a 9 3.74 × 10 6 1934.556975.5550.078
G a 10 2.46 × 10 6 1570.111832.2470.076
G b 1 0.4250.6520.4700.143
G b 2 0.3910.6260.4460.098
G b 3 10.6643.2652.3100.127
G b 4 27.8815.2812.1230.103
G b 5 0.0150.1220.0540.004
G b 6 0.0200.1430.0530.007
G b 7 1.1681.1080.5080.057
G b 8 0.0840.2900.1670.065
Table 5. Evaluating metrics of MICE imputation, considering the Light Gradient Boosting estimator as a parameter for the imputation algorithm.
Table 5. Evaluating metrics of MICE imputation, considering the Light Gradient Boosting estimator as a parameter for the imputation algorithm.
VariablesMSERMSEMAEF-Test p Values
G a 2 1.06 × 10 6 1030.534377.8930.232
G a 3 0.3990.6320.1990.258
G a 4 1.7571.3250.4450.136
G a 5 0.1630.4040.1130.143
G a 6 0.0280.1690.0600.185
G a 7 2.84 × 10 6 1685.302535.4090.097
G a 8 2.2481.4991.0340.137
G a 9 6.07 × 10 5 779.478261.1060.096
G a 10 1.29 × 10 6 1139.792219.0170.146
G b 1 0.1820.4270.2780.201
G b 2 0.1870.4330.2710.233
G b 3 5.6202.3701.1890.219
G b 4 5.7212.3921.1420.182
G b 5 0.0140.1200.0290.056
G b 6 0.0220.1500.0390.061
G b 7 0.2020.4490.3270.153
G b 8 0.0320.1790.0900.166
Table 6. Evaluating metrics of MICE imputation, considering the Extreme Gradient Boosting estimator as a parameter for the imputation algorithm.
Table 6. Evaluating metrics of MICE imputation, considering the Extreme Gradient Boosting estimator as a parameter for the imputation algorithm.
VariablesMSERMSEMAEF-Test p Values
G a 2 2.9 × 10 6 1711.004265.0920.304
G a 3 0.2880.5360.1660.279
G a 4 1.2801.1310.2360.455
G a 5 0.1370.3700.0760.226
G a 6 0.0270.1660.0370.314
G a 7 4.60 × 10 6 2146.545432.5470.151
G a 8 1.3781.1740.7450.186
G a 9 4.4 × 10 5 670.529155.1010.115
G a 10 1.78 × 10 6 1337.567168.7070.247
G b 1 0.1920.4380.2580.293
G b 2 0.1880.4340.2550.356
G b 3 4.2622.0641.0380.346
G b 4 2.5681.6020.9050.206
G b 5 0.0120.1110.0240.078
G b 6 0.0200.1420.0320.081
G b 7 0.1740.4170.2950.254
G b 8 0.0340.1840.0820.188
Table 7. Loadings of PCA for the imputed data set.
Table 7. Loadings of PCA for the imputed data set.
VariablesPC1PC2PC3PC4
G a 2 0.322−0.023−0.018−0.003
G a 3 0.304−0.0190.0050.010
G a 4 0.315−0.035−0.0180.014
G a 5 0.310−0.013−0.047−0.025
G a 6 0.311−0.016−0.0070.024
G a 7 0.324−0.025−0.0180.002
G a 8 −0.0540.098−0.2330.564
G a 9 0.322−0.021−0.0270.007
G a 10 0.322−0.027−0.013−0.006
G b 1 −0.1100.594−0.0020.024
G b 2 −0.0680.644−0.0120.025
G b 3 0.2610.304−0.014−0.092
G b 4 0.2620.339−0.009−0.076
G b 5 0.053−0.0180.6440.261
G b 6 0.0500.0380.5950.394
G b 7 0.1800.010−0.2180.329
G b 8 0.0400.0780.350−0.575
Table 8. The final main engine models of the procedure along with their encoding and their frequency in the data set.
Table 8. The final main engine models of the procedure along with their encoding and their frequency in the data set.
EncodingMain Engine ModelFrequencyEncodingMain Engine ModelFrequency
26S50MC-C1791386S50ME-B995
76S42MC1082496RTA48T94
16UEC45LSE36996S60ME-C893
315S50MC-C369115S60MC-C93
136S46MC-C333286RT-FLEX5092
205S50ME-B923166UEC45LSE-ECOB272
436S46ME-B819436S60MC-C852
176S60MC-C189155RT-FL50D52
196S50MC-C8174595S60MC-C846
307S50MC-C168216S42MC743
46S60MC153717S35MC42
56UEC52LA135335G60ME-C938
236S50MC1331105UEC45LSE36
276RT-FL50131975S50MC-C833
105S60ME-C8127576RTA48T-B32
256S70MC-C127816RT-FL48T31
646S46MC-C8115326UEC50LSII30
06S70MC107885RT-FLEX58T-B26
125S50MC107776G70ME-C924
226S50ME-C8100
Table 9. Performance of evaluation metrics (balanced accuracy) for imputation data and the top performance for each of the three dimensionality reduction methods using k-fold cross validation. The time taken for each algorithm is also included.
Table 9. Performance of evaluation metrics (balanced accuracy) for imputation data and the top performance for each of the three dimensionality reduction methods using k-fold cross validation. The time taken for each algorithm is also included.
Classification Algorithm ImputPCA-4Tsne10-3Umap10-3 ImputPCA-4Tsne10-3Umap10-3
ExtraTreesClassifierBalanced accuracy0.95070.87090.90200.8364Time taken0.35440.41360.34010.4025
RandomForestClassifier0.95020.85770.89310.81250.55241.06090.67130.7047
BaggingClassifier0.94080.83190.87000.78750.17920.18860.15620.1689
DecisionTreeClassifier0.93400.82670.88590.79280.02990.03440.02820.0317
ExtraTreeClassifier0.90340.82400.85960.77300.01060.01000.00960.0094
LabelPropagation0.88480.55140.62810.40101.02170.71520.67090.6862
LabelSpreading0.88450.53110.61570.39541.60831.31641.29511.3110
KNeighborsClassifier0.70340.56830.70820.65260.14590.03770.03500.0346
LGBMClassifier0.59200.53010.55250.76351.44341.56961.56461.8766
LinearDiscriminantAnalysis0.57780.20080.11470.10320.02070.01130.01040.0103
GaussianNB0.57750.31260.18710.15830.01460.01120.01060.0108
LogisticRegression0.50930.20230.13780.10410.84300.77780.80070.7614
SVC0.47570.24980.30730.24901.09821.15091.10991.1175
LinearSVC0.46780.09170.08450.07251.44582.18460.44340.7941
NearestCentroid0.44910.25340.24210.21290.01070.00860.00860.0082
CalibratedClassifierCV0.41460.08120.08470.06985.66128.60052.05963.3967
SGDClassifier0.38950.11440.08990.08810.30620.15870.12490.1399
QuadraticDiscriminantAnalysis0.31930.40120.26800.23100.02590.01160.01160.0115
Perceptron0.31120.08690.08930.07880.11290.06510.05550.0595
PassiveAggressiveClassifier0.26120.07770.08350.08320.13470.06960.06230.0622
BernoulliNB0.25850.05030.04510.04590.01180.00930.00920.0090
RidgeClassifier0.14770.04810.04380.06190.01440.01220.01140.0116
RidgeClassifierCV0.14770.04810.04370.06150.02810.02120.02090.0204
AdaBoostClassifier0.12630.08650.06800.07140.48220.47500.43500.4289
DummyClassifier0.02560.02560.02560.02560.00780.00600.00640.0059
Table 10. The average balanced accuracy scores from five iterations of the k-fold cross-validation technique of the ExtraTreesClassifier across different “n_estimators” values (for imputed and non-imputed data as well as for multiple dimensionality reduction methods).
Table 10. The average balanced accuracy scores from five iterations of the k-fold cross-validation technique of the ExtraTreesClassifier across different “n_estimators” values (for imputed and non-imputed data as well as for multiple dimensionality reduction methods).
n1050100150500
Imput0.94910.95370.94790.95130.9541
Umap-2-100.80890.82670.82050.81890.8273
Umap-2-300.73090.74540.74680.75550.7523
Umap-2-1230.58960.61740.61420.61440.6186
Umap-3-100.82510.82240.83220.82180.8357
Umap-3-300.76720.78860.80150.79110.7963
Umap-3-1230.68230.70890.72000.70760.7177
PCA-20.74740.76730.77010.77530.7693
PCA-30.82280.82850.82800.83540.8282
PCA-40.85220.85940.86300.86140.8667
tsne-2-100.88800.90480.89540.89470.8941
tsne-2-300.88200.89100.89340.89100.8996
tsne-2-1230.87860.87250.88010.88050.8813
tsne-3-100.89610.89130.88610.90290.9005
tsne-3-300.89190.89540.89270.88700.8899
tsne-3-1230.87530.88400.87770.88410.8830
Table 11. Confusion matrix of the 5 most frequent engine models (imputed data) and n_estimator = 50.
Table 11. Confusion matrix of the 5 most frequent engine models (imputed data) and n_estimator = 50.
Index1271331
1680110
20358000
75021200
13010650
31000071
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Skarlatos, K.; Papageorgiou, G.; Biris, P.; Skamnia, E.; Economou, P.; Bersimis, S. Ship Engine Model Selection by Applying Machine Learning Classification Techniques Using Imputation and Dimensionality Reduction. J. Mar. Sci. Eng. 2024, 12, 97. https://doi.org/10.3390/jmse12010097

AMA Style

Skarlatos K, Papageorgiou G, Biris P, Skamnia E, Economou P, Bersimis S. Ship Engine Model Selection by Applying Machine Learning Classification Techniques Using Imputation and Dimensionality Reduction. Journal of Marine Science and Engineering. 2024; 12(1):97. https://doi.org/10.3390/jmse12010097

Chicago/Turabian Style

Skarlatos, Kyriakos, Grigorios Papageorgiou, Panagiotis Biris, Ekaterini Skamnia, Polychronis Economou, and Sotirios Bersimis. 2024. "Ship Engine Model Selection by Applying Machine Learning Classification Techniques Using Imputation and Dimensionality Reduction" Journal of Marine Science and Engineering 12, no. 1: 97. https://doi.org/10.3390/jmse12010097

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop