Next Article in Journal
Rayleigh Waves in a Thermoelastic Half-Space Coated by a Maxwell–Cattaneo Thermoelastic Layer
Next Article in Special Issue
Modified Cox Models: A Simulation Study on Different Survival Distributions, Censoring Rates, and Sample Sizes
Previous Article in Journal
Study on Single-Machine Common/Slack Due-Window Assignment Scheduling with Delivery Times, Variable Processing Times and Outsourcing
Previous Article in Special Issue
Predicting Pump Inspection Cycles for Oil Wells Based on Stacking Ensemble Models
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Imputing Missing Data in One-Shot Devices Using Unsupervised Learning Approach

by
Hon Yiu So
1,*,†,
Man Ho Ling
2 and
Narayanaswamy Balakrishnan
3
1
Department of Mathematics and Statistics, Oakland University, Rochester, MI 48309, USA
2
Department of Mathematics and Information Technology, The Education University of Hong Kong, Hong Kong, China
3
Department of Mathematics and Statistics, McMaster University, Hamilton, ON L8S 4K1, Canada
*
Author to whom correspondence should be addressed.
Current Address: 146 Library Drive, Rochester, MI 48309, USA.
Mathematics 2024, 12(18), 2884; https://doi.org/10.3390/math12182884
Submission received: 9 August 2024 / Revised: 1 September 2024 / Accepted: 3 September 2024 / Published: 15 September 2024
(This article belongs to the Special Issue Statistical Simulation and Computation: 3rd Edition)

Abstract

:
One-shot devices are products that can only be used once. Typical one-shot devices include airbags, fire extinguishers, inflatable life vests, ammo, and handheld flares. Most of them are life-saving products and should be highly reliable in an emergency. Quality control of those productions and predicting their reliabilities over time is critically important. To assess the reliability of the products, manufacturers usually test them in controlled conditions rather than user conditions. We may rely on public datasets that reflect their reliability in actual use, but the datasets often come with missing observations. The experimenter may lose information on covariate readings due to human errors. Traditional missing-data-handling methods may not work well in handling one-shot device data as they only contain their survival statuses. In this research, we propose Multiple Imputation with Unsupervised Learning (MIUL) to impute the missing data using Hierarchical Clustering, k-prototype, and density-based spatial clustering of applications with noise (DBSCAN). Our simulation study shows that MIUL algorithms have superior performance. We also illustrate the method using datasets from the Crash Report Sampling System (CRSS) of the National Highway Traffic Safety Administration (NHTSA).

1. Introduction

One-shot devices are products that can only be used once. For example, airbags deploy during collision. They have to be replaced as they cannot deploy again. EpiPens are auto-injectors that are capable of injecting injections of epinephrine into patients who are experiencing life-threatening allergic reactions. These pens need to be replaced because they cannot be deployed again, obviously. Since most one-shot devices are life-saving, they have to be extremely dependable in times of emergency.
As most products come with warranties and insurance, any small systematic errors in one-shot device production would lead to colossal life and financial losses. The recent bankruptcy of the Takata airbag company is an infamous example of this [1]. The quality control of those products and predicting their reliability over time is critically important. Accelerated Life Tests (ALTs) are popular procedures to assess and analyze the quality of one-shot devices, and they depend heavily on the extrapolation of the life model from the high-stress levels during the experiment to the user conditions. Any biases in the model estimates would be amplified during the extrapolation.
Under parametric settings, the one-shot device data with N observations would have the likelihood function,
L ( θ ; x ) = i = 1 N ( F ( t i ; θ , x i ) ) 1 δ i ( 1 F ( t i ; θ , x i ) ) δ i ,
where  θ  is the parameter of the lifetime distribution with cumulative distribution function  F ( · ) x i  is the vector of covariates of the ith subject,  x = { x 1 , x 2 , , x N } t i  is the observed time when the ith one-shot device is activated and  δ i  is the indicator equal to 1 if the device functions correctly.
Manufacturers often carry out these tests before products enter the market, and we rarely test them in actual user conditions. To address this issue, we can refer to publicly available datasets that collect information from actual usage and retrospectively analyze the reliability of those products. For example, the Crash Report Sampling System (CRSS) of the National Highway Traffic Safety Administration (NHTSA) [2] has collected national crash data since 2016. It collects details of car crashes, including the status of airbag deployment. It provides a good data source for evaluating different car safety systems during the actual user conditions. This kind of retrospective study is vitally important as it can indicate potential problems in safety devices early and avoid disastrous replacements similar to the Takada airbag recall, see [1].
In most reliability testing, there are seldom any missing data as the experiments are under well-controlled conditions. However, datasets used in retrospective studies often have missing data. For example, in CRSS, the proportion of missing observations in the datasets is substantial (at least 16.43% in 2017, 13.39% in 2018, 10.00% in 2019 and 11.78% in 2020). This is due to various difficulties in data collection, such as lack of human resources, vehicle conditions or the severity of the accidents. Most popular statistical methods, like multiple imputations, to handle missing data often require the missingness not to depend on the response [3,4,5], which may not be reasonable in a retrospective study. This is because some latent factors not recorded in the datasets may affect both the responses and the missing mechanism. One obvious latent factor is whether the car was parked in a covered space, which is a factor of the airbag lifetime and overall car condition but is not reported in the CRSS. The existence of latent variables leads to heterogeneous data and produces biased imputation results.
Unsupervised learning (UL) techniques such as k-prototype, DBSCAN, and Hierarchical Clustering can effectively discover hidden variables in various domains. In the medical field, refs. [6,7] utilize these methods to classify patients more precisely. They improve the categorization of primary breast cancer and heart failure with maintained ejection fraction by finding separate patient clusters with diverse clinical profiles and outcomes. In engineering, ref. [8] apply the algorithms on railway vibration data to extract useful features for defect detection. UL seems helpful in categorizing the datasets into homogeneous subsets, the observations of which should have similar latent factors.
To analyze the one-shot device data with missing observations and latent factors, we propose using an unsupervised learning algorithm to form clusters, of which the data are considered homogeneous, and a standard statistical imputation technique should be possible. This paper is organized as follows: Section 2 introduces the missing data in one-shot data analysis and the inverse probability weighting (IPW). We review the traditional imputation methods in Section 3. We propose the novel multiple imputations with unsupervised learning (MIUL) in Section 4. We compare the proposed algorithm with some traditional missing data-handling methods in Section 5 and illustrate the usefulness of using CRSS datasets in Section 6. We provide concluding remarks and future research directions in Section 7.

2. Missing Mechanisms and IPW

There are many reasons missing data exist in one-shot device analysis. In particular, when we analyze data from consumers, covariates are often lost for various reasons. For example, customers may not recall certain variables or outcomes, or data entries may be deleted due to human error.

2.1. Missing Data Mechanisms

The missing data mechanism describes how the missingness appears in the datasets. Denote  d i  to be the missing indicator for the ith observation, and  d i = 1  if all the covariates, age of the devices at the instance and the survival statuses are observed, and  d i = 0  otherwise. We denote  X *  to be the variables that are completely observed for all data entries, and  X i *  is the corresponding value of the ith observation. If  Pr ( d i = 0 | X i ) = Pr ( d i = 0 )  which is a constant, the missing mechanism is referred to as missing completely at random (MCAR); if  Pr ( d i = 0 | X i )  is not constant but independent of the missing values, it is referred to as missing at random (MAR). Otherwise, we say the mechanism is missing not at random (MNAR).
If data are MCAR, analysis based on complete cases remains valid with respect to yielding unbiased results but loses statistical power [9,10,11]. The analysis may yield biased results if data are MAR, but this can be overcome using appropriate statistical methods [12]. If data are MNAR, the probability that a variable value is missing depends on the missing value and cannot be fully explained by the remaining observed variables. In that case, analysis tends to yield biased results if missing data are not appropriately handled, and sensitivity analyses are usually recommended to examine the effect of different assumptions about the missing data. In most cases, the missing mechanism is not MCAR, and handling the missing data properly is essential for valid estimation results.

2.2. Inverse Probability Weighting

Dropping the observations with missing information (i.e., complete case analysis) is generally not recommended as it removes much information and may also introduce biases in the model estimation. Therefore, in the analysis of missing data in the one-shot device with completing risks, extra caution is required to assess the extent and types of missing data before analysis, explore potential mechanisms that contribute to the missing data and use appropriate missing-data strategies to handle the missing data and conduct sensitivity analysis to assess the robustness of research findings.
The inverse probability weighting (IPW) method is a popular statistical approach to handling missing data, see [13]. IPW is a technique that can correct the bias resulting from complete data analysis and is also utilized to adjust for unequal fractions in missing data. IPW addresses this issue by assigning weights to each individual in the analysis based on the inverse of the corresponding probability of being a complete case. These weights balance the distribution of observed data to resemble a distribution with no missing data, thus reducing the bias introduced in a complete-case analysis.
Usually, we denote the non-missing probability as  p i = p ( x i * ) , where  x *  are the covariates observable for all observations and  X i *  are the corresponding values for the ith entry. Under the IPW method, the log-likelihood function of one-shot device data is
l I P W ( θ ; x ) = i = 1 N d i p i ( 1 δ i ) ln ( F ( t i ; θ , x i ) ) + δ i ln ( 1 F ( t i ; θ , x i ) ) ,
and the score function is
S θ ( θ ) = l I P W ( θ ; x ) θ = i = 1 N d i p i 1 δ i F ( t i ; θ , x i ) δ i 1 F ( t i ; θ , x i ) F ( t i ; θ , x i ) θ .
If we assume that a parametric model for the non-missing probability,  p i = p ( x i * ; α ) , the score function for estimating  p i  would be
S α ( α ) = i = 1 N d i p ( x i * ; α ) 1 d i 1 p ( x i * ; α ) p ( x i * ; α ) α ,
where  S α ( α ^ ) = 0  and  p ^ i = p ( x i * , α ^ )  is the corresponding estimate.

3. Literature Review on Imputation

Besides the IPW method, imputation strategies provide alternative ways to handle missing data. They are more intuitive as they fill out the missing data and give “complete datasets”. Researchers can use the datasets directly without complicated mathematical formulas. Here are a few popular methods for imputing missing data.

3.1. Single Imputations

3.1.1. Mean (or Mode) Imputation

Imputation is a common strategy for dealing with missing data. It fills in the blanks with an appropriate value to create a “completed” dataset that can be analyzed using traditional statistical procedures. The most direct and intuitive strategy for continuous variables is to replace missing observations with the mean of the observed sections, known as mean imputation, see [14]. Similarly, unobserved categorical and ordinal variables could be substituted by the mode of the observed ones (mode imputation). However, because this method does not take into account association across variables [15,16], it also produces biased and over-fitting results [17]. In Figure 1, we show how to perform a mean imputation. The variable Region represents the region of a car accident, the variable Age represents the age of the car involved, and the variable Success? represents whether the airbags of the car function properly.

3.1.2. Expectation Maximization Imputations

The expectation maximization (EM) imputation is a type of single imputation technique using the EM algorithm, and it is extensively studied by various literature [5,18]. Refs. [19,20] also consider using EM for the masked causes of one-shot devices. The EM algorithm is an iterative procedure of expectation (E-steps) and maximization steps (M-steps). In one iteration, the expected values of the missing responses or covariates would be updated as the mean conditional on the observed covariates, such as the components’ manufacturer and current stress levels (E-step). The complete data formulas will then be applied to the dataset with missing data filled from the E-step to obtain the updated estimates by maximizing the complete likelihood function (M-step). The E-step and M-step will be performed iteratively until the imputed values converge. To illustrate how EM imputation works, we use the data example in Figure 1. We assume the  A g e  follows the exponential distribution with rate =  λ  and the probability of Success,  S = 1 , given the A g e  of the car is
P ( S = 1 ) = β , A g e 6 0.5 β , A g e > 6 , f o r 0 < β < 1 .
The observed likelihood function is
L ( λ , β ) = 0.5 β 3 ( 1 β ) 2 ( 1 0.5 β ) λ 7 exp ( 33.5 λ ) ( 1 0.5 exp ( 6 λ ) ) β ,
and the complete log-likelihood is
l c λ , β = 1 + I A g e 5 > 6 ln 0.5 + 4 β + 2 ln 1 β + ln 1 0.5 β + 8 ln λ 33.5 + A g e 5 λ
with relevant expected values given the current parameter estimates  λ ( t )  and  β ( t )  being
E I A g e 5 > 6 = exp 6 λ t a n d E ( A g e 5 ) = 1 / λ ( t ) .
We then maximize  Q t ( λ , β ) = E l c λ , β | λ ( t ) , β ( t )  to obtain the next parameter estimates,  λ ( t + 1 )  and  β ( t + 1 ) . Figure 2 illustrates how the EM imputation can be implemented.

3.2. Multiple Imputations

Multiple imputations have been considered as a gold standard to account for missing data, and more specifically, the fully conditional specification (FCS) method (also labeled the multiple imputations by chained equations (MICE) algorithm) [21]. The multiple imputations by FCS impute multivariate missing data on a variable-by-variable basis, which is particularly useful for studies with large datasets and complex data structures, see [22]. It requires a specification of an imputation model for each incomplete variable and iteratively creates imputations per variable [23]. Compared to a single imputation, MI replaces each missing value with multiple plausible values, allowing uncertainty about the missing data to be considered. MI consists of two stages. The first stage involves creating multiple imputed datasets. In practice, we can first apply Box-Cox transformation on the numerical covariates such that they follow a multivariate normal distribution approximately. The MI method imputes the missing responses or the covariates stochastically based on the distribution of the observed data. We will repeat the process five to ten times to create ten imputed datasets. Once again, we use the data example in Figure 1 to show how multiple imputation works. We assume the variable Age has a linear relationship with the variables Region and Success and the log-odds of Success are linearly related to the variables Region and Age. We first impute the missing data with the average of observed values. Then, we form a regression model on the variable under imputation and predict the missing value with the model. If the variable is numerical, we often use the predictive mean matching method, which imputes the missing entries with the closest observed values to the predictions. For binary variables, we can perform a logistic regression and draw Bernoulli random variables with the predicted probabilities to impute the missing variables. Figure 3 shows how multiple imputation works step by step. For brevity reasons, we do not consider the model coefficient uncertainty in this example.
In the second stage, we will analyze each of the five or ten imputed datasets separately using standard statistical models and then combine the results from the five or ten analyses to report the conclusion; see [24]. One popular choice is to use Rubin’s Rules [25,26] to combine the estimates  β ˜ ( m ) , m = 1 , , M  from multiple imputed datasets, and obtain the the pooled estimate and variance,
β ^ = 1 M i = 1 M β ˜ ( m ) a >n >d V T o t a l = V W + V B + V B / M ,
respectively, where
V W = 1 M i = 1 M S E β ˜ ( m ) 2 >a >n >d V B = 1 M 1 i = 1 M β ˜ ( m ) β ^ 2 ,
S E β ˜ ( m )  is the standard error of of the estimate from the m imputed dataset.

4. Multiple Imputation with Unsupervised Learning

This research proposes new imputation methods to handle missing data in one-shot devices that hybridize multiple imputations with unsupervised learning (MIUL). Traditional imputation assumes the data are from a homogenous group, which is only sometimes valid in the actual ALTs or observational studies on a one-shot device. The components of devices/systems may be coupled in the manufacturing process or assembly, so the components within the device may have interrelationships, leading to data with latent heterogeneity and dependence, which can be described by frailty models [27,28].
It is well-known that the latent variables will affect the model coefficients when they are correlated with the observed variables. Since the latent variables are not observed, an unsupervised ML algorithm would be a powerful way to discover hidden clusters of the devices based on the observed variable [29]. We cluster one-shot devices into groups with different characteristics using machine-learning techniques. Within each group, their latent factors should be similar to each other. Therefore, the unknown latent factors are “controlled”, and the regression model parameters would be correctly adjusted. We can apply several conventional unsupervised clustering techniques to the observed covariates to discover hidden structures. For example, the k-means clustering technique is a popular method for clustering analysis in data mining. It is an unsupervised ML method partitioning the dataset into k clusters according to the distance of each observation to the cluster means. Once we cluster the data, we can impute the data using the following Algorithm 1.
Algorithm 1: Multiple imputations with unsupervised learning (MIUL)
Mathematics 12 02884 i001
This algorithm describes our proposed methods: multiple imputations with unsupervised learning (MIUL). Ref. [30] also discuss similar ideas, but they mainly focus on imputing datasets for clustering, while this research study focuses on clustering the datasets for imputation. There are several popular unsupervised learning algorithms, and they are listed below.

4.1. Hierarchical Clustering

Hierarchical Clustering is a technique that creates an ordered sequence of data clusters using a specific dissimilarity measure [31]. The concept of grouping hierarchically was first introduced by [32]. The term “Hierarchical Clustering” was first coined by [33], who discussed the approaches and the choice of distance measure. Ref. [34] gives further details on how Hierarchical Clustering can be done. It is also an unsupervised ML method that tries to build a hierarchy of clusters. In the agglomerative approach, each observation is first treated as a distinct cluster. These single observation clusters gradually merge depending on the smallest dissimilarity. Given the distance measure between two data points  X i , X i  is  d i s t ( X i , X i ) , there are three common dissimilarity measures between two clusters [35]: suppose G and H are two groups during the clustering process:
Single linkage (SL) dissimilarity,
d S L ( G , H ) = min { X i G , X i H } d i s t ( X i , X i ) ,
complete linkage dissimilarity,
d C L ( G , H ) = max { X i G , X i H } d i s t ( X i , X i ) ,
and group average (GA),
d G A ( G , H ) = 1 N G N H X i G X i H d i s t ( X i , X i ) ,
where  N G  and  N H  are the numbers of data points in Groups G and H, respectively. Similarly, in the divisive approach, all the observations are treated as one cluster and split into clusters based on dissimilarity until each observation is treated as a distinct cluster. Both methods create multi-tiered hierarchies as the procedure goes on. They span the extremes of each observation existing as their cluster and all observations combining into a single, all-encompassing cluster. A dendrogram is a visual representation of the cluster-merging process. The process can stop at k clusters if the average dissimilarity between a pair of clusters changes substantially. Factors like many irrelevant variables may obscure clusters, as they only exist within a small subset of variables. Thus, despite seeming close due to relevant variables, the curse of dimensionality can create a vast distance between observations in a high-dimensional space, which challenges the effectiveness of dissimilarity measures and the accuracy of the resulting clusters.

4.2. K-Prototype

K-prototype was initially proposed by [36]. The term “prototypes” refers to the centroids or representative points of clusters in a dataset. It extends the well-known k-means clustering algorithm to the categorical variables or attributes with mixed numerical and categorical values. The K-means clustering algorithm first normalizes each variable of the ith observation,  X i * = ( x i 1 * , , x i j * , , x i J * ) , by
x i j = x i j * min ( x 1 j * , , x I j * ) max ( x 1 j * , , x I j * ) min ( x 1 j * , , x I j * ) ,
where  X i = ( x i 1 , , x i j , , x i J )  It, then, partitions numerical dataset variables  X = { X 1 , X 2 , , X n }  into k clusters by minimizing the within groups sum of squared error (WGSS),
P ( W , Q ) = l = 1 k i = 1 n w i , l d i s t ( X i , Q l ) ,
where  w i , l { 0 , 1 } , l = 1 k w i , l = 1 , 1 i n , is the indicator of the ith observation being in the lth cluster,  Q l Q = { Q 1 , Q 2 , , Q k }  is a set of observations in the same cluster, and  d i s t ( X i , Q l )  is the squared Euclidean distance between  X i  and the cluster center of  Q l ,
d i s t 1 ( X i , Q l ) = j = 1 p ( x i , j q l , j ) 2 ,
where  x i , j  is the value of the jth variable of the ith observations and there are p numerical variables. We adopt the elbow method to determine k, which gives the reduction of minimized WGSS less than 10%.
However, datasets often contain both numerical and categorical variables. Suppose the first p variables are numerical and the remaining  m p  variables are categorical in a dataset. The k-prototype algorithm replaces  d ( X i , Q l )  by the following dissimilarity measure,
d i s t 2 ( X i , Q l ) = j = 1 p ( x i , j q l , j ) 2 + γ j = p + 1 m δ ( x i , j , q l , j ) ,
where
q l , j = 1 N i Q l x i , j , for   1 j p c i , j , for   p + 1 j m ,
and  c l , j  is the most frequent category of jth variable in the cluster.  Q l  and  δ ( x i , j , c l , j ) = 1  if  x i , j = q l , j  and zero otherwise. The weight  γ  is chosen to balance the effects of numerical and categorical attributes. Details on how to select  γ  can be found in [37]. Ref. [38] reduces the misclassification of data points near the boundary region by considering the distribution centroids for the categorical variables in a cluster, and [39] further improves it by proposing a new dissimilarity measure between data objects and cluster centers.

4.3. Density-Based Spatial Clustering of Application with Noise

Density-Based Spatial Clustering of Application with Noise (DBSCAN) was originally proposed by [40]. The algorithm is uniquely designed to discover clusters of arbitrary shape, and it requires minimal domain knowledge to determine input parameters with very general assumptions [41]. It is efficient for large databases, making it a practical choice for various applications. The effectiveness and efficiency of DBSCAN were evaluated using synthetic and real data.
The concept of density-based clusters is central to the operation of DBSCAN. The Eps-neighborhood of a point p in the dataset D is
N E p s ( p ) = { q D | d i s t ( p , q ) E p s } ,
where  d i s t ( p , q )  is a distance measure between points p and q. If,  p N E p s ( q )  and the neighborhood size,  | N E p s ( q ) |  is greater than a preset minimum number of points,  M i n P t s , then p is directly density-reachable to q. We also define p as density-reachable from q if there is a chain of points  p 1 = p , p 2 , p n = q  which are directly density-reachable consecutively. p and q are density-connected if they are both density-reachable from a common point o. The cluster here is defined as the set of points that are density-reachable from each other (maximality) and density-connected (connectivity). The set of points not belonging to any cluster are considered to be noise. Ref. [42] addresses the issue of detecting the cluster in data of varying densities by assigning the noise points to the closest eligible cluster. Ref. [43] revisits the DBSCAN algorithm and discusses how to obtain the appropriate parameters for DBSCAN. Ref. [44] extends the algorithm to produce DBSCAN clusters in a nested, hierarchical way. Ref. [45] implements the hierarchical DBSCAN in Python that gives the best DBSCAN parameters.

4.4. Gower’s Distance

Most dissimilarity measures for numerical variables are based on the Euclidean distance, the square root of (6). If discrete variables exist, the k-prototype algorithm adopts the dissimilarity measure states in (7). However, both of them require the variables to be completely observed. We may use Gower’s distance proposed by [46] to handle partially observed data entries. It provides another similarity measure between the ith and kth observations,
d i s t G ( i , k ) = j = 1 m w j d i k ( j ) d i s t i k ( j ) j = 1 m w j d i k ( j ) ,
where  d i k ( j )  is the indicator if the jth variable is observed for both the ith and kth data entries,  d i s t i k ( j )  is the distance measure of the jth variable between ith and kth observations,  x i ( j )  and  x k ( j ) . If jth variable is binary or nominal,  d i s t i k ( j ) = 1  if  x i ( j ) x k ( j )  and 0 otherwise. If the jth variable is numerical or ordinal,  d i s t i k ( j )  would be the absolute difference, standardized by the range  | x i ( j ) x k ( j ) | / r a n g e ( x i ( j ) ) . The weight for the jth variable,  w j , indicating the importance of the variable, is usually set to one, based on the Adansonian principle of equal weighting, see [47].

5. Simulation Study

5.1. Simulation Settings

We develop a simulation study to compare various methods when handling missing data in one-shot devices. Suppose we are monitoring the quality of airbags in the United States. For simplicity, we assume that the airbags have exponential lifetimes with the hazard rate,  λ ( x ) ,
f ( t i ; x i ) = λ ( x i ) exp ( λ ( x i ) t i ) , for   t i > 0 ,
where  x i  is the variables that are associated with the airbag quality, the car make (A, B, and C), the car registration region (Northeast, Midwest, South, and West) and parking location (indoor or outdoor). To ensure the value is positive, we adopt a log-linear function.
λ a c t u a l ( x i ) = exp ( β 0 + β m a k e + β r e g i o n + β p a r k ) ,
where  β 0  is the baseline log-hazard for all airbags, equal to −6, −5 and −4 representing high, medium and low reliability (Relb.) levels of the devices, around 62%, 82% and 93%, respectively.  β m a k e B = 0.4  and  β m a k e C = 0.8  correspond to the log-hazard rates of airbags from the car makes B and C relative to A.  ( β r e g i o n M W , β r e g i o n S , β r e g i o n W ) = ( 0.6 , 0.4 , 0.8 )  represents the log-hazard rates of cars from Midwest, South and West relative to those from Northeast, accordingly.
The effect of outdoor parking is assumed to be  β p a r k o u t = 0.6  relative to parking inside. Parking location is quite possibly related to various factors, namely, car owner’s location (urban or rural), car body type (sedan, Sport Utility Vehicles and others), whether the car is in the West region, and whether the driver is an alcohol drinker. Here, we assume the probability of outdoor parking is linearly related to the factors mentioned. Parking location is unlikely to be recorded, and we consider it a hidden variable. Therefore, the regression equation would be limited to
λ m o d e l ( x i ) = exp ( β 0 + β m a k e + β r e g i o n ) ,
which resembles the model misspecification in actual modeling.
Compared to other reliability experiments, this is an observational study, and some data are inevitably missed during the data collection. We simulate two different types of missing indicators. One is unrelated to the one-shot device response, nor any hidden variable corresponding to the MAR mechanism. We also generate another missing indicator related to the hidden variable, parking location, representing the MNAR scenarios. IPW uses logistic regression with all the fully observed variables to model the probability of non-missing. MI and all the MIUL methods use all the variables in the datasets for imputation. Details of the parameter setting can be found in the Appendix A.
In the simulation, we consider the scenarios with 1000, 2000, and 4000 cars we examined, representing small, medium, and large datasets. We also consider three missingness levels: around 10%, 15%, and 25%. The simulation study aims to analyze car makes’ coefficients to see which one has potential problems.

5.2. The Competitors’ Details

5.2.1. Traditional Methods

In the simulation, we compare six missing handling methods. The first three are mean imputation (Mean.I), IPW and MI, representing traditional and popular missing handling methods.
In Mean.I, we impute the missing numerical observations by the average of the observed variables. If the variables are categorical, it would be intuitive to use the mode (the most frequently observed categorical values). However, it always imputes the survival status as “success” as the one-shot devices are highly reliable. To have a more sensible imputation, we impute categorical variables by linear discrimination analysis, for which linear equations are formed based on other variables to discriminate the categorical responses.
For IPW, we use all the fully observed variables to perform logistic regression on non-missing indicators (1- the missing indicator of the missing variable). We then predict the probabilities of non-missingness,  p i  and weight the score function by  1 / p i  when we estimate the parameters for a one-shot device.
For MI, we use the default setting of mice() function from the mice package [48] in R program [49]. By default, the program creates linear equations for imputing missing data. Numerical variables are imputed by the predictive mean matching method (which gives the closest observed values to the missing value prediction based on a linear model). It also imputes the categorical values by the prediction from logistic regression, multinomial regression and proportional odds model regression for binary, nominal and ordinal variables, respectively. We then estimate the parameters,  β ˜ m a k e ( m )  for the mth imputed data. We then use Rubin’s Rules [25,26] mentioned in Section 3.2 to obtain the estimate and variance.

5.2.2. Our Proposed Methods

The other three methods are based on our proposed method MIUL with the unsupervised learning algorithms, k-prototype (K.MI), DBSCAN with Gower’s distance (DB.MI) and Hierarchical Clustering with Gower’s distance (HC.MI). Before clustering, we calculate the distance between data points by daisy function in the cluster package [50].
For k-prototype, we use the kproto function from the clustMixType package by [51]. We first group the data with the number of clusters  k = 3  and then keep increasing the cluster numbers until the reduction of the within-cluster sum of squares is less than 10%.
For DBscan, we modify the suggestion by [52] and select the  E p s  considering the distance from each point to its third-nearest neighbor and define the minimum data points for a cluster,  M i n P t s  as 50. We, then, use the dbscan function fpc package by [53] to perform DBSCAN clustering.
For Hierarchical Clustering, we use the hclust from the stats package by [49] and we cut trees to form five clusters.
After clustering, we use the MI methods mentioned above to impute data within each cluster. We combine the clusters as imputed datasets and follow Rubin’s Rules again to infer the estimation.

5.3. Simulation Results

To evaluate the traditional methods and our proposed methods, we compare the biases and the mean squared error (MSE) of the car make coefficient  β m a k e , which are defined as
B i a s = 1 n i = 1 n ( β ^ m a k e , i β m a k e ) , M S E = 1 n i = 1 n ( β ^ m a k e , i β m a k e ) 2 ,
for the simulation size n, which is 1000 in our simulation study. We have six scenarios focusing on different missing data cases.
Scenario 1 focuses on the response variable, the survival statuses of one-shot devices, being missing at random. Table A2, Table A3, Table A4 and Table A5 show the bias and mean squared error of  β ^ m a k e B  and  β ^ m a k e C  when the response variable is missing at random. Although mean imputation generally has lower biases when the one-shot devices have higher reliability, our proposed methods become more accurate when we have more failure responses. The reason is quite apparent: the majority of the devices are successful and the mean imputation imputes the missing response as a success, which is the mode of the response. As a result, the imputed responses are most likely in favor of the success case and result in worse biases. When the reliability is high, the multiple imputations with unsupervised learning (MIUL) are generally better. However, when we consider the mean squared errors of  β ^ m a k e B  and  β ^ m a k e C , the traditional statistical methods do not have any advantage. The DB.MI performs better when the product reliability is high, while the K.MI works better when the product reliability is high.
Scenario 2 considers the survival statuses of one-shot devices being MNAR due to hidden variables. The simulation results are presented in Table A6, Table A7, Table A8 and Table A9. Considering the biases, they have similar patterns to the previous scenario. The mean imputation has smaller biases with low product reliability, while the MIUL algorithm gains an advantage when product reliability decreases. For MSE, MIUL algorithms generally perform better and, in addition, K.MI has superior performance when product reliability is low.
Scenario 3 considers the age of one-shot devices being missing at random. The simulation results are presented in Table A10, Table A11, Table A12 and Table A13. Similar to Scenarios 1 and 2, Mean.I has the most negligible bias when product reliability is low. However, as the reliability becomes high, DB.MI’s performance suppresses other methods and HC.MI comes next. When we look at MSE, DB.MI has the smallest and HC.MI has the second smallest. This indicates that DB.MI and HC.MI outperforms K.MI when the missing variable is numeric instead of categorical.
Scenario 4 considers the age of one-shot devices being missing not at random due to the hidden variable and Table A14, Table A15, Table A16 and Table A17 summarize the results. Here, we have a similar conclusion as Scenario 3, that Mean.I has a low bias when the reliability is low. DB.MI works best regarding bias and MSE when the reliability is high and HC.MI is the second best. Occasionally, MI has less MSE compared to others.
Scenario 5 represents the case when the covariate Region is MAR and the simulation results are shown in Table A18, Table A19, Table A20 and Table A21. The mean imputation usually has the most insignificant biases when the product reliability is low. However, our proposed MIUL methods work relatively well for most cases and HC.MI is slightly better than other MIUL methods.
Scenario 6 represents the case when the covariate Region is missing not at random and we present the results in Table A22, Table A23, Table A24 and Table A25. Again, the Mean.I has the most negligible bias when the reliability of the one-shot device is low, while MIUL methods, especially HC.MI, have better performance in bias for low-reliability products. When we look at the MSE, DB.MI seems the best when the reliability is low to medium, while MI performs pretty well when the reliability is low.
The simulation shows that MIUL algorithms consistently outperform the traditional approach, namely the mean imputation, IPW and MI approaches for products. This tendency is more obvious when the products have a medium or high reliability. This is because high-reliability products produce fewer failure cases. Then, traditional methods tend to weight success cases higher and create biases as they cannot observe the latent variables in the dataset.
With MIUL, the clustering of the observations attempts to “discover” the latent variables and gives more flexible equations for imputing the missing variable. The target parameters  β m a k e B  and  β m a k e C  are estimated accurately with different variables under various missing mechanisms. Generally speaking, when the survival statuses of one-shot devices are missing, K.MI is the best; when the ages of one-shot devices are missing, both DB.MI and HC.MI work well; when a covariate is missing, HC.MI works well in most cases. This gives HC.MI the best overall performance among all MIUL algorithms.
To summarize, MIUL algorithms suppress the traditional missing data-handling methods, especially when the product reliability is medium to high, which is valid for most one-shot devices. The missing mechanism does not impact their relative performance much, suggesting that we prefer MIUL most of the time.

6. Applications with the CRSS Data

6.1. CRSS Datasets

In this section, we illustrate the application of the MIUL algorithm on the Crash Report Sampling System (CRSS) datasets 2016–2020. National Highway Traffic Safety Administration (NHTSA) developed and implemented CRSS to reduce the motor vehicle crash experience in the United States [54]. It is an annual survey designed independent of other NHTSA surveys. It is a valuable instrument for comprehending and analyzing crash data, yielding vital insights that can be used to enhance road safety. For simplicity reasons, we ignore the complex sample design and treat the data as independent in this illustration.

6.2. Defining the Airbag Success in CRSS Data

In CRSS, the airbag variable indicates when the airbag deploys or not during the accident. It does not explicitly imply whether the airbags function correctly. Here, we make a few modifications to the variable. First, we define the situation when the airbags should be deployed. It is reasonable to assume that an airbag should be deployed when the driver or passengers are critically injured. After all, airbags are designed to reduce the severity of injury. As the airbag sensor is at the front and due to the size of a car, the airbag should be deployed if the areas are impacted. We also assume the airbag should be deployed if the car has to be towed after the accident, as the collision is severe. We present the details of how such a a situation is exactly defined in Appendix B. Then, we can define the airbags as functioning correctly during the accident if the airbags deploy when they should or they do not deploy when they should not. Figure 4 presents the airbag success rate by manufacturing year for different accident years. The success rates are around 95%, which is reasonable. Each manufacturing year’s airbag success rates are consistent with each other for different accident years. Thus, we can conclude that such a definition for airbag success looks reasonable.

6.3. Data Analysis

To simplify the modeling process, we use the same model (10) in the simulation study and we limit our target to comparing the airbag’s reliability of the car makes’ origins, namely, America, Asia and Europe. Throughout the accident years 2017–2020, the missing rates are 16.43%, 13.39%, 10.00% and 11.78%, respectively, and they are close to medium to low missing levels in the simulation study. Excluding the observations with missing variables, the success rate of the airbags is about 95%, as shown in Figure 4, confirming that airbags are highly reliable products. The CRSS generally publishes around 95,000 observations each year. Due to technical and resource limitations, we usually work on a dataset with sizes smaller than CRSS. In our illustration, we only sample 1000 (small), 2000 (medium) and 4000 (large) from the CRSS dataset. Then, we repeat this process 1000 times and record the estimates of the log hazard rate of airbags in Asian and European cars relative to American cars. There are two reasons for the sub-sampling. First, as the sample size increases, while there are some hidden variables that we cannot observe, the biases would increase as sample size increases which can be observed in the simulation study above. Therefore, we sample observations from large datasets and see if the estimate averages vary for different sample sizes. Second, it would allow a more straightforward confidence interval calculation using the bootstrap method while fewer computational resources are required. It also resembles the standard airbag tests of particular makes and models, as the sample sizes rarely exceed 4000. We report the means, standard errors (SEs) and the 95% confidence intervals (lower limit, CI.L and upper limit, CI.U) in Table 1 and Table 2.
Table 1 presents the estimates of  β m a k e A s i a n  under various imputation methods. The estimates are consistent for various sample sizes, indicating that a sample size of 2000 to 4000 should be enough for the parameter estimation. All the  β ^ m a k e A s i a n  are most likely around zero. Therefore, we can safely claim that the airbag motility of Asian brands is pretty much similar to that of American ones.
Next, we present the estimates of  β m a k e E u r o p e a n  in Table 2. The estimates increase when the sample size increases for traditional methods, while the estimates from the proposed MIUL methods remain relatively stable. This suggests that the MIUL methods are more robust to the sample sizes. The CIs of  β ^ m a k e E u r o p e a n  do not cover the zero under K.MI when the sample size is large for accident years 2019 and 2020 with the corresponding Wald statistics  0.503 0.231 = 2.175  (p-value  = 0.030 ) and  0.529 0.253 = 2.095  (p-value  = 0.036 ). Since K.MI usually has the most negligible bias and MSE, especially when the sample size and product reliability are high with medium to low missing levels, the quality of airbags of European manufacturers is likely significantly worse than the American ones, requiring further investigation.

7. Conclusions and Discussions

Retrospective studies are important in reliability analysis as they measure the actual lifetimes of the one-shot devices under the actual users’ conditions. They can detect if the one-shot devices are designed and manufactured correctly and give early warnings if there are some systematic defaults in one-shot devices. This is crucial as most safety and life-saving products are one-shot devices. Early detection of potential flaws can save a lot of lives and casualties. In this study, we have proposed a retrospective way to analyze the reliability of one-shot devices through publicly available data.
Different from the usual reliability experiment, those datasets are not collected in a controlled environment, and therefore missing observations are inevitable. The traditional statistical methods may not handle the missing data properly since the observations may not be missing at random. With hidden variables, like parking locations and maintenance habits, those methods may not work well as the model cannot be specified correctly.
When machine learning algorithms are applied to impute the missing data, unsupervised learning may be useful to impute the missing observations. Still, it is not intuitive how the imputation can be carried out. Our study proposed an innovative way for missing data imputation using unsupervised learning, and it works reasonably well when hidden variables are present in the dataset under a missing-not-at-random assumption. With an accurate imputation strategy, retrospective studies on one-shot devices become possible.
Using a simulation study, we have shown that the MIUL methods perform superior to the traditional methods in the context of one-shot device reliability evaluation. We illustrate the methods using the CRSS datasets provided by NHTSA. Under a definition of airbag success, we find out that the airbags made by European cars may be significantly worse than those made by American manufacturers.
The CRSS datasets are collected through a complex survey design, which this research ignores for simplicity reasons. Therefore, in future studies, we could incorporate the survey design structure in the MIUL, enhancing the estimation accuracy. We could also extend the unsupervised learning part to more advanced algorithms like auto-encoder, and the self-organizing map, which may provide more precise results. It would also be interesting to see their estimation performance when the one-shot devices have Weibull and gamma lifetimes with frailty, which are more realistic than the exponential distribution.
This manuscript also provides a potential way to detect possible manufacturing issues with one-shot devices using public data. It would be interesting to regularly track the number of airbag failures and report on any defective models. One method is to monitor failures using control charts. Control charts are commonly used in quality control to identify underlying problems in industrial processes; see [55,56,57]. Modifying control charts for public datasets would be an intriguing issue worthy of more investigation.

Author Contributions

Writing—original draft, H.Y.S.; Writing—review & editing, M.H.L.; Supervision, N.B. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the University Research Committee of Oakland University (Grant Type: Faculty Research Fellows).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Datasets are available from the Crash Report Sampling System of National Highway Traffic Safety Administration. https://www.nhtsa.gov/crash-data-systems/crash-report-sampling-system accessed on 20 June 2023.

Acknowledgments

The first author thanks the University Research Committee of Oakland University for supporting this research. The authors appreciate the anonymous referees who provided useful and detailed comments on a previous version of the manuscript. The authors also thank the associate editor and editor for handling our paper.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Details of Simulation Settings

Appendix A.1. One-Shot Device Model Setting

The reliability function of one-shot devices given covariate  x i :
R ( t | x ) = Pr ( T > t | x ) = exp ( t λ a c t u a l ( x ) )
The actual hazard rates of one-shot devices:
λ a c t u a l ( x ) = exp ( β 0 + β m a k e + β r e g i o n + β p a r k ) ,
where  β 0 = 6 , 5  and  4  represent high, medium and low reliability of the one-shot devices, respectively,  ( β m a k e B , β m a k e C ) = ( 0.4 , 0.8 ) ( β r e g i o n M W , β r e g i o n S , β r e g i o n W ) = ( 0.6 , 0.4 , 0.8 )  and  β p a r k = 0.6 .
The probability of parking outside is
Pr ( P a r k = O u t | x ) = 0.2 + 0.2 I R u r a l + 0.2 I W e s t 0.1 I S e d a n + 0.2 I A l c o h o l ,
where the meanings of indicators are  I R u r a l , Pr ( I R u r a l = 1 ) = 23.5 % , the car is in a rural area,  I W e s t , Pr ( I W e s t = 1 ) = 16.9 % , the car is in the West region,  I S e d a n , Pr ( I S e d a n = 1 ) = 47.5 % , the car is a Sedan,  I A l c o h o l , Pr ( I A l c o h o l = 1 ) = 6 % , the driver is an alcohol drinker. We also assume that the driver’s age following gamma distribution with shape and rate parameters is  ( α , β ) = ( 20.01 , 0.83 ) , respectively, if the cars are parked outdoors or  ( α , β ) = ( 12.04 , 0.24 )  otherwise. Finally, Car_Age, the car age at the time of the accident, follows the following multinomial distribution.
Table A1. The distribution of car age at the time of the accident in the simulation study.
Table A1. The distribution of car age at the time of the accident in the simulation study.
Car_Age0.51.52.53.54.55.56.57.58.59.510.511.512.513.514.515.516.5
Prob.0.36%5.02%7.65%7.39%7.42%7.22%4.94%4.44%4.08%3.95%4.11%4.31%4.61%4.88%4.64%4.24%3.79%
Car_Age17.518.519.520.521.522.523.524.525.526.527.528.529.530.531.532.533.5
Prob.3.37%2.87%2.36%1.93%1.56%1.18%0.88%0.66%0.47%0.33%0.28%0.21%0.16%0.13%0.11%0.07%0.06%
Car_Age34.535.536.537.538.539.540.541.542.543.544.545.546.547.548.549.550+
Prob.0.06%0.04%0.03%0.02%0.03%0.02%0.02%0.02%0.01%0.01%0.01%0.01%0.01%0.01%0.01%0.01%0.01%

Appendix A.2. The Missing Mechanism for Different Scenarios

Scenario 1 in Section 5 stands for the survival statuses of one-shot devices being MAR,
Pr ( I M i s s i = 1 ) = α 0 + 0.05 I O t h e r + 0.01 log ( C a r _ A g e ) ,
where  α 0 = 0.05 , 0.1  and  0.2  represent low, medium and high missing levels, respectively, and the variable  I O t h e r , Pr ( I O t h e r = 1 ) = 32.7 %  indicates that the car is not a Sedan nor a Utility Vehicle.
Scenario 2 in Section 5 stands for the survival statuses of one-shot devices being MNAR,
Pr ( I M i s s i = 1 ) = α 0 + 0.05 I O t h e r + 0.01 log ( C a r _ A g e ) + 0.1 I O u t ,
where  α 0 = 0.05 , 0.1  and  0.2  represent low, medium and high missing levels, respectively, and the indicator  I O t h e r = 1  if the car is not a Sedan nor a utility vehicle and  I O u t = 1  if the car is parked outdoors.
Scenario 3 in Section 5 stands for the  C a r _ A g e  being MAR,
Pr ( I M i s s i = 1 ) = α 0 + 0.01 I S o u t h + 0.05 I R u r a l ,
where  α 0 = 0.05 , 0.1  and  0.2  represent low, medium and high missing levels, respectively, and the indicators,  I S o u t h , Pr ( I S o u t h = 1 ) = 53.6 % , the accident location in the South region and  I R u r a l  the car is in a rural area.
Scenario 4 in Section 5 stands for the  C a r _ A g e  being MNAR,
Pr ( I M i s s i = 1 ) = α 0 + 0.01 I S o u t h + 0.05 I R u r a l + 0.1 I O u t ,
where  α 0 = 0.05 , 0.1  and  0.2  represent low, medium and high missing levels, respectively, and the indicators  I S o u t h = 1  if the accident is located in the South region,  I R u r a l = 1  if the accident location is in a rural area and  I O u t = 1  if the car is parked outdoors.
Scenario 5 in Section 5 stands for the  R e g i o n  being MAR,
Pr ( I M i s s i = 1 ) = α 0 + 0.05 I R u r a l ,
where  α 0 = 0.05 , 0.1  and  0.2  represent low, medium and high missing levels, respectively, and the indicator  I R u r a l = 1  means the car is in a rural area.
Scenario 6 in Section 5 stands for the  R e g i o n  being MNAR,
Pr ( I M i s s i = 1 ) = α 0 + 0.01 I S o u t h + 0.05 I R u r a l + 0.1 I O u t ,
where  α 0 = 0.05 , 0.1  and  0.2  represent low, medium, and high missing levels, respectively, and the meanings of indicators are  I S o u t h = 1  if the accident location is in the South region,  I R u r a l = 1  if the car is in a rural area and  I O u t = 1  if the car is parked outdoors.

Appendix B. Details of the Real Data Analysis

For each year, the CRSS publishes more than 20 datasets, including PERSON.CSV, which contains motorists’ and passengers’ data, VEHICLE.CSV, which contains the data of the vehicles involved in the accidents and VEVENT.CSV, which contains the harmful and non-harmful events for the vehicles; see [2]. We focus on these three data files for real data analysis.
We define the variable AirBag_Should to indicate the situations in which the airbags should deploy which satisfy at least one of the following conditions:
  • If there are any motorists or passengers severely injured or dead (INJ_SEV in categories 3 or 4);
  • If the area of impact on the vehicle is not at the back (AOI1 in categories 1–5 or 7–12);
  • if the car has to be towed (TOWED = 1).
The variable AirBag_Deploy is an indicator concerning whether there is any airbag deployed during the accident and the variable AirBag_Success indicates whether AirBag_Deploy is equal to AirBag_Should. It shows if the airbags work probably on a vehicle during an accident.
Then, we treat AirBag_Success as the one-shot device status,  δ i , and the vehicle age (accident year minus car model year) as the one-shot device’s observed time  t i . The rest of the covariates,  x i , are the origins of the car makes (America, Europe or Asia), accident region (Northeast, Midwest, South, West) and if it happens in the urban areas (urban or rural), the vehicle body type (Sedans, Sport Utility Vehicle and others) and the driver’s age at the accident.
Then, we sample 1000, 2000 or 4000 observations from the dataset to mimic the situation in which smaller data are collected in practice. Then, we apply the traditional missing data-handling methods and our proposed method, MIUL.
The R code, which modifies the CRSS datasets, is posted on GitHub: https://github.com/hso-OU/OneShotMIUL/blob/main/MLOnshotDataAnalysisV2.R accessed on 20 June 2023. The dataset created, CRSSData.RData, is uploaded to Kaggle: https://www.kaggle.com/datasets/honyiuso/airbagcrss/data accessed on 20 June 2023.

Appendix C. Tables of Simulation Results

Table A2. Bias of  β ^ m a k e B  when the response variable is missing at random. Bold values represent the best result.
Table A2. Bias of  β ^ m a k e B  when the response variable is missing at random. Bold values represent the best result.
Methods
Relb. Miss. Size Mean.I IPW MI K.MI DB.MI HC.MI
LowHigSml−0.0227−0.0229−0.0272−0.0359−0.0385−0.0386
Med−0.0263−0.0267−0.0311−0.0494−0.0440−0.0462
Lrg−0.0245−0.0250−0.0271−0.0687−0.0404−0.0451
MedSml−0.0225−0.0225−0.0252−0.0284−0.0325−0.0306
Med−0.0279−0.0281−0.0315−0.0388−0.0375−0.0393
Lrg0.00010.0001−0.0046−0.0253−0.0118−0.0157
LowSml−0.0233−0.0238−0.0258−0.0284−0.0301−0.0290
Med−0.0235−0.0236−0.0253−0.0297−0.0296−0.0304
Lrg−0.0188−0.0184−0.0225−0.0325−0.0255−0.0273
MedHigSml−0.0017−0.0022−0.0064−0.0417−0.0215−0.0268
Med0.01510.01430.0147−0.0596−0.0070−0.0213
Lrg0.04080.04180.0285−0.09820.0177−0.0079
MedSml−0.0125−0.0128−0.0131−0.0295−0.0248−0.0271
Med0.00720.00720.0034−0.0323−0.0017−0.0148
Lrg0.05940.05940.0525−0.02660.03560.0232
LowSml−0.0068−0.0071−0.0084−0.0196−0.0149−0.0155
Med0.02160.02160.0187−0.00400.01360.0089
Lrg0.03600.03620.0341−0.01960.02520.0131
HigHigSml0.02800.02730.0218−0.08760.0008−0.0229
Med0.09660.09650.0747−0.12760.05660.0208
Lrg0.48690.48700.3553−0.17990.29950.2463
MedSml0.02180.02180.0183−0.04110.0098−0.0061
Med0.09190.09150.0884−0.04310.06910.0522
Lrg0.39490.39410.3289−0.02850.30230.2683
LowSml0.01850.01860.0165−0.01940.00940.0015
Med0.06700.06700.0622−0.01860.05470.0406
Lrg0.32800.32890.29020.07560.28360.2698
Table A3. Bias of  β ^ m a k e C  when the response variable is missing at random. Bold values represent the best result.
Table A3. Bias of  β ^ m a k e C  when the response variable is missing at random. Bold values represent the best result.
Methods
Relb. Miss. Size Mean.I IPW MI K.MI DB.MI HC.MI
LowHigSml−0.0057−0.0056−0.0070−0.0140−0.0175−0.0185
Med−0.0077−0.0080−0.0096−0.0209−0.0218−0.0224
Lrg−0.0047−0.0049−0.0027−0.0318−0.0149−0.0185
MedSml−0.0068−0.0067−0.0077−0.0100−0.0143−0.0132
Med−0.0179−0.0182−0.0199−0.0241−0.0245−0.0270
Lrg0.00990.01030.0075−0.00580.0007−0.0024
LowSml−0.0070−0.0073−0.0076−0.0106−0.0120−0.0111
Med−0.0070−0.0072−0.0076−0.0103−0.0117−0.0123
Lrg−0.0020−0.0013−0.0035−0.0094−0.0062−0.0069
MedHigSml0.00460.00420.0024−0.0235−0.0170−0.0192
Med0.02110.02050.0228−0.0306−0.0024−0.0121
Lrg0.04250.04390.0355−0.05500.01810.0008
MedSml−0.0079−0.0081−0.0075−0.0187−0.0213−0.0221
Med0.02210.02210.0211−0.00650.01090.0024
Lrg0.06080.06130.0555−0.00280.03670.0279
LowSml−0.0005−0.0007−0.0005−0.0096−0.0090−0.0090
Med0.02430.02390.02310.00600.01550.0125
Lrg0.04160.04170.0408−0.00010.03180.0213
HigHigSml0.02840.02760.0257−0.0552−0.0027−0.0183
Med0.07750.07730.0646−0.08590.03610.0083
Lrg0.46980.46970.3459−0.09270.28050.2407
MedSml0.02560.02550.0230−0.02170.01220.0000
Med0.09010.09010.0916−0.00900.06570.0553
Lrg0.38590.38540.32550.02130.29060.2670
LowSml0.01340.01360.0116−0.01440.0030−0.0035
Med0.06710.06710.06410.00250.05350.0424
Lrg0.33020.33120.29510.11500.28830.2764
Table A4. Mean squared error of  β ^ m a k e B  when the response variable is missing at random. Bold values represent the best result.
Table A4. Mean squared error of  β ^ m a k e B  when the response variable is missing at random. Bold values represent the best result.
Methods
Relb. Miss. Size Mean.I IPW MI K.MI DB.MI HC.MI
LowHigSml0.01660.01660.01740.01770.01700.0179
Med0.03440.03450.03570.03590.03410.0358
Lrg0.06300.06330.06540.06530.06160.0642
MedSml0.01420.01420.01460.01470.01440.0144
Med0.02730.02720.02800.02780.02750.0275
Lrg0.06080.06060.06280.05830.05890.0599
LowSml0.01400.01390.01420.01450.01400.0143
Med0.03050.03060.03080.03110.03050.0311
Lrg0.05170.05190.05170.05180.05120.0518
MedHigSml0.03960.03950.04110.04030.03640.0383
Med0.07860.07860.08080.07100.07110.0722
Lrg0.16640.16650.16480.12550.14730.1486
MedSml0.03180.03190.03320.03170.03100.0318
Med0.06990.07040.06950.06480.06650.0662
Lrg0.23050.23170.23060.12840.15780.2156
LowSml0.03010.03010.03030.02940.02910.0294
Med0.06860.06890.06870.06490.06650.0671
Lrg0.14860.14920.14970.12680.14180.1408
HigHigSml0.10480.10470.10880.08800.09060.0917
Med0.29470.29460.27020.15270.20790.2103
Lrg3.66903.67162.37230.42071.73491.8630
MedSml0.08550.08580.08760.07920.08050.0805
Med0.26860.26910.26540.17140.22720.2320
Lrg2.95732.96002.27440.62491.94091.9558
LowSml0.08370.08350.08430.07760.08060.0818
Med0.19160.19280.19050.15420.17820.1782
Lrg2.05872.06101.68310.81591.68291.7118
Table A5. Mean squared error of  β ^ m a k e B  when the response variable is missing not at random. Bold values represent the best result.
Table A5. Mean squared error of  β ^ m a k e B  when the response variable is missing not at random. Bold values represent the best result.
Methods
Relb. Miss. Size Mean.I IPW MI K.MI DB.MI HC.MI
LowHigSml0.01750.01750.01810.01790.01690.0177
Med0.03400.03400.03500.03420.03250.0341
Lrg0.06560.06590.06830.06410.06270.0653
MedSml0.01470.01480.01520.01490.01450.0147
Med0.02840.02830.02930.02810.02810.0284
Lrg0.06180.06160.06380.05800.06010.0602
LowSml0.01440.01440.01460.01470.01420.0147
Med0.03000.03000.02990.03010.02950.0302
Lrg0.05420.05430.05400.05340.05300.0543
MedHigSml0.04260.04250.04350.04140.03930.0406
Med0.08260.08230.08460.07220.07390.0747
Lrg0.18210.18220.18090.12850.15800.1634
MedSml0.03340.03340.03490.03340.03180.0333
Med0.07300.07340.07260.06630.06900.0689
Lrg0.23940.24030.24030.13830.16780.2231
LowSml0.03120.03130.03160.03070.02990.0308
Med0.06930.06960.07000.06560.06690.0674
Lrg0.15580.15630.15590.13220.14750.1469
HigHigSml0.11380.11370.11870.09180.09950.0997
Med0.28840.28840.26760.14430.20260.2119
Lrg3.69883.70232.40990.40051.76171.8864
MedSml0.09520.09550.09650.08620.08810.0889
Med0.27470.27540.27500.17610.22980.2366
Lrg2.93952.94152.26490.62451.92711.9404
LowSml0.08750.08740.08730.08200.08360.0853
Med0.20650.20750.20780.16860.19220.1918
Lrg2.05022.05251.67970.81991.68071.7181
Table A6. Bias of  β ^ m a k e B  when the response variable is missing not at random. Bold values represent the best result.
Table A6. Bias of  β ^ m a k e B  when the response variable is missing not at random. Bold values represent the best result.
Methods
Relb. Miss. Size Mean.I IPW MI K.MI DB.MI HC.MI
LowHigSml−0.0184−0.0196−0.0255−0.0359−0.0392−0.0371
Med−0.0198−0.0218−0.0265−0.0500−0.0411−0.0428
Lrg−0.0181−0.0196−0.0256−0.0725−0.0373−0.0494
MedSml−0.0225−0.0238−0.0275−0.0328−0.0358−0.0338
Med−0.0252−0.0262−0.0303−0.0416−0.0387−0.0374
Lrg0.0005−0.0004−0.0023−0.0298−0.0162−0.0166
LowSml−0.0216−0.0228−0.0256−0.0295−0.0317−0.0296
Med−0.0206−0.0214−0.0242−0.0317−0.0313−0.0305
Lrg−0.0136−0.0148−0.0176−0.0337−0.0238−0.0257
MedHigSml0.00310.0028−0.0032−0.0410−0.0171−0.0269
Med0.02140.02080.0146−0.0686−0.0052−0.0247
Lrg0.06010.06110.0408−0.11460.0232−0.0097
MedSml−0.0059−0.0066−0.0069−0.0315−0.0192−0.0218
Med0.01160.01090.0044−0.0374−0.0016−0.0170
Lrg0.07210.07110.0635−0.02900.05070.0296
LowSml−0.0033−0.0046−0.0057−0.0211−0.0130−0.0165
Med0.02360.02300.0186−0.01460.01200.0053
Lrg0.04190.04150.0349−0.02750.02400.0150
HigHigSml0.03580.03490.0245−0.09960.0054−0.0231
Med0.12270.12290.0965−0.14890.06800.0281
Lrg0.51940.51750.3956−0.22380.33160.2786
MedSml0.02770.02710.0194−0.04750.0089−0.0132
Med0.08880.08770.0777−0.07220.06200.0349
Lrg0.46950.46950.3848−0.08710.32680.2880
LowSml0.01660.01660.0159−0.03320.0037−0.0094
Med0.08000.07930.0705−0.02360.05830.0399
Lrg0.34700.34960.31540.02410.28120.2654
Table A7. Bias of  β ^ m a k e C  when the response variable is missing not at random. Bold values represent the best result.
Table A7. Bias of  β ^ m a k e C  when the response variable is missing not at random. Bold values represent the best result.
Methods
Relb. Miss. Size Mean.I IPW MI K.MI DB.MI HC.MI
LowHigSml−0.0047−0.0050−0.0073−0.0153−0.0204−0.0186
Med−0.0037−0.0048−0.0052−0.0221−0.0195−0.0192
Lrg0.00570.00500.0051−0.0302−0.0087−0.0164
MedSml−0.0072−0.0079−0.0090−0.0134−0.0169−0.0147
Med−0.0157−0.0160−0.0167−0.0251−0.0260−0.0243
Lrg0.00930.00890.0075−0.0088−0.0014−0.0035
LowSml−0.0064−0.0069−0.0080−0.0114−0.0143−0.0123
Med−0.0077−0.0079−0.0087−0.0142−0.0153−0.0142
Lrg0.00120.0007−0.0012−0.0110−0.0055−0.0081
MedHigSml0.00830.00830.0056−0.0224−0.0137−0.0193
Med0.02900.02860.0247−0.0366−0.0001−0.0130
Lrg0.06180.06260.0490−0.06510.02480.0006
MedSml−0.0022−0.0028−0.0018−0.0201−0.0165−0.0167
Med0.02590.02590.0213−0.00700.0136−0.0003
Lrg0.06900.06860.0656−0.00120.04750.0329
LowSml0.00170.00060.0008−0.0106−0.0080−0.0098
Med0.02680.02660.02360.00020.01490.0133
Lrg0.04550.04520.0412−0.00470.02680.0209
HigHigSml0.03640.03510.0284−0.06200.0000−0.0173
Med0.10280.10300.0829−0.10290.04640.0170
Lrg0.51500.51400.4107−0.11450.33100.2895
MedSml0.03010.02970.0251−0.02570.0076−0.0080
Med0.09040.09060.0831−0.02970.06330.0418
Lrg0.45560.45760.3814−0.02990.32000.2834
LowSml0.00920.00910.0102−0.0265−0.0052−0.0153
Med0.07690.07680.07090.00120.05420.0429
Lrg0.34900.35230.32420.07990.28450.2755
Table A8. Mean squared error of  β ^ m a k e B  when the covariate variable is missing not at random. Bold values represent the best result.
Table A8. Mean squared error of  β ^ m a k e B  when the covariate variable is missing not at random. Bold values represent the best result.
Methods
Relb. Miss. Size Mean.I IPW MI K.MI DB.MI HC.MI
LowHigSml0.01750.01750.01830.01850.01770.0184
Med0.03580.03590.03760.03590.03520.0368
Lrg0.06980.07060.07160.06740.06820.0691
MedSml0.01530.01530.01590.01600.01580.0158
Med0.02840.02850.02990.02970.02890.0292
Lrg0.06370.06370.06410.06150.06110.0638
LowSml0.01400.01400.01420.01480.01460.0145
Med0.03100.03110.03200.03150.03070.0315
Lrg0.05670.05670.05830.05470.05510.0550
MedHigSml0.04320.04300.04480.04110.03970.0414
Med0.08540.08520.08890.07520.07780.0750
Lrg0.34050.34450.29250.15300.23120.2076
MedSml0.03270.03280.03340.03270.03210.0325
Med0.07000.07000.07220.06520.06670.0651
Lrg0.17920.17940.18250.13570.15930.1521
LowSml0.03240.03220.03340.03250.03140.0317
Med0.07490.07550.07470.06920.07010.0730
Lrg0.15480.15480.15250.13080.14300.1450
HigHigSml0.11520.11530.11530.09720.09710.0999
Med0.50200.50450.39650.17290.28590.3570
Lrg3.69513.70952.39740.34271.80471.9310
MedSml0.09010.09000.09160.07890.08190.0810
Med0.21820.21850.21180.15260.19140.1796
Lrg3.58673.59712.65710.45582.14842.0664
LowSml0.08540.08550.08660.07900.08140.0819
Med0.20300.20260.20060.15580.17930.1801
Lrg2.32102.33102.06430.75201.70861.7976
Table A9. Mean squared error of  β ^ m a k e C  when the response variable is missing not at random. Bold values represent the best result.
Table A9. Mean squared error of  β ^ m a k e C  when the response variable is missing not at random. Bold values represent the best result.
Methods
Relb. Miss. Size Mean.I IPW MI K.MI DB.MI HC.MI
LowHigSml0.01820.01820.01890.01870.01750.0184
Med0.03640.03650.03850.03540.03500.0367
Lrg0.07150.07200.07270.06570.06730.0685
MedSml0.01490.01490.01530.01500.01470.0149
Med0.03030.03020.03130.03080.02960.0303
Lrg0.06880.06840.06890.06450.06520.0679
LowSml0.01510.01500.01490.01550.01510.0152
Med0.03050.03050.03100.03050.02970.0309
Lrg0.06070.06060.06220.05850.05790.0588
MedHigSml0.04400.04390.04600.04100.03960.0418
Med0.09270.09230.09540.07580.08270.0800
Lrg0.35270.35640.30720.15180.24630.2183
MedSml0.03520.03520.03630.03480.03390.0345
Med0.07500.07480.07770.06880.06960.0694
Lrg0.19220.19220.19780.14610.16960.1669
LowSml0.03360.03350.03440.03340.03230.0327
Med0.07510.07550.07520.06940.06960.0738
Lrg0.15830.15800.15700.13350.14560.1498
HigHigSml0.12290.12300.12290.09820.10350.1069
Med0.50400.50760.39970.16530.29010.3615
Lrg3.70863.72122.43200.32391.81691.9330
MedSml0.09820.09810.10010.08480.08890.0881
Med0.23040.23090.22650.15620.19970.1911
Lrg3.58983.60522.66130.45132.14732.0800
LowSml0.09110.09110.09400.08410.08690.0879
Med0.21690.21670.21590.16710.19090.1941
Lrg2.29122.30422.03780.75151.68621.7798
Table A10. Bias of  β ^ m a k e B  when Car_Age is missing at random. Bold values represent the best result.
Table A10. Bias of  β ^ m a k e B  when Car_Age is missing at random. Bold values represent the best result.
Methods
Relb. Miss. Size Mean.I IPW MI K.MI DB.MI HC.MI
LowHigSml−0.0088−0.0088−0.0095−0.0088−0.0135−0.0123
Med−0.0097−0.0101−0.0085−0.0097−0.0129−0.0135
Lrg0.00270.0027−0.0059−0.0051−0.0099−0.0095
MedSml−0.0074−0.0075−0.0090−0.0089−0.0107−0.0113
Med−0.0121−0.0122−0.0173−0.0168−0.0196−0.0192
Lrg0.00700.00710.00060.00250.0004−0.0004
LowSml−0.0053−0.0055−0.0064−0.0068−0.0080−0.0077
Med−0.0056−0.0060−0.0067−0.0071−0.0081−0.0077
Lrg−0.0039−0.0048−0.0014−0.0013−0.0024−0.0020
MedHigSml0.00790.00800.00050.0008−0.0011−0.0011
Med0.02320.02310.01760.01890.01630.0164
Lrg0.03510.03360.01720.01800.01510.0143
MedSml−0.0070−0.0072−0.0069−0.0066−0.0078−0.0075
Med0.01930.01930.01800.01860.01630.0169
Lrg0.05790.05810.05070.05090.04960.0488
LowSml0.00110.0011−0.0003−0.0002−0.0005−0.0006
Med0.02300.02310.02270.02290.02240.0224
Lrg0.03930.03890.03550.03610.03490.0349
HigHigSml0.03320.03300.02400.02500.02370.0236
Med0.07360.07370.06020.06150.05860.0588
Lrg0.44920.45170.24970.24980.24710.2477
MedSml0.01840.01850.01800.01820.01770.0177
Med0.08240.08210.07170.07270.07070.0711
Lrg0.31390.31470.27870.27900.27720.2785
LowSml0.01350.01380.00950.00970.00930.0096
Med0.07570.07570.06230.06280.06210.0625
Lrg0.30520.30550.27120.27130.27070.2703
Table A11. Bias of  β ^ m a k e C  when the Car_Age is missing at random. Bold values represent the best result.
Table A11. Bias of  β ^ m a k e C  when the Car_Age is missing at random. Bold values represent the best result.
Methods
Relb.Miss.SizeMean.IIPWMIK.MIDB.MIHC.MI
LowHigSml0.01640.01630.01310.01310.01350.0136
Med0.03360.03380.02710.02710.02670.0271
Lrg0.06640.06610.05110.05070.05020.0505
MedSml0.01450.01450.01300.01310.01330.0132
Med0.02560.02550.02280.02280.02300.0231
Lrg0.05870.05850.05080.05100.05000.0505
LowSml0.01300.01290.01240.01240.01250.0125
Med0.02870.02850.02680.02680.02680.0267
Lrg0.04970.04970.04670.04660.04670.0467
MedHigSml0.03720.03720.02960.02940.02940.0292
Med0.07360.07370.05790.05800.05740.0577
Lrg0.15880.15790.11920.11900.11950.1190
MedSml0.03140.03140.02770.02770.02750.0275
Med0.06500.06480.06020.06010.05970.0599
Lrg0.14170.14250.12780.12800.12720.1271
LowSml0.03040.03050.02750.02760.02760.0275
Med0.06720.06700.06330.06360.06330.0633
Lrg0.13870.13850.12730.12780.12700.1271
HigHigSml0.09930.09970.07470.07500.07450.0747
Med0.30440.30570.23530.23520.23390.2341
Lrg3.27733.29611.49351.49061.48611.4905
MedSml0.08140.08170.06970.06950.06930.0693
Med0.19480.19370.16950.16970.16930.1692
Lrg2.39892.40212.05472.05122.04582.0457
LowSml0.08220.08230.07290.07290.07300.0728
Med0.18600.18620.16780.16780.16740.1674
Lrg1.90621.90111.63041.63061.62811.6278
Table A12. Mean squared error of  β ^ m a k e B  when Car_Age is missing at random. Bold values represent the best result.
Table A12. Mean squared error of  β ^ m a k e B  when Car_Age is missing at random. Bold values represent the best result.
Methods
Relb. Miss. Size Mean.I IPW MI K.MI DB.MI HC.MI
LowHigSml0.01640.01630.01310.01310.01350.0136
Med0.03360.03380.02710.02710.02670.0271
Lrg0.06640.06610.05110.05070.05020.0505
MedSml0.01450.01450.01300.01310.01330.0132
Med0.02560.02550.02280.02280.02300.0231
Lrg0.05870.05850.05080.05100.05000.0505
LowSml0.01300.01290.01240.01240.01250.0125
Med0.02870.02850.02680.02680.02680.0267
Lrg0.04970.04970.04670.04660.04670.0467
MedHigSml0.03720.03720.02960.02940.02940.0292
Med0.07360.07370.05790.05800.05740.0577
Lrg0.15880.15790.11920.11900.11950.1190
MedSml0.03140.03140.02770.02770.02750.0275
Med0.06500.06480.06020.06010.05970.0599
Lrg0.14170.14250.12780.12800.12720.1271
LowSml0.03040.03050.02750.02760.02760.0275
Med0.06720.06700.06330.06360.06330.0633
Lrg0.13870.13850.12730.12780.12700.1271
HigHigSml0.09930.09970.07470.07500.07450.0747
Med0.30440.30570.23530.23520.23390.2341
Lrg3.27733.29611.49351.49061.48611.4905
MedSml0.08140.08170.06970.06950.06930.0693
Med0.19480.19370.16950.16970.16930.1692
Lrg2.39892.40212.05472.05122.04582.0457
LowSml0.08220.08230.07290.07290.07300.0728
Med0.18600.18620.16780.16780.16740.1674
Lrg1.90621.90111.63041.63061.62811.6278
Table A13. Mean squared error of  β ^ m a k e C  when Car_Age is missing at random. Bold values represent the best result.
Table A13. Mean squared error of  β ^ m a k e C  when Car_Age is missing at random. Bold values represent the best result.
Methods
Relb. Miss. Size Mean.I IPW MI K.MI DB.MI HC.MI
LowHigSml0.01670.01670.01330.01330.01310.0132
Med0.03240.03250.02590.02580.02510.0256
Lrg0.06600.06590.05110.05160.05040.0509
MedSml0.01450.01450.01300.01300.01290.0128
Med0.02690.02680.02360.02370.02360.0236
Lrg0.05890.05870.05180.05220.05140.0520
LowSml0.01360.01360.01310.01310.01310.0130
Med0.02850.02840.02630.02630.02620.0261
Lrg0.05330.05310.04960.04970.04950.0497
MedHigSml0.03870.03870.03160.03170.03150.0315
Med0.07600.07610.06080.06090.06010.0607
Lrg0.17450.17380.13040.13070.13030.1299
MedSml0.03360.03370.02920.02920.02900.0290
Med0.06940.06920.06380.06400.06360.0638
Lrg0.15400.15440.13870.13900.13830.1384
LowSml0.03230.03240.02890.02900.02890.0289
Med0.06940.06920.06450.06460.06430.0643
Lrg0.14260.14250.13030.13110.13000.1300
HigHigSml0.10870.10900.08110.08140.08100.0809
Med0.29880.30030.23080.23080.22930.2298
Lrg3.23003.25011.48811.48701.48321.4833
MedSml0.08860.08900.07690.07700.07660.0768
Med0.20250.20150.17590.17640.17560.1756
Lrg2.39602.40052.05512.05132.04652.0463
LowSml0.08520.08520.07630.07630.07640.0761
Med0.19720.19800.18090.18080.18050.1805
Lrg1.90071.89791.62431.62341.62151.6214
Table A14. Bias of  β ^ m a k e B  when Car_Age is missing not at random. Bold values represent the best result.
Table A14. Bias of  β ^ m a k e B  when Car_Age is missing not at random. Bold values represent the best result.
Methods
Relb. Miss. Size Mean.I IPW MI K.MI DB.MI HC.MI
LowHigSml−0.0218−0.0251−0.0255−0.0260−0.0375−0.0367
Med−0.0273−0.0308−0.0286−0.0291−0.0392−0.0392
Lrg−0.0204−0.0244−0.0301−0.0304−0.0381−0.0369
MedSml−0.0199−0.0226−0.0251−0.0257−0.0323−0.0320
Med−0.0236−0.0262−0.0295−0.0302−0.0360−0.0357
Lrg−0.0027−0.0068−0.0126−0.0119−0.0182−0.0174
LowSml−0.0206−0.0232−0.0243−0.0250−0.0288−0.0284
Med−0.0186−0.0213−0.0240−0.0243−0.0275−0.0277
Lrg−0.0165−0.0202−0.0193−0.0184−0.0225−0.0226
MedHigSml−0.0026−0.0037−0.0094−0.0081−0.0128−0.0130
Med0.02420.02170.00820.00940.00540.0045
Lrg0.03250.02920.01650.01800.01220.0125
MedSml−0.0091−0.0110−0.0124−0.0116−0.0142−0.0140
Med0.01230.01110.00170.0025−0.0011−0.0005
Lrg0.06420.06120.04890.04970.04570.0459
LowSml−0.0066−0.0086−0.0091−0.0086−0.0101−0.0098
Med0.02070.01940.01670.01760.01530.0162
Lrg0.03520.03220.02920.03050.02810.0279
HigHigSml0.03170.03060.02030.02180.01800.0182
Med0.10060.09840.07500.07530.07350.0726
Lrg0.48170.48200.25130.25410.24930.2504
MedSml0.02290.02240.01470.01480.01370.0139
Med0.09730.09590.07190.07350.07150.0714
Lrg0.44760.44950.28960.29030.28740.2881
LowSml0.01560.01530.01330.01360.01340.0134
Med0.07200.07030.06090.06130.05970.0600
Lrg0.31400.31700.26760.26950.26740.2672
Table A15. Bias of  β ^ m a k e C  when Car_Age is missing not at random. Bold values represent the best result.
Table A15. Bias of  β ^ m a k e C  when Car_Age is missing not at random. Bold values represent the best result.
Methods
Relb. Miss. Size Mean.I IPW MI K.MI DB.MI HC.MI
LowHigSml−0.0095−0.0110−0.0103−0.0095−0.0142−0.0137
Med−0.0130−0.0144−0.0097−0.0096−0.0138−0.0135
Lrg−0.0013−0.0035−0.0060−0.0068−0.0092−0.0085
MedSml−0.0049−0.0062−0.0084−0.0083−0.0112−0.0107
Med−0.0134−0.0143−0.0174−0.0177−0.0195−0.0196
Lrg0.00620.0046−0.0014−0.0011−0.0042−0.0035
LowSml−0.0039−0.0049−0.0071−0.0075−0.0088−0.0090
Med−0.0048−0.0056−0.0079−0.0080−0.0094−0.0094
Lrg−0.0025−0.0045−0.0027−0.0017−0.0034−0.0033
MedHigSml0.00030.0003−0.0021−0.0007−0.0026−0.0033
Med0.02950.02780.01570.01630.01460.0135
Lrg0.02830.02610.02000.02060.01850.0188
MedSml−0.0046−0.0057−0.0074−0.0068−0.0081−0.0075
Med0.02490.02440.01770.01850.01690.0167
Lrg0.06220.06080.05060.05100.04850.0488
LowSml−0.0001−0.0010−0.0014−0.0010−0.0015−0.0012
Med0.02440.02430.02180.02240.02110.0217
Lrg0.04210.04040.03510.03560.03450.0344
HigHigSml0.03390.03320.02330.02430.02190.0223
Med0.07940.07800.05870.05980.05860.0576
Lrg0.48180.48330.24730.24920.24620.2469
MedSml0.02720.02710.01770.01810.01760.0174
Med0.09320.09220.07180.07290.07170.0719
Lrg0.43590.43920.27770.27770.27600.2767
LowSml0.01090.01120.01020.01030.01040.0102
Med0.07130.07050.06160.06200.06110.0614
Lrg0.31700.32070.27160.27250.27100.2708
Table A16. Mean squared error of  β ^ m a k e B  when Car_Age is missing not at random. Bold values represent the best result.
Table A16. Mean squared error of  β ^ m a k e B  when Car_Age is missing not at random. Bold values represent the best result.
Methods
Relb. Miss. Size Mean.I IPW MI K.MI DB.MI HC.MI
LowHigSml0.01580.01590.01330.01310.01370.0137
Med0.03310.03320.02640.02620.02660.0268
Lrg0.06730.06730.04890.04930.04890.0479
MedSml0.01530.01530.01300.01300.01320.0132
Med0.02780.02780.02340.02340.02340.0234
Lrg0.06060.06030.05130.05120.05070.0509
LowSml0.01360.01350.01250.01260.01290.0127
Med0.02960.02940.02680.02660.02670.0265
Lrg0.05300.05320.04680.04710.04640.0466
MedHigSml0.03900.03890.02940.02940.02920.0292
Med0.07870.07810.05990.05940.05920.0592
Lrg0.16970.16860.12010.11950.11800.1187
MedSml0.03330.03320.02750.02760.02750.0274
Med0.07350.07300.05980.05930.05930.0594
Lrg0.17070.17050.12770.12770.12700.1275
LowSml0.03050.03050.02770.02770.02770.0277
Med0.06920.06920.06330.06340.06310.0633
Lrg0.14390.14290.12670.12710.12670.1269
HigHigSml0.10550.10480.07590.07580.07570.0756
Med0.31130.31170.23440.23340.23300.2340
Lrg3.52783.55781.48871.49111.48241.4866
MedSml0.08970.08980.06980.06980.06940.0696
Med0.32400.32640.16970.17040.16950.1695
Lrg3.35923.38012.05262.05002.04742.0448
LowSml0.08360.08350.07340.07350.07350.0735
Med0.18920.18870.16730.16750.16720.1671
Lrg2.00902.02071.63201.63131.63051.6307
Table A17. Mean squared error of  β ^ m a k e C  when Car_Age is missing not at random. Bold values represent the best result.
Table A17. Mean squared error of  β ^ m a k e C  when Car_Age is missing not at random. Bold values represent the best result.
Methods
Relb. Miss. Size Mean.I IPW MI K.MI DB.MI HC.MI
LowHigSml0.01590.01590.01340.01330.01340.0133
Med0.03260.03260.02560.02550.02510.0253
Lrg0.06860.06830.05020.05050.04990.0487
MedSml0.01520.01510.01290.01280.01280.0129
Med0.02920.02900.02410.02400.02380.0237
Lrg0.06090.06060.05250.05200.05220.0521
LowSml0.01430.01410.01330.01330.01340.0133
Med0.02900.02880.02620.02600.02610.0260
Lrg0.05590.05600.05010.05030.04930.0497
MedHigSml0.04090.04070.03170.03180.03160.0317
Med0.08450.08410.06300.06250.06230.0621
Lrg0.18610.18590.13040.13060.12840.1293
MedSml0.03490.03470.02930.02940.02920.0291
Med0.07770.07720.06350.06330.06310.0632
Lrg0.17690.17550.13860.13800.13850.1385
LowSml0.03270.03290.02900.02900.02890.0289
Med0.06980.06970.06450.06470.06450.0646
Lrg0.15070.14960.13040.13070.13020.1299
HigHigSml0.11250.11190.08180.08160.08170.0816
Med0.30750.30810.23010.22970.22910.2305
Lrg3.49573.53031.48671.48781.47661.4827
MedSml0.09840.09830.07690.07710.07680.0769
Med0.32570.32810.17530.17580.17510.1751
Lrg3.34883.37502.05102.04712.04652.0436
LowSml0.08650.08630.07660.07680.07670.0767
Med0.20450.20440.18060.18100.18040.1806
Lrg2.02132.03451.62591.62501.62421.6235
Table A18. Bias of  β ^ m a k e B  when Region is missing at random. Bold values represent the best result.
Table A18. Bias of  β ^ m a k e B  when Region is missing at random. Bold values represent the best result.
Methods
Relb. Miss. Size Mean.I IPW MI K.MI DB.MI HC.MI
LowHigSml−0.0213−0.0218−0.0245−0.0235−0.0300−0.0303
Med−0.0233−0.0237−0.0269−0.0261−0.0322−0.0324
Lrg−0.0232−0.0235−0.0278−0.0274−0.0329−0.0340
MedSml−0.0245−0.0249−0.0250−0.0244−0.0281−0.0280
Med−0.0286−0.0289−0.0287−0.0283−0.0317−0.0319
Lrg−0.0073−0.0086−0.0082−0.0081−0.0114−0.0124
LowSml−0.0233−0.0236−0.0240−0.0236−0.0258−0.0258
Med−0.0216−0.0219−0.0226−0.0222−0.0242−0.0244
Lrg−0.0153−0.0158−0.0178−0.0176−0.0194−0.0194
MedHigSml−0.0018−0.0020−0.0074−0.0068−0.0092−0.0095
Med0.02710.02690.01000.01060.00840.0076
Lrg0.03270.03260.01740.01670.01500.0138
MedSml−0.0102−0.0104−0.0108−0.0105−0.0121−0.0123
Med0.00570.00500.00200.00230.00070.0005
Lrg0.05550.05480.04970.04970.04810.0478
LowSml−0.0080−0.0082−0.0080−0.0079−0.0086−0.0087
Med0.02030.02010.01810.01820.01720.0173
Lrg0.03720.03610.03060.03030.03000.0293
HigHigSml0.02270.02250.02110.02130.02020.0201
Med0.09260.09150.07570.07540.07470.0743
Lrg0.42000.42240.25150.25110.25170.2508
MedSml0.02070.02090.01510.01520.01470.0147
Med0.08010.07980.07200.07200.07130.0716
Lrg0.36010.35940.29130.29100.29160.2913
LowSml0.01810.01850.01310.01320.01290.0130
Med0.07120.07090.06230.06220.06180.0620
Lrg0.28060.27890.26840.26830.26880.2682
Table A19. Bias of  β ^ m a k e C  when Region is missing at random. Bold values represent the best result.
Table A19. Bias of  β ^ m a k e C  when Region is missing at random. Bold values represent the best result.
Methods
Relb. Miss. Size Mean.I IPW MI K.MI DB.MI HC.MI
LowHigSml−0.0062−0.0065−0.0083−0.0082−0.0105−0.0108
Med−0.0064−0.0065−0.0078−0.0082−0.0101−0.0104
Lrg0.00170.0016−0.0044−0.0046−0.0069−0.0076
MedSml−0.0092−0.0094−0.0085−0.0084−0.0095−0.0095
Med−0.0186−0.0187−0.0174−0.0173−0.0183−0.0184
Lrg0.00230.00140.00180.00160.0005−0.0004
LowSml−0.0058−0.0060−0.0070−0.0069−0.0078−0.0078
Med−0.0062−0.0063−0.0069−0.0067−0.0076−0.0075
Lrg0.00140.0010−0.0017−0.0018−0.0025−0.0023
MedHigSml0.00230.0023−0.0005−0.0004−0.0009−0.0012
Med0.03110.03100.01630.01640.01600.0153
Lrg0.03380.03350.02000.01840.01920.0179
MedSml−0.0057−0.0060−0.0060−0.0058−0.0066−0.0067
Med0.01890.01830.01770.01770.01730.0170
Lrg0.05520.05480.04930.04910.04870.0484
LowSml 4 × 10 4 4 × 10 4 4 × 10 4 4 × 10 4 5 × 10 4 7 × 10 4
Med0.02310.02280.02250.02230.02210.0223
Lrg0.04070.03990.03530.03470.03520.0343
HigHigSml0.02580.02570.02350.02390.02320.0232
Med0.07580.07470.06070.06010.06040.0600
Lrg0.41270.41620.24640.24580.24700.2462
MedSml0.02170.02200.01830.01840.01800.0182
Med0.07940.07960.07210.07210.07180.0719
Lrg0.34100.34160.27770.27720.27860.2782
LowSml0.01450.01490.00940.00950.00930.0093
Med0.07250.07210.06240.06210.06200.0622
Lrg0.28580.28490.27130.27130.27200.2714
Table A20. Mean squared error of  β ^ m a k e B  when Region is missing at random. Bold values represent the best result.
Table A20. Mean squared error of  β ^ m a k e B  when Region is missing at random. Bold values represent the best result.
Methods
Relb. Miss. Size Mean.I IPW MI K.MI DB.MI HC.MI
LowHigSml0.01550.01560.01260.01250.01270.0127
Med0.03320.03320.02630.02630.02620.0262
Lrg0.06100.06090.04850.04860.04840.0484
MedSml0.01460.01460.01290.01280.01290.0129
Med0.02510.02520.02300.02290.02290.0229
Lrg0.05610.05610.05040.05040.05020.0502
LowSml0.01330.01330.01240.01240.01250.0125
Med0.02760.02750.02660.02660.02670.0267
Lrg0.04950.04960.04630.04620.04620.0462
MedHigSml0.03650.03640.02930.02930.02910.0291
Med0.08750.08790.05820.05830.05780.0580
Lrg0.15860.15870.11800.11860.11750.1174
MedSml0.03140.03130.02740.02740.02730.0274
Med0.06940.06900.05950.05960.05930.0592
Lrg0.14340.14430.12760.12780.12740.1273
LowSml0.02900.02910.02760.02760.02750.0275
Med0.06670.06670.06330.06340.06330.0632
Lrg0.14040.14020.12680.12700.12640.1265
HigHigSml0.10040.10010.07490.07520.07480.0749
Med0.35420.35320.23400.23480.23490.2354
Lrg3.04613.06721.48551.48511.49781.4996
MedSml0.08140.08150.06980.06980.06970.0697
Med0.19780.19730.17000.17000.16950.1694
Lrg2.67312.67652.05332.05672.06532.0650
LowSml0.08060.08090.07320.07320.07310.0730
Med0.18120.18130.16810.16810.16780.1679
Lrg1.72101.71561.63061.63111.63501.6328
Table A21. Mean squared error of  β ^ m a k e C  when Region is missing at random. Bold values represent the best result.
Table A21. Mean squared error of  β ^ m a k e C  when Region is missing at random. Bold values represent the best result.
Methods
Relb. Miss. Size Mean.I IPW MI K.MI DB.MI HC.MI
LowHigSml0.01600.01610.01300.01300.01290.0129
Med0.03290.03290.02510.02520.02490.0248
Lrg0.06460.06430.04970.04980.04910.0492
MedSml0.01470.01470.01260.01260.01260.0126
Med0.02640.02650.02360.02350.02330.0233
Lrg0.05810.05790.05210.05220.05180.0517
LowSml0.01400.01400.01310.01310.01310.0131
Med0.02750.02740.02620.02620.02620.0262
Lrg0.05380.05360.04950.04950.04940.0494
MedHigSml0.03950.03940.03150.03160.03130.0313
Med0.08990.09040.06150.06150.06110.0612
Lrg0.16850.16840.12850.12890.12830.1282
MedSml0.03330.03320.02910.02910.02910.0291
Med0.07320.07290.06360.06370.06330.0633
Lrg0.15590.15640.13840.13840.13810.1381
LowSml0.03040.03050.02890.02900.02890.0290
Med0.06750.06770.06440.06450.06450.0644
Lrg0.14500.14500.13030.13040.13000.1300
HigHigSml0.10970.10940.08070.08110.08070.0808
Med0.34470.34350.22950.23010.23040.2309
Lrg3.04953.06971.48341.48331.49591.4974
MedSml0.08930.08940.07720.07720.07720.0772
Med0.20210.20150.17580.17580.17540.1750
Lrg2.64732.64942.05162.05472.06492.0637
LowSml0.08400.08410.07650.07650.07640.0764
Med0.19560.19570.18130.18120.18080.1809
Lrg1.71201.70681.62371.62471.62891.6262
Table A22. Bias of  β ^ m a k e B  when Region is missing not at random. Bold values represent the best result.
Table A22. Bias of  β ^ m a k e B  when Region is missing not at random. Bold values represent the best result.
Methods
Relb. Miss. Size Mean.I IPW MI K.MI DB.MI HC.MI
LowHigSml−0.0218−0.0256−0.0261−0.0252−0.0318−0.0318
Med−0.0273−0.0313−0.0269−0.0263−0.0324−0.0333
Lrg−0.0204−0.0250−0.0273−0.0277−0.0330−0.0340
MedSml−0.0199−0.0231−0.0252−0.0247−0.0287−0.0288
Med−0.0236−0.0265−0.0280−0.0279−0.0316−0.0322
Lrg−0.0027−0.0073−0.0100−0.0097−0.0135−0.0142
LowSml−0.0206−0.0236−0.0240−0.0237−0.0262−0.0262
Med−0.0186−0.0218−0.0228−0.0223−0.0246−0.0251
Lrg−0.0165−0.0205−0.0178−0.0173−0.0196−0.0202
MedHigSml−0.0026−0.0040−0.0073−0.0066−0.0092−0.0095
Med0.02420.02160.01060.01130.00830.0076
Lrg0.03250.02960.01630.01650.01420.0132
MedSml−0.0091−0.0110−0.0114−0.0109−0.0126−0.0128
Med0.01230.01090.00210.00170.00050.0003
Lrg0.06420.06100.05020.04990.04910.0480
LowSml−0.0066−0.0089−0.0081−0.0079−0.0089−0.0090
Med0.02070.01920.01810.01810.01690.0171
Lrg0.03520.03210.02910.02890.02840.0278
HigHigSml0.03170.03070.02130.02150.02020.0201
Med0.10060.09850.07570.07530.07410.0741
Lrg0.48170.48250.25030.24990.25120.2497
MedSml0.02290.02230.01520.01520.01440.0144
Med0.09730.09670.07340.07320.07210.0724
Lrg0.44760.44890.29020.29050.29130.2908
LowSml0.01560.01500.01290.01330.01280.0125
Med0.07200.07070.06190.06170.06160.0614
Lrg0.31400.31740.26750.26740.26800.2671
Table A23. Bias of  β ^ m a k e C  when Region is missing not at random. Bold values represent the best result.
Table A23. Bias of  β ^ m a k e C  when Region is missing not at random. Bold values represent the best result.
Methods
Relb. Miss. Size Mean.I IPW MI K.MI DB.MI HC.MI
LowHigSml−0.0095−0.0112−0.0093−0.0094−0.0116−0.0116
Med−0.0130−0.0147−0.0081−0.0083−0.0101−0.0110
Lrg−0.0013−0.0040−0.0051−0.0067−0.0074−0.0082
MedSml−0.0049−0.0063−0.0079−0.0079−0.0093−0.0094
Med−0.0134−0.0144−0.0163−0.0168−0.0177−0.0183
Lrg0.00620.00420.00050.0003−0.0010−0.0017
LowSml−0.0039−0.0051−0.0067−0.0067−0.0077−0.0076
Med−0.0048−0.0059−0.0070−0.0069−0.0075−0.0080
Lrg−0.0025−0.0045−0.0018−0.0014−0.0023−0.0030
MedHigSml0.00030.0001−0.0002−0.0001−0.0007−0.0011
Med0.02950.02770.01740.01750.01660.0159
Lrg0.02830.02660.01970.01890.01910.0183
MedSml−0.0046−0.0056−0.0063−0.0063−0.0068−0.0070
Med0.02490.02440.01820.01730.01750.0173
Lrg0.06220.06080.05040.05010.05030.0493
LowSml−0.0001−0.0012−0.0004−0.0003−0.0005−0.0007
Med0.02440.02430.02260.02240.02190.0220
Lrg0.04210.04040.03390.03350.03410.0335
HigHigSml0.03390.03330.02360.02360.02290.0230
Med0.07940.07820.06040.05980.05990.0599
Lrg0.48180.48410.24570.24460.24660.2451
MedSml0.02720.02710.01840.01830.01780.0179
Med0.09320.09320.07360.07320.07240.0729
Lrg0.43590.43910.27810.27840.27940.2787
LowSml0.01090.01090.00930.00940.00920.0091
Med0.07130.07090.06230.06210.06230.0620
Lrg0.31700.32090.27140.27110.27170.2711
Table A24. Mean squared error of  β ^ m a k e B  when Region is missing not at random. Bold values represent the best result.
Table A24. Mean squared error of  β ^ m a k e B  when Region is missing not at random. Bold values represent the best result.
Methods
Relb. Miss. Size Mean.I IPW MI K.MI DB.MI HC.MI
LowHigSml0.01580.01600.01280.01270.01300.0129
Med0.03310.03330.02640.02630.02630.0264
Lrg0.06730.06720.04870.04930.04850.0486
MedSml0.01530.01530.01280.01280.01290.0129
Med0.02780.02780.02280.02300.02290.0229
Lrg0.06060.06000.05050.05060.05010.0503
LowSml0.01360.01350.01240.01250.01250.0125
Med0.02960.02930.02670.02670.02670.0267
Lrg0.05300.05290.04670.04660.04650.0465
MedHigSml0.03900.03890.02920.02910.02900.0289
Med0.07870.07850.05840.05860.05800.0580
Lrg0.16970.16900.11770.11810.11710.1171
MedSml0.03330.03310.02750.02750.02740.0274
Med0.07350.07300.05960.05960.05930.0593
Lrg0.17070.17050.12790.12790.12730.1275
LowSml0.03050.03050.02740.02750.02740.0274
Med0.06920.06920.06370.06360.06340.0634
Lrg0.14390.14250.12710.12700.12670.1268
HigHigSml0.10550.10490.07500.07490.07480.0748
Med0.31130.31170.23450.23560.23500.2354
Lrg3.52783.56111.47781.48111.50351.4995
MedSml0.08970.08990.06980.07000.06980.0698
Med0.32400.32690.16970.16970.16930.1695
Lrg3.35923.38252.05272.05522.06902.0670
LowSml0.08360.08340.07320.07330.07320.0732
Med0.18920.18880.16780.16790.16760.1675
Lrg2.00902.02161.62961.63481.63851.6370
Table A25. Mean squared error of  β ^ m a k e C  when Region is missing not at random. Bold values represent the best result.
Table A25. Mean squared error of  β ^ m a k e C  when Region is missing not at random. Bold values represent the best result.
Methods
Relb. Miss. Size Mean.I IPW MI K.MI DB.MI HC.MI
LowHigSml0.01590.01590.01310.01310.01290.0129
Med0.03260.03260.02550.02540.02510.0251
Lrg0.06860.06820.04970.04990.04910.0493
MedSml0.01520.01510.01270.01270.01250.0125
Med0.02920.02900.02370.02380.02360.0235
Lrg0.06090.06030.05190.05200.05140.0515
LowSml0.01430.01410.01310.01310.01310.0131
Med0.02900.02870.02630.02620.02620.0261
Lrg0.05590.05580.04990.05000.04960.0497
MedHigSml0.04090.04070.03140.03140.03120.0311
Med0.08450.08440.06120.06130.06080.0608
Lrg0.18610.18600.12950.12970.12860.1286
MedSml0.03490.03470.02910.02910.02900.0290
Med0.07770.07730.06370.06360.06320.0632
Lrg0.17690.17570.13840.13850.13770.1378
LowSml0.03270.03290.02890.02890.02880.0288
Med0.06980.06980.06480.06490.06460.0646
Lrg0.15070.14900.13060.13040.13010.1302
HigHigSml0.11250.11190.08090.08090.08070.0808
Med0.30750.30800.22980.23090.23040.2307
Lrg3.49573.53341.47251.47621.49961.4955
MedSml0.09840.09840.07730.07750.07720.0772
Med0.32570.32880.17550.17550.17530.1755
Lrg3.34883.37752.04932.05212.06662.0649
LowSml0.08650.08620.07650.07670.07660.0766
Med0.20450.20450.18080.18100.18050.1805
Lrg2.02132.03501.62471.62921.63281.6328

References

  1. Tabuchi, H. Air Bag Flaw, Long Known to Honda and Takata, Led to Recalls. The New York Times, 11 September 2014. [Google Scholar]
  2. National Center for Statistics and Analysis. Crash Report Sampling System Analytical User’s Manual, 2016–2021; (Report No. DOT HS 813 436); National Highway Traffic Safety Administration: Washington, DC, USA, 2023.
  3. Rubin, D.B. Inference and Missing Data. Biometrika 1976, 63, 581–592. [Google Scholar] [CrossRef]
  4. Rubin, D.B. Multiple Imputation for Nonresponse in Surveys; John Wiley & Sons: New York, NY, USA, 1987. [Google Scholar]
  5. Little, R.; Rubin, D. Statistical Analysis with Missing Data; John Wiley & Sons: Hoboken, NJ, USA, 2019. [Google Scholar]
  6. Ferro, S.; Bottigliengo, D.; Gregori, D.; Fabricio, A.S.C.; Gion, M.; Baldi, I. Phenomapping of Patients with Primary Breast Cancer Using Machine Learning-Based Unsupervised Cluster Analysis. J. Pers. Med. 2021, 11, 272. [Google Scholar] [CrossRef] [PubMed]
  7. Nouraei, H.; Nouraei, H.; Rabkin, S.W. Comparison of Unsupervised Machine Learning Approaches for Cluster Analysis to Define Subgroups of Heart Failure with Preserved Ejection Fraction with Different Outcomes. Bioengineering 2022, 9, 175. [Google Scholar] [CrossRef] [PubMed]
  8. Zuo, Y.; Lundberg, J.; Chandran, P.; Rantatalo, M. Squat Detection and Estimation for Railway Switches and Crossings Utilising Unsupervised Machine Learning. Appl. Sci. 2023, 13, 5376. [Google Scholar] [CrossRef]
  9. Groenwold, R.H.; White, I.R.; Donders, A.R.; Carpenter, J.R.; Altman, D.G.; Moons, K.G. Missing covariate data in clinical research: When and when not to use the missing-indicator method for analysis. Can. Med. Assoc. J. 2012, 184, 1265–1269. [Google Scholar] [CrossRef]
  10. Pedersen, A.B.; Mikkelsen, E.M.; Cronin-Fenton, D.; Kristensen, N.R.; Pham, T.M.; Pedersen, L.; Petersen, I. Missing data and multiple imputation in clinical epidemiological research. Clin. Epidemiol. 2017, 15, 157–166. [Google Scholar] [CrossRef]
  11. Yang, S.; Berdine, G. Missing values in data analysis. Southwest Respir. Crit. Care Chronicles 2022, 10, 57–60. [Google Scholar] [CrossRef]
  12. Sterne, J.A.; White, I.R.; Carlin, J.B.; Spratt, M.; Royston, P.; Kenward, M.G.; Wood, A.M.; Carpenter, J.R. Multiple imputation for missing data in epidemiological and clinical research: Potential and pitfalls. Br. Med. J. 2009, 29, b2393. [Google Scholar] [CrossRef]
  13. Seaman, S.R.; White, I.R. Review of inverse probability weighting for dealing with missing data. Stat. Methods Med. Res. 2013, 22, 278–295. [Google Scholar] [CrossRef]
  14. Jerez, J.M.; Molina, I.; García-Laencina, P.J.; Alba, E.; Ribelles, N.; Martín, M.; Franco, L. Missing data imputation using statistical and machine learning methods in a real breast cancer problem. Artif. Intell. Med. 2010, 50, 105–115. [Google Scholar] [CrossRef]
  15. García-Laencina, P.J.; Sancho-Gómez, J.L.; Figueiras-Vidal, A.R. Pattern classification with missing data: A review. Neural Comput. Appl. 2010, 19, 263–282. [Google Scholar] [CrossRef]
  16. Waljee, A.K.; Mukherjee, A.; Singal, A.G.; Zhang, Y.; Warren, J.; Balis, U.; Marrero, J.; Zhu, J.; Higgins, P.D.R. Comparison of imputation methods for missing laboratory data in medicine. BMJ Open 2013, 3, 1–7. [Google Scholar] [CrossRef] [PubMed]
  17. Barakat, M.S.; Field, M.; Ghose, A.; Stirling, D.; Holloway, L.; Vinod, S.; Dekker, A.; Thwaites, D. The effect of imputing missing clinical attribute values on training lung cancer survival prediction model performance. Health Inf. Sci. Syst. 2017, 5, 16. [Google Scholar] [CrossRef] [PubMed]
  18. Gmel, G. Imputation of missing values in the case of a multiple item instrument measuring alcohol consumption. Stat. Med. 2001, 20, 2369–2381. [Google Scholar] [CrossRef]
  19. Balakrishnan, N.; So, H.Y.; Ling, M.H. EM algorithm for one-shot device testing with competing risks under exponential distribution. Reliab. Eng. Syst. Saf. 2015, 137, 129–140. [Google Scholar] [CrossRef]
  20. Balakrishnan, N.; So, H.Y.; Ling, M.H. EM Algorithm for One-Shot Device Testing with Competing Risks under Weibull Distribution. IEEE Trans. Reliab. 2016, 65, 973–991. [Google Scholar] [CrossRef]
  21. Azur, M.J.; Stuat, E.A.; Frangakis, C.; Leaf, P.J. Multiple imputation by chained equations: What is it and how does it work? Int. J. Methods Psychiatr. Res. 2011, 20, 40–49. [Google Scholar] [CrossRef]
  22. Liu, Y.; De, A. Multiple Imputation by Fully Conditional Specification for Dealing with Missing Data in a Large Epidemiologic Study. Int. J. Stat. Med. Res. 2015, 4, 287–295. [Google Scholar] [CrossRef]
  23. Murray, J.S. Multiple Imputation: A Review of Practical and Theoretical Findings. Stat. Sci. 2018, 33, 142–159. [Google Scholar] [CrossRef]
  24. Lee, K.; Carlin, J.B. Multiple imputation in the presence of non-normal data. Stat. Med. 2016, 36, 606–617. [Google Scholar] [CrossRef]
  25. Barnard, J.; Rubin, D.B. Small-Sample Degrees of Freedom with Multiple Imputation. Biometrika 1999, 86, 948–955. [Google Scholar] [CrossRef]
  26. Heymans, M.W.; Eekhout, I. Applied Missing Data Analysis with SPSS and (R)Studio. First Draft. 2019. Available online: https://bookdown.org/mwheymans/bookmi/missing-data-in-questionnaires.html (accessed on 22 February 2024).
  27. Ling, M.H.; Balakrishnan, N.; Yu, C.; So, H.Y. Inference for One-Shot Devices with Dependent k-Out-of-M Structured Components under Gamma Frailty. Mathematics 2021, 9, 3032. [Google Scholar] [CrossRef]
  28. Ling, M.H.; Balakrishnan, N.; Bae, S.J. On the application of inverted Dirichlet distribution for reliability inference of completely censored components with dependent structure. Comput. Ind. Eng. 2024, 196, 110452. [Google Scholar] [CrossRef]
  29. Hand, D.J.; Bolton, R.J. Pattern discovery and detection: A unified statistical methodology. J. Appl. Stat. 2004, 31, 885–924. [Google Scholar] [CrossRef]
  30. Aschenbruck, R.; Szepannek, G.; Wilhelm, A.F.X. Imputation Strategies for Clustering Mixed Type Data with Missing Values. J. Classif. 2023, 40, 2–24. [Google Scholar] [CrossRef]
  31. Agresti, A. An Introduction to Categorical Data Analysis, 3rd ed.; John Wiley & Sons: Hoboken, NJ, USA, 2019. [Google Scholar]
  32. Ward, J.H. Herarchical grouping to optimize an objective function. J. Am. Stat. Assoc. 1963, 58, 236–244. [Google Scholar] [CrossRef]
  33. Johnson, S.C. Hierarchical Clustering schemes. Psychometrika 1967, 32, 241–254. [Google Scholar] [CrossRef]
  34. Lance, G.N.; Williams, W.T. A general theory of classificatory sorting strategies: 1. Hierarchical systems. Comput. J. 1967, 9, 373–380. [Google Scholar] [CrossRef]
  35. Hastie, T.; Tibshirani, R.; Friedman, J. The Elements of Statistical Learning, 2nd ed.; Springer: New York, NY, USA, 2009. [Google Scholar]
  36. Huang, Z. Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values. Data Min. Knowl. Discov. 1998, 2, 283–304. [Google Scholar] [CrossRef]
  37. Huang, Z. Clustering large data sets with mixed numeric and categorical values. In Proceedings of the First Pacific Asia Knowledge Discovery and Data Mining Conference, Singapore, 23–24 February 1997; World Scientific: Singapore, 1997; pp. 21–34. [Google Scholar]
  38. Ji, J.; Pang, W.; Zhou, C.; Han, X.; Wang, Z. A fuzzy k-prototype clustering algorithm for mixed numeric and categorical data. Knowl.-Based Syst. 2012, 30, 129–135. [Google Scholar] [CrossRef]
  39. Ji, J.; Bai, T.; Zhou, C.; Ma, C.; Wang, Z. An improved k-prototypes clustering algorithm for mixed numeric and categorical data. Neurocomputing 2010, 120, 590–596. [Google Scholar] [CrossRef]
  40. Ester, M.; Kriegel, H.P.; Sander, J.; Xu, X. A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96), Portland, OR, USA, 2–4 August 1996. [Google Scholar]
  41. Shukla, N.; Hagenbuchner, M.; Win, K.T.; Yang, J. Breast cancer data analysis for survivability studies and prediction. Comput. Methods Programs Biomed. 2018, 155, 199–208. [Google Scholar] [CrossRef] [PubMed]
  42. Ankerst, M.; Breunig, M.M.; Kriegel, H.-P.; Sander, J. OPTICS: Ordering points to identify the clustering structure. In Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data (SIGMOD ’99), Philadelphia, PA, USA, 1–3 June 1999; Association for Computing Machinery: New York, NY, USA, 1999; pp. 49–60. [Google Scholar]
  43. Schubert, E.; Sander, J.; Ester, M.; Kriegel, H.-P. DBSCAN Revisited, Revisited: Why and How You Should (Still) Use DBSCAN. CM Trans. Database Syst. 2017, 42, 19. [Google Scholar] [CrossRef]
  44. Campello, R.J.G.B.; Moulavi, D.; Zimek, A.; Sander, J. Hierarchical Density Estimates for Data Clustering, Visualization, and Outlier Detection. ACM Trans. Knowl. Discov. Data 2017, 10, 1–51. [Google Scholar] [CrossRef]
  45. McInnes, L.; Healy, J.; Astels, S. hdbscan: Hierarchical density-based clustering. J. Open Source Softw. 2017, 2, 205. [Google Scholar] [CrossRef]
  46. Gower, J.C. A General Coefficient of Similarity and Some of Its Properties. Biometrics 1971, 27, 857–871. [Google Scholar] [CrossRef]
  47. Gower, J.C. A note on Burnaby’s character-weighted similarity coefficient. J. Int. Assoc. Math. Geol. 1970, 2, 39–45. [Google Scholar] [CrossRef]
  48. van Buuren, S.; Groothuis-Oudshoorn, K. mice: Multivariate Imputation by Chained Equations in R. J. Stat. Softw. 2011, 45, 1–67. [Google Scholar] [CrossRef]
  49. R Core Team. R: A Language and Environment for Statistical Computing; R Foundation for Statistical Computing: Vienna, Austria, 2023. [Google Scholar]
  50. Maechler, M.; Rousseeuw, P.; Struyf, A.; Hubert, M.; Hornik, K. Cluster: Cluster Analysis Basics and Extensions, R package version 2.1.4. 2022.
  51. Szepannek, G. clustMixType: User-friendly clustering of mixed-type data in R. R J. 2018, 10, 200–208. [Google Scholar] [CrossRef]
  52. Rubin, D.B. Determination of optimal Epsilon (Eps) value on DBSCAN algorithm to clustering data on peatland hotspots in Sumatra. IOP Conf. Ser. Earth Environ. Sci. 1976, 31, 012012. [Google Scholar]
  53. Hennig, C. fpc: Flexible Procedures for Clustering, R package version 2.2-12. 2024.
  54. Zhang, F.; Subramanian, R.; Chen, C.-L.; Noh, E.Y. Crash Report Sampling System: Design Overview, Analytic Guidance, and FAQs (Report No. DOT HS 812 688); National Highway Traffic Safety Administration: Washington, DC, USA, 2019.
  55. Uncu, N.; Koyuncu, M. Enhancing Control: Unveiling the Performance of Poisson EWMA Charts through Simulation with Poisson Mixture Data. Appl. Sci. 2023, 13, 11160. [Google Scholar] [CrossRef]
  56. Vivancos, J.-L.; Buswell, R.A.; Cosar-Jorda, P.; Aparicio-Fernández, C. The application of quality control charts for identifying changes in time-series home energy data. Energy Build. 2020, 215, 109841. [Google Scholar] [CrossRef]
  57. Yeganeh, A.; Shadman, A. Using evolutionary artificial neural networks in monitoring binary and polytomous logistic profiles. J. Manuf. Syst. 2021, 61, 546–561. [Google Scholar] [CrossRef]
Figure 1. Illustration on how to perform a mean imputation. The highlighted entries are the missing observations, and the red numbers represent the values imputed by the mean imputation. The table on the left is complete data, and the table in the middle is the data with missing observations. The table on the right has the missing data imputed by the mean of observed values.
Figure 1. Illustration on how to perform a mean imputation. The highlighted entries are the missing observations, and the red numbers represent the values imputed by the mean imputation. The table on the left is complete data, and the table in the middle is the data with missing observations. The table on the right has the missing data imputed by the mean of observed values.
Mathematics 12 02884 g001
Figure 2. An illustration on how we implement EM algorithm imputation. The highlighted entries are the missing observations, and the red numbers represent the values imputed during the imputation procedure. The initial parameter estimates are  λ ( 0 ) = 0.1 , β ( 0 ) = 0.5 . We impute the missing data with the expectation based on the current parameter estimates,  λ ( t ) , β ( t ) .
Figure 2. An illustration on how we implement EM algorithm imputation. The highlighted entries are the missing observations, and the red numbers represent the values imputed during the imputation procedure. The initial parameter estimates are  λ ( 0 ) = 0.1 , β ( 0 ) = 0.5 . We impute the missing data with the expectation based on the current parameter estimates,  λ ( t ) , β ( t ) .
Mathematics 12 02884 g002
Figure 3. An illustration of how the multiple imputation procedure works. The highlighted entries are the missing observations, and the red numbers represent the values imputed during the procedure. In this example, we assume there is no model coefficient uncertainty.
Figure 3. An illustration of how the multiple imputation procedure works. The highlighted entries are the missing observations, and the red numbers represent the values imputed during the procedure. In this example, we assume there is no model coefficient uncertainty.
Mathematics 12 02884 g003
Figure 4. Airbag success rate by manufacturing year for the accidents that happened between 2017 and 2020.
Figure 4. Airbag success rate by manufacturing year for the accidents that happened between 2017 and 2020.
Mathematics 12 02884 g004
Table 1. The mean, standard error (SE), 95% confidence interval lower and upper limits (CI.L, CI.U) estimates of  β m a k e A s i a n  under various imputation methods with bootstrap samples of 1000 (Sml), 2000 (Med) and 4000 (Lrg).
Table 1. The mean, standard error (SE), 95% confidence interval lower and upper limits (CI.L, CI.U) estimates of  β m a k e A s i a n  under various imputation methods with bootstrap samples of 1000 (Sml), 2000 (Med) and 4000 (Lrg).
Methods
Year Size Stat Mean.I IPW MI K.MI DB.MI HC.MI
2017SmlMean−0.00520−0.00883−0.010230.00636−0.01036−0.01041
SE0.357210.357740.355170.324780.357490.35748
CI.L−0.73138−0.74240−0.74146−0.65902−0.73862−0.73884
CI.U0.701970.684700.696470.621630.687640.68775
MedMean0.014840.010490.009300.017700.010300.01026
SE0.236140.240190.236390.228980.235490.23552
CI.L−0.45937−0.47115−0.46515−0.43956−0.46083−0.46396
CI.U0.469830.468260.459990.447590.463990.46300
LrgMean0.024930.020570.019090.027140.020970.02092
SE0.165600.169030.167070.165640.165360.16541
CI.L−0.29847−0.31457−0.31544−0.31185−0.31176−0.31219
CI.U0.336000.339240.338590.344270.333200.33312
2018SmlMean0.019700.027230.016850.020730.016310.01623
SE0.332880.336810.332630.308760.331490.33135
CI.L−0.63370−0.62636−0.63507−0.57207−0.63620−0.63582
CI.U0.655280.673210.668050.586580.640480.64060
MedMean0.030630.038120.031440.034970.028860.02874
SE0.231440.232790.229250.225620.230770.23081
CI.L−0.41830−0.40815−0.42225−0.41337−0.41929−0.41835
CI.U0.489700.497130.485280.476630.479020.47678
LrgMean0.019740.027410.019670.025830.017450.01735
SE0.160320.161430.159510.159190.159760.15975
CI.L−0.30033−0.29550−0.30266−0.30400−0.30021−0.30070
CI.U0.327690.342100.315990.331770.323830.32292
2019SmlMean0.023220.022890.027260.021500.025310.02537
SE0.305920.305700.303480.291860.305430.30532
CI.L−0.56986−0.57100−0.57995−0.54529−0.56919−0.56984
CI.U0.600670.605910.619620.584950.609040.60940
MedMean0.030050.029880.035600.032890.032520.03247
SE0.222680.222800.222700.217880.221740.22169
CI.L−0.41365−0.41014−0.39257−0.40485−0.39461−0.39340
CI.U0.450680.448560.461570.447090.450150.45006
LrgMean0.037130.036950.043280.042740.039570.03952
SE0.140950.141450.141930.140640.140530.14055
CI.L−0.23866−0.25076−0.23560−0.22885−0.23287−0.23235
CI.U0.316600.315280.327640.325410.322080.32310
2020SmlMean−0.05207−0.05195−0.04961−0.04693−0.04882−0.04900
SE0.326790.326860.324990.311550.325590.32565
CI.L−0.75252−0.75609−0.72875−0.71641−0.73368−0.73715
CI.U0.626210.617030.589050.553510.606400.61026
MedMean−0.05094−0.05105−0.04930−0.04889−0.04853−0.04860
SE0.231190.231570.233000.226110.230530.23046
CI.L−0.49147−0.49482−0.50387−0.47786−0.48809−0.48786
CI.U0.395350.400190.402790.405220.398010.39755
LrgMean−0.06173−0.06173−0.06056−0.05789−0.05855−0.05867
SE0.157460.157780.157260.154850.156310.15630
CI.L−0.40157−0.40060−0.39700−0.39529−0.38936−0.38874
CI.U0.251950.255430.244740.236980.241690.23980
Table 2. The mean, standard error (SE), 95% confidence interval lower and upper limits (CI.L, CI.U) estimates of  β m a k e E u r o p e a n  under various imputation methods with bootstrap samples of 1000 (Sml), 2000 (Med) and 4000 (Lrg).
Table 2. The mean, standard error (SE), 95% confidence interval lower and upper limits (CI.L, CI.U) estimates of  β m a k e E u r o p e a n  under various imputation methods with bootstrap samples of 1000 (Sml), 2000 (Med) and 4000 (Lrg).
Methods
Year Size Stat Mean.I IPW MI K.MI DB.MI HC.MI
2017SmlMean−0.02011−0.032890.311720.700790.177390.17519
SE1.804281.864201.147480.561351.466581.46665
CL.L−8.06343−8.22431−1.48400−0.39785−1.84854−1.84319
CL.U1.428671.539361.545891.629221.494231.48724
MedMean0.367580.386030.490930.672420.443690.44238
SE0.531640.596590.433070.378900.430060.43005
CL.L−0.64688−0.86281−0.41189−0.12741−0.46466−0.48352
CL.U1.203801.256911.254471.348701.172801.17846
LrgMean0.385320.416060.487530.602360.456730.45434
SE0.299400.342180.288160.277330.285290.28575
CL.L−0.24345−0.32304−0.13967−0.01447−0.15739−0.15425
CL.U0.907931.038410.979011.085620.956910.95491
2018SmlMean0.166470.150270.354150.629880.256790.25582
SE1.343321.377010.927300.582641.137521.13778
CL.L−1.36720−1.77539−1.06060−0.48064−1.17999−1.22472
CL.U1.388711.412561.416761.529521.364571.36863
MedMean0.366990.362440.457300.606850.414070.41211
SE0.415620.434500.401770.365900.397310.39717
CL.L−0.47720−0.51849−0.41055−0.11368−0.42388−0.41376
CL.U1.071511.076041.158271.293201.108381.10896
LrgMean0.374970.372190.456420.545080.418740.41656
SE0.288110.303150.277930.268830.270540.27107
CL.L−0.24697−0.29888−0.16556−0.04139−0.18245−0.18405
CL.U0.862090.896250.948451.018560.898490.90448
2019SmlMean0.127990.147510.310030.597990.236110.23435
SE1.124831.140240.897270.502080.978980.97823
CL.L−1.28459−1.37239−1.07873−0.47576−1.17632−1.18715
CL.U1.241071.270651.300101.432911.245041.24685
MedMean0.253130.271370.367030.517530.330180.32845
SE0.406230.412120.370210.346870.369150.36883
CL.L−0.62547−0.55190−0.41421−0.19964−0.43137−0.42785
CL.U0.934570.949051.016821.141740.973960.97342
LrgMean0.306750.326080.414380.503070.382520.37995
SE0.246170.249040.239750.231260.232440.23235
CL.L−0.21820−0.18353−0.076940.02680−0.08056−0.08262
CL.U0.758150.799000.855560.903540.799910.80472
2020SmlMean0.144600.162180.365510.640160.287100.28645
SE1.076761.087710.732920.562550.825080.82700
CL.L−1.23825−1.31103−0.84157−0.40260−0.96034−1.03690
CL.U1.317531.313341.371221.516301.272461.29429
MedMean0.276790.298280.403390.578730.368090.36792
SE0.412220.418260.385540.361510.378470.37922
CL.L−0.61672−0.55402−0.40512−0.14819−0.46587−0.47197
CL.U0.970360.998961.035701.205281.026711.01124
LrgMean0.301380.321740.414950.529140.388130.38716
SE0.273870.277110.255970.252600.249960.24982
CL.L−0.28459−0.27273−0.146510.01791−0.11541−0.12547
CL.U0.783580.811760.889050.981160.839440.83161
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

So, H.Y.; Ling, M.H.; Balakrishnan, N. Imputing Missing Data in One-Shot Devices Using Unsupervised Learning Approach. Mathematics 2024, 12, 2884. https://doi.org/10.3390/math12182884

AMA Style

So HY, Ling MH, Balakrishnan N. Imputing Missing Data in One-Shot Devices Using Unsupervised Learning Approach. Mathematics. 2024; 12(18):2884. https://doi.org/10.3390/math12182884

Chicago/Turabian Style

So, Hon Yiu, Man Ho Ling, and Narayanaswamy Balakrishnan. 2024. "Imputing Missing Data in One-Shot Devices Using Unsupervised Learning Approach" Mathematics 12, no. 18: 2884. https://doi.org/10.3390/math12182884

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop