Next Article in Journal
Theoretical Analysis and Systematic Comparison of Local Navigation Control Strategies in Semi-Structured Environments: A Systems Approach
Previous Article in Journal
A Spatiotemporal Feature-Driven Deep Learning Framework for Fine-Grained Tugboat Operation Recognition
Previous Article in Special Issue
Access to Care in a Capacity-Constrained System: Do Coverage Expansions Improve Health Outcomes? Evidence from U.S. States, 2006–2023
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Copula-Based Joint Modeling Framework for Hospitalization Costs and Length of Stay in Massive Healthcare Data

by
Xuan Xu
1 and
Yijun Wang
2,*
1
Asset and Laboratory Management Office, Zhejiang University of Finance and Economics, Hangzhou 310018, China
2
School of Statistics and Data Science, Zhejiang Gongshang University, Hangzhou 310018, China
*
Author to whom correspondence should be addressed.
Systems 2026, 14(2), 226; https://doi.org/10.3390/systems14020226
Submission received: 29 January 2026 / Revised: 11 February 2026 / Accepted: 19 February 2026 / Published: 23 February 2026

Abstract

In large-scale medical data, the connection between hospital length of stay and medical expenses shows a complex and nonlinear relationship instead of a straightforward positive link. This study proposes a Cox–Log-Logistic–Copula joint modeling framework to describe the marginal distributions and latent dependence between the two variables. Specifically, a semi-parametric Cox proportional hazards model is used for hospitalization duration, while a Log-Logistic model handles medical costs. The two margins are flexibly coupled through a Copula function to capture dynamic variations in cost levels during different hospitalization stages. To address computational challenges in large datasets, this study also includes subsample correction and one-step adjustment algorithms, combined with parallel computing strategies, to enhance estimation efficiency and accuracy. Empirical results show that the length of hospital stays and medical costs are not always positively related. In some cases, higher medical expenses occur during shorter stays, suggesting possible over-treatment or uneven resource distribution. The proposed framework proves to have strong explanatory power in identifying nonlinear patterns in healthcare behavior and offers a new quantitative tool for optimizing medical resource allocation and controlling costs.

1. Introduction

Medical expenses have grown rapidly over the past few decades and continue to pose major challenges for healthcare systems worldwide. In both research and policy talks, hospital length of stay is often seen as a key and clear factor influencing inpatient medical costs [1,2], since it is usually considered a sign of how serious a disease is, how complicated the treatment might be, and the need for follow-up outpatient care. For example, John et al. (2003) [3] highlighted how the length of stay in the intensive care unit (ICU) can influence the total medical expenses; Angus et al. (1996) [4] pointed out that patients with prolonged hospital stays often require more intensive inpatient services, as well as additional outpatient follow-up, medication, or rehabilitation after discharge, leading to higher overall healthcare costs; Haidar et al. (2023) [5] find that complications or treatment adjustments during hospitalization may further increase expenditures. That’s why certain situations aim to limit the length of stay, helping to control medical costs.
At the same time, the evidence shows that the connection between how long patients stay in the hospital and their medical expenditures is shaped by substantial patient heterogeneity. Existing studies have shown that length of stay not only correlates with medical costs, but also varies systematically with individual characteristics such as gender and comorbidities. For example, Hawkes et al. (2006) [6] document significant differences in hospital stay duration across gender and comorbidity status among elderly patients with hip fractures. These findings indicate that the cost implications of hospitalization duration cannot be fully understood without accounting for underlying clinical and demographic differences. Apart from individual differences, the link between a patient’s length of stay in the hospital and their medical expenses can also be shaped by how healthcare facilities are organized and how payment systems work in today’s healthcare setting. If hospital stays are too brief, it might affect the effectiveness of treatments and potentially lead to higher costs later on [7]. In particular, spending is increasingly influenced by diagnosis-related group (DRG)–based payment schemes, insurance reimbursement ceilings, and standardized clinical pathways, which may weaken the direct association between hospitalization duration and total costs [8,9]. Under such systems, Patients with similar lengths of stay may incur vastly different medical expenditures due to differences in treatment intensity, disease severity, and insurance coverage [10]. In some cases, patients with relatively short hospital stays may incur high costs due to intensive diagnostic procedures or surgical interventions, whereas longer stays may not necessarily translate into proportionally higher expenditures [11].
As a result, existing findings on the relationship between hospital length of stay and medical expenditures remain mixed. Most studies predominantly focus on marginal effects, exploring how the length of stay impacts medical expenses using regression methods [12,13]. While such analyses provide useful insights, they offer limited information on the joint behavior of hospitalization duration and medical expenditures. Several studies report that the explanatory power of length of stay diminishes substantially once disease severity, treatment intensity, and insurance arrangements are taken into account [14]. At this point, we’re curious about how much the length of hospital stay really influences medical costs? Why are the research results seem so varied? One possible reason for these mixed results is how previous research approaches the modeling strategies. Many studies mostly focus on medical expenses as the main outcome, treating hospital stay length as just an explanatory variable, which suggests they think these aspects can be modeled separately. Some other studies look only at what determines how long patients stay in the hospital, without really considering medical costs as part of the same picture. This kind of approach misses the point that length of stay and medical expenses can be interconnected and influenced by common unseen factors like disease severity, treatment choices, and hospital policies. When we ignore this connection, simple models might not fully reflect the real relationship between hospital stay length and medical costs, especially when there’s a lot of variation among patients.
Motivated by the above considerations, this study aims to re-examine the relationship between hospital length of stay and medical expenditures in large-scale healthcare data. Rather than modeling these two outcomes separately, we employ a joint modeling framework to investigate whether hospital length of stay continues to be a predominant driver of medical expenditures after adequately accounting for heterogeneity, nonlinear effects, and complex dependence structures.
Copula-based methods provide a natural and powerful tool for joint modeling, as they allow the marginal distributions of multiple outcomes to be specified separately while explicitly characterizing their dependence structure (Sklar, 1959) [15]. In recent years, health economics has increasingly embraced copula-based models as a flexible framework for analyzing complex relationships between healthcare outcomes. A key advantage of these approaches lies in their ability to capture dependence arising from shared but partially unobserved factors, such as disease severity, treatment intensity, and institutional practices. Early studies demonstrate the usefulness of copulas in addressing selectivity bias, simultaneity, and joint determination in healthcare utilization and health outcomes (e.g., Smith, 2003 [16]; Zimmer and Trivedi, 2006 [17]; Dancer et al., 2008 [18]; Quinn, 2007 [19]). More recently, Rezaei Gharroodi et al. (2019) [20] develop a Gaussian copula–based regression framework to jointly model trivariate mixed outcomes and apply the approach to household health services utilization data, capturing dependence among discrete, continuous, and censored responses. These applications illustrate the flexibility of copula-based joint modeling in complex healthcare settings. Length of stay and medical costs are inherently interconnected and jointly influenced by unobserved clinical and institutional factors, rendering separate modeling strategies potentially misleading. By allowing the marginal distributions of length of stay and expenditures to be specified independently while explicitly modeling their dependence, copula-based approaches facilitate a more nuanced examination of their joint behavior, even in the presence of censoring, heavy-tailed cost distributions, and nonlinear effects.
However, most existing copula-based tools in health economics are mainly used for smaller, more manageable data sets or specific outcomes. They often fall short when it comes to handling the complexities of large-scale administrative inpatient data, which include huge sample sizes, diverse patient characteristics, censored stay lengths, and highly skewed medical costs [21]. These challenges demand models that are both flexible and computationally efficient. This is why we are excited to develop a copula-based joint modeling framework specifically designed for large healthcare datasets, allowing us to accurately explore the relationship between hospital stay durations and medical expenses, all while keeping the computations practical.
In summary, this study revisits the relationship between hospital length of stay and medical expenditures by adopting a joint modeling framework that explicitly accounts for complex marginal behaviors, dependence structures, and the large-scale nature of modern healthcare data. Rather than modeling hospitalization duration and medical costs separately, the proposed approach allows the two outcomes to be analyzed within a unified framework, enabling a more comprehensive assessment of their interdependence at the individual level. Importantly, the framework is designed to remain computationally feasible in settings characterized by massive sample sizes and rich covariate information, thereby facilitating reliable inference without sacrificing model flexibility. Through simulation studies and an empirical application based on large administrative healthcare data, this study illustrates how joint modeling can provide deeper insights into the interplay between hospitalization duration and medical expenditures.
The remainder of this paper is organized as follows: Section 2 and Section 3 introduce the model and the estimation strategy, respectively. Section 4 presents the results of simulation studies. Section 5 reports an empirical application using real-world healthcare data. Section 6 concludes with a discussion of the main findings and directions for future research.

2. Model Specification

2.1. Modeling Hospital Length of Stay

In real-world healthcare data, hospitalization duration is often subject to right censoring, as a non-negligible proportion of patients remain hospitalized at the end of the observation period [22]. Moreover, length of stay reflects a dynamic discharge process influenced by clinical progression, treatment decisions, and institutional practices, rather than a static outcome observed uniformly across patients. Prior studies have shown that the Cox proportional hazards model performs well in analyzing such censored time-to-event data [23]. Moreover, research on medical claims has indicated that length of stay is closely related to factors such as gender, disease severity, and type of surgery [11]. Accordingly, this study employs the Cox proportional hazards regression model to analyze hospital length of stay, fully leveraging the available covariate information to more accurately characterize its marginal distribution.
Let T i be the length of hospital stay for patient i , i = 1 , , n , and X = X i 1 , X i 2 , , X i p represent factors like gender and disease severity that influence the stay duration. Denote C i as the censoring time, which is common in medical studies, especially due to right censoring in hospital stay data. The observed time is given by T ˜ = min T , C , Δ = I T C , where I ( · ) is the indicator function. This study assumes that, conditional on the covariates X i , the event time T i and the censoring time C i are independent. Under this assumption, the Cox proportional hazards model is employed to model hospital length of stay. Following Cox (1972) [24], the hazard function is specified as
h t X = h 0 t exp X T β ,
where β is a p-dimensional parameter vector and h 0 ( t ) denotes the baseline hazard function. The corresponding conditional survival function is given by
S i ( t | X ) = exp ( H i ( t | X i ) ) = exp ( H 0 ( t ) e X i T β ) ,
where H 0 ( t ) is the cumulative baseline hazard function. Consequently, the conditional distribution function of the hospital length of stay T i can be expressed as
F T X t x = P T t X = x = 1 exp H 0 t e X T β .

2.2. Modeling Medical Expenditures

When it comes to modeling medical expenditures, research covers many important aspects, such as how resources are allocated efficiently, reforms in payment systems, and predicting costs for specific diseases. Among these, the task of estimating total medical expenses for individual patients stands out as especially important, because it plays a key role in developing effective strategies to control costs [25]. However, considering medical expenditure data typically deviate substantially from normal distribution and exhibit pronounced right skewness, whereby a small proportion of patients account for a large portion of total medical costs. This makes modeling these costs challenging. Several methods, such as logarithmic transformations, have been used to reduce skewness and support linear modeling. However, these approaches can have limitations; while log transformation sometimes helps normalize data, it can also worsen skewness if the original distribution is very different from normal [26]. Recently, more flexible and adaptive parametric models like Gamma, Pareto, and Weibull distributions have gained popularity [27]. These models are better suited to capturing the skewness and heavy tails often seen in medical expenditure data.
Building on the findings of Chou et al. (2013) [28], heavy-tailed distribution families have been shown to perform particularly well in modeling medical expenditures. In particular, when medical expenditures are assumed to follow a Log-Logistic distribution, parameter estimates tend to exhibit smaller standard errors, and the fitted cumulative distribution function aligns closely with the empirical distribution—especially in the estimation of tail probabilities—demonstrating superior goodness of fit. Accordingly, this study adopts the Log-Logistic distribution to model medical expenditure data. On the one hand, the Log-Logistic distribution is parsimonious and capable of capturing skewness and heavy tails with only a small number of parameters, leading to high estimation efficiency in large-scale datasets. On the other hand, its heavy-tailed property allows for robust accommodation of extreme values commonly observed in massive healthcare data. Furthermore, individual medical expenditures are influenced by multiple covariates, such as income level, regional disparities, insurance type, and disease severity, while some variables (e.g., age) may exhibit nonlinear effects on expenditures [29]. Therefore, this study incorporates covariate information into the modeling framework by extending the scale parameter of the Log-Logistic distribution, thereby enhancing its suitability for large-scale medical expenditure data and enabling a more accurate representation of expenditure heterogeneity and structural characteristics.
Thus, let us assume that medical expenditure Y follows a Log-Logistic distribution, whose probability density function is given by
f y α , θ = θ θ α α y y α α θ 1 1 + y y α α θ 2 , y > 0
where α > 0 is the scale parameter and θ > 0 is the shape parameter.
In addition, several covariates such as income, region, and type of health insurance are key factors influencing medical expenditures. To enhance the interpretability and flexibility of the model, a logarithmic link function is adopted, with log ( α i ) used to characterize the relationship between covariates and mean medical expenditures. This specification not only ensures the positivity of the scale parameter but also provides an intuitive interpretation of the direction and magnitude of covariate effects on medical expenditures. It is worth noting, however, that some covariates (e.g., age) may exhibit nonlinear relationships with medical expenditures [29]. Traditional regression models typically assume a linear relationship between covariates and the transformed mean function, which makes it difficult to accurately capture complex nonlinear effects and thus limits model flexibility. To address this issue, we follow the modeling framework proposed by Chen (2013) [29] and specify
log α i = w i T γ + f z i 1 + f z i 2 + + f z i k ,
where w i = ( w i 1 , w i 2 , , w i q ) denotes the set of covariates linearly associated with medical expenditures for individual i, γ is a q-dimensional vector of parameters to be estimated, and ( z i 1 , z i 2 , , z i k ) represent covariates that have nonlinear effects on medical expenditures. This formulation allows linear and nonlinear covariate effects to be modeled simultaneously, substantially improving the adaptability and interpretability of the model in the analysis of medical expenditure data.

2.3. Copula-Based Dependence Structure

While modeling medical expenditures and hospital length of stay separately allows their marginal behaviors to be appropriately characterized, existing research has increasingly recognized that these two outcomes are jointly determined at the individual level. In practice, length of stay and medical expenditures are realized simultaneously for the same patient and may be influenced by shared but unobserved factors such as disease severity, treatment intensity, and institutional practices. Accordingly, several studies have adopted joint modeling frameworks to explicitly account for the dependence between hospitalization duration and costs. For example, Tang, Luo, and Gardiner (2016) [30] develop a bivariate random-effects copula model to jointly analyze length of stay and medical expenditures, demonstrating the importance of explicitly modeling their dependence structure.
Despite these methodological advances, joint modeling of length of stay and medical expenditures remains challenging in large-scale administrative healthcare data. In particular, the two outcomes exhibit markedly different marginal characteristics: length of stay is naturally a time-to-event outcome subject to censoring, whereas medical expenditures are continuous, highly right-skewed, and often heavy-tailed, with potentially nonlinear covariate effects. In such settings, reliance on restrictive joint distributional assumptions may lead to model misspecification and obscure the dependence structure of substantive interest. To address these issues, this study adopts a copula-based joint modeling framework that decouples marginal specifications from the dependence structure, allowing each outcome to be modeled with an appropriate marginal distribution while explicitly characterizing their dependence.
Within copula-based joint modeling, existing approaches can be broadly classified into three categories. The first consists of fully parametric models, which assume specific Copula families (e.g., Clayton, Gumbel, Gaussian) to characterize joint dependence [31,32]. These models offer computational simplicity and clear interpretation but may suffer from bias if the assumed functional form is misspecified. The second category comprises nonparametric models, which avoid restrictive functional assumptions but typically require large sample sizes and substantial computational resources, rendering them less suitable for massive datasets [33,34]. The third category includes semiparametric models, which strike a balance between flexibility and stability by adopting parametric specifications for either the marginal distributions or the dependence structure while modeling the remaining components nonparametrically [35]. Compared with the first two approaches, semiparametric copula models preserve the interpretability of marginal distributions while allowing sufficient flexibility to capture complex dependence patterns. This balance is particularly important in large-scale healthcare data settings. Accordingly, this study employs a semiparametric copula framework, combining marginal estimates of length of stay and medical expenditures to analyze their dependence structure.
Specifically, the Copula function is employed to link the conditional distribution functions F T X t x and F Y W , Z y w , z , such that
P T t , Y y X , W , Z = C γ F T X t x , F Y W , Z y w , z ,
where C γ · denotes a Copula function with parameter vector γ . Under this framework, the joint distribution of hospital length of stay and medical expenditures is constructed by separately modeling their marginal distributions and their dependence structure. This separation preserves the interpretability of the marginal models while providing substantial flexibility in capturing complex dependence patterns.

3. Methods

Once the joint modeling framework is established, the primary methodological challenge concerns the estimation of model parameters and functional components. Because the proposed model contains both parametric and nonparametric elements, different estimation strategies are required for different parts of the model. For the Cox proportional hazards model, the regression parameter vector β is typically estimated by maximizing the partial likelihood. The corresponding partial likelihood function is given by
L β = i = 1 N exp X i T β j = 1 N I T j T i exp X j T β δ i ,
where δ i denotes the event indicator. Parameter estimation is commonly carried out via the Newton–Raphson iterative algorithm, with the updating scheme given by β k + 1 = β k I β k 1 U β k , where
U β = i = 1 N X i T ˜ i j = 1 n I T ˜ j T ˜ i exp X j T β X j j = 1 n I T ˜ j T ˜ i exp X j T β δ i ,
I β = i = 1 N j = 1 n I T ˜ j T ˜ i exp X j T β X j X j T j = 1 n I T ˜ j T ˜ i exp X j T β [ j = 1 n I T ˜ j T ˜ i exp X j T β X j j = 1 n I T ˜ j T ˜ i exp X j T β ] 2 δ i ,
and here, a 2 = a a T .
In the context of massive datasets, directly applying the Newton–Raphson algorithm for iterative estimation often becomes computationally prohibitive due to excessive memory requirements, high time complexity, and substantial computational cost. Existing extensions such as the Lasso-Cox [36] and Elastic Net-Cox [37] models perform well in moderate-sized samples but continue to face computational bottlenecks in massive data settings.
To alleviate these issues, this study adopts the Cox–OSES approach [38], which reduces computational burden through information matrix approximation and subsample-based correction. Specifically, the full sample of size N is first partitioned into two disjoint subsets, denoted by J 1 and J 2 , where the subsample size is n 1 satisfies n 1 N . Based on the subsample J 1 , an initial estimator β ^ 1 is obtained by maximizing the Cox partial likelihood.
Subsequently, a one-shot update is performed by incorporating information from the remaining subset J 2 to yield a bias-corrected estimator. This procedure allows the model to exploit gradient information from the full dataset while substantially reducing computational cost. In particular, the information derived from subset J 1 is used to compute the efficient score contribution for observations in J 2 , given by i J 2 S O i β ^ 1 , H ^ 1 The efficient score function is defined as
S O β ^ 1 , H ^ 1 = 0 X E X T ˜ t d M t ,
where M t = N t 0 t I T ˜ s exp X T β ^ 1 d H ^ 1 s , N t = Δ I T ˜ t , H ^ 1 t = i J 1 Δ i j J 1 I T ˜ j T ˜ i exp X j T β ^ 1 , E ^ t = j J 1 I T ˜ j t exp X j T β ^ 1 X j j J 1 I T ˜ j t exp X j T β ^ 1 .
Using the one-shot updating scheme, the bias-corrected estimator of the regression parameter is obtained as β ^ f = β ^ 1 + n I ^ 1 n I ^ 1 n 1 n 1 1 i J 2 S O i β ^ 1 , H ^ 1 , where I ^ ( 1 ) denotes the information matrix estimated from the first subsample. Next, the baseline cumulative hazard function H 0 ( t ) is estimated.
In large-scale datasets, direct full-sample estimation can be computationally expensive. Therefore, a divide-and-conquer strategy is adopted: estimation is carried out independently within each data block, and the baseline cumulative hazard H 0 t is obtained using the Breslow estimator, given by
H ^ 0 t = 0 t i = 1 n d N i s j = 1 n I T ˜ j T ˜ i exp X j T β ^ .
The block-specific estimates are then aggregated and combined through interpolation. Finally, the conditional distribution function of the hospital length of stay can be estimated as F T X t x = P T t X = x = 1 exp H 0 t e x T β , yielding the estimator F ^ T ( t | X ) .
In the estimation of the medical expenditure distribution, the scale parameter α i is assumed to depend on observed covariates, as well as several unknown nonparametric functions. Because the functional forms of these nonparametric components are unspecified, this study adopts the P-spline approach [39] to parameterize each function f j ( · ) , that is
f j t = η j 1 t + + η j p t p + k = 1 K j η j p + k t h j k + p ,
where h j k k = 1 K j denotes the spline knots, and B j t = t , , t q , t h j 1 + q , , t h j K j + q T is a truncated power basis of degree q. In this study, the degree is set to q = 2 . Let η = η j 1 , η j p + K j T , thus the log-scale parameter can be expressed as log α i = w i T γ + B 1 η 1 + + B m η m .
Define U i = w i , B 1 , , B k , ξ = γ T , η 1 T , η k T , so that log α i = U i ξ . It is worth noting that, to avoid identifiability issues, intercept terms are excluded from the P-spline representations of the nonparametric functions. Under this specification, the nonparametric components can be expressed using a finite-dimensional parameter vector and thus incorporated into a likelihood-based estimation framework. With respect to the distributional assumption, medical expenditure Y is assumed to follow a Log-Logistic distribution with probability density function
f y i α i , θ = θ θ α i α i y i y i α i α i θ 1 1 + y i y i α i α i θ 2 .
Accordingly, the likelihood function is given by
L γ , θ = i = 1 n f y i α i , θ = i = 1 n θ θ α i α i y y α α i θ 1 1 + y y α i α i θ 2 .
Substituting α i = exp U i ξ into the likelihood function and taking logarithms yields the log-likelihood l ξ , θ = i = 1 n log θ U ξ + θ 1 log y i U i ξ 2 log 1 + y i e U i ξ θ . To prevent overfitting of the spline functions, a second-order difference penalty is imposed on the spline coefficients, resulting in the penalized log-likelihood l p ξ , θ ; λ = l ξ , θ λ 2 D ξ 2 , where D is a penalty matrix applied only to the truncated spline coefficients, and λ > 0 is a smoothing parameter. Parameter estimation is carried out using a two-step procedure. In the first step, θ is fixed and ξ is estimated. Given θ , the parameter vector ξ is updated iteratively using the Newton–Raphson algorithm, ξ n e w = ξ o l d + H 1 g , where the gradient vector g and Hessian matrix H are given by g = U T θ 1 + 2 s λ D ξ , H = U T W U + λ D , with s i = r i θ 1 + r i θ , r i = y i e U i ξ .
However, in the presence of massive datasets, directly computing the Hessian matrix using the full sample within the Newton–Raphson algorithm becomes computationally expensive. To address this issue, we extend the estimation procedure to accommodate large-scale data by fully exploiting the block structure of the information matrix and adopting a cumulative, sample-wise updating scheme that enables parallel computation. In addition, a bias-corrected estimation strategy is employed to further improve computational efficiency and numerical stability. Specifically, an initial estimator ξ ^ 1 is first obtained using a relatively small subsample, along with the corresponding subsample-based information matrix H ^ 1 . Subsequently, a one-step update is performed on a second subsample using the score function, yielding a bias-corrected estimator ξ ^ . In the second step, ξ is fixed and the shape parameter θ is estimated. Given ξ ^ , the Newton–Raphson updating scheme for θ is given by θ n e w = θ o l d l p l p θ θ 2 l p 2 l p θ 2 θ 2 , where l p θ = i = 1 n 1 θ + log r i 2 r i θ log r i 1 + r i , 2 l p θ 2 = i = 1 n 1 θ 2 2 r i θ log r i 2 1 + r i θ 2 , with r i = y i e U i ξ . This procedure is repeated over a grid of candidate smoothing parameters λ , and the value of λ ^ that maximizes the penalized log-likelihood is selected. The final parameter estimates ( ξ ^ , θ ^ ) are then obtained. Based on the estimated spline coefficients and covariates, the nonparametric functions f ^ ( z ) are recovered, which in turn yield the estimated scale parameters α ^ i . Accordingly, the cumulative distribution function of medical expenditures is given by F Y ; α , θ = P Y y = y θ α θ + y θ . which leads to the estimated conditional distribution function F ^ Y ( Y U ) .
After obtaining the estimated marginal distributions, pseudo-observations are constructed as u ^ i = F ^ T X t i x i , v ^ i = F ^ Y W , Z y i w i , z i . Let the Copula density function be defined as c u , v ; ρ = 2 u v C u , v ; ρ , The corresponding pseudo log-likelihood function is then given by l ρ = i = 1 n log c u ^ i , v ^ i ; ρ , and the dependence parameter is estimated via maximum likelihood as ρ ^ = arg max ρ i = 1 n log c u ^ i , v ^ i ; ρ .
In large-scale data settings, direct global iteration based on maximum likelihood estimation requires repeated computation of gradients and Hessian matrices, leading to computational costs that grow approximately linearly with sample size. When the number of observations reaches the order of tens of millions, such procedures often suffer from numerical instability and slow or unreliable convergence. To address these challenges, this study adopts a “consistent initial estimation followed by one-step correction” strategy. Specifically, Kendall’s tau is first used to obtain a consistent estimator of the dependence parameter [40], which serves as an approximately unbiased initial value. Starting from this initial estimator, only a single Newton–Raphson update is required to obtain an estimate close to the full maximum likelihood solution. This approach substantially reduces computational burden while retaining statistical efficiency. In addition, to further alleviate computational costs, a block-wise parallelization strategy is employed when evaluating the log-likelihood and gradient functions. The full dataset is randomly partitioned into multiple subsets, within which local gradients and likelihood contributions are computed independently. These local quantities are then aggregated to form a global updating direction, thereby avoiding repeated full-sample scans and significantly enhancing computational scalability. Finally, among a set of candidate Copula families—including Gaussian, Gumbel, Clayton, and Frank Copulas—the optimal Copula model is selected based on the Akaike Information Criterion (AIC) [41]. This information-based selection procedure ensures both the robustness of dependence modeling and computational efficiency in large-scale data environments.
Figure 1 summarizes the overall workflow of the proposed copula-based joint modeling framework, including data preprocessing, marginal modeling for length of stay and medical expenditures, copula-based dependence estimation with model selection, and subsequent inference and evaluation.

4. Simulation

The simulation settings are thoughtfully designed to mirror important aspects of modern large-scale administrative healthcare data, creating a more relatable and engaging picture. First, the sample size in the simulations mimics massive inpatient datasets, where the large number of observations presents notable computational and estimation challenges. This helps us explore how well the proposed framework scales and remains stable under conditions typical of real-world applications. Second, medical expenditures are modeled based on heavy-tailed, highly skewed distributions, reflecting the well-known fact that healthcare costs are concentrated among a small percentage of patients. Recognizing this skewness is essential for testing the robustness of our marginal and joint modeling strategies. Third, the simulation includes high-dimensional covariates with varying effects, some of which have little to no impact on outcomes. This realistic approach mirrors clinical settings where many patient characteristics have limited influence, allowing us to see how our method performs amidst noisy data. Finally, certain covariates like age are allowed to have nonlinear effects on medical expenditures, capturing observed deviations from a simple linear relationship. Instead of trying to recreate a specific dataset, this simulation framework aims to represent a variety of plausible healthcare scenarios, helping us understand how well the proposed method works in realistic data conditions.
Thus, a large-scale synthetic dataset is generated with sample size n = 10 7 . Let X i = X i 1 , , X i p denote the covariate vector with dimension p = 12 , where each component is independently generated from a standard normal distribution N 0 , 1 . Let W i = W i 1 , , W i d be another set of covariates with dimension q = 8 . Specifically, W i 1 , W i 2 , W i 3 N ( 0 , 1 ) , W i 4 u n i f 2 , 2 , W i 5 B e r n 0.4 , W i 6 B e r n 0.2 , W i 7 B e r n 0.8 , W i 8 E x p 1 . In addition, the covariates Z i 1 and Z i 2 are independently generated from the uniform distribution Unif [ 0 , 4 ] . Under this setting, the medical expenditure variable Y i , conditional on ( W i , Z i ) , follows a Log-Logistic distribution, Y i W i , Z i log L o g i s t i c α i , θ , with conditional density function F Y W , Z y i w , z i ; γ , θ = θ θ α i α i y i y i α i α i θ 1 1 + y i y i α i α i θ 2 , where the scale parameter satisfies log α i = W T γ + f 1 Z 1 i + f 2 Z 2 i . The parameter values are set as θ = 1.8 , γ = 0.2 , 0.3 , 0.15 , 0.1 , 1.2 , 2.4 , 0.4 , 0.2 T , with nonlinear functions specified as f 1 Z = 0.6 sin 2 π Z and f 2 Z = 0.4 cos ( 2 π Z ) . The hospital length of stay variable T i is generated from a Cox proportional hazards model, h t X = h 0 t exp X T β , which implies the conditional distribution function F T X t x = 1 exp H 0 t e x T β , where β = 1 , 0 , 0 , 1 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 1 T To characterize the dependence structure between hospital length of stay T and medical expenditures Y, a Copula-based approach is employed to link their marginal distributions. Specifically, two dependence scenarios are considered. CaseI (Gaussian Copula): The Gaussian Copula is defined as C ρ u , v = Φ ρ Φ 1 u , Φ 1 v , where Φ ρ · , · denotes the cumulative distribution function of a bivariate normal distribution with correlation coefficient ρ c = 0.8 , and Φ 1 ( · ) is the inverse of the standard normal distribution function. CaseII (Clayton Copula): C ς u , v = u ς + v ς 1 1 1 ς ς , ς = 2 . In the data generation process, pairs U i , V i are first sampled from the specified Copula. The corresponding realizations of ( T i , Y i ) are then obtained through the inverse marginal distribution functions. In addition, an independent censoring time C i is introduced, with a censoring rate of 25 % . The observed time is defined as T ˜ = min T i , C i and the event indicator is given by e v e n t i = I T i C i
Next, a series of Monte Carlo simulations are conducted to illustrate the performance of the proposed estimation methods. Table 1 and Table 2 report the estimation results for the parameters β , γ , and θ under the Gaussian Copula setting. Estimation accuracy is assessed using bias and root mean squared error (RMSE), while the robustness of model selection based on the Akaike Information Criterion (AIC) is also examined. Figure 2 presents the estimated baseline cumulative hazard function H 0 ( t ) under the Gaussian Copula. Figure 3 and Figure 4 display the estimated nonparametric functions f 1 ( Z 1 ) and f 2 ( Z 2 ) , respectively, and Figure 5 provides a comparison of Copula density contour plots. Similarly, Table 3 and Table 4 summarize the estimation results for β , γ , and θ under the Clayton Copula setting, with performance evaluated in terms of bias and RMSE, along with the robustness of AIC-based model selection. Figure 6 shows the estimated baseline cumulative hazard function H 0 ( t ) under the Clayton Copula. Figure 7 and Figure 8 present the estimated nonparametric functions f 1 ( Z 1 ) and f 2 ( Z 2 ) , respectively, and Figure 9 compares the corresponding Copula density contour plots.
Based on the results reported in Table 1, Table 2, Table 3 and Table 4, it can be seen that under both the Gaussian Copula and the Clayton Copula settings, the estimated values of the regression parameters β , γ , and θ are all close to their true values. The corresponding biases are generally on the order of 10 3 to 10 2 , and the RMSE values remain small, indicating that the proposed estimation method achieves high accuracy and consistency in large-scale data settings. The estimation error of the Copula dependence parameter also remains at a low level, suggesting that the proposed procedure provides stable inference even when modeling complex dependence structures in massive datasets.
Moreover, when comparing Gaussian, Clayton, and Frank copulas while keeping the marginal specifications fixed, the AIC-based model identification rate reaches 100% across all simulation scenarios. Specifically, the true copula family used in the data-generating process is always correctly selected. This result demonstrates that the proposed framework can accurately distinguish between alternative dependence structures and reliably support copula family selection in large-scale applications. Figure 2, Figure 3 and Figure 4 and Figure 6, Figure 7, Figure 8 and Figure 9 show that the estimated nonparametric functions f 1 ( Z 1 ) , f 2 ( Z 2 ) , as well as the baseline hazard function H 0 ( t ) , closely overlap with their true counterparts. This finding suggests that the proposed method exhibits strong fitting capability under complex functional structures. In addition, the Copula density contour plots in Figure 5 and Figure 9 clearly illustrate that the estimated dependence structures are highly consistent with the true underlying structures, indicating that the proposed approach can stably recover the dependence relationship between hospital length of stay and medical expenditures.
To further benchmark the explanatory advantage of the proposed copula-based joint framework, we conduct a direct comparison with a conventional “separate” specification that assumes independence between the two outcomes. Specifically, after estimating the marginal models for length of stay and medical expenditures, we evaluate multiple dependence structures on the same pseudo-observations ( u ^ i , v ^ i ) . In addition to the independence model, which corresponds to the independence copula, we consider two commonly used copula families—the Gaussian and the Clayton copula— under two representative levels of dependence. For the Gaussian copula, the dependence parameter ρ is set to ρ = 0.8 (strong dependence) and ρ = 0.2 (weak dependence), while for the Clayton copula, the dependence parameter γ is analogously set to γ = 2 and γ = 0.2 . This design allows us to assess the improvement of the proposed joint modeling framework over the independence model across both different copula families and varying strengths of dependence. Model adequacy is assessed using the Akaike Information Criterion (AIC) computed from the copula log-likelihood, where AIC = 2 + 2 k and k denotes the number of free dependence parameters ( k = 1 for the joint copula model and k = 0 under independence). We report Δ AIC = AIC joint AIC independent ; negative values indicate that allowing dependence yields a better trade-off between goodness-of-fit and model complexity. Across simulation (see Table 5), Δ AIC is consistently negative, implying that the copula-based joint specification provides a materially improved fit relative to an independence-based separate modeling approach, even after accounting for the additional dependence parameter.
Overall, the simulation results clearly show that our proposed estimation method works both efficiently and reliably, even when dealing with large-scale data. The main goal of this simulation study is to explore how well the method performs, rather than to replicate any specific real-world dataset. To achieve this, the data-generating processes are crafted to mimic key features of administrative healthcare data, such as large sample sizes, medical expenditures with heavy tails, covariate effects that vary widely and have limited explanatory power for some variables, and complex nonlinear relationships.This controlled simulation setting makes it easy to evaluate how accurate, robust, and scalable the estimation method is. In this context, the evidence from the simulations provides strong support for the overall performance of the approach. We believe that this framework can be effectively applied to real-world situations, especially when (i) the distributions of length of stay and medical costs can be reasonably estimated or approximated, (ii) the relationship between outcomes results from shared but only partially observed factors, and (iii) the data presents the large-sample characteristics typical of administrative healthcare records. These conditions are generally met in modern inpatient datasets, highlighting the practical usefulness of our proposed method.

5. Application

In health services research, hospital length of stay is often seen as a main factor influencing medical costs. When patients stay longer in the hospital, they usually use more medical resources, which can lead to higher healthcare expenses. But, this link isn’t always straightforward. Various factors like disease type, patient differences, and healthcare systems can all shape how length of stay impacts costs. Evidence shows that how long someone stays not only affects direct resource use—like hospital beds, nursing, and medical technology—but also influences costs indirectly through factors like complication risks and how intense the treatment is.
Understanding how hospitals’ stay lengths and medical costs are connected is really important. It helps us to make healthcare costs clearer, improve payment systems, and use resources more wisely. To explore these connections, this study presents a friendly new approach using Copulas to model both hospital stays and medical expenses together. This method offers the flexibility to capture how these two factors depend on each other, while still keeping the unique details of each one intact.
The empirical analysis is based on inpatient hospital discharge data from the Statewide Planning and Research Cooperative System (SPARCS) in New York State, a comprehensive administrative healthcare database. The original dataset contains 2,544,543 inpatient admissions, covering a broad spectrum of hospitalizations across the state. SPARCS provides rich patient- and hospitalization-level information, including demographic characteristics (age, sex, and race), admission types (e.g., emergency, elective, trauma), clinical indicators such as length of stay, APR-DRG diagnosis group, severity of illness, and mortality risk, as well as discharge outcomes and economic measures including total charges and payment methods.
Prior to analysis, the data were carefully cleaned according to predefined criteria to ensure data quality, with hospitalizations containing excessive missing information excluded. This process resulted in a final analytical sample of 2,444,747 inpatient admissions. Table 6 and Table 7 present descriptive statistics for the cleaned sample. The average length of stay is 5.37 days, with a median of 3 days, indicating substantial heterogeneity in hospitalization duration. Medical expenditures exhibit pronounced right skewness: the mean total charge is 33,269.40, while the median is 18,209.11, suggesting that a small proportion of hospitalizations account for a large share of total costs. To illustrate this distributional feature, Figure 10 displays the distribution of total charges on both the original and log-transformed scales. While the log transformation substantially reduces skewness for descriptive visualization, it is used solely for illustrative purposes and does not impose any parametric assumption on the subsequent modeling framework.
While most hospitalizations incur relatively moderate costs, a small proportion of cases are associated with extremely high expenditures, as reflected by the large maximum value. Female patients account for 56.3% of hospitalizations, while male patients represent 43.7%. With respect to race, White patients comprise the largest group (57.7%), followed by Black or African American patients (19.5%) and patients of other racial backgrounds (22.8%). Hospitalizations are observed across all age groups, with a relatively balanced distribution.In terms of disease severity, most hospitalizations are classified as moderate (36.8%) or minor (34.9%), whereas major and extreme cases account for 21.7% and 6.6%, respectively. Regarding payment arrangements, hospitalizations involving two payment methods are most common (43.9%), followed by those with one (28.2%) or three payment methods (27.9%). Overall, these distributions highlight the heterogeneity of patient characteristics and clinical conditions in the study sample.
In the original data, hospital admissions are categorized into five types: elective, emergency, newborn, urgent, and trauma. We use elective admissions as the reference point, and the other types are included as indicator variables to help with our analysis.
Regarding discharge outcomes, although 19 discharge categories are recorded, following classifications commonly adopted in the literature [42], discharge destinations are grouped into four categories: home, long-term care facilities, rehabilitation facilities, and death or abnormal discharge. Discharge to home is used as the reference category, with the remaining categories included as indicator variables. Additional patient-level covariates include sex, race, disease severity, number of payment methods, indicators for commercial insurance and public insurance, whether surgery was performed, and whether the admission was classified as an emergency. A detailed summary of these variables is provided in Table 8.
We also compute the pairwise Pearson correlations among the study variables, as shown in Figure 11. Length of stay and total charges exhibit a relatively strong positive correlation. Beyond this relationship, several covariates display non-negligible correlations with either length of stay or total charges, indicating that medical utilization and costs are jointly associated with multiple patient- and admission-related characteristics.
Considering that hospital length of stay is closely related to admission characteristics, discharge destination, and whether surgical procedures are performed, covariates X 1 X 10 are selected as predictors of hospital length of stay T. These variables capture key clinical and administrative factors that are expected to influence hospitalization duration.
Given the strong association between insurance coverage, disease characteristics, and medical expenditures, covariates W 1 W 6 are included as determinants of medical costs. In addition, prior studies suggest that certain covariates may exert nonlinear effects on medical expenditures. For example, age is known to have a potentially nonlinear relationship with healthcare costs. Although age is recorded in categorical form in the dataset, it is treated as a continuous variable in the analysis to facilitate flexible modeling of its effect on medical expenditures.
Furthermore, disease severity is incorporated into the medical expenditure model, as more severe conditions are typically associated with higher treatment intensity and greater healthcare costs. Based on these considerations, medical expenditures are assumed to follow a Log-Logistic distribution, while hospital length of stay is modeled using a Cox proportional hazards framework. Different types of copulas, such as Clayton, Gaussian, and Gumbel, are employed. The AIC results for different copula models are presented in Table 9. The optimal copula type is then selected using the AIC method. Based on the value of AIC, we choose the Clayton copula as the linking copula, then applying the proposed modeling strategy and estimation procedures described in the previous sections, the empirical results are obtained and presented below.
Based on the Cox proportional hazards model results reported in Table 10, the effects of different covariates on hospital length of stay vary substantially. The coefficient for gender is negative, indicating that female patients tend to have longer hospital stays than male patients. This difference may partly reflect pregnancy-related hospitalizations, such as pre-term labor and other obstetric complications, beyond childbirth itself. The coefficient associated with race is small in magnitude relative to other covariates, indicating that race does not have a strong effect on hospital length of stay in this model.
Admission pathways also play an important role in determining length of stay. Patients admitted due to urgent or emergency conditions tend to have shorter hospital stays than those admitted electively, whereas hospitalizations related to newborn delivery are associated with relatively longer stays. Discharge destination exhibits particularly strong explanatory power. Patients discharged to long-term care facilities or rehabilitation institutions have significantly shorter hospital stays, suggesting that hospitals often adopt strategies of early transfer combined with continued post-acute care for these patients. Overall, the Cox model reveals the complex set of driving factors underlying hospital length of stay and confirms the joint influence of sociodemographic characteristics and healthcare pathways.
In the Log-Logistic medical expenditure model presented in Table 11, the number of payment methods and indicators for commercial insurance and public insurance are all positively associated with medical expenditures. This finding implies that diversified payment mechanisms and broader insurance coverage substantially increase healthcare spending during hospitalization. On the one hand, insurance coverage enhances patients’ ability to pay and facilitates access to additional medical resources; on the other hand, it also reflects institutional differences in healthcare financing. Emergency admissions are likewise associated with significantly higher medical expenditures, which is consistent with the fact that emergency care typically involves more intensive treatment and higher costs.
Gender and race both exhibit positive coefficients in the expenditure model, indicating that male patients and certain racial groups incur higher medical expenditures than others. These differences may be related to variations in disease types, care-seeking behavior, and treatment choices.
Figure 12 shows that the estimated baseline cumulative hazard function H 0 ( t ) increases monotonically, indicating that the risk of discharge (or event occurrence) accumulates over time as hospitalization progresses. This pattern is consistent with the empirical distributional characteristics of hospital length of stay. Figure 13 illustrates the nonlinear effect of age on medical expenditures, revealing pronounced fluctuations rather than a simple monotonic relationship. This result suggests that the association between age and medical expenditures is complex and may be influenced by interaction effects with other factors.
Figure 14 further demonstrates the nonlinear effect of disease severity on medical expenditures. Notably, patients with mild conditions exhibit relatively higher expenditure levels, which may reflect potential overuse of diagnostic testing or overtreatment in clinical practice. In contrast, although treatment for severe conditions is more complex, expenditures for severely ill patients do not increase without bound. This pattern suggests that healthcare spending for high-severity patients may be constrained by insurance reimbursement policies, clinical decision-making, and resource allocation mechanisms. These findings indicate that medical expenditures are shaped not only by disease severity but also by institutional and behavioral factors.
Finally, the copula estimation results indicate a dependence parameter of 0.64736 between medical expenditures and hospital length of stay under the Clayton copula (see Figure 15). While this estimate confirms a statistically meaningful dependence between the two outcomes, the magnitude suggests a moderate rather than strong association. This finding is consistent with clinical and institutional realities: although length of stay is an important determinant of inpatient medical expenditures, healthcare costs are simultaneously shaped by multiple additional factors, including treatment intensity, disease severity, insurance coverage, and hospital-specific practices. As a result, variation in medical expenditures cannot be fully explained by hospitalization duration alone.
Several considerations should be taken into account when interpreting these empirical results. First, as is common with large-scale administrative healthcare datasets, detailed clinical information capturing physician decision-making, treatment pathways, and quality of care is not fully observed. Although the proposed copula-based framework explicitly accounts for dependence driven by shared unobserved factors, some residual unobserved heterogeneity may still remain. Second, the empirical analysis is restricted to hospitalized patients, implying that the sample reflects a selected population shaped by prior admission decisions, and potential selection bias related to the hospitalization process cannot be entirely ruled out. Third, since the data originate from a specific healthcare system and institutional setting, caution is warranted when generalizing the findings to other healthcare systems or payment regimes. These limitations do not detract from the methodological contribution of this study but instead point to promising directions for future research, such as incorporating richer clinical information or explicitly modeling admission and selection mechanisms.

6. Conclusions

This study proposes a copula-based joint modeling framework to analyze the relationship between hospital length of stay and medical expenditures in large-scale healthcare data. Methodologically, the framework accommodates distinct marginal behaviors—censored time-to-event outcomes for length of stay and heavy-tailed cost distributions for medical expenditures—while flexibly characterizing their dependence structure. The proposed estimation strategy is specifically designed for massive datasets, combining scalable marginal estimation with efficient copula-based dependence modeling.
From an empirical perspective, the results reveal a moderate but statistically meaningful dependence between hospitalization duration and medical expenditures, suggesting that length of stay is an important, yet not dominant, driver of inpatient costs. Moreover, the analysis uncovers substantial heterogeneity in cost structures: patients with relatively mild conditions may incur disproportionately high expenditures, whereas expenditure growth among severely ill patients appears more constrained by existing payment rules and standardized clinical pathways.
These findings yield several important implications for hospital management and healthcare policy. First, although hospitalization duration is related to medical expenditures, cost-containment strategies that focus primarily on shortening length of stay are unlikely to be sufficient. Excessive emphasis on reducing hospital days may shift costs toward higher treatment intensity, increased diagnostic testing, or post-discharge services without achieving meaningful reductions in total expenditures. Hospital managers may therefore benefit from complementing length-of-stay indicators with additional measures when evaluating cost performance.
Second, the observation that low-risk patients can generate unexpectedly high costs highlights the need for improved management of resource utilization in mild cases. Closer monitoring of diagnostic intensity and procedural use, stronger adherence to clinical guidelines, and the incorporation of treatment appropriateness measures into internal performance evaluation systems may help reduce unnecessary services and improve efficiency.
Third, from a policy perspective, the relatively constrained expenditure growth observed among severely ill patients suggests that existing regulatory instruments—such as standardized treatment protocols and reimbursement ceilings—play an effective role in managing high-risk cases. This finding supports the continued expansion of insurance-based payment mechanisms, including diagnosis-related group (DRG) payments and global budgeting, which align reimbursement more closely with case complexity and treatment intensity rather than hospitalization duration alone. In addition, strengthening tiered healthcare delivery systems may help redirect patients with mild conditions toward primary care or outpatient settings, thereby reducing avoidable inpatient utilization and alleviating pressure on large hospitals.
Several limitations should be acknowledged. As with most administrative healthcare datasets, detailed clinical information related to physician decision-making and quality of care is not fully observed. While the proposed copula-based framework does not eliminate bias arising from such unobserved confounding, it provides a principled way to mitigate its impact by relaxing the conditional independence assumption commonly imposed in separate marginal analyses. By explicitly modeling residual dependence between length of stay and medical expenditures after accounting for observed covariates, shared latent influences are captured in the dependence structure rather than being implicitly absorbed into the marginal models. Nevertheless, unobserved heterogeneity may still remain, and caution is warranted when interpreting the results.
Future research may extend the proposed framework in several directions. Integrating richer clinical information or explicitly modeling admission and selection mechanisms—where, for instance, Bayesian methods such as [43,44,45] could be employed to incorporate prior knowledge, handle uncertainty, and enhance interpretability—could further refine empirical inference. The framework may also be extended to dynamic or multivariate settings to examine patient trajectories and healthcare utilization over time. Finally, the proposed approach provides a useful tool for evaluating alternative payment designs and policy interventions, thereby supporting more evidence-based decision-making in healthcare management and policy.

Author Contributions

Conceptualization, X.X.; methodology, X.X.; formal analysis, X.X.; investigation, X.X.; software, X.X.; validation, X.X.; data curation, Y.W.; writing—original draft preparation, X.X.; writing—review and editing, Y.W.; visualization, X.X.; supervision, Y.W.; project administration, Y.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Zhejiang Provincial Philosophy and Social Sciences Planning Project under Grant No. 25NDJC126YBMS, the Zhejiang Provincial Natural Science Foundation under Grant No. LMS25G010001, and the National Bureau of Statistics of China under Grant No. 2025LZ023. Additional support was provided by the Fundamental Research Funds for the Provincial Universities of Zhejiang under Grant No. 2025ZDPY06.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data used in this study are publicly available from the Health Data NY website (SPARCS De-Identified Hospital Inpatient Discharges).

Acknowledgments

The author would like to thank Xiaobing Zhao for his valuable support and helpful suggestions during the preparation of this manuscript.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

  1. Thomas, W.J.; Guire, K.E.; Horvat, G.G. Is patient length of stay related to quality of care? J. Healthc. Manag. 1997, 42, 489–507. [Google Scholar]
  2. Burns, L.R.; Wholey, D.R. The effects of patient, hospital, and physician characteristics on length of stay and mortality. Med. Care 1991, 29, 251–271. [Google Scholar] [CrossRef]
  3. Rapoport, J.; Teres, D.; Zhao, Y.; Lemeshow, S. Length of stay data as a guide to hospital economic performance for ICU patients. Med. Care 2003, 41, 386–397. [Google Scholar] [CrossRef]
  4. Angus, D.C.; Linde-Zwirble, W.T.; Sirio, C.A.; Rotondi, A.J.; Chelluri, L.; Newbold, R.C.; Lave, J.R.; Pinsky, M.R. The effect of managed care on ICU length of stay: Implications for medicare. JAMA 1996, 276, 1075–1082. [Google Scholar] [CrossRef]
  5. Haidar, S.; Vazquez, R.; Medic, G. Impact of surgical complications on hospital costs and revenues: Retrospective database study of Medicare claims. J. Comp. Eff. Res. 2023, 12, e230080. [Google Scholar] [CrossRef] [PubMed]
  6. Hawkes, W.G.; Wehren, L.; Orwig, D.; Hebel, J.R.; Magaziner, J. Gender differences in functioning after hip fracture. J. Gerontol. Ser. A Biol. Sci. Med. Sci. 2006, 61, 495–499. [Google Scholar] [CrossRef] [PubMed][Green Version]
  7. Han, T.S.; Murray, P.; Robin, J.; Wilkinson, P.; Fluck, D.; Fry, C.H. Evaluation of the association of length of stay in hospital and outcomes. Int. J. Qual. Health Care 2022, 34, mzab160. [Google Scholar] [CrossRef]
  8. Zhao, X.; Xia, Y.; Xu, X. Sufficient dimension reduction on partially nonlinear index models with applications to medical costs analysis. PLoS ONE 2025, 20, e0321796. [Google Scholar] [CrossRef] [PubMed]
  9. Jang, S.I.; Nam, C.M.; Lee, S.G.; Kim, T.H.; Park, S.; Park, E.C. Impact of payment system change from per-case to per-diem on high severity patient’s length of stay. Medicine 2016, 95, e4839. [Google Scholar] [CrossRef] [PubMed]
  10. Newhouse, J.P. Medical care costs: How much welfare loss? J. Econ. Perspect. 1992, 6, 3–21. [Google Scholar] [CrossRef]
  11. Taylor, G.; McGuire, G.; Sullivan, J. Individual claim loss reserving conditioned by case estimates. Ann. Actuar. Sci. 2008, 3, 215–256. [Google Scholar] [CrossRef]
  12. Polverejan, E.; Gardiner, J.C.; Bradley, C.J.; Holmes-Rovner, M.; Rovner, D. Estimating mean hospital cost as a function of length of stay and patient characteristics. Health Econ. 2003, 12, 935–947. [Google Scholar] [CrossRef]
  13. May, P.; Garrido, M.M.; Cassel, J.B.; Morrison, R.S.; Normand, C. Using length of stay to control for unobserved heterogeneity when estimating treatment effect on hospital costs with observational data: Issues of reliability, robustness, and usefulness. Health Serv. Res. 2016, 51, 2020–2043. [Google Scholar] [CrossRef] [PubMed]
  14. Lorenzoni, A.M.L.; Marino, A. Understanding Variations in Hospital Length of Stay and Cost; OECD: Paris, France, 2017. [Google Scholar]
  15. Sklar, M. Fonctions de répartition à n dimensions et leurs marges. Ann. l’ISUP 1959, 8, 229–231. [Google Scholar]
  16. Smith, M.D. Modelling sample selection using Archimedean copulas. Econom. J. 2003, 6, 99–123. [Google Scholar] [CrossRef]
  17. Zimmer, D.M.; Trivedi, P.K. Using trivariate copulas to model sample selection and treatment effects: Application to family health care demand. J. Bus. Econ. Stat. 2006, 24, 63–76. [Google Scholar] [CrossRef]
  18. Dancer, D.; Rammohan, A.; Smith, M.D. Infant mortality and child nutrition in Bangladesh. Health Econ. 2008, 17, 1015–1035. [Google Scholar] [CrossRef] [PubMed]
  19. Quinn, C. Using Copulas to Estimate Reduced-Form Systems of Equations; Technical Report, HEDG, c/o; Department of Economics, University of York: York, UK, 2007. [Google Scholar]
  20. Ghahroodi, Z.R.; Saba, R.A.; Baghfalaki, T. Gaussian copula–based regression models for the analysis of mixed outcomes: An application on household’s utilization of health services data. J. Stat. Theory Appl. 2019, 18, 182–197. [Google Scholar] [CrossRef]
  21. Mittal, S.; Madigan, D.; Burd, R.S.; Suchard, M.A. High-dimensional, massive sample-size Cox proportional hazards regression for survival analysis. Biostatistics 2014, 15, 207–221. [Google Scholar] [CrossRef][Green Version]
  22. Wang, W.; Wang, Y.; Zhao, X. Estimation and inference for fixed center effects on panel count data. Stat. Pap. 2026, 67. [Google Scholar] [CrossRef]
  23. Deresa, N.W.; Keilegom, I.V. Copula based Cox proportional hazards models for dependent censoring. J. Am. Stat. Assoc. 2024, 119, 1044–1054. [Google Scholar] [CrossRef]
  24. Cox, D.R. Regression models and life-tables. J. R. Stat. Soc. Ser. B Stat. Methodol. 1972, 34, 187–202. [Google Scholar] [CrossRef]
  25. Thompson, S.G.; Barber, J.A. How should cost data in pragmatic randomised trials be analysed? BMJ 2000, 320, 1197–1200. [Google Scholar] [CrossRef]
  26. Briggs, A.; Gray, A. The distribution of health care costs and their statistical analysis for economic evaluation. J. Health Serv. Res. Policy 1998, 3, 233–245. [Google Scholar] [CrossRef]
  27. Zhu, Z.; Wang, J.; Chen, C.; Zhou, J. Hospitalization charges for extremely preterm infants: A ten-year analysis in Shanghai, China. J. Med Econ. 2020, 23, 1610–1617. [Google Scholar] [CrossRef] [PubMed]
  28. Qiu, C.; Chen, T.; Wu, X. Evaluate the Medical Insurance Premium under Heavy-tailed Distribution: Empirical Research on NCMS of minhang (Shanghai). J. Appl. Stat. Manag. 2013, 32, 974–983. (In Chinese) [Google Scholar]
  29. Chen, J.; Liu, L.; Zhang, D.; Shih, Y.C.T. A flexible model for the mean and variance functions, with application to medical cost data. Stat. Med. 2013, 32, 4306–4318. [Google Scholar] [CrossRef] [PubMed]
  30. Tang, X.; Luo, Z.; Gardiner, J.C. A Bivariate Random-Effects Copula Model for Length of Stay and Cost. In Statistical Applications from Clinical Trials and Personalized Medicine to Finance and Business Analytics: Selected Papers from the 2015 ICSA/Graybill Applied Statistics Symposium, Colorado State University, Fort Collins; Springer: Cham, Switzerland, 2016; pp. 339–352. [Google Scholar]
  31. Oakes, D. A model for association in bivariate survival data. J. R. Stat. Soc. Ser. B Stat. Methodol. 1982, 44, 414–422. [Google Scholar] [CrossRef]
  32. Di Clemente, A.; Romano, C. Calibrating and simulating copula functions in financial applications. Front. Appl. Math. Stat. 2021, 7, 642210. [Google Scholar] [CrossRef]
  33. Boucher, J.P.; Denuit, M.; Guillen, M. Number of accidents or number of claims? An approach with zero-inflated Poisson models for panel data. J. Risk Insur. 2009, 76, 821–846. [Google Scholar] [CrossRef]
  34. Chen, S.X.; Huang, T.M. Nonparametric estimation of copula functions for dependence modelling. Can. J. Stat. 2007, 35, 265–282. [Google Scholar] [CrossRef]
  35. Chen, X.; Fan, Y.; Tsyrennikov, V. Efficient estimation of semiparametric multivariate copula models. J. Am. Stat. Assoc. 2006, 101, 1228–1240. [Google Scholar] [CrossRef]
  36. Park, M.Y.; Hastie, T. L 1-regularization path algorithm for generalized linear models. J. R. Stat. Soc. Ser. B Stat. Methodol. 2007, 69, 659–677. [Google Scholar] [CrossRef]
  37. Engler, D.; Li, Y. Survival analysis with high-dimensional covariates: An application in microarray studies. Stat. Appl. Genet. Mol. Biol. 2009, 8, 14. [Google Scholar] [CrossRef] [PubMed]
  38. Wang, J.; Zeng, D.; Lin, D.Y. Fitting the Cox proportional hazards model to big data. Biometrics 2024, 80, ujae018. [Google Scholar] [CrossRef] [PubMed]
  39. Ruppert, D.; Wand, M.P.; Carroll, R.J. Semiparametric Regression; Cambridge University Press: Cambridge, UK, 2003; p. 12. [Google Scholar]
  40. Houari, R.; Bounceur, A.; Kechadi, M.T.; Tari, A.K.; Euler, R. Dimensionality reduction in data mining: A Copula approach. Expert Syst. Appl. 2016, 64, 247–260. [Google Scholar] [CrossRef]
  41. Akaike, H. A new look at the statistical model identification. IEEE Trans. Autom. Control 2003, 19, 716–723. [Google Scholar] [CrossRef]
  42. Picone, G.; Mark Wilson, R.; Chou, S.Y. Analysis of hospital length of stay and discharge destination using hazard functions with unmeasured heterogeneity. Health Econ. 2003, 12, 1021–1034. [Google Scholar] [CrossRef] [PubMed]
  43. Xu, A.; Wang, W. Recursive Bayesian prediction of remaining useful life for gamma degradation process under conjugate priors. Scand. J. Stat. 2025, 53, 175–206. [Google Scholar] [CrossRef]
  44. Yin, H.; Wang, Y.; Xu, A. Kernel-based marginal testing for covariate effects in high-dimensional settings. Scand. J. Stat. 2026, 53, 498–531. [Google Scholar] [CrossRef]
  45. Zhu, D.; Xu, A.; Chen, Z.; Ding, S.; Fang, G. An Online Bayesian Framework for Identifying Latent System Degradation States. IEEE Trans. Reliab. 2026, 75, 542–554. [Google Scholar] [CrossRef]
Figure 1. Overall Workflow of the Proposed Copula-Based Joint Modeling Framework.
Figure 1. Overall Workflow of the Proposed Copula-Based Joint Modeling Framework.
Systems 14 00226 g001
Figure 2. Estimation of the baseline cumulative hazard function H 0 ( t ) in CaseI (Gaussiaan Copula).
Figure 2. Estimation of the baseline cumulative hazard function H 0 ( t ) in CaseI (Gaussiaan Copula).
Systems 14 00226 g002
Figure 3. Estimation of the nonparametric function f 1 ( Z 1 ) in CaseI (Gaussiaan Copula).
Figure 3. Estimation of the nonparametric function f 1 ( Z 1 ) in CaseI (Gaussiaan Copula).
Systems 14 00226 g003
Figure 4. Estimation of the nonparametric function f 2 ( Z 2 ) in CaseI (Gaussiaan Copula).
Figure 4. Estimation of the nonparametric function f 2 ( Z 2 ) in CaseI (Gaussiaan Copula).
Systems 14 00226 g004
Figure 5. Comparison of copula density contour plots: true (left) and estimated (right) Gaussian copula densities in CaseI (Gaussiaan Copula).
Figure 5. Comparison of copula density contour plots: true (left) and estimated (right) Gaussian copula densities in CaseI (Gaussiaan Copula).
Systems 14 00226 g005
Figure 6. Estimation of the baseline cumulative hazard function H 0 ( t ) in CaseII(Clayton Copula).
Figure 6. Estimation of the baseline cumulative hazard function H 0 ( t ) in CaseII(Clayton Copula).
Systems 14 00226 g006
Figure 7. Estimation of the nonparametric function f 1 ( Z 1 ) in CaseII (Clayton Copula).
Figure 7. Estimation of the nonparametric function f 1 ( Z 1 ) in CaseII (Clayton Copula).
Systems 14 00226 g007
Figure 8. Estimation of the nonparametric function f 2 ( Z 2 ) in CaseII (Clayton Copula).
Figure 8. Estimation of the nonparametric function f 2 ( Z 2 ) in CaseII (Clayton Copula).
Systems 14 00226 g008
Figure 9. Comparison of copula density contour plots: true (left) and estimated (right) Clayton copula densities in CaseII (Clayton Copula).
Figure 9. Comparison of copula density contour plots: true (left) and estimated (right) Clayton copula densities in CaseII (Clayton Copula).
Systems 14 00226 g009
Figure 10. Distribution of total hospital charges.
Figure 10. Distribution of total hospital charges.
Systems 14 00226 g010
Figure 11. Pearson correlation coefficients among study variables.
Figure 11. Pearson correlation coefficients among study variables.
Systems 14 00226 g011
Figure 12. Estimated cumulative hazard function of hospital length of stay.
Figure 12. Estimated cumulative hazard function of hospital length of stay.
Systems 14 00226 g012
Figure 13. Effect of age on medical expenditures.
Figure 13. Effect of age on medical expenditures.
Systems 14 00226 g013
Figure 14. Effect of APR Severity of Illness on medical expenditures.
Figure 14. Effect of APR Severity of Illness on medical expenditures.
Systems 14 00226 g014
Figure 15. Relationship between medical expenditures and hospital length of stay.
Figure 15. Relationship between medical expenditures and hospital length of stay.
Systems 14 00226 g015
Table 1. Estimation results for β , γ , and θ under the Gaussian Copula.
Table 1. Estimation results for β , γ , and θ under the Gaussian Copula.
ParameterTrue ValueMean EstimateBiasRMSE
β 1 10.9998−0.00020.0015
β 2 00.00010.00010.0009
β 3 0−0.0002−0.00020.0012
β 4 11.00040.00040.0014
β 5 0−0.0002−0.00020.0014
β 6 00.00010.00010.0014
β 7 00.00050.00050.0014
β 8 00.00050.00050.0012
β 9 00.00010.00010.0011
β 10 0−0.0001−0.00010.0010
β 11 00.00000.00000.0011
β 12 11.00050.00050.0015
γ 1 0.20.20000.00000.0010
γ 2 −0.3−0.29990.00010.0010
γ 3 0.150.15010.00010.0010
γ 4 0.10.10010.00010.0010
γ 5 1.21.20600.00600.0063
γ 6 2.42.40350.00350.0047
γ 7 0.40.41720.01720.0174
γ 8 −0.2−0.19630.00370.0039
θ 1.81.7984−0.00160.0022
Table 2. Estimation results for Copula parameters under the Gaussian Copula.
Table 2. Estimation results for Copula parameters under the Gaussian Copula.
Copula TypeTrue ValueMean EstimateBiasRMSEAIC Selection Rate
Gaussian0.80.7947−0.00530.0111100%
Table 3. Estimation results for β , γ , and θ under the Clayton Copula.
Table 3. Estimation results for β , γ , and θ under the Clayton Copula.
ParameterTrue ValueMean EstimateBiasRMSE
β 1 11.0000−0.00000.0013
β 2 00.00010.00010.0010
β 3 0−0.0002−0.00020.0012
β 4 11.00000.00000.0013
β 5 0−0.0001−0.00010.0011
β 6 00.00000.00000.0011
β 7 00.00020.00020.0011
β 8 00.00000.00000.0011
β 9 00.00000.00000.0011
β 10 0−0.0001−0.00010.0010
β 11 0−0.00000.00000.0011
β 12 11.00000.00000.0014
γ 1 0.20.20000.00000.0011
γ 2 −0.3−0.30000.00000.0010
γ 3 0.150.1499−0.00010.0009
γ 4 0.10.0999−0.00010.0008
γ 5 1.21.20560.00560.0060
γ 6 2.42.40460.00460.0052
γ 7 0.40.41790.01790.0180
γ 8 −0.2−0.19650.00350.0037
θ 1.81.7983−0.00150.0018
Table 4. Estimation results for Copula parameters under the Clayton Copula.
Table 4. Estimation results for Copula parameters under the Clayton Copula.
Copula TypeTrue ValueMean EstimateBiasRMSEAIC Selection Rate
Clayton22.09490.0948820.012788100%
Table 5. Improvement over the independence model measured by ΔAIC.
Table 5. Improvement over the independence model measured by ΔAIC.
Copula FamilyTrue DependenceΔAIC
Gaussian ρ = 0.8 756314.46
ρ = 0.2 313167.28
Clayton γ = 2 4612144.33
γ = 0.2 193159.49
Table 6. Descriptive statistics of study variables.
Table 6. Descriptive statistics of study variables.
VariableCategoryn (%)
GenderFemale1,376,058 (56.3)
Male1,068,689 (43.7)
RaceWhite1,410,096 (57.7)
Black/African American477,633 (19.5)
Other557,018 (22.8)
Age0 to 17356,823 (14.6)
18 to 29260,930 (10.67)
30 to 49496,586 (20.31)
50 to 69653,430 (26.73)
70 or Older676,978 (27.69)
APR Severity of IllnessMinor852,428 (34.9)
Moderate899,137 (36.8)
Major529,750 (21.7)
Extreme163,432 (6.6)
Number of payment methods1690,564 (28.2)
21,072,974 (43.9)
3681,209 (27.9)
Admission_ElectiveYes473,279 (19.4)
Admission_UrgentYes202,173 (8.3)
Admission_EmergencyYes1,541,911 (63.1)
Admission_NewbornYes224,232 (9.2)
Admission_TraumaYes3152 (0.1)
Disposition_HomeYes1,961,479 (80.2)
Disposition_Death/abnormalYes114,659 (4.7)
Disposition_Nursing FacilityYes242,950 (9.9)
Disposition_RehabYes125,659 (5.1)
Has Commercial InsuranceYes1,088,911 (44.5)
Has Public InsuranceYes1,718,771 (70.3)
Emergency Department IndicatorYes1,439,810 (58.9)
APR_MedSurgYes574,177 (23.5)
Table 7. Descriptive statistics of length of stay and total charges.
Table 7. Descriptive statistics of length of stay and total charges.
VariableMeanSDMinP25MedianP75Max
Length of Stay (days)5.377.531236119
Total Charges (USD)33,269.4055,499.320.319150.7818,209.1136,485.576,196,973.50
Table 8. Definition of Variables in the SPARCS Dataset.
Table 8. Definition of Variables in the SPARCS Dataset.
SymbolVariableDefinition
X 1 / W 1 GenderIndicator equal to 1 if male, and 0 if female
X 2 / W 2 RaceIndicator equal to 1 if White, and 0 otherwise
Z 1 AgeOrdinal variable coded as: 1 (0–17), 2 (18–29), 3 (30–49), 4 (50–69), and 5 (70 and above)
Z 2 APR Severity of IllnessSeverity level coded as: 1 (Minor), 2 (Moderate), 3 (Major), and 4 (Extreme)
X 3 Admission_UrgentIndicator equal to 1 if hospitalized due to urgent admission, and 0 otherwise
X 4 Admission_EmergencyIndicator equal to 1 if hospitalized due to emergency admission, and 0 otherwise
X 5 Admission_NewbornIndicator equal to 1 if hospitalization is for newborn care, and 0 otherwise
X 6 Admission_TraumaIndicator equal to 1 if hospitalized due to trauma, and 0 otherwise
X 7 Disposition_Death/abnormalIndicator equal to 1 if in-hospital death or abnormal discharge occurred, and 0 otherwise
X 8 Disposition_Nursing FacilityIndicator equal to 1 if transferred to a long-term care facility after discharge, and 0 otherwise
X 9 Disposition_RehabIndicator equal to 1 if transferred to a rehabilitation facility after discharge, and 0 otherwise
X 10 APR_MedsurgIndicator equal to 1 if classified as an APR medical–surgical case, and 0 otherwise
W 3 Number of payment methodsNumber of distinct payment methods used during hospitalization (1, 2, or 3)
W 4 Has Commercial InsuranceIndicator equal to 1 if covered by commercial insurance, and 0 otherwise
W 5 Has Public InsuranceIndicator equal to 1 if covered by public (social) insurance, and 0 otherwise
W 6 Emergency Department IndicatorIndicator equal to 1 if emergency department services were utilized, and 0 otherwise
Table 9. Model comparison of different Copula functions.
Table 9. Model comparison of different Copula functions.
Copula TypeAICLog-LikelihoodParameter
Clayton−230,737.935115,369.9670.64763
Gaussian−151,305.10775,653.5540.27196
Gumbel−103,485.44351,743.7211.15956
Table 10. Estimation results of covariates in the Cox proportional hazards model.
Table 10. Estimation results of covariates in the Cox proportional hazards model.
VariableCovariate DescriptionCoefficient
X 1 Gender 0.1028
X 2 Race 0.0365
X 3 Hospitalized due to urgent 0.1768
X 4 Hospitalized due to emergency 0.1480
X 5 Hospitalized for newborn 0.2513
X 6 Hospitalized due to trauma 0.0999
X 7 In-hospital death or abnormal discharge 0.2249
X 8 Transferred to long-term care facility after discharge 0.6110
X 9 Transferred to rehabilitation facility after discharge 0.4288
X 10 APR Medical Surgical Description 0.0584
Table 11. Estimation results of covariates in the Log-Logistic model.
Table 11. Estimation results of covariates in the Log-Logistic model.
VariableCovariate DescriptionCoefficient
W 1 Gender 0.5650
W 2 Race 0.4210
W 3 Number of payment methods 0.8121
W 4 Commercial insurance 1.1060
W 5 Public insurance 0.8687
W 6 Emergency Department Indicator 0.0684
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Xu, X.; Wang, Y. A Copula-Based Joint Modeling Framework for Hospitalization Costs and Length of Stay in Massive Healthcare Data. Systems 2026, 14, 226. https://doi.org/10.3390/systems14020226

AMA Style

Xu X, Wang Y. A Copula-Based Joint Modeling Framework for Hospitalization Costs and Length of Stay in Massive Healthcare Data. Systems. 2026; 14(2):226. https://doi.org/10.3390/systems14020226

Chicago/Turabian Style

Xu, Xuan, and Yijun Wang. 2026. "A Copula-Based Joint Modeling Framework for Hospitalization Costs and Length of Stay in Massive Healthcare Data" Systems 14, no. 2: 226. https://doi.org/10.3390/systems14020226

APA Style

Xu, X., & Wang, Y. (2026). A Copula-Based Joint Modeling Framework for Hospitalization Costs and Length of Stay in Massive Healthcare Data. Systems, 14(2), 226. https://doi.org/10.3390/systems14020226

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop