Next Article in Journal
Mortality in Patients with 22q11.2 Rearrangements
Previous Article in Journal
Phenotypic Variability of LGMD 2C/R5 in a Genetically Homogenous Group of Bulgarian Muslim Roma
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

On the Optimal Combination of Elliptically Distributed Biomarkers to Improve Diagnostic Accuracy

by
Shiqi Dong
1,
Zhaohai Li
1,
Yuanzhang Li
2 and
Aiyi Liu
3,*
1
Department of Statistics, The George Washington University, Washington, DC 20052, USA
2
Department of Defense (Retired), Silver Spring, MD 20910, USA
3
Eunice Kennedy Shriver National Institute of Child Health and Human Development, Bethesda, MD 20817, USA
*
Author to whom correspondence should be addressed.
Genes 2024, 15(9), 1145; https://doi.org/10.3390/genes15091145
Submission received: 26 July 2024 / Revised: 23 August 2024 / Accepted: 25 August 2024 / Published: 30 August 2024
(This article belongs to the Special Issue Advances in Bioinformatics and Environmental Health)

Abstract

:
Diagnostic biomarkers play a critical role in biomedical research, particularly for the diagnosis and prediction of diseases, etc. To enhance diagnostic accuracy, extensive research about combining multiple biomarkers has been developed based on the multivariate normality, which is often not true in practice, as most biomarkers follow distributions that deviate from normality. While the likelihood ratio combination is recognized to be the optimal approach, it is complicated to calculate. To achieve a more accurate and effective combination of biomarkers, especially when these biomarkers deviate from normality, we propose using a receiver operating characteristic (ROC) curve methodology based on the optimal combination of elliptically distributed biomarkers. In this paper, we derive the ROC curve function for the elliptical likelihood ratio combination. Further, proceeding from the derived best combinations of biomarkers, we propose an efficient technique via nonparametric maximum likelihood estimate (NPMLE) to build empirical estimation. Simulation results show that the proposed elliptical combination method consistently provided better performance, demonstrating its robustness in handling various distribution types of biomarkers. We apply the proposed method to two real datasets: Autism/autism spectrum disorder (ASD) and neural tube defects (NTD). In both applications, the elliptical likelihood ratio combination improves the AUC value compared to the multivariate normal likelihood ratio combination and the best linear combination.

1. Introduction

Diagnostic biomarkers are essential in the diagnosis and prediction of diseases, as well as in clinical trials. Some well-known biomarkers include glycated haemoglobin (HbA1c) for diagnosis of diabetes [1], procalcitonin (PCT) and C-reactive protein (CRP) for COVID-19 [2,3], and prostate-specific-antigen (PSA) [4] for cancer. As new biomarkers for early disease detection and prevention continue to be developed, it is imperative to assess their accuracy in diagnosing diseases by comparing them with existing biomarkers.
Diagnostic accuracy indices are commonly used to measure the ability of a biomarker to discriminate or predict a disease. Consider a single biomarker M, which is measured on a continuous scale. A subject will be diagnosed as diseased if M is larger than a predefined cut point value c. Usually, it is assumed that subjects with higher biomarker values are more likely to be diseased. If this assumption does not hold, appropriate transformations can be applied. Corresponding to a cut point value c, the sensitivity (also called the true positive rate) of this biomarker is the proportion that a diseased person is correctly identified, and the specificity (also called the true negative rate) is the proportion of a non-diseased person that is correctly identified, that is,
s ( c ) = P ( M > c | the subject is diseased ) p ( c ) = P ( M < c | the subject is not diseased ) .
where s ( c ) is the sensitivity (TPR) and p ( c ) is the specificity (TPR).
The ROC curve is plotted by the sensitivity against the 1-specificity (false positive rate) over all possible cut point values of c R . Most measure indices, such as the area under the ROC curve (AUC), are derived from the ROC curve [5].
However, it has been widely acknowledged by medical researchers that diagnosis based on one single biomarker may not provide sufficient accuracy. To improve diagnostic performance, researchers have proposed numerous methods to combine multiple biomarkers and the corresponding measurements to help clinicians make better diagnostic judgments. Su and Liu (1993) derived the optimal linear combinations that could maximize AUC value [6]. They also proved that the best linear combination is equivalent to the optimal combination if both the case and control population follow multivariate normal distributions with the same covariance matrix. In 2000, Pepe and Thompson discussed an empirical estimation for solutions of the optimal linear combinations, which is non-parametric and robust against misspecified distributions [7]. However, the empirical estimation is limited by the sample size. In 2005, Liu et al. derived alternative linear combinations that exhibit higher sensitivity over a range of high (or low) specificity [8]. Furthermore, Pepe et al. (2006) proved that the likelihood ratio combination is the optimal combination [9]. The likelihood ratio, though optimal, is sensitive to distributional assumptions, which are often difficult to justify.
Although many research results have been derived based on multivariate normal distribution, this assumption is often too stringent. In clinical trials, data may follow distributions with heavy tails, such as the multivariate t distribution or light tails. In this paper, we aim to extend the optimal combination of biomarkers to a broader distribution family, elliptical distribution, to enhance its practical utility.
Let X be a d-dimensional random vector. X is said to be ‘elliptically distributed’, if and only if there exists a vector μ R d , a positive semidefinite matrix Σ R d × d , and a function ϕ : R + R such that the characteristic function t φ X μ ( t ) of X μ corresponds to t ϕ t T Σ t , t R d [10].
For an elliptically distributed random vector X, denoted as X E d ( μ , Σ , ϕ ) where μ R d is the location parameter and Σ R d × d scale matrix, it can be represented stochastically by X = d μ + R Λ U ( d ) with Λ Λ T = Σ , where U ( d ) is uniformly distributed on the unit hypersphere with d 1 topological dimensions, R is a non-negative random variable stochastically independent of U ( d ) . Then, the p.d.f of X is given by
f X ( x ) = det Σ 1 · g R ( x μ ) T Σ 1 ( x μ ) , x μ
where
g R ( t ) : = Γ d 2 2 π d / 2 · t ( d 1 ) · f R ( t ) , t > 0
and f R is the p.d.f. of R [11].
R is the generating variate, g R is the density generator and Σ is the scalar matrix. The elliptical vector X is unimodal if and only if its density generator g R is monotonically non-increasing. Most elliptical random vectors are unimodal, including multivariate normal distribution, multivariate T-distribution, multivariate Laplace distribution, and many others [12].
The elliptical distribution family has great properties, including summation stability, which ensures the sum of independent elliptically distributed random vectors with the same covariance generating matrix is elliptical [13]. The elliptical distribution family is broadly used in statistics studies. Many findings related to the normal distribution perform well when applied to the elliptical distribution. Based on the assumption of elliptical distribution, we develop the ROC curve function of likelihood ratio combination. Additionally, we derived the estimation of the elliptical likelihood ratio combination.

2. The Combination of Biomarkers Based on Elliptical Distribution

Elliptical distributions have several notable properties, such as summation stability (meaning the sum of independent elliptically distributed random vectors with the same covariance generating matrix remains elliptical). Another important property is infinite divisibility, allowing the distribution to be expressed as the sum of an arbitrary number of independent and identically distributed random variables. Additionally, elliptical distributions are self-decomposable. These properties make the elliptical distribution family particularly useful in various statistical applications, including robust statistics and generalized multivariate analysis. Given that many statistical techniques developed for the normal distribution also perform well when extended to elliptical distributions, it is both practical and feasible to extend the optimal combination methodology from the multivariate normal distribution to the broader elliptical family.

2.1. Likelihood Ratio Combination of Biomarkers with Elliptical Distribution

Proposition 1.
Suppose there are d-dimensional biomarkers which are elliptically distributed, denoted as X E d μ x , Σ x ; ϕ x and Y E d μ y , Σ y ; ϕ y in case and control groups, respectively. The vectors X , Y could be stochastically represented as X = d μ x + R x Λ x U x ( d ) and Y = d μ y + R y Λ y U y ( d ) where Λ x Λ x T = Σ x and Λ y Λ y T = Σ y . Then, the likelihood ratio (LR) combination, which is the optimal combination, could be expressed as
R ( M ) = g R x ( M μ x ) T Σ x 1 ( M μ x ) g R y ( M μ y ) T Σ y 1 ( M μ y )
where g R x and g R y are density generators of X and Y, respectively.
A subject will be diagnosed as diseased if the likelihood ratio combination R ( M ) exceeds a certain threshold c R . Generally, obtaining a likelihood ratio combination is complicated. However, under the assumption of elliptical distributions, this calculation can be simplified to the ratio of the density generators.
Example 1.
(Multivariate Normal Distribution) Suppose there are d biomarkers following multivariate normal distributions in the case and control group: X M N d ( μ x , Σ x ) , Y M N d ( μ y , Σ y ) .
The density generator of d-dimensional multivariate normal distribution is g ( t ) = 1 ( 2 π ) d / 2 · exp t 2 [10]. Then, the likelihood ratio combination is:
R ( M ) = g R x ( M μ x ) T Σ x 1 ( M μ x ) g R y ( M μ y ) T Σ y 1 ( M μ y ) = exp 1 2 ( M μ x ) T Σ x 1 ( M μ x ) ( M μ y ) T Σ y 1 ( M μ y ) ( M μ x ) T Σ x 1 ( M μ x ) ( M μ y ) T Σ y 1 ( M μ y )
A subject M will be diagnosed as diseased if R ( M ) > c , which is equivalent to that ( ( M μ x ) T Σ x 1 ( M μ x ) ( M μ y ) T Σ y 1 ( M μ y ) ) less than a cut point value. In particular, when Σ x = Σ y = Σ , the likelihood ratio becomes:
R ( M ) = exp ( μ x μ y ) T Σ 1 M 1 2 μ x T Σ 1 μ x μ y T Σ 1 μ y ( μ x μ y ) T Σ 1 M
A subject M will be diagnosed as diseased if R ( M ) > c , which is equivalent to that ( μ x μ y ) T Σ 1 M is larger than a cut point value. Under this condition, the likelihood ratio combination is a linear combination with linear coefficient λ = Σ 1 ( μ x μ y ) [6].
Example 2.
(Multivariate T Distribution) Suppose there are d biomarkers following multivariate T distributions in the case and control group: X t d ( μ x , Σ , v x ) and Y t d ( μ y , Σ , v y ) .
The density generator of multivariate t distribution t d ( μ , Σ , v ) is g R ( t ) = Γ d + v 2 Γ v 2 · 1 ( v π ) d / 2 · 1 + t v d + v 2 [10]. Then, the likelihood ratio combination is:
R ( M ) = g R x ( M μ x ) T Σ x 1 ( M μ x ) g R y ( M μ y ) T Σ y 1 ( M μ y ) 1 + ( M μ x ) T Σ x 1 ( M μ x ) v x d + v x 2 / 1 + ( M μ y ) T Σ y 1 ( M μ y ) v y d + v y 2 v x + ( M μ x ) T Σ x 1 ( M μ x ) d + v x 2 v y + ( M μ y ) T Σ y 1 ( M μ y ) d + v y 2
In particular, if v x = v y = v , i.e., the degree freedoms of the case and control group are equal, the likelihood ratio combination is given by:
R ( M ) = v + ( M μ x ) T Σ x 1 ( M μ x ) v + ( M μ y ) T Σ y 1 ( M μ y ) d + v 2
Under this condition, A subject M will be diagnosed as diseased if the value of v + ( M μ y ) T Σ y 1 ( M μ y ) v + ( M μ x ) T Σ x 1 ( M μ x ) is larger than a cut point c.
Example 3.
(Mixed Distribution) Suppose X M N d ( μ x , Σ x ) and Y t d ( μ y , Σ y , v ) . Here, the distributions in the case and control group have different distribution “types”.
Then the likelihood ratio combination is:
R ( M ) = g R x ( M μ x ) T Σ x 1 ( M μ x ) g R y ( M μ y ) T Σ y 1 ( M μ y ) exp 1 2 ( M μ x ) T Σ x 1 ( M μ x ) 1 + ( M μ y ) T Σ y 1 ( M μ y ) v d + v 2 ( d + v ) log v + ( M μ y ) T Σ y 1 ( M μ y ) ( M μ x ) T Σ x 1 ( M μ x )

2.2. Linear Combination of Biomarkers with Elliptical Distribution

Consider the linear combination of d biomarkers. Let l ( M ) = λ T M where λ = λ 1 , λ 2 , , λ d T is the combination coefficients. A subject will be diagnosed as diseased if l ( M ) is larger than a cut point. In 1993, Su and Liu derived the linear combined ROC curve under multivariate normal distribution. If X N d μ x , Σ x and Y N d μ y , Σ y are distribution of case and control group, respectively, given any specificity p, the corresponding sensitivity s could be represented by S ( p , λ ) = Φ λ T μ x μ y Φ 1 p λ T Σ x λ λ T Σ y λ , where λ is the linear combination coefficient [6]. Here we obtained the linear combined ROC curve based on elliptical distribution.
Proposition 2.
Suppose X E d μ x , Σ x ; ϕ x and Y E d μ y , Σ y ; ϕ y are distribution of case and control group, respectively, then the linear combined ROC curve could be expressed as
s ( t , λ ) = F 0 λ T μ x μ y λ T Σ x λ λ T Σ y λ λ T Σ x λ G 0 1 ( 1 t )
where s is the sensitivity (TPR) and t is the 1-specificity (FPR). F 0 is the c.d.f of λ T X λ T μ x λ T Σ x λ and G 0 is the c.d.f of λ T Y λ T μ y λ T Σ y λ
Proof. 
Let X 0 = X λ T μ x λ T Σ x λ E 1 0 , 1 ; ϕ x with p.d.f f 0 and c.d.f F 0 , Y 0 = Y λ T μ y λ T Σ y λ E 1 0 , 1 ; ϕ y with p.d.f g 0 and c.d.f G 0 . Since f 0 , g 0 are both symmetric functions, we have f 0 ( x ) = f 0 ( x ) , g 0 ( y ) = g 0 ( y ) , 1 F 0 ( x ) = F 0 ( x ) and 1 G 0 ( y ) = G 0 ( y ) .
For each cut point < c < , the false positive rate (1-specificity) t could be represented by:
t = P ( Y > c ) = P Y λ T μ y λ T Σ y λ > c λ T μ y λ T Σ y λ = G 0 λ T μ y c λ T Σ y λ
So c = λ T μ y λ T Σ y λ G 0 1 ( t ) . Then the sensitivity s is given by:
s ( t , λ ) = P ( X > c ) = F 0 λ T μ x c λ T Σ x λ = F 0 λ T μ x λ T μ y + λ T Σ y λ G 0 1 ( t ) λ T Σ x λ = F 0 λ T μ x μ y λ T Σ x λ λ T Σ y λ λ T Σ x λ G 0 1 ( 1 t )
In particular, when Σ x = Σ y = Σ , the ROC curve function could be simplified to s ( t , λ ) = F 0 λ T μ x μ y λ T Σ λ G 0 1 ( 1 t ) .

3. Non-Parametric Estimation of Elliptical Combined ROC Curve

The non-parametric maximum likelihood estimate (NPMLE) of the density generator for the elliptical unimodal distribution was derived by [12]. Sun et al. proposed parameters estimation of a multivariate elliptical distribution to fit data via Tyler’s method [14]. Building upon these, we propose the empirical estimation of ROC curve functions for the likelihood ratio combinations in an elliptical distribution.
Suppose there are n independent observations ( X 1 , X 2 , , X n ) E d μ x , Σ x ; ϕ x in case group, and m independent observations ( Y 1 , Y 2 , , Y m ) E d μ y , Σ y ; ϕ y in control group. The density generators in both groups are monotonically non-increasing, which means these biomarkers are elliptical and unimodally distributed.
The following are the main steps to obtain the empirical estimation of the likelihood ratio combination ROC curve:
Step 1: 
Define μ ^ x and Σ ^ x as the mean vector and scatter matrix estimates, respectively, obtained for the case group using Tyler’s method [14]. Similarly, let μ ^ y and Σ ^ y denote the mean vector and scatter matrix estimates for the control group observations.
Step 2: 
Let Z ( 1 ) , , Z ( n ) be the order statistics of ( Z 1 , Z 2 , , Z n ) , where Z i = ( X i μ ^ x ) T Σ 1 ( X i μ ^ x ) , i ( 1 , 2 , , n ) .
Step 3: 
Define
g ^ i = d 2 c d min s i 1 max t i F n Z ( t ) F n Z ( s ) Z ( t ) d / 2 Z ( s ) d / 2
where F n is the empirical cumulative distribution function (CDF) of the data Z 1 , , Z n , c d = π d / 2 Γ ( d / 2 ) and z R + .
Step 4: 
g ^ x ( z ) = g ^ i , if Z ( i 1 ) < z Z ( i ) 0 , otherwise is the non-parametric maximum likelihood estimate (NPMLE) of density generator g R x .
Step 5: 
Similarly, g ^ y , the NPMLE of the density generator g R y , can be obtained by repeating steps 2 to 4 for the control group observations.
Step 6: 
For i ( 1 , 2 , , n ) , let R ^ i ( X ) = g ^ x ( X i μ ^ x ) T Σ x 1 ( X i μ ^ x ) g ^ y ( X i μ ^ y ) T Σ y 1 ( X i μ ^ y ) .
Step 7: 
For j ( 1 , 2 , , m ) , let R ^ j ( Y ) = g ^ x ( Y j μ ^ x ) T Σ x 1 ( Y j μ ^ x ) g ^ y ( Y j μ ^ y ) T Σ y 1 ( Y j μ ^ y ) .
Step 8: 
For a cut point value c R , Let s ^ ( c ) = i = 1 n I ( R ^ i ( X ) > c ) n , p ^ ( c ) = j = 1 m I ( R ^ j ( Y ) > c ) m , where I is the indicator function.
Step 9: 
The estimated ROC curve of likelihood ration combination is obtained by plotting s ^ ( c ) and 1 p ^ ( c ) over all values of c R .

4. Simulation

To evaluate the performance of the estimation method described above, we conducted simulations comparing the AUC values obtained using three different combinations: the proposed elliptical combination, optimal normal combination, and best linear combination. We assume that biomarkers in the case and control groups follow a multivariate gamma distribution:
X M G s x , r x , Σ x , Y M G s y , r y , Σ y
Here, s x is a vector of shape parameters for the marginal distributions of the columns of X, r x is a vector of rate parameters for the marginal distributions of the columns of X, and Σ x is the correlation matrix. Similarly, s y , r y , and Σ y represent the parameters for the control group biomarkers
We chose the multivariate gamma distribution for simulation for several reasons. First, multivariate gamma variables inherently take values greater than 0, which aligns with the typical range of biomarker values encountered in practical scenarios. Additionally, this distribution allows flexibility in adjusting the shape parameters s and rate parameters r, thereby enabling us to model both symmetric and skewed distributions as needed. This choice of distribution facilitates a realistic simulation environment that reflects the characteristics observed in actual biomarker data.

4.1. Both X and Y are Approximately Symmetric

Firstly, we begin with the condition of approximately symmetric distributions. Suppose there are four biomarkers, let s x = 20 13 20 12 , s y = 15 13 10 20 , r x = 5 1 2 1 , r y = 3 1 1.2 1
Σ x = 1 0.2 0.3 0.1 0.2 2 0.2 0.02 0.3 0.2 1 0.1 0.1 0.02 0.1 2 and Σ y = 1 0.1 0.1 0.6 0.1 1.5 0.2 0.15 0.1 0.2 1 0.6 0.6 0.15 0.6 1 .
The theoretical optimal AUC is 0.90. Figure 1 and Figure 2 demonstrate the marginal distributions of these biomarkers in the case and control group, respectively. In either the case or control group, the marginal distributions of four biomarkers are nearly symmetric.
Based on Pepe et al. (2006), the optimal combination is the likelihood ratio combination:
R ( M ) = f X M 1 , , M 4 f Y M 1 , , M 4
where f X and f Y are density functions for case and control group, respectively [9].
We generated n x observations in the case group and n y observations in the control group, with n x < n y , which is typically the case in practice. We can evaluate the results for different sample sizes of n x and n y . The AUC was calculated using the proposed elliptical combination, multivariate normal combination, and linear combination methods. The optimal combination can be determined using Formula (6), with the corresponding AUC representing the true optimal AUC. This true AUC was used to evaluate the performance of the three methods. Bias, defined as the difference between the estimated AUC and the true AUC, was used as a measure of performance, with a bias closer to 0 indicating better performance.
We repeated the above process 500 times and calculate the median and quantiles of the biases for the three methods. Table 1 presents the simulation results for different sample sizes in the case and control groups:
From the simulation results of the median bias, we observe that the proposed elliptical combination generally outperforms the multivariate normal combination and the linear combination when the sample size is large. However, as the sample size decreases, the multivariate normal combination tends to perform better than the elliptical estimation, indicating that the Non-parametric Maximum Likelihood Estimator (NPMLE) for the elliptical combination requires a relatively large sample size to achieve optimal performance.
Overall, both the elliptical and normal combinations have better AUC values than the linear combination when all biomarkers exhibit approximately symmetric distributions in both the case and control groups. However, the elliptical combination is superior when we have a sufficient sample size, making it the preferred method in scenarios with larger datasets.

4.2. Both X and Y Are Skewed

Now, we consider the condition where all biomarkers are skewed in both the case and control groups. Let s x = 2.5 2.8 3 5 , s y = 3 2 3.5 3 , r x = 1.5 1 3 1 , r y = 2 2 3 0.6
Σ x = 1.5 0.2 0.3 0.1 0.2 2 0.2 0.02 0.3 0.2 1 0.1 0.1 0.02 0.1 2 and Σ y = 1.4 0.1 0.1 0.6 0.1 1.5 0.2 0.15 0.1 0.2 1 0.6 0.6 0.15 0.6 1 .
The theoretical optimal AUC is 0.885. Figure 3 and Figure 4 demonstrate the marginal distributions of these biomarkers in the case and control group, respectively. We can see that in both the case and control groups, the marginal distributions of four biomarkers are skewed.
Using the same simulation process as before, we generated the results under the condition that all biomarkers are skewed in both the case and control groups. Below are the results for different sample sizes of n x and n y :
From the simulation results in Table 2, it is evident that both the elliptical and multivariate normal combinations perform obviously better than the linear combination across all conditions. Furthermore, the proposed elliptical combination consistently outperforms the multivariate normal combination. For all sample sizes, the medians of biases for the elliptical combination are always closer to 0 than those of the multivariate normal combination. Additionally, the interquartile ranges (IQRs) of the elliptical combination consistently cover 0, indicating balanced performance across sample sizes, while the IQRs of the multivariate normal combination are always less than 0, suggesting a tendency to underestimate the AUC.
Notably, the elliptical combination shows superior performance over the multivariate normal and linear combinations when both the case and control groups exhibit skewed distributions. This makes the elliptical combination particularly advantageous in scenarios involving skewed biomarker distributions, as it provides more accurate and reliable AUC estimates.

4.3. X is Approximately Symmetric and Y is Skewed

Next, we consider the condition where one of the case and control groups is skewed, and the other is nearly symmetric. There is no difference in which group is skewed. For simplicity, we assume the case group X is near symmetric while the control group Y is skewed.
let s x = 10 14 21 14 , s y = 2.4 2 2.5 2 , r x = 5.2 6.7 7.9 5 , r y = 1.2 1 1 0.8
Σ x = 2 0.2 0.3 0.1 0.2 2.1 0.2 0.02 0.3 0.2 1.5 0.1 0.1 0.02 0.1 2.3 and Σ y = 1 0.1 0.1 0.6 0.1 1 0.2 0.15 0.1 0.2 1.6 0.6 0.6 0.15 0.6 1 .
The theoretical optimal AUC is 0.899. Figure 5 and Figure 6 demonstrate the marginal distributions of these biomarkers in the case and control group, respectively. Using the same simulation process as before, Table 3 presents the results across various sample sizes:
The simulation results demonstrate that the proposed elliptical combination consistently performs better than both the multivariate normal and linear combinations. Across all sample sizes, the elliptical combination exhibits medians of biases that are consistently closer to 0 compared to the other methods.
Additionally, the IQRs of the elliptical combination reliably encompass 0, reflecting a more balanced estimation. In contrast, the IQRs of the multivariate normal and linear combinations are persistently negative, indicating a systematic tendency to underestimate the AUC.
These findings underscore the effectiveness of the elliptical combination in situations where one group exhibits skewness while the other remains symmetric. This approach proves to be more reliable and accurate in estimating the AUC compared to the multivariate normal and linear combinations, making it a preferred method in such scenarios.
Table 3. Biases of Combined AUC values for Different Methods when X is Near Symmetric and Y is Skewed.
Table 3. Biases of Combined AUC values for Different Methods when X is Near Symmetric and Y is Skewed.
Bias of Estimated AUC
Elliptical CombinationMulti-Normal CombinationLinear Combination
n x = 500
n y = 700
Median−0.0066−0.0148−0.2670
Q1, Q3−0.0133, 0.0002−0.0203, −0.0096−0.2913, −0.2391
Min, Max−0.0257, 0.0064−0.0272, −0.0063−0.3142, −0.2202
n x = 375
n y = 525
Median−0.0039−0.0139−0.2658
Q1, Q3−0.0125, 0.0043−0.0208, −0.0075−0.2975, −0.2340
Min, Max−0.0175, 0.0113−0.0253, −0.0024−0.3141, −0.2214
n x = 300
n y = 500
Median−0.0029−0.0137−0.2656
Q1, Q3−0.0108, 0.0053−0.0206, −0.0071−0.2969, −0.2351
Min, Max−0.0189, 0.0135−0.0250, 0.0012−0.3196, −0.1956
n x = 250
n y = 350
Median−0.0005−0.0127−0.2628
Q1, Q3−0.0115, 0.0099−0.0210, −0.0049−0.3017, −0.2273
Min, Max−0.0258, 0.0172−0.0288, 0.0025−0.3339, −0.1849
n x = 200
n y = 280
Median0.0027−0.0122−0.2593
Q1, Q3−0.0097, 0.0139−0.0223, −0.0031−0.3012, −0.2221
Min, Max−0.0195, 0.0273−0.0331, 0.0039−0.3521, −0.1863

4.4. Mixed Distributions in X and Y: Approximately Symmetric and Skewed

Lastly, we consider the condition where the biomarkers exhibit a combination of skewed and symmetric distributions in both the case and control groups.
let s x = 2 4 3 1 , s y = 3 2 4 1 , r x = 0.3 2 0.2 1 , r y = 0.5 0.7 0.2 0.5
Σ x = 0.5 0.2 0.3 0.1 0.2 2 0.2 0.02 0.3 0.2 1 0.1 0.1 0.02 0.1 2 and Σ y = 1 0.1 0.1 0.6 0.1 1.5 0.2 0.15 0.1 0.2 1 0.6 0.6 0.15 0.6 2 .
The theoretical optimal AUC is 0.785. Figure 7 and Figure 8 demonstrate the marginal distributions of these biomarkers in the case and control group, respectively. The second and fourth biomarkers in both groups are skewed, while the first and third biomarkers are more symmetric.
The simulation results in the Table 4 indicate that the proposed elliptical combination performs better than both the multivariate normal combination and the linear combination when the biomarkers display a mixture of skewed and symmetric distributions in both the case and control groups. Although the multivariate normal and linear methods tend to produce negative biases, signaling a propensity to underestimate, the elliptical combination maintains IQRs that include 0. This suggests that the elliptical method provides a more accurate and stable estimation of the AUC.
Figure 7. Marginal Distributions of X.
Figure 7. Marginal Distributions of X.
Genes 15 01145 g007
Figure 8. Marginal Distributions of Y.
Figure 8. Marginal Distributions of Y.
Genes 15 01145 g008
Table 4. Biases of Combined AUC values for Different Methods when Mixed Distributions in X and Y.
Table 4. Biases of Combined AUC values for Different Methods when Mixed Distributions in X and Y.
Bias of Estimated AUC
Elliptical CombinationMulti-Normal CombinationLinear Combination
n x = 500
n y = 700
Median−0.0147−0.0369−0.0997
Q1, Q3−0.0248, 0.0037−0.0496, −0.0254−0.1333, −0.0648
Min, Max−0.0403, 0.0073−0.0661, −0.0115−0.2701, −0.0390
n x = 375
n y = 525
Median−0.0097−0.0358−0.1021
Q1, Q3−0.0223, 0.0037−0.0492, −0.0215−0.1354, −0.0633
Min, Max−0.0295, 0.0186−0.0602, −0.0073−0.3415, −0.0355
n x = 300
n y = 500
Median−0.0068−0.0354−0.1030
Q1, Q3−0.0201, 0.0077−0.0507, −0.0196−0.1392, −0.0659
Min, Max−0.0411, 0.0199−0.0717, −0.0108−0.3408, −0.0434
n x = 250
n y = 350
Median−0.0010−0.0343−0.1026
Q1, Q3−0.0204, 0.0170−0.0519, −0.0160−0.1410, −0.0625
Min, Max−0.0330, 0.0347−0.0866, −0.0015−0.2508, −0.0359
n x = 200
n y = 280
Median0.0053−0.0316−0.1051
Q1, Q3−0.0136, 0.0230−0.0512, −0.0125−0.1474, −0.0612
Min, Max−0.0308, 0.0320−0.0603, 0.0018−0.3638, −0.0425
Overall, the elliptical combination method consistently provided better performance across most conditions, demonstrating its robustness in handling various distribution types of biomarkers. The multivariate normal combination also performed well but showed tendencies to underestimate, particularly when dealing with skewed distributions. The linear combination was less effective, highlighting its limitations in comparison to the elliptical and multivariate normal methods. The proposed elliptical combination is more accurate and reliable in diverse conditions.

5. Application

We applied the proposed methods to case-control studies and compared the results with those from other established approaches. To evaluate the diagnostic performance of the biomarkers, we analyzed and compared the ROC curves and corresponding AUC values of the proposed elliptical likelihood ratio combination against two other methods: the multivariate normal combination proposed by Su and Liu [6] and the best linear combination introduced by Pepe et al. [9]. We expected that the proposed elliptical combination would yield a higher AUC than the alternatives. We began by demonstrating the application of the proposed combination method to the diagnosis of autism.

5.1. Biomarkers for Autism/Autistic Disorder

Autism/autism spectrum disorder (ASD) is a collection of developmental disorders characterized by challenges in communication, repetitive behaviors, and increased irritability [15]. The diagnosis of autism/autism spectrum disorders (ASD) has gained higher attention due to its diagnostic complexities. Failure to correctly diagnose children with autism/ASD may result in missed opportunities for appropriate treatments, while misdiagnosis of autism/ASD also leads to potentially harmful consequences for children [16].
In 2002–2005, the Eunice Kennedy Shriver National Institute of Child Health and Human Development conducted a case-control study to investigate the potential role of growth-related hormones in diagnosing autism in children aged 4 to 8 years [17]. The study included 59 subjects in the control group and 71 subjects in the case group. Each participant was tested for multiple diagnostic biomarkers, including insulin-like growth factors (IGF1, IGF2, IGFBP3 ), as well as dehydroepiandrosterone (DHEA) DHEA sulfate (DHEAS), and so on. Detailed descriptions of subject enrollment and data collection are provided by Mills et al. [17]. We will combine five biomarkers, including IGF1, IGF2, IGFBP3, DHEA, and DHEAS, from the data to diagnose autism/ASD and calculate combined AUC values.
The results of the application are presented in Figure 9a. Figure 9b shows the smoothed ROC curves for different combinations. The ROC curve of the elliptical likelihood ratio combination (red line) has the largest AUC value among all methods. Notably, the red line is obviously higher than all other ROC curves, particularly when the specificity is small. The multivariate normal likelihood ratio combination achieved an AUC value of 0.8009, which is higher than the empirical linear combination. The proposed elliptical combination further enhanced the AUC to 0.8157, demonstrating its superior ability to discriminate between case and control groups in this context.

5.2. Diagnosis of Neural Tube Defects (NTDs)

Neural tube defects (NTDs) are a group of birth defects, including spina bifida, anencephaly, and others [18]. NTDs are one of the most common birth defects. Maternal biomarkers play a crucial role in the early diagnosis and prevention of NTDs. By assessing the ROC curve and AUC values, we can gain insights into the potential biomarker combinations in diagnosing NTDs. This information is crucial for enhancing the accuracy and effectiveness of the screening program and providing valuable insights into the early detection and prevention of neural tube defects.
In the study conducted on the Irish National Rubella Screening Program, data were collected from blood samples obtained from pregnant women, with a focus on recording neural tube defects (NTDs). We chose four biomarkers related to the NTD diagnosis: vitamin B12, reduced red cell folate, plasma total folate, and free choline. After removing missing values, the dataset consisted of 71 observations in the case group (women with NTDs) and 230 observations in the control group (women without NTDs).
The results of the application are presented in Figure 10a. The multivariate normal likelihood ratio combination achieves an AUC value of 0.6923, which is slightly higher than the AUC value obtained with the best linear combinations. The proposed elliptical likelihood ratio combination exhibited an obviously improved AUC value of 0.7714. The elliptical likelihood ratio combination could improve the AUC by 11.43 % and 21.79 % compared to the multivariate normal likelihood ratio combination and best linear combination, respectively. This marked enhancement in performance demonstrated the effectiveness of utilizing elliptical distributions in the estimation process. In addition, the ROC curve of the elliptical likelihood ratio combination (represented by the red line) is consistently higher than all other combinations across various specificity values. Figure 10b shows the smoothed ROC curves for different combinations. Therefore, according to the combined ROC curves and AUC values, we can safely conclude that the elliptical combination significantly improved the diagnosis performance of biomarkers related to the NTDs.

6. Conclusions and Discussion

In this paper, we discuss the likelihood ratio combination of elliptically distributed biomarkers to enhance diagnostic accuracy. The non-parametric maximum likelihood estimation for the proposed method is developed. Simulation results show the advantages of using the elliptical combination for more accurate and reliable AUC calculations in diverse conditions. Compared to the multivariate normal combination, the elliptical likelihood ratio combination benefits from greater flexibility and robustness across a wider range of data distributions, including those with heavy or light tails, symmetric or skewed, making it more suitable for real-world applications. By relaxing the assumption of normality and allowing for elliptical distributions, we can fully leverage the potential of the biomarkers and improve their diagnostic accuracy.
We compare the performance of the normal likelihood ratio combination, the best linear combination, and the proposed elliptical likelihood ratio combination using autism/ASD and NTDs datasets. Our findings consistently demonstrate that the elliptical likelihood ratio combination outperforms the other combinations in terms of diagnostic accuracy.

Author Contributions

Conceptualization, A.L.; Methodology, S.D. and Y.L.; Algorithms and Data Analysis, S.D.; Writing—original draft, S.D.; Writing—review & editing, Y.L. and A.L.; Supervision, Z.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Acknowledgments

Research of A. Liu was supported by the Intramural Research Program of the Eunice Kennedy Shriver National Institute of Child Health and Human Development (NICHD). The authors thank Jim Mills for sharing and helpful discussion on the neural tube defects data.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. World Health Organization. Use of Glycated Haemoglobin (HbA1c) in the Diagnosis of Diabetes Mellitus: Abbreviated Report of a WHO Consultation; World Health Organization: Geneva, Switzerland, 2011. [Google Scholar]
  2. Kermali, M.; Khalsa, R.K.; Pillai, K.; Ismail, Z.; Harky, A. The role of biomarkers in diagnosis of COVID-19—A systematic review. Life Sci. 2020, 254, 117788. [Google Scholar] [CrossRef] [PubMed]
  3. Zhang, L.; Guo, H. Biomarkers of COVID-19 and technologies to combat SARS-CoV-2. Adv. Biomark. Sci. Technol. 2020, 2, 1–23. [Google Scholar] [CrossRef] [PubMed]
  4. Barry, M.J. Clinical practice. prostate-specific-antigen testing for early diagnosis of prostate cancer. N. Engl. J. Med. 2001, 344, 1373–1377. [Google Scholar] [CrossRef] [PubMed]
  5. Hanley, J.A.; McNeil, B.J. The meaning and use of the area under a receiver operating characteristic (roc) curve. Radiology 1982, 143, 29–36. [Google Scholar] [CrossRef] [PubMed]
  6. Su, J.Q.; Liu, J.S. Linear combinations of multiple diagnostic markers. J. Am. Stat. Assoc. 1993, 88, 1350–1355. [Google Scholar] [CrossRef]
  7. Pepe, M.; Thompson, M. Combining diagnostic test results to increase accuracy. Biostatistics 2000, 1, 123–140. [Google Scholar] [CrossRef] [PubMed]
  8. Liu, A.; Schisterman, E.F.; Zhu, Y. On linear combinations of biomarkers to improve diagnostic accuracy. Stat. Med. 2005, 24, 37–47. [Google Scholar] [CrossRef] [PubMed]
  9. Pepe, M.S.; Cai, T.; Longton, G. Combining predictors for classification using the area under the receiver operating characteristic curve. Biometrics 2006, 62, 221–229. [Google Scholar] [CrossRef] [PubMed]
  10. Frahm, G. Generalized Elliptical Distributions: Theory and Applications. Ph.D. Thesis, Universität zu Köln, Köln, Germany, 2004. [Google Scholar]
  11. Fang, K.; Kotz, S.; Ng, K.W. Symmetric Multivariate and Related Distributions; Springer: New York, NY, USA, 1990. [Google Scholar]
  12. Bhattacharyya, S.; Bickel, P. Adaptive Estimation in Elliptical Distributions with Extensions to High Dimensions. 2013. Available online: https://www.semanticscholar.org/paper/ADAPTIVE-ESTIMATION-IN-ELLIPTICAL-DISTRIBUTIONS-TO-Bhattacharyya-Bickel/7df76b5d2626f4f7292d057bbf561fba4ebe697a (accessed on 1 June 2021).
  13. Hult, H.; Lindskog, F. Multivariate extremes, aggregation and dependence in elliptical distributions. Adv. Appl. Probab. 2002, 34, 587–608. [Google Scholar] [CrossRef]
  14. Sun, Y.; Babu, P.; Palomar, D.P. Regularized tyler’s scatter estimator: Existence, uniqueness, and algorithms. IEEE Trans. Signal Process. 2014, 62, 5143–5156. [Google Scholar] [CrossRef]
  15. Pinto-Martin, J.A.; Young, L.M.; Mandell, D.S.; Poghosyan, L.; Giarelli, E.; Levy, S.E. Screening strategies for autism spectrum disorders in pediatric primary care. J. Dev. Behav. Pediatr. 2008, 345–350. [Google Scholar] [CrossRef] [PubMed]
  16. Falkmer, T.; Anderson, K.; Falkmer, M.; Horlin, C. Diagnostic procedures in autism spectrum disorders: A systematic literature review. Eur. Child Adolesc. Psychiatry 2013, 22, 329–340. [Google Scholar] [CrossRef] [PubMed]
  17. Mills, J.; Hediger, M.; Molloy, C.; Chrousos, G.; Manning-Courtney, P.; Yu, K.; Brasington, M.; England, L. Elevated levels of growth-related hormones in autism and autism spectrum disorder. Clin. Endocrinol. 2007, 67, 230–237. [Google Scholar] [CrossRef] [PubMed]
  18. National Institute of Child Health and Human Development: About Neural Tube Defects (NTDs). 2017. Available online: https://www.nichd.nih.gov/health/topics/ntds/conditioninfo/ (accessed on 1 June 2021).
Figure 1. Marginal Distributions of X.
Figure 1. Marginal Distributions of X.
Genes 15 01145 g001
Figure 2. Marginal Distributions of Y.
Figure 2. Marginal Distributions of Y.
Genes 15 01145 g002
Figure 3. Marginal Distributions of X.
Figure 3. Marginal Distributions of X.
Genes 15 01145 g003
Figure 4. Marginal Distributions of Y.
Figure 4. Marginal Distributions of Y.
Genes 15 01145 g004
Figure 5. Marginal Distributions of X.
Figure 5. Marginal Distributions of X.
Genes 15 01145 g005
Figure 6. Marginal Distributions of Y.
Figure 6. Marginal Distributions of Y.
Genes 15 01145 g006
Figure 9. Combinations of Autism Data and Corresponding AUC Values.
Figure 9. Combinations of Autism Data and Corresponding AUC Values.
Genes 15 01145 g009
Figure 10. Combinations of NTDs Data and Corresponding AUC Values.
Figure 10. Combinations of NTDs Data and Corresponding AUC Values.
Genes 15 01145 g010
Table 1. Biases of Combined AUC Values for Different Methods when Both X and Y are Close to Symmetric Distributions.
Table 1. Biases of Combined AUC Values for Different Methods when Both X and Y are Close to Symmetric Distributions.
Bias of Estimated AUC
Elliptical CombinationMulti-Normal CombinationLinear Combination
n x = 500
n y = 700
Median0.0027−0.0055−0.0952
Q1, Q3−0.0020, 0.0074−0.0095, −0.0013−0.1273, −0.0736
Min, Max−0.0056, 0.0107−0.0148, 0.0016−0.1552, −0.0586
n x = 375
n y = 525
Median0.0047−0.0052−0.0978
Q1, Q3−0.0013, 0.0100−0.0105, 0.0003−0.1323, −0.0717
Min, Max−0.0069, 0.0178−0.0153, 0.0041−0.1858, −0.0524
n x = 300
n y = 500
Median0.0051−0.0050−0.0984
Q1, Q3−0.0007, 0.0115−0.0101, −0.0001−0.1377, −0.0724
Min, Max−0.0048, 0.0168−0.0146, 0.0044−0.1935, −0.0517
n x = 250
n y = 350
Median0.0075−0.0040−0.1031
Q1, Q30, 0.0148−0.0101, 0.0022−0.1434, −0.0712
Min, Max−0.0056, 0.0229−0.0161, 0.0069−0.1992, −0.0476
n x = 200
n y = 280
Median0.0096−0.0032−0.1028
Q1, Q30.0015, 0.0195−0.0104, 0.0038−0.1431, −0.0664
Min, Max−0.0109, 0.0283−0.0144, 0.0102−0.1901, −0.0370
Table 2. Biases of Combined AUC values for Different Methods when Both X and Y are Skewed.
Table 2. Biases of Combined AUC values for Different Methods when Both X and Y are Skewed.
Bias of Estimated AUC
Elliptical CombinationMulti-Normal CombinationLinear Combination
n x = 500
n y = 700
Median−0.0084−0.0188−0.0449
Q1, Q3−0.0159, −0.0006−0.0273, −0.0111−0.0582, −0.0324
Min, Max−0.0247, 0.0046−0.0353, −0.0028−0.0695, −0.0246
n x = 375
n y = 525
Median−0.0064−0.0181−0.0452
Q1, Q3−0.0150, 0.0019−0.0286, −0.0093−0.0608, −0.0309
Min, Max−0.0208, 0.0079−0.0347, −0.0037−0.0748, −0.0161
n x = 300
n y = 500
Median−0.0052−0.0181−0.0444
Q1, Q3−0.0148, 0.0044−0.0283, −0.0079−0.0640, −0.0269
Min, Max−0.0225, 0.0145−0.0368, 0.0031−0.0792, −0.0168
n x = 250
n y = 350
Median−0.0028−0.0177−0.0451
Q1, Q3−0.0136, 0.0080−0.0298, −0.0061−0.0650, −0.0268
Min, Max−0.0214, 0.0246−0.0381, 0.0074−0.0761, −0.0167
n x = 200
n y = 280
Median0.0001−0.0168−0.0428
Q1, Q3−0.0128, 0.0128−0.0295, −0.0032−0.0701, −0.0229
Min, Max−0.0256, 0.0198−0.0431, 0.0072−0.0994, −0.0014
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Dong, S.; Li, Z.; Li, Y.; Liu, A. On the Optimal Combination of Elliptically Distributed Biomarkers to Improve Diagnostic Accuracy. Genes 2024, 15, 1145. https://doi.org/10.3390/genes15091145

AMA Style

Dong S, Li Z, Li Y, Liu A. On the Optimal Combination of Elliptically Distributed Biomarkers to Improve Diagnostic Accuracy. Genes. 2024; 15(9):1145. https://doi.org/10.3390/genes15091145

Chicago/Turabian Style

Dong, Shiqi, Zhaohai Li, Yuanzhang Li, and Aiyi Liu. 2024. "On the Optimal Combination of Elliptically Distributed Biomarkers to Improve Diagnostic Accuracy" Genes 15, no. 9: 1145. https://doi.org/10.3390/genes15091145

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop