1. Introduction
The ROC curve is a valuable statistical tool in medical research for evaluating the performance of binary classifiers across different thresholds. It finds wide applications in fields like radiology, oncology, and genomics [
1,
2]. In medical studies, ROC curves are particularly useful when evaluating a continuous biomarker to classify individuals as diseased or healthy. Graphically, the ROC curve plots sensitivity (the proportion of true positive) versus one minus specificity (the proportion of false positive) at all possible biomarker thresholds. Extensive and in-depth research has delved into the intricate realm of statistical inferences associated with ROC curves, offering valuable insights and understanding of how these curves are used to evaluate the performance of classification models. For a detailed review, refer to [
3,
4,
5,
6].
Let
denote the cumulative distribution function (CDF) of the healthy population and
denote that of the diseased population. Without loss of generality, let us assume that biomarker values are higher in the diseased group than in the healthy group, and an individual is classified as diseased when their biomarker value exceeds a given threshold (
x). Under this assumption, the sensitivity is
, and the specificity is
. The ROC curve is then given by
for
, where
.
In ROC analysis, two common summary indices are used to assess a biomarker’s diagnostic accuracy: the
[
7,
8] and the Youden index (
J) [
9,
10,
11]. They are defined mathematically as
where
is the “optimal” threshold. By definition, the
summarizes the overall performance of a classifier across all possible thresholds. While valuable, it does not directly provide an “optimal” threshold. On the other hand,
J is the maximum value of the sensitivity plus the specificity minus 1. Not only does
J quantify the biomarker’s effectiveness (with a value of
indicating complete separation of the biomarker’s distributions for diseased and healthy populations, and
indicating complete overlap), but it also offers a distinct advantage over
by providing a criterion for selecting the “optimal” threshold
c. However,
J only measures the diagnostic accuracy at the “optimal” threshold
c and not at other thresholds.
In practical scenarios where medical practitioners encounter multiple biomarkers, they often use the
to choose the most diagnostically useful biomarker [
12,
13]. However, relying solely on
has limitations. The biomarker with the highest
might not have the best overall accuracy at the “optimal” threshold. Similarly, focusing only on the Youden index selects the biomarker with the highest total accuracy at the “optimal” threshold. But this "best" biomarker by the Youden index may not perform well overall. If the threshold changes, the biomarker may no longer maintain satisfactory diagnostic accuracy. For real examples and further discussions, refer to [
1,
14]. In summary, both the
and the Youden index are valuable tools for evaluating a biomarker’s effectiveness, each emphasizing distinct aspects of its performance. Simultaneously examining
and
J, which provide complementary information, may help us make better decisions [
1]. This motivates us to develop joint inference procedures for
and
J in this paper.
In the literature, ref. [
1] considered both parametric and nonparametric methods for constructing confidence regions of
and
J. Later, ref. [
2] proposed both parametric and nonparametric tests to determine if a biomarker exceeds predefined target values with hypotheses
versus
. For the parametric inference procedures, it is assumed that the original biomarkers or the biomarkers after the Box–Cox transformation follow normal distributions in both the healthy and diseased groups. For the nonparametric inference procedures, the empirical CDF or kernel method is used to estimate
and
.
Generally, parametric joint inference procedures are highly efficient when the underlying parametric models are correct. This means that the resulting confidence region for
has a smaller area, and the joint testing procedure has greater power. However, these procedures may not be robust to misspecification of the models for
and
. See
Section 4 for more details. On the other hand, nonparametric methods are free from assumptions about the models of
and
. In medical research, it has been observed that healthy and diseased populations often share certain common characteristics [
4,
15,
16,
17]. However, fully nonparametric methods ignore this information, potentially leading to inefficient inference procedures.
In this paper, we develop new semiparametric joint inference procedures for
based on a semiparametric density ratio model (DRM; refs. [
18,
19,
20]), which effectively utilizes information from both healthy and diseased populations. Let
and
be the probability density functions of
and
, respectively. The DRM assumes that
where
is a prespecified,
p-variate vector-valued nontrivial function of
x,
, and
are unknown parameters. The unspecified baseline distribution
makes DRM a semiparametric model. This flexibility allows DRM to encompass many distributions commonly used in studying ROC curves [
21]. For instance, if we set
to
, the DRM encompasses the lognormal distributions (with equal variance on the log scale) and the beta distributions (sharing the same power parameter for
). Similarly, setting
to
x, it includes the normal distributions with the same variance and exponential distributions. The DRM has a close relationship with the logistic regression model. To illustrate this point, let us define
D = 0 and 1 as indicators for individuals from the healthy and diseased populations, respectively. As shown by [
18,
19], the DRM is equivalent to the logistic regression model through the following equation:
where
.
The DRM has proven itself as a valuable tool for inference on ROC curves and their summary indices [
4,
15,
17,
22]. Existing theoretical and numerical studies have shown that point estimators of
and
J under the DRM are more efficient than fully nonparametric estimators. However, as far as we are aware, semiparametric joint inference procedures for
, such as confidence regions and joint hypothesis testing procedures, remain uninvestigated under the DRM (
2). This paper aims to fill this gap.
Our contributions are three-fold. First, we establish the joint asymptotic normality of the maximum empirical likelihood estimator (MELE) of
under the DRM (
2). This allows us to construct an asymptotically valid Wald-type confidence region for
. We further propose a nonparametric bootstrap procedure to improve the coverage accuracy of the Wald-type confidence region. Second, we develop a joint testing procedure for the null hypothesis:
versus the alternative hypothesis:
. We introduce a novel bootstrap procedure to obtain its
p-value. Finally, we evaluate the performance of our proposed methods through simulation studies and application to real data on Duchenne Muscular Dystrophy. The numerical studies demonstrate that the proposed method produces more precise confidence regions for
with smaller areas. Additionally, the newly proposed joint testing procedure maintains controlled type-I error rates while achieving satisfactory power.
The rest of the paper is structured as follows. In
Section 2, we introduce the maximum empirical likelihood estimator (MELE) of
and prove its joint asymptotic normality.
Section 3 details the proposed joint inference procedures. This includes constructing confidence regions and conducting joint hypothesis tests for
.
Section 4 presents simulation results and
Section 5 contains a real application. A summary and discussion of the findings are given in
Section 6 and
Section 7, respectively.
5. Real Data Analysis
In this section, we evaluate the performance of the proposed methods using a dataset on Duchenne Muscular Dystrophy (DMD). DMD is a genetic disorder characterized by progressive muscle weakness and wasting. It is caused by mutations in the dystrophin gene, the largest human gene, located on the X chromosome (Xp21). DMD primarily affects males in early childhood. Interestingly, females with one copy of the mutated gene typically do not show symptoms. Therefore, identifying potential female carriers is crucial.
According to [
26], individuals carrying the DMD gene mutation may not exhibit symptoms but often have elevated levels of specific biomarkers. The authors of [
27] compiled a dataset encompassing four biomarkers: Creatine Kinase (CK), Hemopexin (H), Lactate Dehydrogenase (LD), and Pyruvate Kinase (PK). These biomarkers were measured in blood serum samples from a healthy control group (
= 127) and a group of DMD carriers (
= 67).
For illustration, we consider the biomarkers PK and H. We choose
in the proposed methods for each biomarker. We perform the goodness-of-fit test suggested in Remark 1; the
p-values based on 1000 bootstrap samples are 0.215 and 0.780 for PK and H, respectively. This suggests that the DRM in (
2) with
provides reasonable fits for both biomarkers PK and H.
Table 6 presents the point estimates (PEs) of
and the ACRs for
at the 95% confidence level based on the BEL, GPQ, and BTAT methods. We omit the results for the EL, AD, and BTI methods because, as shown in
Section 4, the BEL method achieves better coverage than the EL method, the AD method has larger ACRs compared to the GPQ method, and the BTI method performs similarly to the BTAT method. Clearly, the BEL method gives the coverage region with the smallest area. As an illustration, we further plot the 95% confidence regions of
for the biomarker H based on the BEL, GPQ, and BTAT methods in
Figure 1, which demonstrates similar observations as in
Table 6.
To illustrate the proposed joint test method BELT, we assess whether the biomarker PK simultaneously exceeds the prespecified target values of
and
simultaneously. These values represent the PE of
based on the BEL method for biomarker
H. Our BELT method with
gives the
p-value 0.022. This result provides strong evidence to reject the null hypothesis
or
at the 5% significance level. In contrast, applying both the PBA and NKS tests from [
2] fails to reject the same null hypothesis. In conclusion, our BELT method provides stronger evidence against the null hypothesis
or
. This implies that the biomarker PK has better discriminatory ability than biomarker H in terms of both
and the Youden index
J.