A Dunnett-Type Test and Its Sample Size Calculation for Comparing K ROC Curves with a Control

Sin-Ho Jung

doi:10.3390/diagnostics14161813

Abstract

Diagnostic biomarkers are key components of diagnostics. In this paper, we consider diagnostic biomarkers taking continuous values that are associated with a dichotomous disease status, called malignant or benign. The performance of such a biomarker is evaluated by the area under the curve (AUC) of its receiver operating characteristic curve. We assume that, together with the disease status, one control and multiple experimental biomarkers are collected from each subject to test if any of the experimental biomarkers have a larger AUC than the control. In this case, each experimental biomarker will be compared with the control so that a multiple testing issue is involved in the comparisons. In this paper, we propose a simple non-parametric statistical testing procedure to compare

K (\geq

2) experimental biomarkers with a control, adjusting for the multiplicity and its sample size calculation method. Our sample size formula requires the specification of the AUC values (or the standardized effect size of each biomarker between the benign and malignant groups) together with the correlation coefficients between the biomarkers, the prevalence of the malignant group in the study population, the type I error rate, and the power. Through simulations, we show that the statistical test controls the overall type I error rate accurately and the proposed sample size closely maintains the specified statistical power.

Keywords:

AUC; biomarker; family-wise error rate; location-shift model; prevalence; sensitivity; specificity

1. Introduction

Different types of biomarkers are measured from the tumor, blood, or urine using molecular, biochemical, physiological, anatomical, or imaging methods. The observed biomarkers are used for various purposes during the diagnosis and treatment of diseases. For example, diagnostic biomarkers are used to diagnose diseases, predictive biomarkers are used to predict the response to specific treatments, and prognostic biomarkers are used to measure the aggressiveness of a disease for patients with no treatment or a non-targeted treatment. When the outcome is dichotomous and the biomarker is continuous, the performance of a biomarker is evaluated by the area under the curve (AUC) of its receiver operating characteristic (ROC) curve, which plots the sensitivity against one minus the specificity calculated for all the possible cutoff values of the biomarker. In this paper, we consider diagnostic biomarkers taking continuous values that are associated with a dichotomous disease status, called benignity or malignancy, but the proposed methods can be used for any type of biomarker with such properties.

Various parameters have been used to measure the performance of biomarkers and compare among different biomarkers. For instance, sensitivity at a fixed specificity level determined by the cutoff value of each biomarker may be compared between biomarkers [1,2]. Some investigators proposed testing methods to compare the whole or partial AUC of ROC curves of biomarkers [3,4,5,6,7].

Emir et al. [8] extended the method of Delong et al. [6] to comparison of ROC curves from longitudinally collected biomarkers, and Jung et al. [9] proposed an ROC-type method to compare the performance of biomarkers that are associated with a time-to-event endpoint such as progression-free survival.

Sample size calculation is an important step in the development stage of medical projects. Hwang and Su [10] proposed a sample size formula for the test statistic of Delong et al. [6]. Their variance formula requires the specification of so many parameter values. Furthermore, these parameters are not conceptually interpretable, so it is almost impossible to accurately specify these parameters in sample size calculations. This can prevent us from using their sample size formula in designing real biomarker projects. Jung [11] proposed a simple non-parametric test statistic that is asymptotically identical to that of Delong et al. [6] and its sample size formula. Compared to the formula of Hwang and Su [10], his formula requires specification of much smaller number of design parameters and the design parameters are intuitively interpretable. These sample size methods are limited to the comparison of two biomarkers.

In this paper, we consider the cases where a control biomarker (usually an existing one) and

K (\geq

2) experimental biomarkers are collected from each subject together with a dichotomous disease status. We want to discover the experimental biomarkers that are more closely associated with the disease outcome than the control biomarker. To this end, we compare the AUC of the ROC curve of each experimental biomarker with that of the control using the two-sample test statistic [11]. With K experimental biomarkers to be compared with the control, we will conduct K statistical tests, so the discovery of new biomarkers will involve a multiple testing issue. We propose a multiple testing procedure using a Dunnett-type approach [12,13]. We derive a sample size formula for the multiple testing procedure too. While the statistical test is completely non-parametric, its power depends on the distributions of the biomarkers. Furthermore, the biomarkers to be compared are correlated because they are collected from each patient. This results in additional problems in deriving the statistical testing method and its sample size formula.

Since the AUC of an ROC curve is a rank statistic, it is invariant to the monotone transformations of biomarker values. Hence, if we find a transformation of biomarker distributions to symmetric distributions, such as normal distributions, with a location-shift model and unit variances, then, for a sample size calculation, we need to specify only the location parameter between the benign and malignant groups for each biomarker (which corresponds to the AUC) and the correlation coefficient between the paired biomarkers after the transformation. A log-transformation followed by standardization often carries out this goal. Through simulations, we show that our multiple testing procedure accurately controls the overall type I error rate, called the family-wise error rate (FWER), and its sample size formula closely maintains a specified statistical power.

2. A Dunnett-Type Test to Compare K Biomarkers with a Control

Suppose that biomarker 0 denotes a control (or existing) biomarker and biomarkers

1, \dots, K

denote K experimental biomarkers that are collected from each subject. Let

X_{k}

and

Y_{k}

denote the random variables for observations of biomarker k(=0, 1,

\dots, K)

from malignant and benign cases, respectively. In a data set, we have copies of

X_{k}

,

(x_{k 1}, \dots, x_{k, m})

, from m malignant cases, and copies of

Y_{k}

,

(y_{k 1}, \dots, y_{k, n})

, from n benign cases. For the total sample size N(=

m + n)

,

m / N \to γ \in (0, 1)

denotes the prevalence of malignancy in the target population. As can be seen from our sample size formula and the results of numerical studies in Section 5, our proposed test will have a maximum power if benign and malignant cases are balanced, i.e.,

γ = 1 / 2

. Let

F_{k} (t) = P (X_{k} \leq t)

and

G_{k} (t) = P (Y_{k} \leq t)

denote the cumulative density function (CDF) of biomarker k for malignant cases and benign cases, respectively, and

{\hat{F}}_{k} (t) = m^{- 1} \sum_{i = 1}^{m} I (x_{k i} \leq t)

and

{\hat{G}}_{k} (t) = n^{- 1} \sum_{j = 1}^{n} I (y_{k j} \leq t)

be their empirical estimators.

Without loss of generality, we assume that the raw biomarker values are non-negative and a large biomarker value is associated with a higher chance of malignancy. As such, for a cutoff values t,

{\bar{F}}_{k} (t) = 1 - F_{k} (t)

is called the sensitivity and

G_{k} (t)

is called the specificity of biomarker k. For

{\bar{G}}_{k} (t) = 1 - G_{k} (t)

, the ROC curve of biomarker k is the plot of

({\bar{G}}_{k} (t), {\bar{F}}_{k} (t))

with respect to all possible cutoff value t.

AUC of the ROC curve for biomarker k is

θ_{k} = P (X_{k} \geq Y_{k}) = \int_{0}^{\infty} G_{k} (t) d F_{k} (t)

, which is estimated by

{\hat{θ}}_{k} = \int_{0}^{\infty} {\hat{G}}_{k} (t) d {\hat{F}}_{k} (t) = \frac{1}{m n} \sum_{i = 1}^{m} \sum_{j = 1}^{n} I (x_{k i} \geq y_{k j})

For experimental biomarker

k (= 1, \dots, K)

, we want to test the null hypothesis

H_{k} : θ_{k} = θ_{0}

against an alternative hypothesis

{\bar{H}}_{k} : θ_{k} > θ_{0}

using

{\hat{θ}}_{k} - {\hat{θ}}_{0}

. It was shown that [11], for

k = 0, 1, \dots, K

, we have

{\hat{θ}}_{k} - θ_{k} = m^{- 1} \sum_{i = 1}^{m} ϵ_{k i} + n^{- 1} \sum_{j = 1}^{n} e_{k j} + o_{p} (N^{- 1 / 2})

(1)

where

ϵ_{k i} = \int_{0}^{\infty} G_{k} (t) d {I (x_{k i} \leq t) - F_{k} (t)} = G_{k} (x_{k i}) - θ_{k}

and

e_{k j} = \int_{0}^{\infty} {I (y_{k j} \leq t) - G_{k} (t)} d F_{k} (t) = {\bar{F}}_{k} (y_{k j}) - θ_{k}

. Hence, for

k = 1, \dots, K

,

\sqrt{N} ({\hat{θ}}_{k} - {\hat{θ}}_{0}) = \sqrt{N} {m^{- 1} \sum_{i = 1}^{m} (ϵ_{k i} - ϵ_{0 i}) + n^{- 1} \sum_{j = 1}^{n} (e_{k j} - e_{0 j})} + Δ_{k} \sqrt{N} + o_{p} (1)

(2)

where

Δ_{k} = θ_{k} - θ_{0}

.

Note that

(ϵ_{0 i}, ϵ_{1 i}, \dots, ϵ_{K i})

and

(e_{0 i}, e_{1 i}, \dots, e_{K i})

are

K + 1

-dimensional independent random vectors with mean 0 because malignant and benign groups are independent. Furthermore, under

H_{k}

, we have

Δ_{k} = 0

, so that, by the central limit theorem (CLT),

\sqrt{N} ({\hat{θ}}_{k} - {\hat{θ}}_{0}) / {\hat{σ}}_{k}

converges to

N (0, 1)

, where

{\hat{σ}}_{k}^{2} = N {m^{- 2} \sum_{i = 1}^{m} {({\hat{ϵ}}_{k i} - {\hat{ϵ}}_{0 i})}^{2} + n^{- 2} \sum_{j = 1}^{n} {({\hat{e}}_{k j} - {\hat{e}}_{0 j})}^{2}}

and

{\hat{ϵ}}_{k i}

and

{\hat{e}}_{k j}

are obtained from

ϵ_{k i}

and

e_{k j}

, respectively, by replacing

F_{k}

and

G_{k}

with their consistent estimators

{\hat{F}}_{k}

and

{\hat{G}}_{k}

. That is,

{\hat{ϵ}}_{k i} = {\hat{G}}_{k} (x_{k i}) - {\hat{θ}}_{k} = \frac{1}{n} \sum_{j = 1}^{n} I (y_{k j} \leq x_{k i}) - {\hat{θ}}_{k}

and

{\hat{e}}_{k j} = \hat{\bar{F}} (y_{k j}) - {\hat{θ}}_{k} = \frac{1}{m} \sum_{i = 1}^{m} I (x_{k i} \geq y_{k j}) - {\hat{θ}}_{k}

The univariate (or unadjusted) test rejects

H_{k}

if

Z_{k} = \sqrt{N} ({\hat{θ}}_{k} - {\hat{θ}}_{0}) / {\hat{σ}}_{k} > z_{1 - α}

, where

z_{1 - α}

denotes the

100 (1 - α)

percentile of

N (0, 1)

distribution. Appendix A of Jung [11] shows that the variance estimator

{\hat{σ}}_{k}^{2}

is asymptotically identical to that of DeLong et al. [6]. In this sense, our univariate test statistic is asymptotically identical to that of DeLong et al. [6].

When there are multiple experimental biomarkers (i.e.,

K \geq 2

), the biomarker discovery will have increased false positivity by the multiple univariate tests. For accurate control of false positivity, we need a multiple testing adjustment. We propose to control the FWER, i.e., the probability to accept an experimental biomarker when all K values of them are no better than the control. For the overall null hypothesis

H_{0} = \cap_{k = 1}^{K} H_{k}

and a common critical value

c = c_{α}

, controlling the FWER at

α

is obtained by solving

α = P \{\cup_{k = 1}^{K} (Z_{k} > c) | H_{0}\}

(3)

with respect to c. Note that the critical region for our testing has a square shape. Delong et al. [6] proposed a one-way ANOVA-type test with an ellipsoidal-shape critical value, but our Dunnett-type test will be more appropriate when comparing multiple experimental biomarkers with a common control.

In order to calculate the FWER provided in Equation (3), we need the joint distribution of

(Z_{1}, \dots, Z_{K})

under

H_{0}

. Applying the multivariate CLT to Equation (2), under

H_{0}

,

(Z_{1}, \dots, Z_{K})

is asymptotically normal with mean 0 and variance

Σ

that can be consistently estimated by

\hat{Σ} = {({\hat{ρ}}_{k k^{'}})}_{k, k^{'} = 1, \dots, K}

where

{\hat{ρ}}_{k k^{'}} = {\hat{σ}}_{k k^{'}} / \sqrt{{\hat{σ}}_{k}^{2} {\hat{σ}}_{k^{'}}^{2}}

and

{\hat{σ}}_{k k^{'}} = N {m^{- 2} \sum_{i = 1}^{m} ({\hat{ϵ}}_{k i} - {\hat{ϵ}}_{0 i}) ({\hat{ϵ}}_{k^{'} i} - {\hat{ϵ}}_{0 i}) + n^{- 2} \sum_{j = 1}^{n} ({\hat{e}}_{k j} - {\hat{e}}_{0 j}) ({\hat{e}}_{k^{'} j} - {\hat{e}}_{0 j})}

For a given critical value c, the FWER is obtained by calculating the probability on the right-hand side of Equation (3) based on the multivariate normal distribution using simulations or a numerical integration method. Finally, the critical value

c = c_{α}

controlling the FWER at

α

is obtained by solving Equation (3) with respect to c and accepting (or discovering) biomarker

k (= 1, \dots, K)

if

Z_{k} > c_{α}

.

When

K = 2

, the double integration for the probability calculation can be simplified to a single integration. If

(Z_{1}, Z_{2})

is a bivariate normal random vector with mean 0, variance 1, and correlation coefficient

ρ_{12}

, then the conditional distribution of

Z_{1}

given

Z_{2} = z

is normal with mean

ρ_{12} z

and variance

(1 - ρ_{12}^{2})

. Hence, from Equation (3), we have

FWER = P \{(Z_{1} > c) \cup (Z_{2} > c)\} = 1 - P \{(Z_{1} \leq c) \cap (Z_{2} \leq c)\}

so the critical value

c = c_{α}

controlling the FWER at

α

is obtained by solving the equation

α = 1 - \int_{- \infty}^{c} ϕ (z) Φ (\frac{c - {\hat{ρ}}_{12} z}{\sqrt{1 - {\hat{ρ}}_{12}^{2}}}) d z

(4)

with respect to c, where

ϕ (\cdot)

and

Φ (\cdot)

are the probability density function (PDF) and CDF of

N (0, 1)

distribution, respectively.

For biomarker

k (= 1, \dots, K)

with test statistic value

Z_{k} = z_{k}

, we may calculate the FWER-adjusted p-value by

p_{k} = P (Z_{1} > z_{k}, or \dots, or Z_{K} > z_{k} | H_{0}) = P \{\cup_{k^{'} = 1}^{K} (Z_{k^{'}} > z_{k}) | H_{0}\}

3. Sample Size Calculation for the Dunnett-Type Test

Now, we derive a sample size formula for a specified alternative hypothesis

H_{a} = \cup_{k = 1}^{K} {\bar{H}}_{k}

. For

Δ_{k} > 0

, we consider a specific alternative hypothesis

H_{a} : θ_{k} - θ_{0} = Δ_{k} for k = 1, \dots, K

From Equation (2), we have

E ({\hat{θ}}_{k} - {\hat{θ}}_{0}) = Δ_{k}

for

k = 1, \dots, K

under

H_{a}

.

We have to calculate the variances and covariances of

\sqrt{N} ({\hat{θ}}_{0}, {\hat{θ}}_{1}, \dots, {\hat{θ}}_{K})

. Define

σ_{k}^{2} (ϵ) = var (ϵ_{k i})

and

σ_{k}^{2} (e) = var (e_{k i})

, i.e.,

σ_{k}^{2} (ϵ) = var {G_{k} (X_{k})} = \int_{0}^{\infty} G_{k} {(x)}^{2} d F_{k} (x) - θ_{k}^{2}

and

σ_{k}^{2} (e) = var {{\bar{F}}_{k} (Y_{k})} = \int_{0}^{\infty} {\bar{F}}_{k} {(y)}^{2} d G_{k} (y) - θ_{k}^{2}

For

0 \leq k \neq k^{'} \leq K

, let

f_{k k^{'}} (x_{k}, x_{k^{'}})

and

g_{k k^{'}} (y_{k}, y_{k^{'}})

denote the PDFs of

(X_{k}, X_{k^{'}})

and

(Y_{k}, Y_{k^{'}})

, respectively. We also define

σ_{k k^{'}} (ϵ) = cov (ϵ_{k i}, ϵ_{k^{'} i})

and

σ_{k k^{'}} (e) = cov (e_{k i}, e_{k^{'} i})

, i.e.,

σ_{k k^{'}} (ϵ) = cov {G_{k} (X_{k}), G_{k^{'}} (X_{k^{'}})} = \int_{0}^{\infty} \int_{0}^{\infty} G_{k} (x_{k}) G_{k^{'}} (x_{k^{'}}) f_{k k^{'}} (x_{k}, x_{k^{'}}) d x_{k} d x_{k^{'}} - θ_{k} θ_{k^{'}}

and

σ_{k k^{'}} (e) = cov {{\bar{F}}_{k} (Y_{k}), {\bar{F}}_{k^{'}} (Y_{k^{'}})} = \int_{0}^{\infty} \int_{0}^{\infty} {\bar{F}}_{k} (y_{k}) {\bar{F}}_{k^{'}} (y_{k^{'}}) g_{k k^{'}} (y_{1}, y_{2}) d y_{k} d y_{k^{'}} - θ_{k} θ_{k^{'}}

Let

s_{k}^{2} = var ({\hat{θ}}_{k} \sqrt{N})

and

s_{k k^{'}} = cov ({\hat{θ}}_{k} \sqrt{N}, {\hat{θ}}_{k^{'}} \sqrt{N})

for

0 \leq k, k^{'} \leq K

. Then, from Equation (1), we have

s_{k}^{2} = γ^{- 1} σ_{k}^{2} (ϵ) + {\bar{γ}}^{- 1} σ_{k}^{2} (e)

and

s_{k k^{'}} = γ^{- 1} σ_{k k^{'}} (ϵ) + {\bar{γ}}^{- 1} σ_{k k^{'}} (e)

where

\bar{γ} = 1 - γ

.

Hence, we have

V = var {\sqrt{N} ({\hat{θ}}_{1} - {\hat{θ}}_{0}), \dots, \sqrt{N} ({\hat{θ}}_{K} - {\hat{θ}}_{0})} = {(σ_{k k^{'}})}_{k, k^{'} = 1, \dots, K}

with

σ_{k k} = σ_{k}^{2} = var {\sqrt{N} ({\hat{θ}}_{k} - {\hat{θ}}_{0})} = s_{k}^{2} - 2 s_{k 0} + s_{0}^{2}

and, for

k \neq k^{'} (= 1, \dots, K)

,

σ_{k k^{'}} = cov {\sqrt{N} ({\hat{θ}}_{k} - {\hat{θ}}_{0}), \sqrt{N} ({\hat{θ}}_{k^{'}} - {\hat{θ}}_{0})} = s_{k k^{'}} - s_{k 0} - s_{k^{'} 0} + s_{0}^{2}

By the multivariate CLT applied to (2) under

H_{a}

,

\sqrt{N} ({\hat{θ}}_{1} - {\hat{θ}}_{0}, \dots, {\hat{θ}}_{K} - {\hat{θ}}_{0})

is asymptotically normal with mean

\sqrt{N} (Δ_{1}, \dots, Δ_{K})

and variance V. So, under

H_{a}

, we have

Z_{k} = U_{k} + \frac{Δ_{k} \sqrt{N}}{σ_{k}}

where

U_{k} = \frac{\sqrt{N} ({\hat{θ}}_{k} - {\hat{θ}}_{0} - Δ_{k})}{σ_{k}}

Note that

(U_{1}, \dots, U_{K})

is normal with mean 0 and variance–covariance of

Σ = {(ρ_{k k^{'}})}_{k, k^{'} = 1, \dots, K}

with

ρ_{k k} = 1

and

ρ_{k k^{'}} = σ_{k k^{'}} / (σ_{k} σ_{k^{'}})

for

1 \leq k \neq k^{'} \leq K

.

Hence, for given N, the power

1 - β

of the Dunnett-type test controlling the FWER at

α

is provided as

1 - β = P {\cup_{k = 1}^{K} (Z_{k} > c_{α}) | H_{a}}

= P \{\cup_{k = 1}^{K} (U_{k} > c_{α} - \frac{Δ_{k} \sqrt{N}}{σ_{k}})\}

(5)

where

c = c_{α}

is obtained by solving

α = P \{\cup_{k = 1}^{K} (U_{k} > c)\}

(6)

For given power

1 - β

, the required sample size N is obtained by solving Equation (5) with respect to N. The probabilities on the right-hand sides of Equations (5) and (6) can be solved using simulations or a multivariate numerical integration.

When

K = 2

, the conditional distribution of

U_{1}

given

U_{2} = u

is normal with mean

ρ_{12} u

and variance

(1 - ρ_{12}^{2})

, so Equations (5) and (6) are simplified to

1 - β = 1 - \int_{- \infty}^{{\bar{c}}_{1}} ϕ (u) Φ (\frac{{\bar{c}}_{2} - ρ_{12} u}{\sqrt{1 - ρ_{12}^{2}}}) d u

(7)

and

α = 1 - \int_{- \infty}^{c_{α}} ϕ (u) Φ (\frac{c_{α} - ρ_{12} u}{\sqrt{1 - ρ_{12}^{2}}}) d u

(8)

where

{\bar{c}}_{k} = c_{α} - σ_{k}^{- 1} Δ_{k} \sqrt{N}

for

k = 1, 2

.

4. Transformation of Biomarker Data to Normal Distributions

The Dunnett-type test statistic is completely non-parametric, but the asymptotic distribution and power of the test statistic depend on the distributions of biomarkers for benign and malignant groups through the values of

θ_{k}

under

H_{a}

and the variance–covariance of

({\hat{θ}}_{1} - {\hat{θ}}_{0}, \dots, {\hat{θ}}_{K} - {\hat{θ}}_{0})

, as shown in the previous section. So, we have to assume a parametric model for sample size calculation.

Biomarkers usually take positive values. If we know the distributions of original biomarker values, then we may calculate a sample size based on them. However, we rarely have good information about the original distributions. Furthermore, for most multivariate positive distributions, their means, variances, and correlation coefficients are mutually associated, so it is difficult to specify the parameter values for a sample size calculation.

Since AUC is a rank statistic, it is invariant to monotone transformations of

X_{k}

and

Y_{k}

. We can find a transformation, such as logarithm, to change the distribution of a positive random variable to a (approximately) normal distribution, for which the parameters are not associated with each other and can be easily specified. Such a transformation tends to stabilize the variance of random variables too.

We assume that, after a log-transformation, each biomarker marginally has normal distributions with a common variance (less strictly, a location-shift model) between malignant and benign groups. Then, we standardize the data by subtracting the mean of benign group and dividing by the common standard deviation. Then, for the resulting data of biomarker k,

X_{k} - δ_{k}

and

Y_{k}

are independent

N (0, 1)

with a location parameter

δ_{k} (> 0)

. Note that

δ_{k}

is the standardized effect size of log-transformed biomarker values between malignant and benign groups. It is known that

θ_{k}

and

δ_{k}

have a 1-to-1 relationship, i.e.,

θ_{k} = \int_{- \infty}^{\infty} ϕ (z - δ_{k}) Φ (z) d z

[11].

With such a transformation, it is easy to conceptualize the design parameters for sample size calculations. Let

ϕ_{r} (x, y)

denote the PDF of bivariate normal distribution with 0 means, unit variances, and correlation coefficient r. Then, we have

F_{k} (t + δ_{k}) = G_{k} (t) = Φ (t)

,

f_{k k^{'}} (x_{k}, x_{k^{'}}) = ϕ_{r_{k k^{'}}} (x_{k} - δ_{k}, x_{k^{'}} - δ_{k^{'}})

, and

g_{k k^{'}} (y_{k}, y_{k^{'}}) = ϕ_{r_{k k^{'}}} (y_{k}, y_{k^{'}})

, where

r_{k k^{'}} = corr (X_{k}, X_{k^{'}}) = corr (Y_{k}, Y_{k^{'}})

. Furthermore, after such a transformation by Appendix B of Jung [11], we have

σ_{k}^{2} (ϵ) = σ_{k}^{2} (e)

and

σ_{k k^{'}} (ϵ) = σ_{k k^{'}} (e)

, so we have

s_{k}^{2} = {(γ \bar{γ})}^{- 1} σ_{k}^{2} (ϵ)

and

s_{k k^{'}} = {(γ \bar{γ})}^{- 1} σ_{k k^{'}} (ϵ)

, and

σ_{k}^{2}

and

σ_{k k^{'}}

are simplified to

σ_{k}^{2} = {(γ \bar{γ})}^{- 1} {σ_{k}^{2} (ϵ) - 2 σ_{k 0} (ϵ) + σ_{0}^{2} (ϵ)}

(9)

and

σ_{k k^{'}} = {(γ \bar{γ})}^{- 1} {σ_{k k^{'}} (ϵ) - σ_{k 0} (ϵ) - σ_{k^{'} 0} (ϵ) + σ_{0}^{2} (ϵ)}

(10)

where

σ_{k}^{2} (ϵ) = \int_{- \infty}^{\infty} Φ^{2} (x) ϕ (x - δ_{k}) d x - θ_{k}^{2}

and

σ_{k k^{'}} (ϵ) = \int_{- \infty}^{\infty} \int_{- \infty}^{\infty} Φ (x_{k}) Φ (x_{k^{'}}) ϕ_{ρ_{k k^{'}}} (x_{k} - δ_{k}, x_{k^{'}} - δ_{k^{'}}) d x_{k} d x_{k^{'}} - θ_{k} θ_{k^{'}}

Note that the ranges of integrations are changed from

(0, \infty)

for the original data to

(- \infty, \infty)

for the transformed data.

From Equations (9) and (10), we find that the sample size depends on the prevalence of malignancy

γ

through

γ \bar{γ}

, so the required sample size is minimized when

γ = \bar{γ} = 1 / 2

and is unchanged between two

γ

values with identical

| γ - 1 / 2 |

value. Furthermore, since

ρ_{k k^{'}} = σ_{k k^{'}} / (σ_{k} σ_{k^{'}})

, the correlation matrix of

\sqrt{N} ({\hat{θ}}_{1} - {\hat{θ}}_{0}, \dots, {\hat{θ}}_{K} - {\hat{θ}}_{0})

,

Σ = {(ρ_{k, k^{'}})}_{k, k^{'} = 1, \dots, K}

, is free of

γ

.

Assuming that, for biomarker k,

X_{k}

and

Y_{k}

marginally follow normal distributions after a log-transformation, a sample size calculation is conducted as follows.

Specify
(a)
$FWER = α$ and $power = 1 - β$ ;
(b)
Prevalence of malignancy $= γ$ ;
(c)
For biomarker $k (= 0, 1, \dots, K)$ , standardized effect size $δ_{k}$ between benign and malignant groups or AUC $θ_{k} = \int_{- \infty}^{\infty} ϕ (z - δ_{k}) Φ (z) d z$ under $H_{a}$ ; and
(d)
$r_{k k^{'}} = corr (X_{k}, X_{k^{'}}) = corr (Y_{k}, Y_{k^{'}})$ .
Calculate $ρ_{k k^{'}} = corr ({\hat{θ}}_{k} - {\hat{θ}}_{0}, {\hat{θ}}_{k^{'}} - {\hat{θ}}_{0})$ using Equations (9) and (10) based on specified $δ_{k}$ (or $θ_{k}$ ) and $r_{k k^{'}}$ .
Calculate $c_{α}$ by solving Equation (6).
Calculate the required sample size N by solving Equation (5).

5. Numerical Studies

Suppose that there are two experimental biomarkers and a control biomarker following the trivariate normal distribution model with a location-shift as discussed in the previous section. We consider prevalence of malignancy

γ = 0.1

, 0.3, or 0.5,

θ_{0} = 0.6

, 0.7, or 0.8, AUCs for biomarkers

θ_{1} = θ_{2} = θ_{0}

under

H_{0}

, and

θ_{1} = θ_{2} = θ_{0} + Δ

with

Δ = 0.1

or 0.15 under

H_{a}

. Using the one-to-one relationship between AUC

θ_{k}

and location-shift parameter

δ_{k}

between

X_{k}

and

Y_{k}

, we identify the location parameters

(δ_{0}, δ_{1}, δ_{2})

corresponding to the specified

(θ_{0}, θ_{1}, θ_{2})

values. We assume that correlation coefficients between any pair of biomarkers are identically

r = 0.2

or 0.4. Under each design setting of

(θ_{0}, Δ, r, γ)

, we calculate the critical value

c_{α}

for 1-sided FWER

α = 0.05

from Equation (8) and calculate sample size N for power

1 - β = 0.8

, 0.85, or 0.9.

We conduct simulations to check if the calculated sample size is accurately powered or not. At first, we generate

B = 5000

samples of size N under the design setting of the sample size calculation. That is, for

m = γ N

and

n = (1 - γ) N

,

{(x_{0 i}, x_{1 i}, x_{2 i}), i = 1, \dots, m}

and

{(y_{0 j}, y_{1 j}, y_{2 j}), j = 1, \dots, n}

are independent trivariate normal random vectors with mean vectors

(δ_{0}, δ_{1}, δ_{1})

and

(0, 0, 0)

, respectively, but with variances 1 and compound symmetry structure with correlation coefficient r. Here,

δ_{0}

and

δ_{1}

are location parameters corresponding to AUCs

θ_{0}

and

θ_{1} = θ_{2} = θ_{0} + Δ

, respectively. The empirical power

1 - \hat{β}

is estimated by the proportion of samples rejecting

H_{0}

among the

B = 5000

simulation samples. Also, to check if the proposed Dunnett-type test accurately controls the FWER, we generate

B = 5000

samples of size N under

H_{0}

(i.e., with

Δ = 0

), and the empirical FWER

\hat{α}

is estimated by the proportion of samples rejecting

H_{0}

.

(Table 1 may be placed around here.)

Table 1. Required sample size N (and empirical FWER level

\hat{α}

and power

1 - \hat{β}

from simulations) for 1-sided

α = 0.05

test of

H_{0} : θ_{1} = θ_{2} = θ_{0}

vs.

H_{a} : (θ_{1} > θ_{0}) \cup (θ_{2} > θ_{0})

given

r = corr (X_{k}, X_{k^{'}}) = corr (Y_{k}, Y_{k^{'}})

for

k \neq k^{'}

, AUCs

(θ_{0}, θ_{1}, θ_{2})

under

H_{a}

, and prevalence of malignancy

γ

.

The results of sample size calculations and simulations are summarized in Table 1. As expected, sample size increases in

1 - β

and decreases in

| γ - 1 / 2 |

,

ρ

,

θ_{0} (> 1 / 2)

, and

Δ = θ_{k} - θ_{0}

for

k = 1, 2

. From the simulation results, we find that our test statistic controls the FWER accurately and the calculated sample sizes are quite accurate in the sense that the empirical powers are close to the corresponding nominal power

1 - β

overall. For some design settings, however, the proposed test is slightly anti-conservative and the calculated sample size is slightly overpowered, especially if the calculated sample size is small and the allocation between cases and controls is very unbalanced (i.e.,

γ = 0.1

). The calculated sample sizes are never underpowered, at least for the design settings considered in this numerical study. Similar results were observed from numerical studies on two-sample version of this test [11]. In designing a study, slightly overpowered sample sizes are less of an issue than underpowered ones. If we want a more accurately powered sample size, we may calculate one using simulations. In this case, the sample size calculated from our formula can be used for validation of the sample size estimated from simulations.

Example: Suppose that there exists a biomarker A and we want to investigate whether we can improve its performance by adding a new biomarker B or C. So, biomarker 0 is A, 1 is A + B, and 2 is

A + C

. After a log-transformation and standardization with the mean of benign group and the common standard deviation as in Section 4, we assume that

(X_{0}, X_{1}, X_{2})

and

(Y_{0}, Y_{1}, Y_{2})

are independent trivariate normal random vectors with unit variances commonly, but with mean vectors

(0, 0, 0)

and

(δ_{0}, δ_{1}, δ_{2})

, respectively. More specifically, if

({\tilde{X}}_{0}, {\tilde{X}}_{1}, {\tilde{X}}_{2})

and

({\tilde{Y}}_{0}, {\tilde{Y}}_{1}, {\tilde{Y}}_{2})

denote the random vectors for the log-transformed biomarker data for malignant and benign groups, respectively, for

k = 0, 1, 2

, let

μ_{k}

denote the mean of

{\tilde{Y}}_{k}

and

s_{k}^{2}

the common variance of

{\tilde{X}}_{k}

and

{\tilde{Y}}_{k}

. Then, the standardized biomarker values will be

X_{k} = ({\tilde{X}}_{k} - μ_{k}) / s_{k}

and

Y_{k} = ({\tilde{Y}}_{k} - μ_{k}) / s_{k}

. In the data analysis,

μ_{k}

and

s_{k}^{2}

will be replaced by their sample estimates. Assuming that B and C add a similar amount of information to A, let

corr (X_{0}, X_{1}) = corr (X_{0}, X_{2}) = corr (Y_{0}, Y_{1}) = corr (Y_{0}, Y_{2}) = 0.3

and

corr (X_{1}, X_{2}) = corr (Y_{1}, Y_{2}) = 0.3 \sqrt{2}

. If we assume

θ_{0} = 0.7

and

θ_{1} = θ_{2} = 0.8

, the corresponding location-shift parameters are

(δ_{0}, δ_{1}, δ_{2}) = (0.742, 1.190, 1.190)

. By assuming a prevalence of malignancy

γ = 0.5

, we have

ρ_{12} = corr ({\hat{θ}}_{1} - {\hat{θ}}_{0}, {\hat{θ}}_{2} - {\hat{θ}}_{0}) = 0.65

. Using

α = 0.05

, the required sample size for 80% of power is

N = 184

, or

m = γ N = 92

malignant cases and

n = (1 - γ) N = 92

benign cases. From 5000 simulation samples, we observe an empirical FWER of

\hat{α} = 0.056

under

H_{0}

and power of

1 - \hat{β} = 0.888

under

H_{a}

. We find that this sample size is somewhat overpowered.

6. Discussion

A Dunnett-type testing method is proposed to compare the performance of a control biomarker with

K (\geq

2) new biomarkers using ROC curves. The test statistic is completely non-parametric and robust. Also proposed is a sample size calculation method for the Dunnett-type test. For a sample size calculation, we have to specify the distribution of the biomarkers. The biomarkers to be compared are collected from each subject, so they are usually correlated. Furthermore, the raw data of the biomarkers are usually positive, so their parameters, such as means, variances, and correlation coefficients, are complicatedly associated. As a result, it is difficult to specify the parameter values in a sample size calculation.

In our numerical studies, we assume equal correlation coefficients among the biomarkers for simplicity. But, the true correlation structure may be more complicated among real biomarkers. Our test statistic automatically accounts for any type of the true correlation structure by using the so-called GEE-type approach [14]. In sample size calculation, we have to specify the true correlation coefficients based on our best estimation. If we have pilot data, we may use the estimated correlation coefficients as the true ones in the sample size calculation. In this sense, we may consider a two-stage design of biomarker projects to estimate the correlation coefficients among the biomarkers from the first-stage data and calculate the sample size for the whole project combining the stage 1 and 2 data sets. In this case, we do not need a type I error adjustment for the final testing since we do not conduct any statistical testing at the first stage.

Since the AUC of ROC curves is a rank statistic, the AUCs and the proposed test statistic are invariant to a monotone transformation of biomarker data. Utilizing this nice property, Jung [11] proposed to calculate sample sizes based on the log-transformed data for the easy specification of the parameter values. We have extended his two-sample testing method to

(K + 1)

-sample testing by controlling the FWER for multiple testing adjustment. Through simulations, we find that the Dunnett-type test controls the FWER accurately and the calculated sample sizes maintain the specified power closely overall. One-sided critical regions are used in the presentation of our method, but the extension to a two-sided test is straightforward.

To simplify the discussion, we have assumed that the transformed data follow a normal distribution, but it can be any distribution with a location-shift model between the benign and malignant groups. Whatever the transformed distribution, the correct distribution functions should be used in the sample size calculation. Note that there exists a transformation for any distribution transforming to a normal distribution, although it may not be a log-transformation. However, a log-transformation performs this goal relatively well in the real analysis of positive data.

The Dunnett test is popularly used in clinical trial or retrospective clinical studies to compare multiple experimental treatment groups with a common control group with a multiple testing adjustment. Usually, the groups to be compared are independent. For example, a Dunnett-type test and its sample size calculation method were proposed to compare

K (\geq

2) experimental arms with a common control arm using a time-to-event endpoint [15]. In this paper, however, the

K + 1

biomarkers are to be compared since all the

K + 1

biomarkers are collected from each subject.

Contrary to the case of the Dunnett-type test, we may want to compare the

K \geq 3

groups without a common control. In this case, we usually use a one-way ANOVA test. Jung and Hui [16] proposed a sample size calculation method for the ANOVA-type log-rank tests to compare

K (\geq

3) treatment groups using a time-to-event endpoint.

Our Fortran program, developed for sample size calculation and simulations, is available from the author upon request.

Funding

This research received no external funding.

Institutional Review Board Statement

This paper proposes statistical methods for clinical projects. It doesn’t report any real clinical data at all, so that there’re no issues related to IRB and Consent form.

Informed Consent Statement

Not applicable.

Data Availability Statement

No new data were created or analyzed in this study.

Conflicts of Interest

The authors declare no conflict of interest.

References

Greenhouse, S.W.; Mantel, N. The evaluation of diagnostic tests. Biometrics 1950, 6, 399–412. [Google Scholar] [CrossRef]
Linnet, K. Comparison of quantitative diagnostic tests: Type 1 error, power, and sample size. Stat. Med. 1987, 6, 147–158. [Google Scholar] [CrossRef]
Hanley, J.A.; McNeil, B.J. The meaning and use of the area under receiver operating characteristic (ROC) curve. Radiology 1982, 143, 29–36. [Google Scholar] [CrossRef]
Hanley, J.A.; McNeil, B.J. Method for comparing the area under the ROC curves derived from the same cases. Radiology 1983, 148, 839–843. [Google Scholar] [CrossRef]
McClish, D.K. Comparing the areas under more than two ROC curves. Med. Decis. Mak. 1987, 7, 149–155. [Google Scholar] [CrossRef]
DeLong, E.R.; Delong, D.M.; Clarke-Pearson, D.L. Comparing the areas under two or more correlated receiver operating characteristic curves: A nonparametric approach. Biometrics 1988, 44, 837–845. [Google Scholar] [CrossRef] [PubMed]
Wieand, S.; Gail, M.H.; James, B.R.; James, K.L. A family of nonparametric statistics for comparing diagnostic markers with paired or unpaired data. Biometrika 1989, 76, 585–592. [Google Scholar] [CrossRef]
Emir, B.; Wieand, S.; Jung, S.H.; Ying, Z. Comparison of diagnostic markers with repeated measurements: A non-parametric ROC curve approach. Statics Med. 2000, 19, 511–523. [Google Scholar] [CrossRef]
Jung, S.H.; Wieand, S.; Cha, S. A statistic for comparing two correlated markers which are prognostic for time to an event. Stat. Med. 1995, 14, 2217–2226. [Google Scholar] [CrossRef] [PubMed]
Hwang, Y.T.; Su, N.C. Sample size determination for comparing accuracies between two diagnostic tests under a paired design. Biom. J. 2022, 64, 771–804. [Google Scholar] [CrossRef] [PubMed]
Jung, S.H. Sample size calculation for comparing two ROC curves. Pharm. Stat. 2024, 23, 557–569. [Google Scholar] [CrossRef] [PubMed]
Dunnett, C.W. A multiple comparison procedure for comparing several treatments with a control. J. Am. Stat. Assoc. 1955, 50, 1096–1121. [Google Scholar] [CrossRef]
Dunnett, C.W. New tables for multiple comparisons with a control. Biometrics 1964, 20, 482–491. [Google Scholar] [CrossRef]
Liang, K.Y.; Zeger, S.L. Longitudinal data analysis using generalized linear models. Biometrika 1986, 73, 13–22. [Google Scholar] [CrossRef]
Jung, S.H.; Kim, C.; Chow, S.C. Sample size calculation for the log-rank tests for multi-arm trials with a common control. J. Korean Stat. Soc. 2008, 37, 11–22. [Google Scholar] [CrossRef]
Jung, S.H.; Hui, S. Sample size calculations to compare K different survival distributions. Lifetime Data Anal. 2002, 8, 361–373. [Google Scholar] [CrossRef] [PubMed]

Table 1. Required sample size N (and empirical FWER level

\hat{α}

and power

1 - \hat{β}

from simulations) for 1-sided

α = 0.05

test of

H_{0} : θ_{1} = θ_{2} = θ_{0}

vs.

H_{a} : (θ_{1} > θ_{0}) \cup (θ_{2} > θ_{0})

given

r = corr (X_{k}, X_{k^{'}}) = corr (Y_{k}, Y_{k^{'}})

for

k \neq k^{'}

, AUCs

(θ_{0}, θ_{1}, θ_{2})

under

H_{a}

, and prevalence of malignancy

γ

.

Table 1. Required sample size N (and empirical FWER level

\hat{α}

and power

1 - \hat{β}

from simulations) for 1-sided

α = 0.05

test of

H_{0} : θ_{1} = θ_{2} = θ_{0}

vs.

H_{a} : (θ_{1} > θ_{0}) \cup (θ_{2} > θ_{0})

given

r = corr (X_{k}, X_{k^{'}}) = corr (Y_{k}, Y_{k^{'}})

for

k \neq k^{'}

, AUCs

(θ_{0}, θ_{1}, θ_{2})

under

H_{a}

, and prevalence of malignancy

γ

.

$1 - β$	r	$θ_{0}$	$θ_{1} = θ_{2}$	$γ = 0.1$	$γ = 0.3$	$γ = 0.5$
0.8	0.2	$0.6$	$0.7$	$703 (0.057, 0.806)$	$302 (0.049, 0.801)$	$253 (0.052, 0.804)$
			$0.75$	$299 (0.063, 0.813)$	$128 (0.061, 0.814)$	$108 (0.056, 0.812)$
		$0.7$	$0.8$	$565 (0.059, 0.823)$	$242 (0.053, 0.814)$	$204 (0.048, 0.814)$
			$0.85$	$234 (0.065, 0.841)$	$100 (0.054, 0.820)$	$84 (0.058, 0.829)$
		$0.8$	$0.9$	$370 (0.063, 0.847)$	$159 (0.057, 0.837)$	$134 (0.053, 0.848)$
			$0.95$	$153 (0.070, 0.917)$	$66 (0.058, 0.883)$	$55 (0.051, 0.879)$
	0.4	$0.6$	$0.7$	$540 (0.052, 0.813)$	$232 (0.052, 0.815)$	$195 (0.051, 0.797)$
			$0.75$	$231 (0.058, 0.823)$	$99 (0.055, 0.802)$	$84 (0.049, 0.807)$
		$0.7$	$0.8$	$440 (0.055, 0.830)$	$189 (0.046, 0.804)$	$159 (0.055, 0.816)$
			$0.85$	$186 (0.065, 0.846)$	$80 (0.058, 0.834)$	$67 (0.061, 0.821)$
		$0.8$	$0.9$	$300 (0.060, 0.863)$	$129 (0.053, 0.832)$	$108 (0.052, 0.845)$
			$0.95$	$135 (0.070, 0.929)$	$58 (0.064, 0.900)$	$49 (0.049, 0.898)$
0.85	0.2	$0.6$	$0.7$	$816 (0.050, 0.864)$	$350 (0.059, 0.856)$	$294 (0.053, 0.849)$
			$0.75$	$347 (0.066, 0.863)$	$149 (0.051, 0.858)$	$125 (0.055, 0.846)$
		$0.7$	$0.8$	$655 (0.059, 0.871)$	$281 (0.053, 0.866)$	$236 (0.046, 0.851)$
			$0.85$	$271 (0.059, 0.890)$	$117 (0.057, 0.862)$	$98 (0.060, 0.871)$
		$0.8$	$0.9$	$430 (0.062, 0.893)$	$184 (0.060, 0.881)$	$155 (0.050, 0.877)$
			$0.95$	$178 (0.069, 0.938)$	$77 (0.055, 0.933)$	$64 (0.054, 0.922)$
	0.4	$0.6$	$0.7$	$626 (0.047, 0.864)$	$269 (0.056, 0.858)$	$226 (0.051, 0.856)$
			$0.75$	$269 (0.063, 0.864)$	$115 (0.055, 0.863)$	$97 (0.049, 0.855)$
		$0.7$	$0.8$	$510 (0.056, 0.877)$	$219 (0.056, 0.854)$	$184 (0.049, 0.864)$
			$0.85$	$216 (0.066, 0.887)$	$93 (0.057, 0.877)$	$78 (0.056, 0.877)$
		$0.8$	$0.9$	$349 (0.059, 0.895)$	$150 (0.054, 0.886)$	$126 (0.056, 0.874)$
			$0.95$	$157 (0.064, 0.946)$	$67 (0.051, 0.939)$	$57 (0.056, 0.929)$
0.9	0.2	$0.6$	$0.7$	$969 (0.055, 0.912)$	$416 (0.050, 0.901)$	$349 (0.049, 0.903)$
			$0.75$	$412 (0.058, 0.905)$	$177 (0.056, 0.903)$	$149 (0.053, 0.909)$
		$0.7$	$0.8$	$779 (0.055, 0.906)$	$334 (0.050, 0.906)$	$281 (0.049, 0.905)$
			$0.85$	$322 (0.060, 0.922)$	$138 (0.058, 0.913)$	$116 (0.058, 0.909)$
		$0.8$	$0.9$	$511 (0.052, 0.932)$	$219 (0.053, 0.920)$	$184 (0.048, 0.919)$
			$0.95$	$212 (0.065, 0.971)$	$91 (0.060, 0.957)$	$76 (0.051, 0.959)$
	0.4	$0.6$	$0.7$	$744 (0.056, 0.904)$	$319 (0.054, 0.904)$	$268 (0.054, 0.895)$
			$0.75$	$319 (0.061, 0.905)$	$137 (0.057, 0.903)$	$115 (0.055, 0.908)$
		$0.7$	$0.8$	$607 (0.057, 0.911)$	$260 (0.052, 0.911)$	$219 (0.048, 0.913)$
			$0.85$	$257 (0.054, 0.924)$	$110 (0.054, 0.922)$	$93 (0.057, 0.918)$
		$0.8$	$0.9$	$415 (0.062, 0.937)$	$178 (0.053, 0.917)$	$150 (0.049, 0.927)$
			$0.95$	$186 (0.066, 0.974)$	$80 (0.056, 0.968)$	$67 (0.048, 0.965)$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Article Metrics

Citations

Article Access Statistics

Journal Statistics

Multiple requests from the same IP address are counted as one view.