Abstract
Diagnostic biomarkers are key components of diagnostics. In this paper, we consider diagnostic biomarkers taking continuous values that are associated with a dichotomous disease status, called malignant or benign. The performance of such a biomarker is evaluated by the area under the curve (AUC) of its receiver operating characteristic curve. We assume that, together with the disease status, one control and multiple experimental biomarkers are collected from each subject to test if any of the experimental biomarkers have a larger AUC than the control. In this case, each experimental biomarker will be compared with the control so that a multiple testing issue is involved in the comparisons. In this paper, we propose a simple non-parametric statistical testing procedure to compare 2) experimental biomarkers with a control, adjusting for the multiplicity and its sample size calculation method. Our sample size formula requires the specification of the AUC values (or the standardized effect size of each biomarker between the benign and malignant groups) together with the correlation coefficients between the biomarkers, the prevalence of the malignant group in the study population, the type I error rate, and the power. Through simulations, we show that the statistical test controls the overall type I error rate accurately and the proposed sample size closely maintains the specified statistical power.
Keywords:
AUC; biomarker; family-wise error rate; location-shift model; prevalence; sensitivity; specificity 1. Introduction
Different types of biomarkers are measured from the tumor, blood, or urine using molecular, biochemical, physiological, anatomical, or imaging methods. The observed biomarkers are used for various purposes during the diagnosis and treatment of diseases. For example, diagnostic biomarkers are used to diagnose diseases, predictive biomarkers are used to predict the response to specific treatments, and prognostic biomarkers are used to measure the aggressiveness of a disease for patients with no treatment or a non-targeted treatment. When the outcome is dichotomous and the biomarker is continuous, the performance of a biomarker is evaluated by the area under the curve (AUC) of its receiver operating characteristic (ROC) curve, which plots the sensitivity against one minus the specificity calculated for all the possible cutoff values of the biomarker. In this paper, we consider diagnostic biomarkers taking continuous values that are associated with a dichotomous disease status, called benignity or malignancy, but the proposed methods can be used for any type of biomarker with such properties.
Various parameters have been used to measure the performance of biomarkers and compare among different biomarkers. For instance, sensitivity at a fixed specificity level determined by the cutoff value of each biomarker may be compared between biomarkers [1,2]. Some investigators proposed testing methods to compare the whole or partial AUC of ROC curves of biomarkers [3,4,5,6,7].
Emir et al. [8] extended the method of Delong et al. [6] to comparison of ROC curves from longitudinally collected biomarkers, and Jung et al. [9] proposed an ROC-type method to compare the performance of biomarkers that are associated with a time-to-event endpoint such as progression-free survival.
Sample size calculation is an important step in the development stage of medical projects. Hwang and Su [10] proposed a sample size formula for the test statistic of Delong et al. [6]. Their variance formula requires the specification of so many parameter values. Furthermore, these parameters are not conceptually interpretable, so it is almost impossible to accurately specify these parameters in sample size calculations. This can prevent us from using their sample size formula in designing real biomarker projects. Jung [11] proposed a simple non-parametric test statistic that is asymptotically identical to that of Delong et al. [6] and its sample size formula. Compared to the formula of Hwang and Su [10], his formula requires specification of much smaller number of design parameters and the design parameters are intuitively interpretable. These sample size methods are limited to the comparison of two biomarkers.
In this paper, we consider the cases where a control biomarker (usually an existing one) and 2) experimental biomarkers are collected from each subject together with a dichotomous disease status. We want to discover the experimental biomarkers that are more closely associated with the disease outcome than the control biomarker. To this end, we compare the AUC of the ROC curve of each experimental biomarker with that of the control using the two-sample test statistic [11]. With K experimental biomarkers to be compared with the control, we will conduct K statistical tests, so the discovery of new biomarkers will involve a multiple testing issue. We propose a multiple testing procedure using a Dunnett-type approach [12,13]. We derive a sample size formula for the multiple testing procedure too. While the statistical test is completely non-parametric, its power depends on the distributions of the biomarkers. Furthermore, the biomarkers to be compared are correlated because they are collected from each patient. This results in additional problems in deriving the statistical testing method and its sample size formula.
Since the AUC of an ROC curve is a rank statistic, it is invariant to the monotone transformations of biomarker values. Hence, if we find a transformation of biomarker distributions to symmetric distributions, such as normal distributions, with a location-shift model and unit variances, then, for a sample size calculation, we need to specify only the location parameter between the benign and malignant groups for each biomarker (which corresponds to the AUC) and the correlation coefficient between the paired biomarkers after the transformation. A log-transformation followed by standardization often carries out this goal. Through simulations, we show that our multiple testing procedure accurately controls the overall type I error rate, called the family-wise error rate (FWER), and its sample size formula closely maintains a specified statistical power.
2. A Dunnett-Type Test to Compare K Biomarkers with a Control
Suppose that biomarker 0 denotes a control (or existing) biomarker and biomarkers denote K experimental biomarkers that are collected from each subject. Let and denote the random variables for observations of biomarker k(=0, 1, from malignant and benign cases, respectively. In a data set, we have copies of , , from m malignant cases, and copies of , , from n benign cases. For the total sample size N(=, denotes the prevalence of malignancy in the target population. As can be seen from our sample size formula and the results of numerical studies in Section 5, our proposed test will have a maximum power if benign and malignant cases are balanced, i.e., . Let and denote the cumulative density function (CDF) of biomarker k for malignant cases and benign cases, respectively, and and be their empirical estimators.
Without loss of generality, we assume that the raw biomarker values are non-negative and a large biomarker value is associated with a higher chance of malignancy. As such, for a cutoff values t, is called the sensitivity and is called the specificity of biomarker k. For , the ROC curve of biomarker k is the plot of with respect to all possible cutoff value t.
AUC of the ROC curve for biomarker k is , which is estimated by
For experimental biomarker , we want to test the null hypothesis against an alternative hypothesis using . It was shown that [11], for , we have
where and . Hence, for ,
where .
Note that and are -dimensional independent random vectors with mean 0 because malignant and benign groups are independent. Furthermore, under , we have , so that, by the central limit theorem (CLT), converges to , where
and and are obtained from and , respectively, by replacing and with their consistent estimators and . That is,
and
The univariate (or unadjusted) test rejects if , where denotes the percentile of distribution. Appendix A of Jung [11] shows that the variance estimator is asymptotically identical to that of DeLong et al. [6]. In this sense, our univariate test statistic is asymptotically identical to that of DeLong et al. [6].
When there are multiple experimental biomarkers (i.e., ), the biomarker discovery will have increased false positivity by the multiple univariate tests. For accurate control of false positivity, we need a multiple testing adjustment. We propose to control the FWER, i.e., the probability to accept an experimental biomarker when all K values of them are no better than the control. For the overall null hypothesis and a common critical value , controlling the FWER at is obtained by solving
with respect to c. Note that the critical region for our testing has a square shape. Delong et al. [6] proposed a one-way ANOVA-type test with an ellipsoidal-shape critical value, but our Dunnett-type test will be more appropriate when comparing multiple experimental biomarkers with a common control.
In order to calculate the FWER provided in Equation (3), we need the joint distribution of under . Applying the multivariate CLT to Equation (2), under , is asymptotically normal with mean 0 and variance that can be consistently estimated by
where and
For a given critical value c, the FWER is obtained by calculating the probability on the right-hand side of Equation (3) based on the multivariate normal distribution using simulations or a numerical integration method. Finally, the critical value controlling the FWER at is obtained by solving Equation (3) with respect to c and accepting (or discovering) biomarker if .
When , the double integration for the probability calculation can be simplified to a single integration. If is a bivariate normal random vector with mean 0, variance 1, and correlation coefficient , then the conditional distribution of given is normal with mean and variance . Hence, from Equation (3), we have
so the critical value controlling the FWER at is obtained by solving the equation
with respect to c, where and are the probability density function (PDF) and CDF of distribution, respectively.
For biomarker with test statistic value , we may calculate the FWER-adjusted p-value by
3. Sample Size Calculation for the Dunnett-Type Test
Now, we derive a sample size formula for a specified alternative hypothesis . For , we consider a specific alternative hypothesis
From Equation (2), we have for under .
We have to calculate the variances and covariances of . Define and , i.e.,
and
For , let and denote the PDFs of and , respectively. We also define and , i.e.,
and
Hence, we have with
and, for ,
By the multivariate CLT applied to (2) under , is asymptotically normal with mean and variance V. So, under , we have
where
Note that is normal with mean 0 and variance–covariance of with and for .
Hence, for given N, the power of the Dunnett-type test controlling the FWER at is provided as
where is obtained by solving
For given power , the required sample size N is obtained by solving Equation (5) with respect to N. The probabilities on the right-hand sides of Equations (5) and (6) can be solved using simulations or a multivariate numerical integration.
4. Transformation of Biomarker Data to Normal Distributions
The Dunnett-type test statistic is completely non-parametric, but the asymptotic distribution and power of the test statistic depend on the distributions of biomarkers for benign and malignant groups through the values of under and the variance–covariance of , as shown in the previous section. So, we have to assume a parametric model for sample size calculation.
Biomarkers usually take positive values. If we know the distributions of original biomarker values, then we may calculate a sample size based on them. However, we rarely have good information about the original distributions. Furthermore, for most multivariate positive distributions, their means, variances, and correlation coefficients are mutually associated, so it is difficult to specify the parameter values for a sample size calculation.
Since AUC is a rank statistic, it is invariant to monotone transformations of and . We can find a transformation, such as logarithm, to change the distribution of a positive random variable to a (approximately) normal distribution, for which the parameters are not associated with each other and can be easily specified. Such a transformation tends to stabilize the variance of random variables too.
We assume that, after a log-transformation, each biomarker marginally has normal distributions with a common variance (less strictly, a location-shift model) between malignant and benign groups. Then, we standardize the data by subtracting the mean of benign group and dividing by the common standard deviation. Then, for the resulting data of biomarker k, and are independent with a location parameter . Note that is the standardized effect size of log-transformed biomarker values between malignant and benign groups. It is known that and have a 1-to-1 relationship, i.e., [11].
With such a transformation, it is easy to conceptualize the design parameters for sample size calculations. Let denote the PDF of bivariate normal distribution with 0 means, unit variances, and correlation coefficient r. Then, we have , , and , where . Furthermore, after such a transformation by Appendix B of Jung [11], we have and , so we have and , and and are simplified to
and
where
and
Note that the ranges of integrations are changed from for the original data to for the transformed data.
From Equations (9) and (10), we find that the sample size depends on the prevalence of malignancy through , so the required sample size is minimized when and is unchanged between two values with identical value. Furthermore, since , the correlation matrix of , , is free of .
Assuming that, for biomarker k, and marginally follow normal distributions after a log-transformation, a sample size calculation is conducted as follows.
5. Numerical Studies
Suppose that there are two experimental biomarkers and a control biomarker following the trivariate normal distribution model with a location-shift as discussed in the previous section. We consider prevalence of malignancy , 0.3, or 0.5, , 0.7, or 0.8, AUCs for biomarkers under , and with or 0.15 under . Using the one-to-one relationship between AUC and location-shift parameter between and , we identify the location parameters corresponding to the specified values. We assume that correlation coefficients between any pair of biomarkers are identically or 0.4. Under each design setting of , we calculate the critical value for 1-sided FWER from Equation (8) and calculate sample size N for power , 0.85, or 0.9.
We conduct simulations to check if the calculated sample size is accurately powered or not. At first, we generate samples of size N under the design setting of the sample size calculation. That is, for and , and are independent trivariate normal random vectors with mean vectors and , respectively, but with variances 1 and compound symmetry structure with correlation coefficient r. Here, and are location parameters corresponding to AUCs and , respectively. The empirical power is estimated by the proportion of samples rejecting among the simulation samples. Also, to check if the proposed Dunnett-type test accurately controls the FWER, we generate samples of size N under (i.e., with ), and the empirical FWER is estimated by the proportion of samples rejecting .
(Table 1 may be placed around here.)
Table 1.
Required sample size N (and empirical FWER level and power from simulations) for 1-sided test of vs. given for , AUCs under , and prevalence of malignancy .
The results of sample size calculations and simulations are summarized in Table 1. As expected, sample size increases in and decreases in , , , and for . From the simulation results, we find that our test statistic controls the FWER accurately and the calculated sample sizes are quite accurate in the sense that the empirical powers are close to the corresponding nominal power overall. For some design settings, however, the proposed test is slightly anti-conservative and the calculated sample size is slightly overpowered, especially if the calculated sample size is small and the allocation between cases and controls is very unbalanced (i.e., ). The calculated sample sizes are never underpowered, at least for the design settings considered in this numerical study. Similar results were observed from numerical studies on two-sample version of this test [11]. In designing a study, slightly overpowered sample sizes are less of an issue than underpowered ones. If we want a more accurately powered sample size, we may calculate one using simulations. In this case, the sample size calculated from our formula can be used for validation of the sample size estimated from simulations.
Example: Suppose that there exists a biomarker A and we want to investigate whether we can improve its performance by adding a new biomarker B or C. So, biomarker 0 is A, 1 is A + B, and 2 is . After a log-transformation and standardization with the mean of benign group and the common standard deviation as in Section 4, we assume that and are independent trivariate normal random vectors with unit variances commonly, but with mean vectors and , respectively. More specifically, if and denote the random vectors for the log-transformed biomarker data for malignant and benign groups, respectively, for , let denote the mean of and the common variance of and . Then, the standardized biomarker values will be and . In the data analysis, and will be replaced by their sample estimates. Assuming that B and C add a similar amount of information to A, let and . If we assume and , the corresponding location-shift parameters are . By assuming a prevalence of malignancy , we have . Using , the required sample size for 80% of power is , or malignant cases and benign cases. From 5000 simulation samples, we observe an empirical FWER of under and power of under . We find that this sample size is somewhat overpowered.
6. Discussion
A Dunnett-type testing method is proposed to compare the performance of a control biomarker with 2) new biomarkers using ROC curves. The test statistic is completely non-parametric and robust. Also proposed is a sample size calculation method for the Dunnett-type test. For a sample size calculation, we have to specify the distribution of the biomarkers. The biomarkers to be compared are collected from each subject, so they are usually correlated. Furthermore, the raw data of the biomarkers are usually positive, so their parameters, such as means, variances, and correlation coefficients, are complicatedly associated. As a result, it is difficult to specify the parameter values in a sample size calculation.
In our numerical studies, we assume equal correlation coefficients among the biomarkers for simplicity. But, the true correlation structure may be more complicated among real biomarkers. Our test statistic automatically accounts for any type of the true correlation structure by using the so-called GEE-type approach [14]. In sample size calculation, we have to specify the true correlation coefficients based on our best estimation. If we have pilot data, we may use the estimated correlation coefficients as the true ones in the sample size calculation. In this sense, we may consider a two-stage design of biomarker projects to estimate the correlation coefficients among the biomarkers from the first-stage data and calculate the sample size for the whole project combining the stage 1 and 2 data sets. In this case, we do not need a type I error adjustment for the final testing since we do not conduct any statistical testing at the first stage.
Since the AUC of ROC curves is a rank statistic, the AUCs and the proposed test statistic are invariant to a monotone transformation of biomarker data. Utilizing this nice property, Jung [11] proposed to calculate sample sizes based on the log-transformed data for the easy specification of the parameter values. We have extended his two-sample testing method to -sample testing by controlling the FWER for multiple testing adjustment. Through simulations, we find that the Dunnett-type test controls the FWER accurately and the calculated sample sizes maintain the specified power closely overall. One-sided critical regions are used in the presentation of our method, but the extension to a two-sided test is straightforward.
To simplify the discussion, we have assumed that the transformed data follow a normal distribution, but it can be any distribution with a location-shift model between the benign and malignant groups. Whatever the transformed distribution, the correct distribution functions should be used in the sample size calculation. Note that there exists a transformation for any distribution transforming to a normal distribution, although it may not be a log-transformation. However, a log-transformation performs this goal relatively well in the real analysis of positive data.
The Dunnett test is popularly used in clinical trial or retrospective clinical studies to compare multiple experimental treatment groups with a common control group with a multiple testing adjustment. Usually, the groups to be compared are independent. For example, a Dunnett-type test and its sample size calculation method were proposed to compare 2) experimental arms with a common control arm using a time-to-event endpoint [15]. In this paper, however, the biomarkers are to be compared since all the biomarkers are collected from each subject.
Contrary to the case of the Dunnett-type test, we may want to compare the groups without a common control. In this case, we usually use a one-way ANOVA test. Jung and Hui [16] proposed a sample size calculation method for the ANOVA-type log-rank tests to compare 3) treatment groups using a time-to-event endpoint.
Our Fortran program, developed for sample size calculation and simulations, is available from the author upon request.
Funding
This research received no external funding.
Institutional Review Board Statement
This paper proposes statistical methods for clinical projects. It doesn’t report any real clinical data at all, so that there’re no issues related to IRB and Consent form.
Informed Consent Statement
Not applicable.
Data Availability Statement
No new data were created or analyzed in this study.
Conflicts of Interest
The authors declare no conflict of interest.
References
- Greenhouse, S.W.; Mantel, N. The evaluation of diagnostic tests. Biometrics 1950, 6, 399–412. [Google Scholar] [CrossRef]
- Linnet, K. Comparison of quantitative diagnostic tests: Type 1 error, power, and sample size. Stat. Med. 1987, 6, 147–158. [Google Scholar] [CrossRef]
- Hanley, J.A.; McNeil, B.J. The meaning and use of the area under receiver operating characteristic (ROC) curve. Radiology 1982, 143, 29–36. [Google Scholar] [CrossRef]
- Hanley, J.A.; McNeil, B.J. Method for comparing the area under the ROC curves derived from the same cases. Radiology 1983, 148, 839–843. [Google Scholar] [CrossRef]
- McClish, D.K. Comparing the areas under more than two ROC curves. Med. Decis. Mak. 1987, 7, 149–155. [Google Scholar] [CrossRef]
- DeLong, E.R.; Delong, D.M.; Clarke-Pearson, D.L. Comparing the areas under two or more correlated receiver operating characteristic curves: A nonparametric approach. Biometrics 1988, 44, 837–845. [Google Scholar] [CrossRef] [PubMed]
- Wieand, S.; Gail, M.H.; James, B.R.; James, K.L. A family of nonparametric statistics for comparing diagnostic markers with paired or unpaired data. Biometrika 1989, 76, 585–592. [Google Scholar] [CrossRef]
- Emir, B.; Wieand, S.; Jung, S.H.; Ying, Z. Comparison of diagnostic markers with repeated measurements: A non-parametric ROC curve approach. Statics Med. 2000, 19, 511–523. [Google Scholar] [CrossRef]
- Jung, S.H.; Wieand, S.; Cha, S. A statistic for comparing two correlated markers which are prognostic for time to an event. Stat. Med. 1995, 14, 2217–2226. [Google Scholar] [CrossRef] [PubMed]
- Hwang, Y.T.; Su, N.C. Sample size determination for comparing accuracies between two diagnostic tests under a paired design. Biom. J. 2022, 64, 771–804. [Google Scholar] [CrossRef] [PubMed]
- Jung, S.H. Sample size calculation for comparing two ROC curves. Pharm. Stat. 2024, 23, 557–569. [Google Scholar] [CrossRef] [PubMed]
- Dunnett, C.W. A multiple comparison procedure for comparing several treatments with a control. J. Am. Stat. Assoc. 1955, 50, 1096–1121. [Google Scholar] [CrossRef]
- Dunnett, C.W. New tables for multiple comparisons with a control. Biometrics 1964, 20, 482–491. [Google Scholar] [CrossRef]
- Liang, K.Y.; Zeger, S.L. Longitudinal data analysis using generalized linear models. Biometrika 1986, 73, 13–22. [Google Scholar] [CrossRef]
- Jung, S.H.; Kim, C.; Chow, S.C. Sample size calculation for the log-rank tests for multi-arm trials with a common control. J. Korean Stat. Soc. 2008, 37, 11–22. [Google Scholar] [CrossRef]
- Jung, S.H.; Hui, S. Sample size calculations to compare K different survival distributions. Lifetime Data Anal. 2002, 8, 361–373. [Google Scholar] [CrossRef] [PubMed]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).