1. Introduction
Sensitivity (Se) and specificity (Sp) are fundamental measures of the performance of a diagnostic test. They, respectively, provide the probability of a positive test result in patients with the index disease (the disease that the test is designed to detect or exclude) and a negative test result in patients without the disease [
1]. When combined with pre-test probability (prevalence) of disease in a tested population, they can be used to determine the more clinically useful metrics
positive predictive value (PPV) and
negative predictive value (NPV) [
2]. Alternatively, Se and Sp can be used to determine likelihood ratios, which can be used with the pre-test odds (rather than probability) of disease to estimate the post-test odds of disease [
3].
The Se and Sp of an investigational test (IT) must therefore be accurately determined before the test can be useful in clinical practice. Typically, accuracy is determined through one or more clinical studies in which study patients undergo both the IT and another diagnostic test, which serves as a reference standard (RS). The RS is used to classify each patient as disease-positive (DP) or disease-negative (DN), and these classifications are used to classify the patient’s IT result as true positive (TP), true negative (TN), false positive (FP), or false negative (FN). The respective numbers of patient classifications (n
TP, n
TN, n
FP, and n
FN) are used to determine the sensitivity and specificity of the investigational test (Se
I and Sp
I, respectively; Formulas (1) and (2)).
If the RS is perfect (SeR = SpR = 1 [100%]), then it is a true standard of truth, and each patient is correctly classified as DP or DN. SeI and SpI can then be determined straightforwardly and correctly. When a perfect RS is used to evaluate a patient, there are only two possible outcomes for the RS (either a true-positive RS result classifying the patient as DP or a true-negative RS result classifying the patient as DN); and when a perfect RS is used to classify positive and negative IT results, there can only be four possible IT classifications (TP, TN, FP, and FN).
However, if the RS is imperfect (i.e., SeR and/or SpR < 1 [<100%]), then it can misclassify patients, resulting in incorrect numbers of patients with and without disease (nDP, nDN, nTP, and nTN), and thus incorrect values of SeI and SpI. In this case, there can be four (rather than two) possible classifications of a patient by the imperfect RS: a true-positive result classifying the patient as DP, a true-negative result classifying the patient as DN, a false-positive result misclassifying the patient as DP, or a false-negative result misclassifying the patient as DN. When the imperfect RS is used to classify the positive and negative IT results, the IT can yield true or false results, which could agree or disagree with the RS, which in turn could be true or false. As a result, there could be eight categories: TP, TN, FP, and FN from comparison of the IT results to accurate RS results, plus apparent TP, TN, FP, and FN from comparison of the IT results to inaccurate RS results. Consequently, determination of the true Se and Sp of the IT when an imperfect RS is used is not as straightforward as when the RS is perfect.
An example of an imperfect reference standard that inspired this paper was the report by McKeith and coworkers of a clinical study that determined the Se and Sp of [
123I]ioflupane with single-photon emission computed tomography (SPECT) for assessing patients with suspected dementia with Lewy bodies (DLB) [
4]. In that study, the International Consensus Criteria (ICC [
5]) for diagnosing DLB were used as the RS. However, it was known from a validation study [
6] that the Se and Sp of the ICC were less than perfect (0.83 and 0.95, respectively). Thus, the true Se and Sp of ioflupane for DLB were uncertain. For this reason, I sought ways to adjust the apparent values of Se and Sp of ioflupane imaging, as reported by McKeith et al. [
5], to account for the known Se and Sp of the diagnostic criteria used as the reference standard.
A literature search yielded several relevant articles. Nihashi et al. [
7] reported using a Bayesian latent class model for adjusting the Se and Sp of DaTscan for DLB for eight clinical studies (including follow-up data from the McKeith 2007 study [
4], but not the original data), although neither the published article nor the
Supplemental Materials provided sufficient details to allow one to reproduce their results.
Umemneku Chikere et al. [
8] discussed three methods of correcting for the effects of an imperfect RS: Brenner [
9], Gart and Buck [
10], and Staquet et al. [
11]. All of these authors took different approaches and reported different equations. They did not report enough detail to allow one to determine if their derivations were correct.
Trikalinos et al. mentioned the possibility of adjusting results that are based on an imperfect RS but did not report derivation of the formulas needed to do so [
12].
Therefore, this work was initiated to derive formulas needed to correct for patient misclassifications by an imperfect RS and to determine the true values of a diagnostic test’s Se and Sp when they were determined by using an imperfect RS, in diagnostic terms readily understandable to clinicians, with full transparency of the derivations. Those results are reported here; application of the results to the McKeith ioflupane study [
4] along with a review of relevant literature, will be reported separately.
2. Materials and Methods
Formulas were derived on the basis of an analysis of how a reference standard (RS) is used to classify patients as disease-positive or disease-negative and how misclassifications by an imperfect RS affect the apparent values of SeI and SpI. Throughout, conditional independence of the RS and IT is assumed; i.e., the RS and IT misclassify patients independently. This assumption is reasonable if, for example, the RS and IT work by different mechanisms.
Two diagrams were created to depict patient misclassifications by the RS (
Figure 1) and their effect on the apparent values of Se
I and Sp
I (
Figure 2). In both figures, the prefix
a was used to denote an
apparent value. In
Figure 2, to differentiate between the reference and investigational tests with respect to the numbers of true positives (n
TP), true negatives (n
TN), false positives (n
FP), false negatives (n
FN), Se, and Sp, these variables had the subscripts
R (for
reference test) or
I (for
investigational test) added.
Figure 1 shows how an imperfect RS results in patient misclassifications: multiplying Se
R and Sp
R by the true numbers of disease-positive (n
DP) and disease-negative (n
DN) patients results in the apparent number of disease-positive patients (an
DP) and the apparent number of disease-negative patients (an
DN). In
Figure 1, n
TPR is the number of patients with true-positive RS results, n
FNR is the number of patients with false-negative RS results, n
TNR is the number of patients with true-negative RS results, n
FPR is the number of patients with false-positive RS results, Se
R is the sensitivity of the reference standard, Sp
R is the specificity of the reference standard, an
DP is the apparent number of disease-positive patients, and an
DN is the apparent number of disease-negative patients.
Figure 2 shows how the patient misclassifications result in incorrect values of the IT’s sensitivity (Se
I) and specificity (Sp
I): multiplying an
DP and an
DN by the true (but initially unknown) values of Se
I and Sp
I gives the apparent numbers of true-positive (an
TPI) and true-negative (an
TNI) IT results, which, when, respectively, divided by an
DP and an
DN (in accordance with Equations (1) and (2) above), give incorrect apparent values of the IT’s Se and Sp (aSe
I and aSp
I). In
Figure 2,
nDP is the true number of disease-positive patients
nDN is the true number of disease-negative patients
nTPR is the number of patients with true-positive RS results
nFNR is the number of patients with false-negative RS results
nTNR is the number of patients with true-negative RS results
nFPR is the number of patients with false-positive RS results
SeR is the sensitivity of the RS
SpR is the specificity of the RS
anDP is the apparent number of disease-positive patients according to the RS
anDN is the apparent number of disease-negative patients according to the RS
SeI is the sensitivity of the investigational test
SpI is the specificity of the investigational test
anTPI1 is the apparent number of true-positive IT results based on nTPR
anTPI2 is the apparent number of true-positive IT results based on nFPR
anFNI1 is the apparent number of false-negative IT results based on nTPR
anFNI2 is the apparent number of false-negative IT results based on nFPR
anFPI1 is the apparent number of false-positive IT results based on nFNR
anFPI2 is the apparent number of false-positive IT results based on nTNR
anTNI1 is the apparent number of true-negative IT results based on nFNR
anTNI2 is the apparent number of true-negative IT results based on nTNR
anTPI is the apparent total number of true-positive IT results
anTNI is the apparent total number of true-negative IT results.
The two diagrams were analyzed to develop formulas that were then solved to give nDP, nDN, SeI and SpI starting from the apparent results of a clinical study.
4. Discussion
In assessing an investigational new diagnostic test, it is not always feasible to use a perfect RS, and an imperfect RS (one with Se and/or Sp < 1) must sometimes be used, which can result in patient misclassifications and incorrect values of SeI and SpI. Such situations raise questions about the accuracy of the investigational test’s Se and Sp determined using the imperfect RS. In this work, formulas for correctly calculating the investigational test’s true Se and true Sp from any reference standard were derived.
Three prior studies [
9,
10,
11] reported derivations of formulas for correcting Se and Sp for misclassification by an imperfect RS. Their approaches differed from each other and from the approach taken here. In addition, they did not report enough detail to allow one to determine if their derivations were correct. Therefore, comparison of this work to theirs was difficult but was successful for the approaches by Gart and Buck [
10] and Staquet et al. [
11] (See
Supplementary Materials for details).
Brenner [
9] reported equations for aSe
I and aSp
I for a case–control study if Se
I, Sp
I, Se
R, Sp
R and the exposure Pr were all known, but did not report solving for Se
I and Sp
I, contrary to the paper by Umemneku Chikere et al. [
8] who I believe may have misinterpreted the Brenner equations for aSe
I and aSp
I as being for Se
I and Sp
I.
Gart and Buck [
10] discussed the use of screening and reference tests for estimating disease Pr in epidemiologic studies. They derived equations for what they termed co-positivity and co-negativity (which I determined to be equivalent to aSe
I and aSp
I), and solved these for Se
I and Sp
I if Pr, Se
R and Sp
R, aSe
I and aSp
I are known. I was able to show that my equations for Se
I and Sp
I (after transformation into proportion-based variables) were equivalent to theirs (see
Supplementary Materials for details).
Staquet and colleagues [
11] reported equations for calculating Se
I and Sp
I provided that Se
R, Sp
R, and an
TPI, an
TNI, an
FPI, and an
FNI are known. I was able to show that my equations for Se
I and Sp
I were equivalent to theirs (see
Supplementary Materials for details).
Trikalinos et al. [
12] did not report equations for Se
I or Sp
I, but I compared their equations for the cells of their 2 × 2 contingency table (corresponding to pa
TPI and pa
TNI), and they matched what I had derived (data not shown).
Strengths and Weaknesses of This Work
This work builds on that of Trikalinos et al. [
12], Gart and Buck [
10], Staquet et al. [
11] and Brenner [
9]. These authors discussed potential methods of handling diagnostic studies that use an imperfect RS but did not provide sufficient detail to allow easy replication of their methods. Although Trikalinos et al. [
12] mentioned the possibility of adjusting results that are based on an imperfect RS, they did not report derivation of the formulas needed to do so, as I have. One advantage of my work is that I provide full details of the derivations (see
Supplementary Materials) so that others may easily reproduce and confirm my work. In addition, I report formulas that can use either absolute patient counts or proportions (e.g., prevalence), in contrast to prior authors, who reported formulas for only one or the other approach.
5. Conclusions
Validation of a new diagnostic test by use of an imperfect RS (one with Se and/or Sp < 1) introduces patient misclassifications that result in deviation of the apparent sensitivity and specificity of the index test (aSeI and aSpI, respectively) from the true values. By analyzing the role of the reference standard in the determination of the sensitivity and specificity of an index test, it is possible to derive formulas that correct for patient misclassification by an imperfect RS, as well as for the subsequent error introduced into aSeI and aSpI. The analysis showed that the more imperfect the RS (i.e., the lower the Se and Sp of the RS), the greater the error introduced into aSeI and aSpI. Therefore, when an imperfect RS is used to validate a diagnostic test, it may be necessary to apply corrections to arrive at accurate values of Se and Sp for the test.
This work builds on that of prior authors who discussed potential methods of handling diagnostic studies that use an imperfect RS but did not provide sufficient detail to allow easy replication of their methods. In contrast, full details of the derivations are provided (in
Supplementary Materials) to provide transparency, so that others may confirm and perhaps build upon this work. In addition, formulas based on both patient counts and patient proportions are provided, in contrast to prior authors, who provided either one or the other.
For this corrective method to be feasible, two conditions must be met. First, the assumption of conditional independence of the index test and the RS must be true; this assumption is reasonable if the two tests work by different mechanisms (e.g., if the index test relies on laboratory methods and the RS relies on autopsy). Second, one obviously needs to know the values of the Se and Sp of the RS, which may not always be the case. However, if they are known, then the derived formulas can help provide needed corrections to the apparent values of an index test’s Se and Sp.