1. Introduction
Forensic analysis of crime scene evidence such as fingerprints, tool marks, and bullet and cartridge cases typically involves presenting forensic examiners with two samples: one from the crime scene (e.g., a latent fingerprint), and one from the suspect (e.g., the suspect’s fingerprint). The examiners’ task is to determine whether the suspect’s sample matches or does not match the crime scene sample. This method of conducting forensic feature comparison analyses has been involved in thousands of convictions of innocent individuals, making flawed forensic evidence a leading cause of wrongful conviction in the U.S. (
National Registry of Exonerations, 2025). In light of this discovery, scholars have suggested an alternative procedure for conducting forensic feature comparison analyses: the filler-control method (
Wells et al., 2013), otherwise known as an “evidence lineup” (
Kassin et al., 2013;
Risinger et al., 2002). Akin to an eyewitness lineup, the filler-control method involves presenting examiners with the crime scene sample and a minimum of two comparison samples: one from the suspect, and at least one “filler” sample that is known not to match the crime scene sample. The examiner then must decide whether any of the comparison samples matches the crime scene sample.
The filler-control method is proposed to have several advantages over the standard forensic analysis method. First, the filler-control method may help prevent forensic confirmation bias—the process whereby an examiner’s beliefs, expectations, motivation, or situational context influence their interpretation of forensic evidence (
Kassin et al., 2013). In the standard procedure, examiners’ knowledge of contextual case information (e.g., that the suspect confessed) can lead them to perceive greater similarity between the suspect’s sample and the crime scene sample or lead them to lower their threshold for rendering a match decision (e.g.,
Cooper & Meterko, 2019;
Kukucka & Kassin, 2014). Because examiners using the filler-control method do not know which sample is from the suspect, the filler-control method can reduce the influence of contextual bias on examiners’ judgments (
Quigley-McBride & Wells, 2018). Another key benefit of the filler-control method is that it provides a mechanism for exposing errors—namely, match judgments on fillers—that would go undetected in the standard procedure (
Wells et al., 2013). The filler-control method thus provides a way to estimate the error rate in actual cases (
Wells et al., 2013). Moreover, the filler-control method enables error-rate estimation not only for the forensic technique but for a given laboratory or individual forensic examiner. This feature of the filler-control method underlies the third proposed benefit of the procedure, which is to reduce examiner overconfidence through the provision of error feedback to examiners (
Wells et al., 2013). To date, however, no research has tested this purported benefit of the filler-control method. The goal of the current research, therefore, was to test whether the filler-control method reduces examiner overconfidence compared to the standard feature comparison method.
Overconfidence is rampant in the field of forensic science, where “failure to acknowledge uncertainty in findings is common” (
National Research Council, 2009, p. 47) and expert witnesses have been criticized for providing testimony that goes “far beyond what the relevant science can justify” (
President’s Council of Advisors on Science and Technology, 2016, p. 29). Such high-confidence forensic testimony is persuasive to jurors, who are often more prone to convict when forensic experts downplay or fail to acknowledge the potential for error (
Crozier et al., 2020;
Garrett & Mitchell, 2013;
Koehler, 2011). Indeed, numerous cases of wrongful conviction can be at least partially attributed to testimony from forensic examiners who overstated the validity of their conclusions (
Morgan, 2023;
Innocence Project, n.d.b). As just one example, Ray Krone received a death sentence for a 1992 murder conviction based largely on forensic bite mark testimony from two examiners. One examiner referred to the bite marks on the murder victim as an “excellent match” to Krone’s teeth and stated that “it was Ray Krone’s teeth” (
Garrett & Neufeld, 2009, p. 70). The other examiner testified that Krone’s teeth were a “definite match” (
Garrett & Neufeld, 2009, p. 70). After spending more than a decade in prison, Krone was exonerated when DNA evidence conclusively established his innocence (
Innocence Project, n.d.a).
Examiner overconfidence may be a result of the illusion of validity, a phenomenon in which people are prone to be highly confident in their judgments despite having poor judgmental accuracy (
Kahneman & Tversky, 1973). This phenomenon is pervasive even among trained medical professionals making decisions about patients’ health (
Bushyhead & Christensen-Szalanski, 1981). Importantly, the illusion of validity may be the result of incomplete feedback about one’s mistakes (
Einhorn & Hogarth, 1978). It is possible that this phenomenon occurs among forensic examiners, who do not receive regular and reliable error feedback about their case judgments (
Wells et al., 2013). The standard feature comparison procedure does not provide routine error feedback to forensic examiners because ground truth (i.e., whether the suspect’s sample matches or does not match the crime scene sample) is typically unknown, and false positive errors are often undiscovered until years down the line, if ever. In contrast, the filler-control method provides immediate and regular error feedback to examiners—namely, any time the examiner renders a match judgment on a filler sample. Filler errors are likely to occur after only a few uses of the filler-control method if the technique is unreliable or the examiner is incompetent (
Wells et al., 2013). For valid but imperfect forensic techniques, the filler-control method produces an error rate estimate that can help examiners calibrate their confidence with their accuracy. For example, if a hypothetical forensic examiner knew that her error rate for a certain conclusion was 30%, she might learn that she should never be more than 70% confident in that conclusion.
Although no research, to our knowledge, has examined confidence-accuracy calibration in the forensic filler-control method, a variety of studies have demonstrated that the provision of error feedback can reduce overconfidence and improve confidence-accuracy calibration. In an early examination of this phenomenon, (
Arkes et al., 1987) asked participants to respond to 35 general knowledge questions by indicating which of two response options was the most likely answer and rating their confidence in the accuracy of their response. After responding to five questions, half of the participants received feedback about the accuracy of their answers to the first five questions; the other half did not. Because average accuracy on the first five questions was around chance (50%), information about the correct answers functioned largely as error feedback. The experimenters then measured participants’ confidence calibration on the remaining 30 questions. Consistent with the idea that providing error feedback improves calibration, participants who received feedback after the first five questions were less overconfident on the remaining 30 questions than were participants who did not receive feedback. More recently,
Haddara and Rahnev (
2022) investigated whether feedback-induced performance improvements are attributable to automatic sensory processing improvements (i.e., increased perceptual and/or metacognitive sensitivity) or improvements in decision strategy (i.e., changes in response criteria for perceptual and/or metacognitive judgments). They found that trial-by-trial feedback exerts its effects through the latter mechanism, ultimately improving calibration. Most germane to the current research, the authors explained that “when overconfident participants receive feedback about being wrong, this feedback allows them to lower their confidence ratings, thus improving their confidence calibration” (p. 272). Thus, forensic examiners who receive error feedback after rendering mistaken match judgments on filler samples may likewise adjust their confidence ratings to better reflect their accuracy.
However, the provision of error feedback is not the only feature of the filler-control-method that has the potential to affect confidence–accuracy calibration. Procedures that include fillers are inherently more difficult than those that do not (
Smith et al., 2020a) because the presence of fillers adds “noise” that makes it more difficult to discriminate between the presence and absence of signal (
Macmillan, 2002). This increased task difficulty might work against the calibration-enhancing effects of providing immediate error feedback. According to the hard-easy effect, calibration moves systematically from underconfidence to overconfidence as the difficulty of the task increases (
Lichtenstein et al., 1981). In an early study demonstrating the hard-easy effect,
Lichtenstein and Fischhoff (
1977) asked participants a variety of easy or difficult general knowledge (Experiments 3–5) and psychology (Experiment 4) questions and obtained participants’ probability estimates that their answers were correct. Participants were more overconfident on the hard questions than on the easy questions. This finding has been replicated in a variety of domains, including perceptual judgments (e.g.,
Baranski & Petrusic, 1994;
Petrusic & Baranski, 1997). Thus, the fact that the filler-control method presents examiners with a more challenging perceptual task than the standard method might undermine confidence–accuracy calibration by increasing examiner overconfidence in their judgments.
In two experiments, we examined confidence–accuracy calibration among mock forensic examiners who used either the filler-control method or the standard method to analyze forensic fingerprint evidence. Participants using the filler-control method compared a latent fingerprint to an evidence lineup consisting of four fingerprints—one suspect print and three filler prints—and received error feedback following match judgments on filler samples. Participants using the standard procedure compared a latent fingerprint to a single suspect fingerprint and never received error feedback. To preview, the results from Experiment 1 were consistent with the hard-easy effect in an undergraduate student sample: The filler-control method increased overconfidence and reduced confidence–accuracy calibration compared to the standard procedure. In Experiment 2, we sought to replicate these findings using a sample of forensic science students.
4. General Discussion
We tested the claim that the filler-control method reduces examiner overconfidence compared to the standard forensic analysis method (
Wells et al., 2013). Across two experiments involving undergraduate students (Experiment 1) and forensic science students (Experiment 2), we found evidence of the opposite effect: The filler-control method resulted in greater overconfidence and worse confidence–accuracy calibration compared to the standard method. Why did the filler-control method fail to curb mock examiners’ overconfidence? Theoretically, the capacity of the filler-control method to provide immediate error feedback should enable examiners to calibrate subjective confidence with objective accuracy over time (e.g.,
Arkes et al., 1987;
Haddara & Rahnev, 2022). Indeed, our experiments provided ample opportunity for mock examiners to learn from their mistakes: Mock examiners using the filler-control method made filler-match judgments in approximately half of the trials and thus received error feedback frequently. Yet, even in analyses examining confidence–accuracy calibration across trials, there was no evidence of improvement in calibration in the filler-control method compared to the standard method.
The filler-control method’s potential to curb overconfidence via the provision of error feedback might be counteracted by another feature of the method: It presents an objectively more difficult task than the standard method because the presence of fillers adds “noise,” thereby complicating signal detection (
Macmillan, 2002). Indeed, overall accuracy rates (hits and correct rejections) were lower in the filler-control method than in the standard method in both experiments. Our findings are therefore consistent with the hard-easy effect, the phenomenon in which overconfidence increases as task difficulty increases (
Lichtenstein & Fischhoff, 1977;
Lichtenstein et al., 1981). It may be the case that the increased difficulty of the filler-control method compared to the standard method overpowered any benefits afforded by the provision of immediate error feedback to mock-examiners.
There are additional features that differentiate the filler-control method and the standard forensic analysis method and could drive differences in performance. For example, the two methods might promote different decision strategies among examiners. Preliminary research suggests that the filler-control method makes examiners more conservative in their judgments, presumably to avoid making a known false positive error (e.g,
Kukucka et al., 2020). Kukucka and colleagues observed this effect only for inconclusive judgments, however; non-match judgments were less frequent in the filler-control method than in the standard method, as was the case in our data. As we discuss in the Limitations section, future research that examines confidence–accuracy calibration when examiners have the option to render inconclusive judgments will be valuable.
The filler-control method might also promote the use of relative judgments, in which examiners compare the evidence samples to one another to determine the best match. Longstanding theories of eyewitness identification posit that the simultaneous presentation of lineup members promotes the use of relative judgments compared to when lineup members are presented individually (e.g.,
Wells, 1984,
1993). However, recent research suggests that eyewitness decision-making from lineups may be better characterized by absolute-judgment models, in which eyewitnesses compare each lineup member individually to their memory (
Smith et al., 2024). Moreover, relative judgments may be limited in forensic decision-making because of the nature of forensic feature comparison. Unlike in eyewitness lineups, where eyewitnesses can often rely on rapid, holistic comparisons of lineup members’ faces, forensic examiners may need to more deliberatively compare features of each comparison sample to the crime scene sample, consistent with an absolute-judgment strategy. Nevertheless, future research could investigate the decision strategies examiners use when analyzing forensic evidence using the filler-control method versus the standard method.
4.1. The Incriminating-Exonerating Tradeoff of the Filler-Control Method
Across both of our experiments, the filler-control method was associated with inferior exonerating value but superior incriminating value compared to the standard method. This incriminating-exonerating tradeoff was evident in the diagnosticity ratios associated with mock-examiners’ judgments, regression analyses of non-match judgment accuracy, and PPV analyses and was confirmed by ROC analyses presented in the
Supplemental Materials.
Why was the filler-control method associated with superior incriminating value but inferior exculpatory value compared to the standard method? A primary benefit of the filler-control method appears to be its potential to protect innocent suspects. Surrounding the suspect’s sample with known-innocent filler samples resulted in a larger proportional reduction in innocent-suspect-match judgments than in guilty-suspect-match judgments. This reduction in suspect-match judgments was not driven by an increase in non-match judgments; it was driven by match judgments on fillers, consistent with a differential filler siphoning mechanism. Originally demonstrated in the context of eyewitness identification from police lineups, differential filler siphoning improves the diagnostic value of incriminating outcomes and enhances PPV (e.g.,
Smith et al., 2017;
Wells et al., 2015).
However, this benefit of the filler-control method came at a cost. Aside from impairing confidence–accuracy calibration, the filler-control method reduced non-match judgment accuracy compared to the standard method: fillers reduced accurate non-match judgments more than they reduced inaccurate non-match judgments, decreasing the diagnostic value of non-match judgments and impairing NPV. Although filler-match judgments themselves are diagnostic of innocence (i.e., they are more likely to occur in non-matching than matching trials), they were less diagnostic than were non-match judgments in the standard method. We speculate that the decreased diagnosticity of exonerating outcomes in the filler-control method (filler-match judgments and non-match judgments) compared to exonerating judgments in the standard method (non-match judgments) reflects the effects of task difficulty. By increasing the amount of noise present in the signal-detection task (e.g.,
Macmillan, 2002), fillers undermined decision-making accuracy overall (i.e., lower hits and correct rejections). This accuracy reduction was offset for incriminating outcomes—through a substantial decrease in innocent-suspect matches—but was not offset for exonerating outcomes.
There is yet another aspect of the filler-control method that might reduce its exculpatory value compared to the standard method. Specifically, the evidentiary value of examiners’ confidence ratings in exonerating outcomes may be lower in the filler-control method than in the standard method. In the standard method, an examiner’s confidence in an exculpatory outcome (i.e., a non-match judgment) always provides direct information about the degree to which the suspect’s sample matches the crime scene sample. In the filler-control method, however, an examiner’s confidence in an exculpatory outcome (i.e., a match decision on a filler sample or a non-match judgment) provides only indirect information about the degree to which the suspect’s sample matches the crime scene sample (e.g.,
Smith & Ayala, 2021). If the examiner renders a match judgment on a filler, their confidence scales the extent to which the filler sample matches the crime scene sample. And if the examiner makes a non-match judgment, the meaning of confidence is ambiguous: It might scale the average extent to which all of the comparison samples match the crime scene sample or the extent to which the closest-to-matching comparison sample matches the crime scene sample (e.g.,
Lindsay et al., 2013;
Weber & Brewer, 2006). Either way, confidence in a non-match judgment likewise does not directly scale the match between the suspect sample and the crime scene sample. As a result, the additional information gleaned from examiners’ confidence is less directly informative about the likely guilt of the suspect in the filler-control method than in the standard method.
4.2. Alternative Approaches to Forensic-Feature Comparison
Our findings suggest that neither the filler-control method nor the standard forensic analysis method is an objectively superior method of analyzing forensic evidence. This observation underscores the need for a new approach that better maximizes both the incriminating and exonerating value of forensic evidence. The filler-control method offers multiple advantages that are worth preserving—namely, the ability to neutralize contextual bias, expose invalid techniques and fraudulent analysts, produce error rate estimates for a given technique, laboratory, and analyst, and provide error feedback to analysts (
Wells et al., 2013). Our data also demonstrate that the filler-control method can substantially reduce innocent-suspect-match judgments, suggesting that the method may be especially valuable in settings that run a high risk of false positive errors. To enhance its exonerating potential, the filler-control method could be refined to capture more nuanced information following exculpatory judgments. Borrowing again from innovations in the eyewitness-identification literature, several approaches could accomplish this. These include (1) a ranking procedure in which the lineup members are ranked in terms of their match to the crime scene sample (e.g.,
Carlson et al., 2019;
Tuttle et al., 2025); (2) a perceptual-scaling approach that measures the similarity between the crime scene sample and each comparison sample (e.g.,
Gepshtein et al., 2020); (3) a rule out procedure in which confidence that each comparison sample does
not match the crime scene sample is obtained following a filler-match or non-match judgment (e.g.,
Ayala et al., 2022;
Smith et al., 2023); and (4) a confidence-ratings procedure in which the examiner provides a confidence rating that each sample matches or does not match the crime scene sample (
Sauer et al., 2008,
2012). Future research could test these procedures against the filler-control method and the standard forensic analysis method to determine which might prove superior in forensic contexts.
It remains possible, however, that the costs associated with the filler-control method will ultimately outweigh its benefits. An alternative approach with considerable promise has been proposed by
Guyll and Madon (
2023). In their method, a liaison serves as an intermediary between investigators and forensic examiners, preparing the evidence for analysis. Specifically, the liaison obtains the crime scene sample and the suspect’s sample from investigators and provides them to the examiner along with an additional, case-matched mock evidence pair. In a case involving fingerprint evidence, for example, the liaison would assemble the latent print and suspect print pair from the case as well as a second pair consisting of a mock-latent print and a mock-suspect print that may or may not match one another. The forensic examiner would not know which pair is from the case nor the ground truth of the mock pair (i.e., whether they match). Hence, the examiner would produce two judgments: one for the case and a separate, independently verifiable judgment.
This approach has the potential to preserve the benefits of the filler-control method while addressing its limitations. First, it could reduce innocent-suspect matches, as examiners would know that a false positive on the mock pair would be detected. Second, it protects against contextual bias because even if examiners are exposed to contextual case information (which the liaison should often prevent), they would not know which pair the information applies to. Third, the method would expose invalid techniques and fraudulent analysts, who would make detectable errors on the mock evidence. Fourth, over time, this process would generate error rate estimates for the examiner, lab, and technique. Fifth, the approach allows for feedback on examiner performance, potentially facilitating an improvement in confidence–accuracy calibration over time. Moreover, unlike the filler-control method—where fillers increase task difficulty and undermine performance—the mock and case judgments are made independently, preserving performance advantages of the standard method. While this procedure has yet to undergo empirical testing, we consider it a promising and innovative path for future forensic practice.
4.3. Limitations
As an initial investigation of confidence–accuracy calibration in the forensic filler-control method, the current research had several limitations, the most notable of which concerns the generalizability of the samples. In Experiment 1, participants were undergraduate students with no prior training in forensic evidence analysis—a population that differs in meaningful ways from forensic professionals. Although we attempted to address this limitation by showing all participants a training video before they began the task and by recruiting forensic science students in Experiment 2, it would be ideal to replicate the current findings with a sample of experienced forensic examiners. It is possible, for example, that professional forensic examiners are less subject to the hard-easy effect than our novice student samples. If that were the case, the filler-control method could conceivably increase forensic professionals’ confidence–accuracy calibration.
A second set of limitations relates to characteristics of the measures used in our experiments. As mentioned in
Section 2.1.5, a programming error resulted in a difference in the confidence-scale instructions used in the filler-control and standard procedure conditions. For the filler-control method, participants were instructed to indicate their confidence on a scale ranging from “0 being not confident at all and 100 being very confident.” For the standard method, participants were instructed to indicate their confidence on a scale from “0% confident to 100% confident.” Critically, however, participants in both conditions used the same scale (i.e., a sliding scale ranging from 0 to 100 without verbal labels or percentages) to render their confidence judgments. Thus, although these slight differences in the wording of the instructions are not ideal, we believe that this aspect of our method likely had a trivial effect on our findings.
Another limitation of our measures is that we did not permit participants to render inconclusive judgments, which is a response option available to practicing forensic examiners (
Wells et al., 2013). We imposed this constraint to preserve statistical power for our confidence–accuracy calibration analyses, which exclude inconclusive judgments. Moreover, some have argued that the inconclusive response option should be abandoned in favor of an approach that collects information about the examiner’s decision criterion (e.g., the examiner’s confidence), as the filler-control method does (
Albright, 2022). Nonetheless, the widespread use of the inconclusive option in current forensic practice means that omitting it in our experiments reduces the ecological validity of our findings. For example, it is possible that the filler-control method’s ability to immediately expose errors encourages a more conservative response style, leading professionals to render inconclusive judgments more often (e.g.,
Kukucka et al., 2020; see also
Miller, 1987). Shifting examiners’ responses in this way could conceivably influence their confidence–accuracy calibration. To the extent that examiners render inconclusive judgments on judgment tasks they perceive as difficult (e.g.,
Albright, 2022), it could moderate the hard-easy effect, thereby reducing overconfidence. Thus, it will be important for future research to replicate these findings using a paradigm that includes an inconclusive option.