*3.3. Statistical Characterization of the Assay*

We investigated the stability of the assay across independent screens and the potential influence of confounding experimental factors using a variance component analysis on the full data set, which included the controls as well as the marketed compounds.

The data reported in this study were recorded in 24 different screens over several years. Hence, data replication occurred on various levels (i.e., bevacizumab and KLH were measured in all screens, several compounds were repeatedly measured in some

screens, while subsets of compounds were tested on all donors within each screen), we could estimate the variance contributions of the treatments, the donor, the treatment-donor interaction, and the screen. As a typical screen is done in 3–4 batches, running a given subset of donors on all compounds in each batch, we can also assess the batch effect that is nested in the screen. During the course of the study, healthy donors that had given their blood could visit the blood donation center again, and the derived cells were used in two (or more) screens (i.e., same donor, same treatment but different screens). The results of this analysis are summarized in Figure 3a. Treatment-related effects (the expected effect from a compound in the assay, here driven primarily by the large number of strong KLH responses) accounted for 54% of the total variance; in contrast, the contribution of purely experimental factors was quite small (screen-to-screen variability: 0.5%, batch-to-batch variability within a screen: 2.0%). The donor factor (i.e., a factor accounting for a generally higher or lower donor-specific IFN-γ release independently of the treatment) accounted for 6.9% of the total variability; a similar proportion of the variance (5.4%) was attributed to the donor-treatment interaction (i.e., a factor taking account of a subject-specific response to a given treatment). A relatively high proportion of the total variance (23.6%) could not be readily accounted for by the known experimental factors. This could be due, for example, to the unavoidable technical variability in the protocol used to carry out the assay, or to heterogeneities unaccounted for when collecting sample material from a given donor at different times. In general, it would be very difficult to single out these technical and biological sources of variance and to investigate their relative impact on the assay reproducibility without some very cumbersome additional quality control processes.

**Figure 3.** Variance and assay power. (**a**) Main factors contributing to assay variability estimated by a variance component analysis. The fitted model is the following: log2 (SI) ~ (Compound × DonorID) + Screen/Batch. There is a relatively low relative impact of assay batching variables (Screen, Batch within a screen) in comparison to the compound component. (**b**) DC:CD4+ T cell restimulation assay power curves for compound comparison. (**c**) Assay power curves showing the statistical power to detect a treatment effect by comparing a compound with a comparator treatment. A one-sided paired test within-study (α = 0.05) according to various donor cohort sizes has been used.

The breakdown of the SI readouts into individual variance components enables us to simulate data sets with specified effect sizes for hypothetical treatments. Hence, we can estimate the statistical power (i.e., the probability of detecting a true compound effect) in a wide range of conditions. For example, Figure 3b shows the resulting statistical power when comparing a compound SI fold-change response with the one of a reference or comparator treatment; here, we differentiate the case where both compounds of interest were assessed in the same screen, in contrast to a comparison that was conducted across different screens. A major advantage of a within-screen comparison is that one could apply paired testing (i.e., using 'donor' as a covariate) to yield higher statistical power because the donor-to-donor variability would be partially accounted for in this approach. This is, in our opinion, the recommended setting for a compound ranking study. Moreover, depending on the hypothesis of interest, some additional statistical power may be gained by using a one-sided testing approach. This is legitimate when only a higher (or lower) compound response is of interest as compared to a reference treatment, which, in fact, could be the most relevant scenario. As a rule of thumb, we expect that SI differences of about 75% on a linear scale (i.e., a SI fold-change of 1.75 or 0.8 log2 units) can be detected with a statistical power of 80%, assuming one-sided testing within the same screen, alpha = 0.05, and n = 30 donors.

Statistical power is also a function of the sample size (here, number of donors per screen); we next examined this dependency and the impact of this variable in the interpretation of our assay results (Figure 3c). We observe a considerable gain in statistical power for studies including up to 30 donors per study. Increasing the number of donors beyond this point leads to noticeably smaller gains in statistical power at the cost of a considerable increase in effort and expenses, which is associated with larger experiments. In our experience, a standard study size of 30 donors per screen strikes the right balance, both for the experimental and statistical angles. In the case that enhanced statistical power is desired, we believe that a reduction of the residual assay variance by experimental protocol refinements could be a more promising approach than merely increasing the donor count.
