1. Introduction
Multispectral, multiplex immunofluorescence (mIF) assays are emerging tools for biomarker discovery. They facilitate not only the study of basic cell population densities in a tissue-sparing manner, but also co-expression analyses, quantification of marker intensities, and spatial relationships. Multiple studies performed in numerous tumour types have demonstrated the predictive and prognostic benefit of being able to spatially resolve immunoactive cell populations within the tumour microenvironment (TME) and relate these findings to clinical outcomes [
1,
2,
3,
4,
5,
6,
7,
8,
9,
10,
11,
12,
13]. In a meta-analysis of different biomarker modalities, mIF assays have been shown to have higher predictive value than tumour mutational burden, IFN-
gene signatures, and PD-L1 immunohistochemistry for predicting response to anti-PD-1-based therapies [
14].
In the research setting, performing high-throughput processing and analysis of mIF samples could lead to faster biomarker discovery. In the clinical setting, rigorously validated mIF assays could enable individualized treatments with immune checkpoint inhibitors (ICI) for patients. Given the potential for mIF assays to be used in both research and clinical settings, it is imperative to ensure these assays are reproducible.
Currently, proof-of-principle studies [
15] and guidelines [
16] exist around demonstrating the reproducibility of the staining portion of mIF assays. There is still an unmet need for standardizing the microscopes themselves [
17,
18,
19]. Here, we looked to extend reproducibility assessments to the multispectral microscopes necessary for scanning the mIF-stained tissue samples. Through the use of three different microscopes housed at a single academic institution, we were able to develop a relatively simple and deployable correction model capable of adjusting these multispectral microscopes to a single reference microscope.
2. Materials and Methods
Eight advanced formalin-fixed paraffin-embedded (FFPE) melanoma pathology specimens were obtained from the Johns Hopkins archives. Samples were de-identified and a 4
section was cut from each block. Automated mIF was performed as previously described [
9], but the mIF panel was expanded to include CD3 and a pan-membrane stain (
Table S1).
Briefly, samples were baked offline for 3 h at 65 °C, then loaded onto the Leica BOND RX automated research stainer (Leica Biosystems, Buffalo Grove, IL, USA). Samples were then baked online at 60 °C for 30 min, and residual paraffin was removed (Dewax, Leica, Deer Park, IL, USA). Initial antigen retrieval was performed using a pH9 EDTA buffer (ER2, Leica) for 40 min at 100 °C. After initial blocking for endogenous peroxidases (BLOXALL, Vector Labs, Newark, CA, USA), non-specific antibody binding was blocked (Protein Block, Agilent, Santa Clara, CA, USA). Primary antibodies, polymers, and opals were applied for Position 1 (
Table S1), then antibody stripping was performed using a pH6 sodium citrate buffer (ER1, Leica) for 20 min at 95 °C. This process was repeated for each position, after which slides were counterstained (Spectral DAPI, Akoya Biosciences, Marlborough, MA, USA) and wet mount coverslipped (Prolong Diamond, Invitrogen, Waltham, MA, USA). Prior to staining, the mIF panel was optimized to reduce cross-talk and/or bleed-through by performing primary, secondary, and fluorophore titrations to balance fluorophore intensities, as previously described [
9,
15]. All slides used were stained in the same batch so that batch-to-batch variations would not introduce additional artefacts.
The mIF-stained slides were scanned using PhenoImager HT (formerly known as Vectra Polaris) microscopes (Akoya Biosciences, Marlborough, MA, USA), which are automated multispectral microscopes capable of capturing fluorescent signals with wavelengths between 440 nm and 780 nm. These microscopes imaged samples by first passing light emitted from a multiband LED array through one of seven excitation filter cubes. This light illuminated areas of the tissue samples, stimulating the fluorophores and causing them to fluoresce. Fluorescence light was received by filter systems composed of seven static broadband filter cubes and 43 liquid crystal tunable narrowband filters. Light passing through each of the narrowband filters was captured by a CCD camera, forming a set of 43 monochromatic image planes. The 43-layer “raw” images were spectrally unmixed using the inForm software [
20] (inForm
® v2.4.8, Akoya Biosciences, Marlborough, MA, USA), depending on libraries comprised of pure spectra for each fluorophore also captured on the PhenoImager HT microscopes. The spectral unmixing process transformed the raw images into 10 layers: one layer for each fluorophore, and an additional layer for autofluorescence. The 10 layers of the resulting “unmixed” images were analysed separately as measurements of individual marker expressions within the tissue samples [
9,
15].
Each slide was scanned twice on three different PhenoImager HT microscopes, for a total of six independent scans as summarized in
Table S2. An independent scanning protocol was created for each microscope by auto-exposing on the brightest pixels for each broadband filter across the set of eight tissue samples. The broadband filters used to excite each fluorophore are listed in
Table S1. Emission spectra for each fluorophore were captured across several broadband and narrowband filters, as shown in
Figure S1. The microscope-dependent corrections discussed below were derived from, and applied to, the raw 43-layer multispectral images, so that a common library could be used to perform spectral unmixing on all data coming from different microscopes.
Tiling of the entire sample was achieved by acquiring 20% overlapping “high-power field” (HPF) image tiles, which were assembled into seamless whole-slide images as previously described [
9]. On average, 5700 HPFs were acquired per round of scanning, totalling 34,725 HPFs across the entire dataset (
Table S3). Each raw HPF image was stored as an array of unsigned 16-bit integers, with dimensions of 1872 × 1404 pixels and 43 layers. Each image layer contained the total brightness of each pixel in a specific, narrow range of light wavelengths. Image layers were grouped by the static broadband filters used to initially select wider ranges of wavelengths of light. The mIF narrow-band wavelengths contributing to each image layer and their corresponding broadband filters are plotted in
Figure S1.
A binary image mask
was generated for each raw HPF
h, in which areas containing empty background or oversaturated pixels were set to 0 and areas showing well-imaged tissue were set to 1. Background pixels were determined using Otsu’s thresholding algorithm implemented in OpenCV [
21]; oversaturated pixels were masked out using hand-tuned layer-dependent thresholds.
The raw HPFs were normalized by their exposure times in each image layer to produce the images
, with units of counts/ms. The “mean image”
M for each scan of each sample was then calculated as
describing the average flux of the tissue at each pixel in counts/ms.
These mean images were averaged over the two-dimensional pixel indices
i and
j in each image layer
k to produce the set of
spectra,
describing the average flux of the tissue in each image layer
k observed for scan
r of sample
n on microscope
m, where
H and
W are the height and width of each image.
The
spectra were then normalized so that they impacted measurements relatively equally regardless of their individual brightnesses. First, the new spectra
were calculated by multiplying each
by the fraction of the total sample coming from its own HPFs,
and then the
spectra were divided by their average over the
samples,
microscopes,
scans, and
raw image layers to produce the
spectra,
which represent the relative tissue flux variations about one for each microscope, scan, and sample as a function of multispectral image layer.
3. Results
The
spectra are pictured in
Figure 1, showing an initial variation in overall illumination and relative spectral intensity characterized by a standard deviation of 29.85% on average over all image layers. We used these spectra to develop a method of accounting for those differences, independently modelling contributions from the individual tissue samples themselves and from the three different microscopes.
To simplify calculations, a final set of factors
were calculated, representing the normalized average relative intensities of the samples per microscope per scan without any wavelength dependence:
We first used a simple calibration model applied to the entire set of samples, which showed the reduction in variation that was possible to achieve overall. From this form of the calibrations, we propose a source of the observed differences between the individual multispectral microscopes. We then modified the model slightly to factor out the contributions coming only from the differences in the microscopes, and used a bootstrapping procedure to show that those microscope correction factors could be expected to generalize to additional samples. Finally, we performed three different spectral unmixings on the raw image data to evaluate how standardizing images to a single microscope affects marker intensity measurements, rather than raw image fluxes.
3.1. Correcting the Entire Dataset
Our goal in correcting the entire dataset overall was to effect the greatest possible reduction in the variance shown in
Figure 1. We first removed tissue sample- and microscope-dependent differences in the overall brightnesses of each set of images, and then accounted for differences in the relative spectral sensitivities exhibited by each tissue sample and each microscope. An overview of the method is shown in
Figure 2.
We began by applying two corrections to the overall brightnesses as functions of the tissue samples,
, and microscopes,
, calculated as
Applying these amplitude corrections resulted in the set of
spectra pictured in
Figure 3. These amplitude corrections alone reduced the variance observed from 29.85% to 14.83% on average over all image layers.
We next modelled the effect of varying spectral sensitivities in the specific tissues mounted on each slide. The relative variations in the sample dimension as a function of image layer,
, were calculated by averaging the
spectra over all microscopes and scans and used to determine wavelength-dependent
factors,
The
variations and
correction factors are pictured in
Figure S2.
Applying the
tissue profile corrections to
gave the set of
spectra, pictured in
Figure 4. The standard deviation of the
spectra was 10.56% on average over all image layers.
The variations remaining in the
spectra corresponded to the wavelength-dependent relative differences between the three microscopes. The microscope-relative variations
and correction factors
were calculated similarly to the tissue variation spectra and corrections,
The
variations and
correction factors are shown in
Figure S3.
Applying the
factors resulted in the set of
spectra, pictured in
Figure 5. The
spectra exhibited a 2.70% standard deviation variation on average over all image layers.
The reduction in overall variation between all samples, microscopes, and scans is shown in
Figure 6. The upper plot shows the standard deviation over all samples, microscopes, and scans as a function of the image layer at each stage of correction, and the lower plot shows the averages of these standard deviations over all image layers. An initial standard deviation of 29.85% was reduced to 2.70% after applying corrections accounting for differences between tissue samples and microscopes.
3.2. Contributions to Microscope-Specific Correction Factors
The
factors can be averaged over all layers to calculate
, which should be approximately equal to 1 since the
amplitude corrections were already applied in
. Dividing out these overall scales and averaging over the layers
belonging to each broadband filter group produced the set of
factors,
which quantify the differences between microscopes that are attributable to inhomogeneities in those microscopes’ specific broadband filters. Lastly, the differences specific to the piezoelectrically tuned narrow-band filters (
) were quantified by dividing out both the overall illumination and broadband filter contributions:
The
,
, and
contributions to the total
correction factors are pictured in
Figure S4. It is clear that most of the differences in relative spectral sensitivities between microscopes can be attributed to inhomogeneities in the microscopes’ broadband filter cubes. Each microscope was manufactured with its own static set of broadband filter cubes, and it is expected that the materials used for those filter cubes may differ between instruments from the time of manufacture. Our data indicate that those differences can be as large as 20% with respect to the means of all three microscopes for any given broadband filter group.
This same effect is also visible when calculating the overall covariance matrices of the
,
,
and
spectra in the image layer dimension. The image layer-projected covariance matrix of
, for example,
, can be calculated as
where
These image layer-projected covariance matrices at each stage of correction
,
,
, and
are shown in
Figure S5. They show an overall reduction in the scale of the variance as successive corrections are applied, as well as a strong correlation between groups of layers imaged with the same broadband filter, and between image layers corresponding to similar narrow-band wavelengths, as pictured in
Figure S1.
3.3. Generalizing Microscope Correction Factors
Having determined a model for using a group of multiple-imaged tissue samples to measure microscope-dependent correction factors, we next investigated how generalizable the procedure would be if applied to additional data that were not used to measure the corrections. To this end, we used a bootstrapping procedure to repeatedly calculate microscope-dependent correction factors using particular subsets of the eight tissue samples and then applying those corrections to orthogonal subsets of the tissue samples. This procedure is displayed in
Figure 7.
The data used in this procedure were normalized as before (
) and divided by the same tissue sample-dependent amplitude corrections
to produce the set of
spectra,
These spectra defined new
factors,
analogous to the
factors, to account for spectral variations attributable to differences between the tissue samples on each slide. Dividing by these factors produced the set of
spectra,
in which any remaining variations were attributable to the different microscopes.
At each iteration
s of the bootstrapping procedure,
“fit” samples
were randomly chosen from the full set of eight, and the microscope-dependent correction factors
and
were calculated using just those five samples:
The correction factors were then applied back onto these five samples to produce the
spectra,
and also to the three orthogonal “test” tissue samples. The post-correction standard deviations across the fit and test samples, plus all microscopes and scans, were calculated. This procedure was repeated 56 times, once for each independent choice of the five fit samples.
Figure S6 shows the distribution of the
factors calculated for each iteration; the microscope-dependent correction factors were all very similar regardless of the subset of samples used to calculate them.
Figure 8 shows the standard deviations across all samples, microscopes, and scans of the original
data, the tissue-homogenized
spectra, and the fully-corrected
spectra for all fit/test sample subsets. The
data points shown are the averages over all bootstrapping iterations, with error bars equal to the standard deviation.
Applying microscope-dependent corrections calculated using orthogonal subsets of samples reduced the standard deviation from 13.87% to 2.91 ± 0.03% on average over all image layers, comparable to the final standard deviation of 2.66 ± 0.01% observed when applying corrections back onto the subsets of samples used to calculate them. This shows that normalizing and homogenizing a set of tissue samples using and correction factors reliably leaves only microscope-dependent variations present, and that corrections for those microscope variations can be reliably applied to new tissue samples from the same microscopes.
3.4. Impact to Measurements of Marker Expressions
Immunofluorescence microscopy is often used to measure the expressions of multiple biomarkers simultaneously. The PD1/PDL1 immunofluorescence panel used to stain the tissue samples described in
Section 2 contained stains targeting CD3, PDL1, FoxP3, CD8, PD1, CD163, and Sox10/S100 proteins, as well as a DAPI stain targeting cellular nuclear DNA, and a lab-developed combination (“pan-membrane”) stain targeting cellular membranes. The inForm Automated Image Analysis Software (inForm
® v2.4.8, Akoya Biosciences, Marlborough, MA, USA) [
20] from Akoya Biosciences was used to “unmix” the raw, 43-layer images into new, 10-layer images depicting the normalized expressions of each marker plus a layer for autofluorescence. We then quantified the effects of applying corrections for differences between microscopes on those measurements of marker expressions.
The spectral unmixing process depends on “library” slides as input, which provide measurements of individual marker responses and autofluorescence at different wavelength ranges. We investigated three different unmixing scenarios, depicted in
Figure 9.
First, we unmixed the raw data using a single library whose slides were imaged on microscope 2. Then we performed a second unmixing using three different libraries whose slides were imaged on each of the three microscopes, where raw data were unmixed using the library from the microscope on which they were scanned. In the final scenario, we first applied factors to standardize all images to measurements from microscope 2, and then unmixed all of the corrected images using the single library from microscope 2.
The standardization factors applied were the means of the
and
factors shown in
Figure S6, divided by the factors for microscope 2, so that microscope 2 data were left unaltered and data from microscopes 1 and 3 were standardized to that single reference. The standardization was performed by dividing each raw image by the product of the
and
factors, as in Equation (
17).
The three sets of unmixed images were multiplied by their binary image masks and their average brightnesses in each layer were calculated and normalized as in Equations (
1)–(
4) above, except that the number of image layers was
instead of
. Tissue sample-specific normalization factors
and
were calculated as in Equations (
6) and (
14), respectively, and applied as in Equation (
15). The resulting three sets of
spectra, one for each unmixing method, are shown in
Figure S7 (the autofluorescence layer is omitted). The standard deviations across all samples, microscopes, and scans of these spectra are shown in
Figure 10, along with their averages over all but the autofluorescence layer.
The uncorrected images unmixed with the microscope 2 library showed a remaining variation characterized by an average standard deviation of 15.84%, slightly larger than that observed in the tissue sample-corrected
and
spectra in
Figure 6 and
Figure 8, respectively. The uncorrected images unmixed with the individual microscope libraries had an average standard deviation of 8.49%, showing that using microscope-specific libraries in unmixing does compensate for some, but not all, systematic differences between samples imaged on different microscopes. The microscope-corrected images unmixed using the microscope 2 library showed an average standard deviation of 4.39%, slightly larger than the fully corrected
spectra in
Figure 6 and
Figure 8.
The greatest reduction in the unmixed images’ microscope-specific differences was therefore observed by standardizing the fluxes of the raw images to a single reference microscope, and then unmixing all images using a library from that single reference microscope. The variations remaining in the unmixed images were slightly larger than those remaining in the raw images; likely due to the dimension reduction from 43 to 10 image layers that is inherent to the unmixing process.
4. Discussion
The use of immune checkpoint inhibitors (ICI), has completely changed the landscape of treatment for patients with advanced melanoma and other tumour types [
22]. Two recently published clinical trials treating naïve patients with advanced melanoma showed a five-year overall survival (OS)
% in patients treated with anti-PD-1 [
23,
24]. Additionally, patients treated with a combination of anti-PD-1 and anti-CTLA-4 showed an even higher median 6.5-year OS compared to patients treated with anti-PD-1 alone [
25]. A pre-treatment biomarker to help predict which patients are more likely to respond to therapy is of great interest. Currently, there are no FDA-approved companion diagnostics to determine if patients with advanced melanoma should receive ICI [
26]. Initially, a PD-L1 immunohistochemistry assay was approved as a complementary diagnostic, but this was ultimately rescinded after levels of PD-L1 expression did not correlate with OS [
27]. More recently, a 6-plex mIF assay for predicting objective response, progression-free survival, and OS for patients with advanced melanoma receiving anti-PD-1-based ICI was developed [
9].
There are many steps involved with creating a companion diagnostic assay including, but not limited to, demonstrating high intra- and inter-observer reproducibility of the assay [
28,
29,
30]. This includes demonstrating little variability between multiple reagent lots, validating all instruments involved with performing the assay, and potentially creating a “locked-down” analysis algorithm for those assays requiring image analysis. In collaboration with several other groups, we have performed the initial steps for validating and determining the reproducibility of a mIF 6-plex assay by showing a strong inter- and intra-site concordance of both cell population densities and marker intensity measurements [
15]. Some limitations of this study were that only the reproducibility of the mIF staining itself was tested, and that only regions of interest within the mIF-stained slides were scanned and analysed. Here, we expanded the scanned image to include the whole slide and standardize the multispectral microscopes used to acquire the imagery.
Through the serial scanning of eight advanced melanoma FFPE mIF-stained sections we were able to characterize systematic differences between three PhenoImager HT microscopes and showed that these differences are due to inhomogeneities in the broadband filter cubes built into each microscope. We developed a simple correction model that shows measurements of microscopes-specific correction factors are relatively agnostic to the specific samples used to measure them. Additional work may be needed to determine if these factors remain agnostic when scanning is performed on tissue from other tumour types, as there can be significant differences in staining patterns and background autofluorescence between tumour types. The proposed correction model factors out differences in tissue area across samples and differences between microscopes, making it possible to standardize image data from multiple microscopes to the mean of all microscopes or to a single reference microscope. By standardizing these data to a single reference microscope, we were able to reduce microscope-dependent flux variation in raw images by 79%, and in marker expressions measured in the spectrally unmixed images by 72%. Microscope-specific corrections of this form could allow for the harmonization of mIF assay results across institutions. With such harmonization, it may be possible to use a single set of software phenotyping projects across all samples, which is a pre-requisite for the development of “locked-down” analysis algorithms. More work will be needed to measure and test corrections of this form for microscopes housed at different institutions, and any microscope-specific standardization procedures must remain independent of other standardization steps performed to ensure reproducibility of marker panels or other aspects of imaging. Our group is also developing a method to standardize image data from slides stained in multiple batches, which will be the subject of a forthcoming publication.
These investigations imply a procedure to allow high-throughput mIF imaging using more than one PhenoImager HT microscope. For example, if a large number of slides are obtained all at once for imaging, several slides can be reserved for imaging on all available microscopes to determine microscope-dependent correction factors, and the rest can be imaged on only one microscope. HPFs from microscopes other than the chosen reference microscope can be corrected by the measured standardization factors, and then unmixed using only one library imaged on the reference microscope. The results presented here are only applicable to the specific PhenoImager HT microscopes at a single academic institution. It is expected that other PhenoImager HT systems would exhibit comparable differences due to their own static broadband filter cubes, and that the same method of measuring and applying corrections before spectral unmixing using multiple-imaged samples of a single tissue type would be a reasonable method for quantifying those differences as realized within that tissue type. Additional factors would need to be considered in developing correction models for multispectral image data from other systems.