1. Introduction
Current practice in fire debris analysis within the United States results in reporting a categorical statement with the possibility of additional qualifying statements, as prescribed by the standard method ASTM E1618-14 [
1]. These categorical statements result from subjective thresholds for rendering a judgement on the presence or absence of ignitable liquid residue in a sample. Previous research has led to the development of machine learning approaches and the direct calculation of likelihood ratios (LR) for observing the evidence (i.e., the total ion spectrum from a fire debris sample) under the competing hypothesis that a sample contains or does not contain ignitable liquid residue [
2,
3,
4,
5,
6,
7]. These calculations provide an easy and objective method for evaluating the evidentiary value of a fire debris sample, thereby obviating the need for making subjective categorical statements. Application of the approaches and results presented in this paper to other forensic problems is possible where training data is available for the relevant population under consideration.
When developing a method for calculating likelihood ratios, it is important to address the question of what constitutes a relevant population. This is important in both classification and identification problems [
8,
9]. In both problem types, the choice of a relevant population influences the calculation of the multivariate means as well as the variance and covariance used to calculate the likelihood ratios. Equation (1) presents a one-level model (single measurement for each sample) for the calculation of likelihood ratios with the assumption of multivariate normal between-object distributions.
In Equation (1),
and
(
= 1, 2) are the determinant and inverse of the covariance matrix estimated for class
using an available database. In this work, the two classes are samples containing ignitable liquid residue and samples that do not contain ignitable liquid residue, designated IL and SUB respectively. The term
in Equation (1) is the feature vector for the single measurement of the sample for which the LR is being calculated and
(
= 1, 2) is the mean feature vector for the database samples from class
[
8,
9].
Alternatively, the likelihood ratio can be calculated from Equation (2), which is a one-level model based on the assumption of multivariate between-object Gaussian kernel density distributions [
8,
9].
The determinant and inverse of the covariance matrix,
and
(
= 1, 2) are estimated for each class
, given an estimate of the relevant population. The term
is the feature vector for the single measurement of the sample for which the LR is being calculated and
(
= 1, 2) is the mean feature vectors for the database samples from class
. Equation (2) includes the optimal bandwidth parameter
(
= 1, 2) for the kernel functions used for each class. Calculation of the bandwidth is by Equation (3), where
is the number of samples of class
in the respective database and
is the number of variables in each
p-variate feature vector. The kernel density estimate is more appropriate when the population distribution of the data is not normal [
8,
9].
In a previous report [
3], several chemometric methods were evaluated for calculating the evidentiary value of a fire debris sample. Support vector machine (SVM), linear and quadratic discriminant analysis (LDA and QDA, respectively) and k-nearest neighbors (kNN) were evaluated by cross validation on computationally generated training data and by assessing an independent set of data taken from large-scale test burns. The results showed that SVM and QDA performed better in cross validation than LDA and kNN, based on the area under the receiver operating characteristic (ROC) curves (abbreviated as AUC). The AUCs for SVM, QDA, LDA and kNN were 0.99, 0.98, 0.87 and 0.91 respectively. The equal error rates (EER) for the four methods had the reverse ordering relative to the AUC values (i.e., EER
SVM < EER
QDA < EER
LDA < EER
kNN). The LR produced from the SVM and LDA cross-validation results were better calibrated than the LR produced by QDA and kNN. In that work [
3], calibration was not performed on the LR values following cross validation of each chemometric method. Testing the chemometric methods against the large-scale burn validation data gave AUC values of 0.83, 0.92 and 0.84 for SVM, QDA and kNN methods, respectively. The AUC for SVM, QDA and kNN showed the largest decreases while the LDA AUC (0.87) showed no change. The interpretation of these results was that the computational method was producing training data that possibly was not representative of the large-scale burn data, resulting in poor validation performance of the chemometric models. An alternative explanation may reside in the different approaches used to assign the ground truth for the computationally generated data and the large-scale burn data.
In previous work [
3], the computationally generated data was specified to have a ground truth IL class membership if the sample was generated by including ignitable liquid in the range of 0.01–1.0 fractional contribution to the total ion spectrum. Computationally generated samples assigned membership in the ground truth SUB class did not contain any ignitable liquid contribution. Class assignment (IL or SUB) was made for the large-scale burn samples by an “informed analyst” [
3], who knew the identity and chromatographic characteristics of the ignitable liquid used in the burn. The analyst assigned the class based on the presence or absence of a recognizable pattern from the ignitable liquid used in the burn. While this approach may seem reliable, it is nonetheless different from the method used to assign the ground truth to the computationally generated samples. In the work presented here, class assignments for the large-scale burn samples are not considered as “ground truths” for the purpose of evaluating model performance. Instead, the known ground truth of the computationally generated data and an independent data set from 16 known ground truth samples are used to evaluate model performance. The LLR, or
, of the large-scale burn data are calculated based on an optimal model derived from computationally generated data. This approach is more representative of casework, wherein the evidentiary value of a sample is sought without knowing the ground truth. This approach also affords the opportunity to assess the calculated evidentiary value for samples where the identity of the ignitable liquids and the sampling locations were known relative to the ignitable liquid pour.
2. Materials and Methods
The calculations in this work are based on a set of data that is computationally generated by mixing data from a database of ignitable liquids and data from a substrates database. The computationally generated data is intended to model fire debris data; however, unlike real fire debris data, the ground truth (presence or absence of ignitable liquid residue) is known for the computationally generated data. The relative amounts of ignitable liquids from each of the different ASTM E1618-14 defined classes, and the relative amount of substrate-only containing samples, are controlled in the computational mixing to generate data sets that represent different population distributions of ignitable liquid classes and substrates. The computational mixing process has been described in detail in previous publications and is summarized here for the benefit of the reader [
2,
3].
This work utilized 122 substrate pyrolysis samples from the Substrate Database [
10], and 111 substrate pyrolysis samples that were not included in the previously computed fire debris models [
3], bringing the total number of substrate samples to 233. The ignitable liquid samples used in the models included 445 unweathered and 243 weathered records from the Ignitable Liquids Reference Collection and Database (ILRC) [
11], as previously reported [
3]. The LR calculation is limited to Equation (1), where the covariance matrix for samples containing ignitable liquid are treated as different (QDA) or the same (LDA) as the covariance matrix for samples that do not contain ignitable liquid. In previous work [
3], the LR calculated by Equation (1) were not calibrated following cross validation. In the results reported here, the LR values are calibrated by isotonic regression, also known as the pooled adjacent violators method [
12].
Samples designated “IL” were prepared by mixing the total ion spectrum (TIS) from a single ignitable liquid with the TIS of a random number (1 to 5) of substrates [
13]. The TIS corresponds to the base-peak normalized electron ionization mass spectrum averaged across the chromatographic profile. The computational mixing has previously been described in detail [
2,
3,
4,
5]. A brief review of the computational mixing is given here.
2.1. Computational Fire Debris Data Preparation
Each of the
j (
j = 1–10,000) simulated fire debris TIS were prepared by mixing a random number
i (
i = 1–5) of
from substrate pyrolysis samples with a single
ignitable liquid. Each
is multiplied by a fractional contribution
, where
. The proportion
of the summed
substrate contributions and the
contribution from the IL contribution were multiplied by a vector
to add a maximum of 10% normally distributed noise to each component of
. Each computationally generated
was normalized by dividing each nominal mass-to-charge ratio (
m/z) intensity by the maximum value, as shown in Equation (5). This is the same process that was followed in previous work [
2,
3].
Model data sets were prepared by controlling the proportion of each ignitable liquid class incorporated into each model (i.e., the fraction of the total number of TIS corresponding to each class). The standard method ASTM E1618-14 defines eight different classes of ignitable liquid [
1]. These classes include the aromatic solvents (AR), gasoline (GAS), isoparaffinic solvents (ISO), naphthenic paraffinic solvents (NP), normal alkanes (NA), petroleum distillates (PD), oxygenates (OXY) and miscellaneous (MISC). Pyrolyzed substrates (SUB) are included as an additional class in this study. The model distribution data sets used for training and cross validation were prepared as described in the previous paragraph by a stratified random draw with replacement from each class of IL and SUB in accordance with the population distributions shown in
Table 1. For example, a data set of 10,000 TIS corresponding to population A would contain 5000 samples comprised of a mixture of substrates (containing no IL) and 5000 samples, each containing a single IL from one of the IL classes mixed with up to five pyrolyzed substrate samples. Each IL class was represented in 630 TIS.
The pairwise similarities (
) between distributions
i and
are shown in
Table 2 and calculated based on Equation (6), where
is the Euclidean distance between distributions
i and
j, with the squared difference in fractional contributions
and
summed over the
k = 1–9 classes shown in
Table 1.
2.2. Model Development and Cross Validation
The models used in this work are based on Equation (1), which requires calculation of the covariance matrices (
and
) and the mean feature vectors (
and
) for the IL (class 1) and SUB (class 2) samples from each data set based on the population distributions in
Table 1. The feature vectors were comprised of the principal components scores calculated from the total ion spectra for each sample in a computational data set (sets A–F corresponding to the distributions in
Table 1). Principal components analysis with mean centering of the data was used to remove collinearity among the ion intensities of different mass-to-charge (
m/z) ratio in the total ion spectra. Dimension reduction was achieved by retaining a number of principal component scores required to account for 90% of the variance in the data. The values (
,
,
and
) allow direct calculation of the likelihood ratio, using Equation (1), for a new sample with feature vector
, without optimization of any adjustable parameters. The feature vector
for a new sample is comprised of the scores obtained by projecting the total ion spectrum for the sample into the principal component space defined by the population.
Models developed for each distribution in
Table 1 were cross validated by a 10-fold stratified approach. For each fold in the cross validation, the validation data was removed (10% of ground truth IL and 10% of ground truth SUB) and dimension reduction was performed on the remaining data (90% of the data set) by principal components analysis with mean centering. Retained principal components and associated scores accounted for 90% of the variance in the data, typically corresponding to roughly 30 factors. Two approaches were taken to calculate the covariance matrices required for Equation (1), resulting in two quantitatively different models. In one case, the covariance matrices (
and
) were calculated separately for the ground truth IL and SUB samples from the training data. In the other case, the covariance matrices for the IL and SUB samples were assumed equal and the pooled covariance matrix was calculated. The former approach is equivalent to QDA and the latter is equivalent to LDA [
3]. Equation (1) was used to calculate the likelihood ratios for the model data (90% of the data set) under both assumptions. The LR values were transformed to LLR, which were calibrated by isotonic regression using the pooled adjacent violators method as implemented in the R isotone package [
12,
14]. The calibrated LLR values were used to correct the LLR values predicted for the cross-validation data (10% of ground truth IL and 10% of ground truth SUB), as described in
Section 2.3.
2.3. Model Testing Across Data Distributions
Following model development and calibration, 10% stratified (IL and SUB) test data samples were drawn from each data set A–F. Each test data set was projected into the principal component space described in the previous paragraph and likelihood ratios were calculated from the resulting scores. The likelihood ratios calculated by Equation (1) made use of the mean vectors and covariance matrices generated during model development. Likelihood ratios were transformed to LLRs and calibrated based on the isotonic regression from cross-validation likelihood ratios, as discussed in the previous paragraph. The calibrated LLR values from each test data set and their associated ground truth class membership (IL or SUB) allowed for the generation of ROC curves and recording the AUC for each curve. The sequence of model development, cross validation, calibration and testing was repeated 10 times for each distribution in
Table 1. The modeling and testing process is diagramed in
Figure 1.
2.4. Model Testing Against Known Ground Truth-Simulated Casework Samples
The QDA model based on Distribution A,
Table 1, was used to evaluate 16 samples with known ground-truth class membership (IL or SUB), which were developed in the laboratory to simulate casework-relevant samples. Solutions of ignitable liquids and substrate materials were prepared separately in CS
2 and then combined. Four ignitable liquids were evaporated 75% by volume. Ten microliters of the evaporated ignitable liquid was diluted with 500 µL of CS
2. Volatile pyrolysis products from eight substrate materials, burned individually for 2 min utilizing the modified destructive distillation method, were extracted onto carbon by heating each sample at 66 °C for 16 h. The carbon strips were each extracted into 500 µL of CS
2. Samples classified as containing no ignitable liquid (SUB) consisted of the eight burned substrate materials by themselves. The other eight fire debris samples, designated IL, contained an ignitable liquid with the pyrolysis products from one of the eight substrate materials. These samples were prepared by mixing a portion of the diluted ignitable liquids with the extract of the substrate material.
Table 3 provides a description of the samples. Each sample was evaluated using the optimal distribution of the QDA model.