1. Introduction
Over the past decades, various works have been published to promote the use of vibrational spectroscopy as a diagnostic tool in clinical practice, with the role of a complementary technique to support the results obtained by means of conventional histological and cytological analysis techniques [
1,
2,
3,
4,
5,
6]. The two main vibrational techniques are Raman and Fourier Transform Infrared (FTIR) spectroscopies, which both are able to provide information about the different types of functional groups and their relative content inside the investigated cell or tissue samples [
7,
8]. In particular, in both cases the sample is excited by means of a radiation beam and the spectral intensity of the inelastically scattered (Raman) or absorbed (FTIR) radiation from different biological macromolecules (nucleic acids, proteins, lipids, etc.) is detected. However, vibrational spectra measured from normal and pathological samples are often similar to each other, due to the fact that spectral features related to specific cellular components only slightly change as a result of chemical–physical stress or the onset of pathology. So, the visual inspection of the measured spectrum in most cases is not enough to make a reliable diagnosis.
Such a problem can be addressed by collecting many spectra from the investigated samples and by analysing them through multivariate statistical methods. Multivariate analysis of the measured spectra is a critical step for data interpretation and the possibility of providing a reliable diagnostic result [
9]. Multivariate techniques have proved to be an efficient tool for obtaining information about large datasets of spectral measurements, each one consisting of many variables which are the scattered (as for Raman) or absorbed (as for FTIR) radiation intensity at hundreds of wavenumber values [
10]. In fact, these methods allow the visualization of similarities and differences in the data and to build classification models which can be used to predict the class of unknown samples of the same type: consequently, they are very promising as diagnostic tools.
The multivariate analysis techniques can be divided into unsupervised and supervised methods. The former aim to detect similarities and differences inside a dataset comprising spectra of different classes when there is no information available regard to the class to which they belong. Principal Component Analysis (PCA) is the most popular unsupervised method [
11]. On the contrary, supervised methods label the classes to be differentiated. They are based on two successive steps: firstly, samples whose class is known are used to build a model with proper parameters that optimize the discrimination between the data from different classes; then, unknown samples are assigned to a suitable class using the parameters optimized during the first step. Linear Discriminant Analysis (LDA) and Partial Least Squares Discriminant Analysis (PLS-DA) are effective supervised methods [
11].
Principal Components Analysis (PCA) is one of the most powerful multivariate techniques used for exploratory data analysis, i.e., it provides preliminary approaches to find differences and similarities among data. In the case of spectroscopy investigation, the data are the spectra measured from different samples. All the measured spectra can be represented as a dataset or matrix X, with
n rows, corresponding to the measured samples, and
m columns, each one corresponding to the spectral signal for a specific wavenumber value. The first aim of PCA is to reduce the dimensionality of large datasets (those including all the values of spectral variables in a wide wavenumber range for all the measured samples). The dimensionality reduction is performed by finding new variables, that are linear functions of those of the original dataset, that successively maximize variance and that are uncorrelated with each other [
12]. Briefly, PCA transforms the
m original variables, consisting of the signal values for the
m wavenumber values, into a new set of
m variables, called principal components (PCs), each one is a linear combination of the
m original variables. Each original spectrum takes specific values in the set of PCs: such values are called scores. The criterion according to which the first PC is chosen is that it contains most of the variance of the scores, and each subsequent PC contains less variance. A score plot, reporting the score values of two different PCs for all the
n samples, allows for visualizing differences and similarities among the
n samples, based on the original spectral characteristics. The coefficients describing the influence of the original variables on the score values for a given PC are known as loadings: they give information about the wavenumber values at which the spectral signals furnish the main variability inside the dataset, corresponding to changes in the molecular components contributing to the spectra. In fact, many works reported the discrimination of vibrational spectra from biological samples of different types through PCA score plots as well as the identification of differences in their biochemical content through the main spectral features of the loading plots [
13,
14,
15,
16].
LDA is a supervised classification method which can be used to classify objects (such as spectra measured from unknown samples) as belonging to classes which have been specified before the model is created [
11]. In particular, LDA is based on a linear transformation of
m variables describing
n samples belonging to different classes, so that samples from the same class are close together but samples from different classes are far apart from each other. This goal is achieved by means of a mathematical classification algorithm (based on a Mahalanobis distance calculation between the samples for each class) which maximizes the distance between the means of the classes while minimizing the variance within each class. So, a predicted class is assigned to each sample. After the classification model has been built, it is later used for allocating new and unknown samples to the most probable class. However, the LDA method cannot be applied when the number of spectral variables is larger than the number of samples (
m <
n) [
17]. This issue can be solved by calculating PCA for the spectral data prior to LDA and applying LDA to the PCA scores: this is how the PCA-LDA algorithm works.
N. Iturrioz-Rodriguez et al. showed that Raman spectroscopy can identify changes in the molecular composition of healthy astrocytes compared to glioblastoma patient-derived cells and that, associated with PCA-LDA, can become a diagnostic tool with accuracy values between 80% and 100% [
18]. PCA-LDA statistics were also used to classify oral squamous cell carcinoma cells in saliva samples with an accuracy of 90%, in the case of the Raman data set, and 82%, for the FT-IR one [
19]. The PCA-LDA model also has been demonstrated to yield 100% classification accuracy to differentiate FTIR spectra from hepatitis C infected and healthy freeze-dried sera samples [
20]. The PCA-LDA discrimination model correctly classified also FTIR spectra from gynaecological cancer samples into malignant and benign groups with accuracies of 96% and 93% for the k-fold and “leave one out” validation schemes, respectively [
21].
PLS-DA is a supervised classification technique which combines partial least squares (PLS) regression with LDA. Firstly, a PLS model is built from the spectral data (X matrix) and a dependent variable describing the class of spectra (Y matrix), so that Y = XB + E, where B is a matrix of regression coefficients and E is a matrix of residuals. The X matrix consists of
n rows, each one related to a sample, and
m columns, each one related to the signal intensity for each wavenumber value, whereas the Y matrix consists of
n rows, each one is a categorical variable that specifies the type of sample (samples of different type are described by different discrete numbers that encode the class membership, as −1 and +1). Such a PLS model, that relates the variations of the spectral data to the class of cells from which the spectra were measured, firstly transforms the original spectral variables into a set of a few latent variables (LVs), called factors. Then these new variables are used for regression with the dependent variable [
11,
22]. In the PLS models, scores and loadings specify how the samples and variables are projected along the factors. In particular, PLS scores, similarly to PCA scores, are the sample coordinates along the model components: they are computed in such a way that they capture the part of the structure in X which is most predictive for Y. The PLS loadings specify how much each X-variable contributes to a specific model component, in the same way as the PCA loadings do. A two-dimensional scatter plot of scores for two specified factors gives information about patterns in the samples, i.e., the closer the samples are in the scores plot, the more similar they are with respect to the two components concerned, whereas distant samples in the score plot are different. The corresponding loadings plot provides information about which variables are responsible for differences between samples. In addition, the regression coefficients determine what is the weight of each variable when predicting a particular Y response, i.e., variables with a large regression coefficient play an important role in the regression model. A positive coefficient shows a positive link with the response, and a negative coefficient shows a negative link. The difference between loadings and regression coefficients is that the former is related to each LV, whereas the latter refers to a model with a specific number of LVs. After the PLS regression model has been built, a linear discriminant classifier is used for classifying unknown samples (spectra). When −1 and +1 are the encoded values of class membership, if the predicted value is above 0, a corresponding object is considered a member of a class and if not it is considered a stranger.
Recently, PLS regression algorithms have been widely used in the biomedical field to construct predictive models based on Raman and FTIR spectral signals. R. Pinto Aguiar et al. classified Raman spectra from brain tissue using PLS-DA discrimination into normal (cerebellum and meninges) and tumours (glioblastoma, medulloblastoma, schwannoma, meningioma) with an accuracy of 94.1% [
23]. Raman spectroscopy was also used by D. Cullen et al. to investigate lymphocytes of patients with late radiation toxicity following radiotherapy treatment because of prostate cancer: the PLS-DA model developed to classify patients using known radiation toxicity scores achieved an accuracy value of 93% [
24]. X. Yang et al. reported PLS-DA results about first derivative FTIR data in nucleic acids spectral range collected from serum samples of patients with lung cancer and healthy people: they achieved 87.10% accuracy for discrimination of the two types of samples [
25]. FTIR and PLS-DA were also used to develop a prediction model based on the spectra of blood serum samples collected from healthy people and patients affected by attention deficit and hyperactivity disorder: the model was able to distinguish ADHD patients from healthy individuals with an accuracy of 100% [
26].
Overall, these two different classification models can have different performances when applied to the same dataset for diagnostic purposes. Therefore, it is important to compare the predicted results of both models in order to optimize the diagnostic phase. Hence, the aim of the present study is to evaluate and compare the performance parameters of the PCA-LDA and PLS-DA models for the class prediction of three different datasets of vibrational spectra: a simulated dataset and two experimental datasets of Raman and FTIR spectra, respectively. The comparison is performed by evaluating the values of accuracy, sensitivity and specificity obtained in the class prediction for a subset of each of the three datasets, used as a test set. The obtained results point out that both the classification models were able to predict the class of the different spectra with high values of accuracy (93% ÷ 100%), sensitivity (86% ÷ 100%) and specificity (90% ÷ 100%). So, if datasets of different types of spectra are available, the application of both classification models to the prediction of the class of unknown measured spectra is promising as a reliable complementary diagnostic tool in the clinical setting.
3. Results and Discussion
The normalized simulated spectra of control-like and exposed-like types were independently averaged in order to obtain mean spectra, which are shown in
Figure 1a,b, respectively. As expected, the difference between control and exposed mean spectra, shown in
Figure 1c, is characterized by large positive peaks centred at 785, 830, 1090 and 1580 cm
−1, corresponding to the spectral peaks whose intensity has been decreased for the basic spectrum in order to simulate damage in exposed-like type spectrum. The other positive and negative peaks in
Figure 1c are related to random intensity differences that affect the mean spectra and they are particularly relevant for the largest intensity features, as those centred at 1450 cm
−1, 1660 cm
−1 and 1200–1400 cm
−1 spectral range.
Similarly, the normalized mean Raman spectra of unexposed and proton-exposed MCF10A cells are plotted in
Figure 2a,b, respectively. Such spectra are characterized by Raman peaks and spectral features related to the main cellular components, as nucleic acids (784, 1096, 1340, 1373, 1490 and 1578 cm
−1), proteins (1003, 1032, 1128, 1207, 1260, 1340, 1615 and 1662 cm
−1) and lipids (1128, 1300 and 1440 cm
−1) [
29]. The positive peaks in the difference spectrum in
Figure 2c suggest that the main effect of radiation exposure on the Raman spectra consists in a relative decrease of nucleic acid components, as a consequence of larger exposure damage to DNA/RNA than to protein and lipid components [
27]. A significant intensity decrease of the Raman peaks related to the phosphodiester bond (at 784 cm
−1) and DNA bases ring modes (at 1574 cm
−1) was also reported by Synytsya et al. in proton irradiated calf thymus DNA [
30]. Modification of the above Raman peak related to nucleic acids was also reported by K. Sofinska et al. for cellular samples exposed to different types of ionizing radiation, such as proton and γ-rays [
31].
Furthermore, the infrared absorption spectra of MCF7 and MDA cells, shown in
Figure 3a,b, respectively, are characterized by spectral peaks and bands related to nucleic acids (1082, 1117 and 1227 cm
−1), proteins (1165, 1306, 1390, 1448, 1535 and 1637 cm
−1) and lipids (1390, 1448 and 1736 cm
−1). The difference spectrum in
Figure 3c indicates that the two types of cells can be biochemically discriminated according to the larger relative amount of nucleic acid content in MCF7 cells with respect to MDA ones, as suggested by the spectral peak at 1082 and 1227 cm
−1, whereas the peaks at about 1535 and 1637 cm
−1 are due to a relative spectral shift of the amide II and I band for the two types of cell [
28]. Both such results are in good agreement with those reported in the literature. In particular, both Talari et al. [
32] and Abramczyk et al. [
33] found a larger relative amount of nucleic acids in the MCF7 cells with respect to MDA cells. As for the shifts of amide I and II bands, they might be connected with changes in the secondary protein structures occurring during the process of carcinogenesis [
28].
The difference plots indicate that, for all three cases, important intensity differences between the mean spectra of two types of cells characterize the investigated range. In order to show whether these spectral differences would be enough to discriminate and classify the two types of cells, we firstly analysed data by means of the PCA technique. In particular, samples from the calibration sets were analysed by PCA, using a full cross-validation method. Although PCA is not able to provide classification, it is largely used for data interpretation and visualization, as well as to reduce dimension by extracting information from high-dimension data to project them into a lower dimension. In particular, PCA score plots are able to visualize the similarity and differences between samples and PCA loading plots provide information about the spectral variables responsible for the differences.
Figure 4a shows the score plots of PC1/PC4, with the percentage of each PC in the axis, for the simulated dataset. It has been verified that the first 7 principal components carried around 99% of all the spectral variation found in the dataset. As is visible in the score plot, the PC4 provides the main contribution to the discrimination of control-like spectra from exposed-like ones. In particular, control-like spectra have negative PC4 values and exposed-like spectra are characterized by positive PC4 values, with minor overlap. The results of the
t-test analysis performed for the distributions of PC4 score values for the two types of spectra prove that they are significantly different, as deduced from the box plots on the right side of
Figure 4a. The representation of the loadings of PC4 in
Figure 4b points out four intense negative peaks (at 785, 830, 1090 and 1580 cm
−1) whose spectral positions correspond to those of the four positive peaks in the difference of mean spectra shown in
Figure 1c. Therefore, PCA confirms that the two types of spectra can be mainly discriminated according to PC4 and such discrimination is related to the simulated spectral peaks whose intensity was changed to differentiate between control-like and exposed-like spectra.
Similar results occur for Raman spectra of control- and proton-exposed MCF10A cells. For the spectra of such cells, the first 7 principal components carried around 90% of all the spectral variation found in the whole dataset. The score plot in
Figure 5a points out that the PC4 component discriminates control from exposed cells, where control cells present mainly positive PC4 score values and exposed cells have mainly negative PC4 score values. Although overlapping score values are more major than in
Figure 4a, the distributions of score values are statistically different, as can be deduced from the box plots on the right side of
Figure 5a obtained by the
t-test analysis. A confirmation that PC4 discriminates between the two types of cells is obtained from
Figure 5b, where values of PC4 loadings are shown. The similarity of the spectrum in
Figure 5b with that in
Figure 2c is very evident. The positive peaks in
Figure 5b correspond to Raman peaks due to nucleic acid cellular components, while the spectral positions of the negative peaks correspond to spectral Raman signals related to cellular protein and lipid components, as discussed above.
Finally, PCA results for the dataset of MCF7 and MDA cells are shown in
Figure 6. For the FTIR spectra of this dataset, the first 7 principal components carried around 97% of all the spectral variation found in the whole dataset. The score plot in
Figure 6a highlights that PC1 well discriminate the metastatic MDA cells from the malignant MCF7 ones, with almost no overlapping score values. In particular, MCF7 and MDA cells have positive and negative, respectively, PC1 score values and the box plots at the top of
Figure 6a demonstrates that the two distribution are statistically different, according to
t-test analysis. Furthermore, the loading 1 plot in
Figure 6b is very similar to the difference plot in
Figure 3c, indicating that MCF7 cells (positive score) have a large relative content of nucleic acid components (positive loading bands at about 1085 and 1230 cm
−1) with respect to MDA cells, whereas the spectral features in the 1500–1700 cm
−1 range are related to the shift of the spectral position of amide I and II bands for the two types of cell.
Overall, the presence of several peaks in the loading plots of the discriminating PCs and the agreement of their spectral positions with those of the difference plots confirmed the discrimination potential of vibrational spectra. Therefore, classification techniques can be applied to the test sets of the three datasets, in order to evaluate the discrimination performance.
Hence, following the PCA, linear discrimination analysis was first performed. The first seven PC scores of the calibration set of simulated spectra were used as input data for LDA to build a diagnostic model that will be used for the classification of unknown spectra. The PCA-LDA model correctly classifies all the 36 spectra, as visible in the classification plot shown in
Figure 7a (filled circles). In fact, a discriminant score for the attribution of an object (spectrum) to each class is calculated and the object is assigned to that class for which the discriminant score is the largest. Hence, in
Figure 7a samples lying close to zero for a class are associated with that class. Instead, the accuracy of the PCA-LDA model for the classification of Raman spectra from unexposed and proton-exposed MCF10A cells, shown in
Figure 7b is 87.5%. Indeed, such a model, obtained by using the first seven PC scores of the calibration set as input for the LDA model, erroneously attributes some samples of the calibration set to a different class from the one they actually belong to, as visible in
Figure 7b from the circle crossed samples. In particular, 4 control spectra were attributed to exposed class and 1 exposed spectrum was attributed to control class. Lastly, a 100% accuracy is obtained for the PCA-LDA classification model related to MCF7 and MDA cells, as visible in
Figure 7c.
Despite the small number of spectra used by us, the accuracy value of the PCA-LDA model is similar to that obtained by T. Ning et al. regarding the discrimination of two different types of breast cancer tissue from healthy breast tissue by means of Raman spectroscopy: in fact, their accuracy, validated through full cross-validation, is equal to 88.3% [
34]. Furthermore, Y. Lin et al. declared to discriminate breast cancer by SERS spectra of serum proteins from cancer patients with respect to those from healthy volunteers with the PCA-LDA model, achieving an accuracy value of 84% with a ten-fold cross-validation method [
35]. Similar accuracy values with the PCA-LDA algorithm applied to Raman spectra have been obtained for discrimination of normal parenchyma and follicular patterned thyroid nodules (78%) and for carcinoma versus adenoma follicular lesions (89%) [
36]. Furthermore, Raman spectra from normal and tumour oral tissues were differentiated under the PCA-LDA model with an accuracy of 81.25% with full and k-fold cross-validation methods [
37].
Instead of cross-validation, we prefer to estimate the sensitivity and specificity of the classification model using a set of data (test set) that are external and independent of those used to build the model. This procedure is performed in view of its possible use in a clinical setting for diagnostic purposes. Indeed, in this case, some spectra should be acquired from areas containing material (cells, tissues) that are difficult to diagnose and the classification technique applied to these spectra, having available a dataset of spectra previously acquired from pathological and healthy areas.
Therefore, the prediction parameters of the developed PCA-LDA model were tested using samples of the test sets from the two classes. The results of the model prediction are summarized in
Table 1. Almost all the tested samples were predicted as belonging to the proper class, except one MCF10A exposed sample which is attributed to the control class. Therefore, the model was able to rightly classify the test samples and the accuracy, sensitivity and specificity achieved maximum values in the cases of simulated spectra and FTIR spectra, whereas these values were 93%, 86% and 100%, respectively, in the case of proton-exposed MCF10A cells. In order to obtain a visual picture of the classification performance, the test samples were projected on the classification plot in
Figure 7, where they are represented by hollow circles. It is clearly visible that all the representative points of the test set samples are in proximity to the representative points of the calibration set samples, with only one exception in
Figure 7b due to an MCF10A exposed sample which has been misclassified, as discussed above.
The procedure of optimizing the model using a calibration test with cross-validation and then evaluating the model performance using a test set was also carried out by H. Li et al., who measured Raman spectra from various types of breast cancer [
38]. In particular, they found that the PCA-LDA model correctly classifies all samples in the test set. Furthermore, N. Iturrioz-Rodríguez et al. have recently used Raman spectroscopy and PCA-LDA model for the classification of glioblastoma multiforme cells derived from brain tumour patients versus astrocytes derived from healthy patients, using a test set consisting of different cells than the calibration set [
18]. They stated an average classification accuracy of 92.5%. Therefore, the results we obtained about the performance parameters of the PCA-LDA technique applied to different types of vibrational spectra are in good agreement with those reported by other authors about spectra obtained with the Raman technique.
Furthermore, a PLS model was built for the calibration set of simulated spectra by using 7 latent variables. Clear discrimination of the control and exposed spectra is visible in
Figure 8a, which shows (filled circles) the score plot of Factor 1 and Factor 2 of the PLS model. In particular, the separation between the two types of spectra can be observed along both factors. Such a feature is also confirmed by the plot of the regression coefficients, shown in
Figure 8b for two components of the regression model. It can be deduced that the most important variables in the PLS model were those corresponding to wavenumbers around 780, 830, 1090 and 1580 cm
−1, as expected because the spectral peaks centred at such wavenumber values are mainly responsible for the difference between the average spectra of control-like and exposed-like spectra reported in
Figure 1c. The performance of the prediction ability of the built PLS model, checked by using just the samples from the test set, is summarized in
Table 1. All the unknown samples were correctly assigned to the proper class, so producing maximum values of the performance parameters. Such results can be also visualized by reporting the projections of the test samples on the Factor 2 vs. Factor 1 score plot, shown as hollow circles in
Figure 8a.
Similarly, the PLS model built for calibration samples of the MCF10A cells clearly discriminates control from exposed cells mainly according to Factor 1, as visible in
Figure 9a with the filled circles. The similarity of the plot of regression coefficients with two components in
Figure 9b with the difference between mean spectra of
Figure 2c suggests that two factors are also able to correctly discriminate control from exposed samples according to the intensity of Raman peaks related to nucleic acids components. As for the prediction ability of the model, one exposed sample of the test set was misclassified, so determining accuracy and sensitivity values of 93% and 86%, respectively, as reported in
Table 1 and visible in the scatter plot of the projected samples, shown as hollow circles in
Figure 9a. The misclassified sample of the test set has been labelled for clarity.
Furthermore, the PLS model developed from the calibration samples of the MCF7 and MDA cells rightly discriminates metastatic from malignant cells according to Factor 1, as visible in
Figure 10a with the filled circles. The regression coefficients of the PLS model with two components are plotted in
Figure 10b: the spectral shape of such plot is analogous to that of
Figure 3c, which reports the difference spectrum between MCF7 and MDA mean spectra. Therefore, it can be deduced that two LVs correctly discriminate metastatic cells from malignant ones. Furthermore, in this case, the prediction ability of the model was good but not perfect because one MCF7 sample of the test set was misclassified, as reported in
Table 1. The obtained sensitivity value was 100%, whereas the accuracy and specificity values were 95% and 90%, respectively. The misclassified sample is also visible in
Figure 10a as red hollow circles which have been labelled for clarification purposes. In a comparison of the results obtained by the PLS-DA classification with those of the PCA-LDA classification, it is evident that the latter has a better performance in terms of accuracy and specificity.
The above performance parameters values are better than those obtained by W. Liu et al., who used patient tissues measured by Raman spectroscopy associated with the PLS-DA model and full cross-validation to diagnose colorectal cancer with a sensitivity of 77.7%, a specificity of 91.0%, and an accuracy of 84.3% [
39]. Surface-enhanced Raman spectroscopy combined with the Lasso-PLS-DA algorithm with full cross-validation was used by G. Chen et al. for the identification of different tumour states in nasopharyngeal cancer [
40]: they yielded a diagnostic sensitivity of 68% and a specificity of 84.0% for separating T2-T4 stage from T1 stage cancer. Larger values of classification parameters were achieved by X. Yang et al., who declared 87.10% accuracy, 80% sensitivity and 91.89% specificity by using the PLS-DA model with test set validation to discriminate first derivative FTIR data in nucleic acids spectral range collected from serum samples of patients with lung cancer and healthy people [
25]. Recently, high sensitivity and specificity values (more than 90%) were also reported for the discrimination of FTIR spectra measured for two different melanoma cell lines (primary IPC-298 and metastatic SK-MEL-30) by using the PLS-DA model with test set validation [
41].
As evident from the above discussion, the PCA-LDA and PLS-DA techniques are widely used for the analysis and classification of spectral measurements, together with other multivariate data classification techniques [
37,
39,
42,
43,
44]. However, they have been mainly used for single datasets including spectra of different types to obtain classifications (e.g., discrimination of spectra from healthy and diseased cells). On the contrary, in the present study, the two classification techniques were both used on three different types of very different spectral datasets, in order to obtain a comparison of the predictions, as independent as possible from the single dataset. This comparison is useful for choosing the optimal method. The comparison pointed out the good performance of both methods, with a prevalence of PCA-LDA which is able to classify FTIR spectra with better accuracy and specificity than PLS-DA.