1. Introduction
The COVID-19 pandemic has lasted almost three years worldwide since its outbreak in late 2019, affecting all aspects of our lives and activities. And it could be the most severe threat for generations due to its dangerousness and unpredictability [
1]. Great efforts have been made to understand and discover SARS-CoV-2, including its viral structure [
2,
3,
4], spread process [
5,
6,
7,
8], pathogenesis mechanism [
9,
10], diagnostic approach [
8], variants [
11,
12,
13], and vaccine immunity [
13,
14,
15]. However, there are still significant challenges in the detection of this virus despite all kinds of positive scientific progress have been achieved. After all, an accurate and rapid distinguishment plays an important role in preventing and containing the epidemic. Moreover, early warning and identification of bio-threats is the priority of protection. Considering the operability and timeliness of the technology, detections including sequencing, real-time PCR [
16,
17,
18], loop-mediated isothermal amplification technology (LAMP) [
17,
19,
20,
21], droplet digital PCR technology [
22], and CRISPR diagnosis [
23] have all been developed. Meanwhile, technologies based on serological detection have also performed a significant part in the diagnosis, especially the enzyme-linked immunosorbent assay (ELISA) and lateral flow immunochromatography (LFIA), which have also been often compared to each other to obtain a better sensitivity [
24,
25,
26,
27]. However, Methods based on PCR analysis are currently expensive and low-throughput, sometimes resulting in false-negative cases due to procedural or technical problems. High-quality antisera are often required for ELISA techniques, etc. [
28]. Most importantly, these methods are destructive to samples. Therefore, it is still a challenge to develop a rapid identification method for SARS-CoV-2.
Spectroscopy is a practical approach for detecting different kinds of viruses [
29]; it is vital in bioengineering, natural sciences, environmental monitoring, and medical research. Especially, spectroscopy combined with algorithm analysis can be an effective way to determine the target molecules’ presence. The regular steps for the diagnosis of COVID-19 through spectroscopy could be summarized as follows: (1) collection of specimens from nasopharyngeal or oral swab; (2) preparation of the samples and analysis by spectroscopy such as Raman, infrared, and fluorescence; (3) algorithm applied based on the spectral results; (4) target molecules differentiating from the control samples after statistical analysis. A previous study has proposed a method for detecting COVID-19 by Raman spectroscopy [
30]. They analyzed and discriminant against the samples from blood serum by coupling Raman spectroscopy and principal component analysis (PCA) and partial least squares (PLS) with 87% sensitivity and 100% specificity, indicating a possible rapid detection for COVID-19. Apart from that, Sanchez et al. proposed the possibility of detecting spike and nucleocapsid proteins of SARS-CoV-2 using surface-enhanced Raman spectroscopy to replace RT-PCR due to a significant signature of the virus could be obtained [
31]. Besides, the team of Barauna reported an onsite, rapid, reagent-free, and nondestructive method for the detection of SARS-CoV-2 by Attenuated total reflectance Fourier transform infrared (ATR-FTIR) spectroscopy and a generic algorithm-linear discriminant analysis (GA-LDA) algorithm, which could lead to a result of 95% for blind sensitivity and 89% for specificity [
32]. Except for identifying and detecting SARS-CoV-2, spectroscopy such as three-dimensional excitation-emission matrix fluorescence could be used to classify different proteins [
33]. Therefore, spectroscopy combined with machine learning has provided a promising future in detecting and diagnosing viruses such as COVID-19.
The most commonly used algorithms for discriminating are principal component analysis (PCA) and Linear discriminant analysis (LDA). PCA is a dimensionality reduction technology, especially in diverse areas [
34]. It can simplify data sets by selecting or transforming them to fewer important variables, also known as principal components, through a linear transformation. This method can extract many original variables with specific correlations, perform feature extraction, and recombine them into a new set of independent and comprehensive indicators to replace the original ones, forming a new minority of variables (components). Using these new variables to replace the original data for subsequent processing, the higher-dimensional space problem can be transformed into a low-dimensional space for analysis [
34]. This approach reduces the dimensionality of multivariate data systems and simplifies the statistical features of system variables. PCA has been used in materials and biochemistry areas based on spectra analysis, such as Raman spectra for nanoparticle characterization [
35] and Raman spectra for Hepatitis C infection [
36], indicating potential in description and screening study. LDA is also a dimensionality reduction technology, which can be concluded as focusing on classifying. When the number of classifications is known, it can be used to determine the category to which an unknown object belongs based on certain observational indicators of the classified objects. The discriminant analytic method needs first to classify the objects, further select some variables that can describe the observation objects more comprehensively, and then establish one or more discrimination functions according to certain discriminant criteria. For a case with an undetermined category, as long as it is substituted into the discrimination function, it can be judged which category it belongs to. The previous study has exhibited that LDA has been used in spectra analysis. Lv et al. proposed to use of LDA to classify freshwater fish species based on near-infrared reflectance spectroscopy [
37]. Lin et al. reported a method for identifying osteonecrosis and normal tissue by combining near-infrared spectroscopy and LDA [
38]. PCA and LDA both can perform dimensionality reduction on the data [
39]. The difference is that LDA is a supervised dimensionality reduction method with instructional values, while PCA is an unsupervised [
39]. In addition to dimensionality reduction, LDA can also be used for classification. PCA aims to achieve dimensionality reduction by finding the linear combination with the largest variance of multidimensional data. This method has an obvious influence on simplifying multidimensional data, but it is difficult to achieve effective distinction and distinguishment for similar data belonging to different species. LDA projects high-dimensional data into the best discriminant vector space to maximize the separability of samples in the new space to realize the effective extraction of the classification [
39,
40]. Therefore, PCA combined with LDA can complement each other and has the advantages of simple procedure and high efficiency. Ren et al. developed a method to recognize asphalt fingerprints based on ATR-FTIR spectroscopy and PCA-LDA analysis [
41]. Besides, in the clinical area, PCA-LDA could be employed in diagnosing dental fluorosis with the help of micro-Raman spectroscopy [
42]. Also, PCA-LDA could be regarded as a helpful analysis technology in forensic sciences to detect blood strain based on ATR-FTIR [
43].
In summary, the combination of spectroscopy and algorithms has been a powerful tool in scientific research, including biochemistry, environment, and forensic studies. However, as for COVID-19, despite all the achievements, there is still limited research on developing a systematic pattern for the early identification of SARS-CoV-2 because the realistic factor, such as the infectivity and pathogenicity of the virus, could not be solved. Moreover, although the distinguishment between SARS-CoV-2 and non-SARS-CoV-2 was realized with an accuracy of 97.4% using complementary DNA and machine learning algorithms [
44], the significant role of spectroscopy was neglected. Referring to our previous work [
45], the virus-like model based on the nucleocapsid protein of SARS-CoV-2 was synthesized successfully to replace the original virus in some scientific research. Therefore, in this study, spike and nucleocapsid proteins were focused on producing the virus-like models by phage display, which were named Model-S and Model-N, on being substitutes for the bio-threats of COVID-19. After that, three-dimensional excitation-emission matrix fluorescence and Raman spectroscopy were applied to the models and other selected samples for spectral analysis. Followed by the combination of PCA and LDA algorithms, a systematic process to identify and discriminant the virus-like models of COVID-19 was developed. The results from the two types of spectra could be compared. The idea and system may provide a method for actual SARS-CoV-2 detection and help to realize early monitoring for different kinds of bio-threats in the future.
3. Discussion
Both 3DFS and RS can classify and differentiate the bio-threats from the other samples with the accompany of PCA-LDA algorithms. Moreover, the two methods have been proven reliable through cross-validations with an accuracy of 88.9% and 96.3%, respectively. The correction from RS is slightly higher than that from the 3DFS. However, in 3DFS, misrecognition is among N protein, S2 protein, E. coli TG1, and OVA. The synthesized Model-N and Model-S could be separated directly. While in the analysis of RS, the two bio-threats could be recognized from the other samples, it is difficult to identify between them. To be specific, according to the cross-validation, there is a 33.3% probability of misidentification to Model-S for a sample of Model-N. It might cause problems when the individual model is studied. Nevertheless, SARS-CoV-2 is investigated from a holistic perspective, Model-N and Model-S are both derivatives of it, and these models are supposed to imitate the actual viruses as bio-threats and regarded as a whole. Also, there were slight differences among the repeated measurement in 3DFS and RS for an individual sample. However, the overall identification and classification were not affected. The results for both spectra were useful.
5. Conclusions
In summary, in order to provide a possible method for the identification of the high pathogenic virus SARS-CoV-2 apart from the traditional biological approaches, spectra analysis and algorithms were employed to realize the aim. First of all, virus-like models were synthesized by phage display technology based on the spike and nucleocapsid protein of SARS-CoV-2. These nonpathogenic models were prepared to replace the actual virus in research, avoiding physical and environmental safety problems. Then, the nine related samples, including the models, protein, virus, and bacteria, were selected for further distinguishment. Three-dimensional fluorescence spectroscopy and Raman spectroscopy were applied to collect their spectral information. With the combination of algorithms PCA-LDA, the nine samples were separated, and the bio-threats (Model-N and Model-S) were identified successfully, with a correction of 88.9% (3DFS) and 96.3% (RS) after cross-validation. The two approaches (3DFS and RS) were useful and reliable for identification. This study provided an idea to diagnose SARS-CoV-2 from the perspective of spectral algorithm analysis. Also, the study proposed a pattern for the research and recognition of the highly pathogenic virus by a non-intrusive method and nondestructive test, which may help to realize monitoring and control of the pandemic at an early stage.