1. Introduction
Cancer is considered as one of the main problems of healthcare. A vast number of various forms and manifestations of cancer are widespread [
1,
2]. The most optimal cancer treatment outcome is in case of diagnostics in the early stage. The majority of cancers are prone to the lack of symptoms in the early stage, which leads to a high mortality rate due to late diagnostics. Therefore, the issue of developing new non-invasive techniques to serve cancer diagnostics is at hand.
Exhaled breath is being actively explored as a source of cancer biomarkers [
3,
4,
5]. Owing to its simplicity and convenience of sampling as well as non-invasiveness, the interest in exhaled breath is gaining momentum. Various scientists published the results of studies where cancer diagnostic methods based on exhaled breath analysis using different analytical tools were developed [
6,
7,
8,
9]. Gas chromatography coupled with mass spectrometry (GC-MS) has taken a dominant position in the field of exhaled breath analysis since it is able to provide the most complete information regarding the sample composition [
10,
11,
12]. Additionally, other analytical methods, including ion mobility spectrometry (IMS) [
13], selected ion flow tube mass spectrometry (SIFT-MS) [
14,
15], proton-transfer-reaction mass spectrometry (PTR-MS) [
16,
17,
18], are widely applied for exhaled breath analysis. Electronic noses (e-noses) can be considered as a separate group of tools for exhaled breath analysis with the advantages of simplicity of construction, mobility of the device, and high speed of analysis [
19]. Various e-nose configurations are known to be good candidates as exhaled breath analysis instruments: an e-nose based on metal oxide semiconductor sensors [
20,
21], a chemoresistive e-nose [
22], Cyranose 320 [
23], aeonose [
24,
25], or combined devices consisting of several types of sensors [
26,
27]. Exhaled breath sampling techniques and analytical methods differ in the studies, which can influence the results. Alveolar, end-tidal, or mixed exhaled breath can be a subject of analysis. The concentration of endogenous VOCs is higher in samples of alveolar air [
28]. However, the sampling of alveolar air involves using sophisticated equipment, which restricts the mobility and velocity of sampling. Sampling of end-tidal exhaled air allows us to take more alveolar air, but the ratio of alveolar and dead space air in a sample may differ from one person to another, which contributes to a distortion of the results. Mixed exhaled air is highly diluted by dead space air; therefore, the number of endogenous VOCs is lower. However, this approach is simple, quick, and does not require sophisticated equipment. Obtaining reliable results using mixed exhaled air is possible only in the case of the strict controlling of ambient air as well as conducting the sampling procedure [
29].
Attempts to create a diagnostic method using exhaled breath to reveal cancer of various localizations have already been demonstrated [
30,
31]. The majority of studies are devoted to the identification of an exact disease, as a rule, lung [
21,
24,
26,
27] or breast [
6,
23] cancer. Benzene, 2-propanol, styrene, and pentane were often assigned as metabolites linked with lung cancer development [
32]. Breast cancer biomarkers in common significantly differ in various studies [
23,
33]. However, heptanal was noted as a biomarker in several studies [
34,
35]. Exhaled breath can be useful for diagnosing cancer of other localizations. For example, a diagnostic model was created in [
36], which allowed for the identification of cirrhosis, and primary and secondary liver tumors. Ethane, (E)-2-nonene, acetaldehyde, and acetone contributed the most to the diagnostic accuracy. Ovarian cancer can be diagnosed with 89% accuracy using a diagnostic model based on decanal, nonanal, styrene, 2-butanone, and hexadecane, which were identified in exhaled breath using GC-MS [
37]. Cyclohexanone, 2,2-dimethyldecane, dodecane, 4-ethyl-1-octyn-3-ol, ethylaniline, cyclooctylmethanol, trans-2-dodecen-1-ol, 3-hydroxy-2,4,4-trimethylpentyl, 2-methylpropanoate, and 6-t-butyl-2,2,9,9-tetramethyl-3,5-decadien-7-yne were assigned to colorectal cancer biomarkers [
38].
Another interesting issue is to find alternative evidence that the tumor affects VOC levels in exhaled breath. It can be achieved by comparing exhaled breath profiles of patients before and after surgery. This approach was demonstrated on 84 patients with lung cancer [
39]. Concentrations of 2,5-dimethylfurane, cyclohexanone, propyl cyclohexane, octanal, nonanal, decanal, and 2,2-dymethyldecane differed the most in exhaled breath of patients with lung cancer before and after surgery. An alternative approach is to study the VOC profile extracted by cancer cell lines. The authors [
40] compared metabolite profiles of adenocarcinoma, squamous cell carcinoma, large cell carcinoma, small cell carcinoma cell lines, and one normal small airway epithelial cells. Benzaldehyde, 2-ethylhexanol, and 2,4-decadien-1-ol were found as potential lung cancer biomarkers. Comparing profiles of VOCs from different subjects allows one to trace metabolic pathways and obtain additional proof of biomarkers’ origins. Correlations between the results of exhaled breath and fecal samples of patients with gastric cancer were found in study [
41].
Considering the highest mortality rate and sophisticated diagnostic procedures applied in clinical practice nowadays, the development of a non-invasive and accurate lung cancer diagnostic tool is the most urgent task [
2,
8,
42]. A conventional approach to developing a diagnostic method is the comparison of healthy volunteers and patients with the studied disease. However, the accuracy of the diagnostic model can be unknown when it comes to other diseases. Therefore, it is essential to consider the accuracy of biomarkers not only in relation to heathy subjects, but the selectivity of potential biomarkers to other diseases. Some studies have considered the development of a diagnostic method able to simultaneously detect several cancer types, for example, an electronic nose was presented in [
30] consisting of an array of cross-reactive nanosensors based on organically functionalized gold nanoparticles for diagnosing lung, breast, colorectal, and prostate cancer. VOC profiles of patients with lung cancer, lung cancer and COPD, COPD, and healthy subjects were compared in the study [
42].
The paper is focused on the selectivity of exhaled breath analysis using GC-MS to distinguish lung cancer from cancer of other localizations. Breast, esophageal, colorectal, kidney, prostate, cervix, and skin cancer localizations were considered.
2. Results
The study includes two groups of cancer patients: 85 patients with lung cancer and 85 patients with cancer of other organs, including 11 patients with esophageal cancer, 22 patients with mammary cancer, 16 patients with colorectal cancer, 14 patients with kidney cancer, 7 patients with stomach cancer, 6 patients with prostate cancer, 5 patients with cervix cancer, and 4 patients with skin cancer. These samples of exhaled breath were analyzed using GC-MS.
VOCs and their ratios, which were different in lung cancer and other cancer localization groups, were found using a Mann–Whitney U test. Hexane (
p = 0.013), acetonitrile (
p = 0.036), 1-methylthiopropene (
p = 0.010), 1-methylthiopropane (
p = 0.006), and dimethyl sulfide (
p = 0.021) show a significant difference between groups of patients with lung cancer and cancer of other localizations. Also, several ratios were significantly different between lung cancer and cancer of other localizations (
Table 1).
The ratios were used as input values for the creation and validation of the diagnostic model using an artificial neural network (ANN). The accuracy for training, validation, and test datasets was calculated. The efficiency of Broyden–Fletcher–Goldfarb–Shanno (BFGS) and nonlinear conjugate gradient algorithms was compared for the creation of the model. To validate the model, three-fold cross-validation was implemented (
Table 2). As seen from
Table 2, the BFGS algorithm is better on a test dataset.
The variability of exhaled breath samples of patients with lung, esophageal, breast, colorectal, and kidney cancer was estimated by Kruskal–Wallis H tests. Each pairwise comparison was conducted using Mann-Whitney U tests with a subsequent adjustment of p-value for false discovery rate (FDR).
Some VOCs were found to be different in the studied groups (
Table 3).
Figure 1 represents a median and interquartile range of parameters with the lowest
p-value.
Discriminant analysis (DA) was applied to classify groups of patients with lung, esophageal, breast, colorectal, and kidney cancer. Ratios of VOCs, which were significantly different between the groups, were used as input values. The DA classification matrix is presented in
Table 4.
Figure 2 represents a scattering diagram of canonical values for exhaled breath samples depending on cancer localization.
In addition, the gradient-boosted decision trees (GBDT) algorithm was used to separate groups of patients with lung, esophageal, breast, colorectal, and kidney cancer. To validate the model, three-fold cross-validation was used. The performance for training and test datasets was calculated (
Table 5).
3. Discussion
The development of a non-invasive cancer diagnostic method is an urgent challenge, which attracts the attention of many researchers worldwide [
4,
8,
10,
18,
21]. Despite the attempts of many research groups to solve the problem, the breath test for cancer diagnostics has not yet been implemented in clinical practice. It can be explained by the many pitfalls that are often omitted during research. A conventional approach of biomarker identification assumes comparing a group of pathology with a group of healthy volunteers. However, the approach can lead to false-positive results linked to a lack of considering other disorders. An issue of this work was to compare groups of patients with cancer of various localizations. Breast, esophageal, colorectal, kidney, prostate, cervix, and skin cancers were considered. Not only peak areas but also their ratios were considered in terms of the difference between lung cancer and cancer of other localizations. The implementation of this approach was demonstrated earlier [
43].
Taking into account difficulties concerning lung cancer diagnostics, the most essential task was to separate samples of exhaled breath of lung cancer patients and patients with cancer of other localizations. For this, a Mann–Whitney U test was applied. Acetonitrile, 1-methylthiopropene, 1-methylthiopropane, and dimethyl sulfide were different between patients with lung cancer and cancer of other localizations. Acetonitrile [
44], dimethyl sulfide [
45], and 1-methylthiopropene [
46] were determined as lung cancer biomarkers earlier. Dimethyl sulfide was also listed as a putative biomarker of esophageal cancer [
17]. The ratio of 1-methylthiopropane/acetone was different in groups of lung cancer and healthy volunteers in the previous work [
43].
To create a model capable of separating patients with lung cancer and patients with other cancer localizations, ANN was used. ANN is one of the most powerful machine-learning algorithms. It was used in many research works to create diagnostic models [
24,
47]. Our previous research has shown that the diagnostic model created using ANN is more accurate than random forest, support vector machine, and logistic regression on the same dataset [
43]. ANN is the most flexible method capable of revealing complex patterns that may be inaccessible to traditional algorithms. Therefore, ANN was used in this work to create a classification model to separate lung cancer patients from patients with cancer of other localizations. The efficiency of two algorithms: Broyden–Fletcher–Goldfarb–Shanno (BFGH) and nonlinear conjugate gradient was compared to train the ANN. The nonlinear conjugate gradient algorithm is attractive due to the simplicity of the iterations and lower storage requirements [
48]. BFGS is one of the most effective quasi-Newton methods [
49]. BFGS surpassed the conjugate gradient algorithm: the average sensitivity and specificity on the test dataset were 67% and 69% for BFGS and 56% and 57% for conjugate gradient. Accuracy, which is achieved by comparing lung cancer patients with healthy individuals, is significantly greater in most cases [
50,
51,
52,
53]. The accuracy obtained in our research is utterly inadequate for a large-scale screening due to the high number of expected false positives. The study has several limitations: the group of patients with other cancer localization includes uneven distribution of various cancer localizations. Another drawback is the sample size, which is too small to obtain reliable results. However, this study highlights the problem of differentiating various diseases through exhaled breath analysis. Prospectively, the diagnostic models aimed to identify lung cancer may classify patients with cancer of various localizations as lung cancer patients. Therefore, it is essential to compare not only samples of lung cancer patients and healthy volunteers but also consider other pathologies, which can be potentially confused with the disease.
Another task of this work was to evaluate the possibility of classifying patients with various cancer localizations, namely lung, esophageal, breast, colorectal, and kidney cancer, and find the parameters specific to each group. For this, a Kruskal–Wallis H test was used. As can be seen from
Figure 2, there are no parameters that can classify each cancer in the separate groups. However, the level of dimethyl sulfide is elevated in the case of lung and esophageal cancer in comparison with other cancer localizations. The majority of ratios containing sulfuric compounds is higher in the case of esophageal and colorectal cancers. Dimethyl sulfide and ratios containing this component were significantly different in groups of lung and esophageal cancer as well as lung and kidney cancer. Levels of the set of VOCs and their ratios were equal for the rest of the cancer localizations (
Table 2).
An attempt to classify lung, esophageal, breast, colorectal, and kidney cancer using DA was applied owing to the ability of visualization using a scattering diagram of canonical values. As shown in
Figure 2, the exhaled breath samples of patients with cancer of various localizations cannot be separated. Most samples of esophageal, breast, colorectal, and kidney cancer are classified as lung cancer. ANN is one of the most effective machine learning algorithms [
38]. It is worth noting that ANN works better when the groups have an equal number of cases. Considering the task of separation of groups with different numbers of observations, one of the most effective machine-learning algorithms is GBDT [
45], which was applied to classify the exhaled breath samples of patients with cancer of different localizations. The accuracy of classification on the training data was relatively high for lung and esophageal cancer, but on the test data, it was significantly worse for all cancer localizations. Among the studied cancer types, the model better recognized lung and breast cancer on the test dataset (
Table 5). Lung, breast, colorectal, and prostate cancers were classified through exhaled breath analysis using electronic nose based on cross-reactive nanosensors [
30]. The groups of patients with lung, breast, and colon cancer were fully separated, but prostate and lung cancer and healthy individual groups were overlapped. Our study also demonstrates a better separation of lung and breast cancer, but accuracy is significantly lower. The main limitation of this part of the study is a small sample size with a lot of comparable groups, each of which contains a low number of samples.
The exhaled breath VOC profiles of lung cancer patients and patients suffering from other lung diseases (e.g., chronic obstructive pulmonary disease (COPD), asthma, pneumonia, pulmonary embolism, benign lung tumors) as well as healthy controls were compared in this study [
42]. It was shown that the discrimination of lung cancer and healthy controls was better than between lung cancer and other lung diseases. The classification of 50 breast cancer patients, COPD patients, and healthy volunteers was fulfilled with 100% accuracy on test data using hemoresistive gas sensors and canonical analysis of principal coordinates [
54].
The results obtained in this study additionally prove the assumption of obtaining a potentially incorrect diagnosis since the samples of patients with cancer of various localizations are poorly separated. The issue of separating cancer of various localizations is essential for the development of a reliable and accurate cancer diagnostic tool.