A High-Precision Monitoring Method Based on SVM Regression for Multivariate Quantitative Analysis of PID Response to VOC Signals

Feng, Xiujuan; Liu, Zengyuan; Ren, Yongjun; Dong, Chengliang

doi:10.3390/chemosensors12050074

Open AccessArticle

A High-Precision Monitoring Method Based on SVM Regression for Multivariate Quantitative Analysis of PID Response to VOC Signals

¹

School of Mines, China University of Mining and Technology, Xuzhou 221116, China

²

Industrial Technology Innovation Center for Ecological Restoration of Industrial and Mining Sites in the Petroleum and Chemical Industry, Xuzhou 221116, China

³

Mechano Chemistry Research Institute, China University of Mining and Technology, Xuzhou 221116, China

⁴

School of Computer Science, Nanjing University of Information Science and Technology, Nanjing 210044, China

^*

Authors to whom correspondence should be addressed.

Chemosensors 2024, 12(5), 74; https://doi.org/10.3390/chemosensors12050074

Submission received: 29 February 2024 / Revised: 26 April 2024 / Accepted: 28 April 2024 / Published: 3 May 2024

(This article belongs to the Special Issue Chemical Sensors for Volatile Organic Compound Detection, 2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

:

In the moist environment of soil-water-air, there is a problem of low accuracy in monitoring volatile organic compounds (VOCs) using a photoionization detector (PID). This study is based on the PID water-soil-gas VOC online monitor developed by this group, online monitoring of the concentration of different constituents of VOCs in different production enterprises of the petroleum and chemical industries in Shandong Province, with the concentration of the laboratory test, to build a relevant model. The correlation coefficient about the PID test concentration and the actual concentration correlation coefficient was obtained through the collection of a large number of data trainings. Based on the application of PID in VOC monitoring, the establishment of a PID high-precision calibration model is important for the precise monitoring of VOCs. In this paper, multiple quantitative analyses were conducted, based on SVM regression of PID response to VOC signals, to study the high-precision VOC monitoring method. To select the response signals of PID under different concentrations of environmental VOCs measured by the research group, first, the PID response to VOC signals was modeled using the support vector machine principle to verify the effect of traditional SVM regression. For the problem of raw data redundancy, calculate the time-domain and frequency-domain characteristics of the PID signal, and conduct the principal component analysis of the time-domain of the PID signal. In order to make the SVM regression more generalized and robust, the selection of kernel function parameters and penalty factor of SVM is optimized by genetic algorithm. By comparing the accuracy of PID calibration models such as PID signal feature extraction, SVM regression, and principal component analysis SVM regression, the superiority of photoionization detector using the signal feature extraction PCA-GA-SVM method to monitor VOCs is verified.

Keywords:

VOCs; PID signal feature extraction; principal component analysis (PCA); support vector regression (SVR)

1. Introduction

Volatile organic compounds (VOCs) encompass a diverse array of chemical compounds. As per the World Health Organization (WHO, 1989), total volatile organic compounds (TVOCs) represent a collective term for organic compounds characterized by a melting point below room temperature and a boiling point falling within the range of 50 °C to 250 °C. Within the environmental context, these compounds are denoted as a dynamic group possessing volatility and potential health hazards. VOCs are typically classified based on their chemical structures, which encompass alkanes, aromatics, esters, aldehydes, and other categories. Over 300 distinct types of VOCs have been identified, with notable examples including benzene, ethylbenzene, toluene, xylene, styrene, trichloroethylene, trichloroethane, diisocyanate (TDI), diisocyanotoluene, and various others [1,2]. If emitted into the environment in excess, they can cause symptoms of poisoning in humans [3]. Benzene compounds can also lead to dysfunction in the human nervous system [4]. Prolonged inhalation of benzene compounds can result in abnormal liver function, damage to the hematopoietic organs, and may even lead to symptoms of sepsis, causing abnormalities in human health and potentially triggering disorders such as aplastic anemia. In cases of large-scale vaporization of benzene, individuals may experience acute poisoning, which can lead to fatalities [5].

Five primary monitoring technologies for VOCs are currently employed: photoionization detector (PID), Fourier transform infrared spectrometer (FTIR), gas chromatography-mass spectrometry (GC-MS), flame ionization detector (FID), and metal oxide semiconductor sensor (MOS). Among these, PID, FTIR, and GC-MS are commonly favored in industrial applications. Each method presents distinct advantages and limitations. GC-MS offers extended monitoring durations for VOCs. FID, however, is prone to interference from oxygen, moisture, and nitrogen-, oxygen-, or halogen-containing compounds in the environment during VOC monitoring. DOAS, while capable, is restricted in its applicability primarily to benzene, toluene, and related compounds. MOS suffers from issues of low selectivity and high operating temperatures compared to PID in practical industrial settings. FTIR technology is particularly suited for environmental VOC monitoring, albeit at the cost of high equipment and maintenance expenses. PID stands out for its high monitoring accuracy, nondestructive nature, rapid response time, and extended operational lifespan [6]. It enables VOC monitoring at atmospheric pressure, typically achieving ppm-level concentration monitoring, with some high-accuracy PIDs capable of ppb-level concentration monitoring. PIDs exhibit high sensitivity and facilitate nondestructive monitoring through ionization of measured VOCs in the ionization chamber, enabling synergy with mass spectrometry and other VOC monitoring techniques to ascertain VOC components and concentrations swiftly. Furthermore, PIDs offer quick response times, with the studied PID registering VOC concentrations within 3 s of contact. Thus, PID emerges as a desirable option for VOC monitoring purposes [7,8,9,10].

The calibration methods for photoionization detectors (PIDs) used in VOC monitoring primarily involve determining the concentration of VOCs based on voltage values. This approach is evident in the research of Arnaud Termonia [11], Chung-hwan Je [12], Gianfranco Manes [13], Kentaro Oka [14], Qian [15], Wang J [16], Li [17], Wang Li [18], and others. In PID water-soil-gas monitoring VOCs meter, affected by the factors of the gas such as humidity, temperature, cleanliness, etc., the collected voltage signal contains more interference signals. Especially for monitoring VOCs in soil-water-gas, we need to heat VOCs in soil-water to make them volatilize, and when VOCs volatilize, they will be accompanied by water vapor. In an environment with water vapor, the photocathode surface of the PID is not evenly distributed, so the sensor generates noise. Thus, the signals generated by the PID occur simultaneously with complex noise, impacting the numerical values of PID voltage and leading to misjudgments of VOC concentrations.

Machine learning methods offer a solution by using multidimensional features of the signal instead of relying on a single voltage value for PID calibration. These multidimensional features help reduce the impact of noise on PID signals responding to VOC concentrations, thereby improving the robustness of the method [19].

Deep learning methods, like artificial neural networks, require the collection of a large number of VOC samples, meaning the preparation of VOC gas at specific concentrations [17,18,19]. It demands a substantial amount of human resources, materials, and financial investment [15,16,17,18,19]. The support vector machine (SVM) method, based on small-sample statistical learning theory, addresses this issue by constructing an optimal hyperplane that maximizes the distance between the hyperplane and different sets of samples in the sample or feature space [20]. The objective is to maximize the generalization ability. SVM demonstrates superior generalization ability compared to deep learning methods such as artificial neural networks. In addition, the solution provided by SVM is the unique globally optimal solution. Therefore, this paper will investigate the method of using SVM regression (SVR) to monitor VOC concentrations in the soil-water-air environment.

This study is based on the PID water-soil-gas VOCs online monitor developed by this group [21], online monitoring of the concentration of different constituents of VOCs in different production enterprises of the petroleum and chemical industries in Shandong Province with the concentration of the laboratory test, to build a relevant model. Through the collection of a large number of data trainings, the correlation coefficient about the PID test concentration and the actual concentration correlation coefficient was obtained.

2. Relate Work

2.1. Monitoring Method for VOCs Based on PID

In 1997, Arnaud Termonia et al. [11] used gas chromatography-mass spectrometry in conjunction with PIDs to construct a VOCs monitoring system for monitoring VOCs in landfill sites. This method enables effective VOC monitoring but it has issues such as high cost and maintenance difficulties. In 2007, Chung-hwan Je et al. [12] established an online VOC monitoring system using a set of PID. They focused on the development and application of a multichannel monitoring system based on PID for measuring, processing, and analyzing the concentration levels of VOCs emitted from a walk-in fume hood in hazardous waste management facilities. This system reduced the noise of PID signals by summing up the data over a time interval. In 2016, Gianfranco Manes et al. [13] addressed the issues of nonlinear data, periodic calibration, and replacement distribution associated with PID in long-term monitoring in their VOC online monitoring system based on PID. The first VOC online monitoring system was installed in a petrochemical plant in Italy. Since its installation, the system has been continuously operational without human intervention. The successful operation of this system validates the feasibility of regional VOC monitoring using PID.

In 2015, Qian Kun et al. [15] designed a low-power ZigBee sensor network and a data reception control framework between nodes based on a photoionization detector for monitoring VOCs in indoor environments. Their research focused on the design of a low-power ZigBee sensor network and a data reception control framework for real-time data acquisition and communication of VOC air pollutant levels, enabling automated indoor VOC monitoring. In 2019, Healy et al. [21] analyzed the principle and characteristics of VOC detection using a photoionization detector. They also assessed the advantages and disadvantages of various VOC detection methods and performed cost analysis. Additionally, they provided a detailed description of the circuit design and software system design for online monitoring systems. In 2020, Wang Jin et al. [16] discussed the issues of low accuracy and high cost associated with produced photoionization detectors in China, which make them difficult to widely deploy and utilize. Then, they proposed a method to separate the sensor current detection module from the sensor radio frequency ultraviolet lamp driving module, and designed a high-precision photoionization detector. In 2019, Li Hai et al. [17], based on the principles of photoionization technology, conducted theoretical analysis and simulations to determine various parameters of the ionization chamber in the photoionization detector (PID) according to practical conditions.

2.2. Quantitative Analysis Method for PID Signal

Currently, the calibration method for PID used in VOC monitoring primarily relies on voltage values to determine the current VOC concentration [11,12,13,14,15,16,17,18]. However, when PID is applied to monitor VOCs in real-world soil-water-air environments, the signals generated by the PID can be affected by complex noise. This complex noise can impact the numerical value of PID voltage and result in inaccuracies in VOC concentration estimation. Machine learning methods, on the other hand, can utilize multidimensional features of the PID signal instead of relying solely on voltage values for PID calibration. These multidimensional features help reduce the interference of noise when the PID signal responds to VOC concentration, thus enhancing the accuracy of VOC monitoring using PID.

In recent years, the artificial neural network method has been widely applied in various fields. However, artificial neural networks have drawbacks and limitations, such as slow convergence speed, slow generalization, and a tendency to get trapped in local optima. Moreover, in practical VOC engineering processes, artificial neural networks often require a large number of samples for training. In order to achieve precise VOC concentrations, it is necessary to prepare gas at a specific concentration. However, the preparation work is complex and requires a significant amount of manpower and financial resources [14,15,16,17,18,19].

The basic idea of SVM is to construct an optimal hyperplane that maximizes the distance between the hyperplane and the sample sets of different classes in the sample or feature space, aiming to achieve the goal of maximizing generalization ability [19,20]. Unlike traditional artificial neural network methods, SVM adopts a structural risk minimization criterion, minimizing the generalization error bound to achieve maximum generalization ability [22]. SVM has better generalization ability compared to artificial neural network methods, and its solution is the unique global optimum. Therefore, in this paper, we apply SVR based on PID signals to monitor VOC concentrations.

3. PID Selection and Problem Statement

3.1. Calibration of PID for Various VOCs

Different types of volatile organic compounds (VOCs) exhibit variations in the number of electrons generated and the extent of ionization after being ionized under high-energy ultraviolet light. As a result, the signal generated by the photoionization detector (PID) may reflect varying VOC concentrations due to differences in their composition, despite the same concentration level. To establish a measurement standard, the PID employs a correction factor (CF) to compute the concentration of the monitored gas in relation to the standard gas [23]. In this study, benzene was used as the calibration gas for the PID, with a predefined calibration factor of 0.53. The CF, as defined by Equation (1), represents the correction factor for a specific component of VOC gas, where C_b denotes the concentration of the standard benzene gas, R_b signifies the reading of the benzene gas used for calibration, C_m represents the concentration of a particular component of VOC gas, and R_m denotes the reading of that specific component of VOC gas.

C F = \frac{R_{b} C_{m}}{C_{b} R_{m}}

(1)

It can be observed that different VOC gases have varying sensitivities in the PID. Certain VOC gases with high calibration coefficients exhibit low sensitivity in the PID, such as isobutanol and cyclohexane. On the other hand, some VOC gases with lower calibration coefficients demonstrate relatively higher sensitivity, such as styrene and chlorobenzene. This leads to varying detection accuracies of different gases by the PID. Hence, in practical VOC monitoring processes, the concentration of the actual monitored VOC gas needs to be multiplied by the corresponding response coefficient (RF) to obtain the respective VOC concentration, which can be determined by Equation (2).

C_c = ρ × RF

(2)

In which, C_c represents the actual concentration of the gas to be measured, and ρ represents the concentration displayed by the PID. Table 1 lists the response coefficients for some VOC gases [16].

For example, if the PID response coefficient for isobutylene is 8.87, and the PID response coefficient for benzene is 4.435, which implies that when isobutylene generates a response value of 4.435 V, benzene would produce a response value of 8.87 V.

3.2. PID Selection in the Work

The ultraviolet lamp in the PID is a crucial component that directly determines the performance of the photoionization detector. It has significant effects on important functional indicators of the PID, including the detection limits and accuracy of VOC monitoring. Furthermore, the ultraviolet lamp directly influences key performance indicators of the PID, such as power consumption, lifespan, and size [24,25]. To ensure proper transmission, the ultraviolet lamp requires a material with specific lattice constants as the window material to facilitate the transmission of vacuum ultraviolet photons necessary for photoionization detection. Inert gas is filled inside the ultraviolet lamp to extend its lifespan and accelerate the ignition speed, thereby enhancing the output light intensity.

On the market PID ultraviolet lamp amplitude energies are mainly at the levels of 11.7 eV, 10.6 eV, 10.2 eV, 9.8 eV, 9.6 eV, 8.4 eV, and 8.3 eV [26]. The 10.6 eV ultraviolet lamp demonstrates stable performance during operation. In this research, a 10.6 eV AC-powered ultraviolet lamp was selected for the PID (MH-Sensor PID Photoionization Gas Sensor 4R, MeiHui Science And Technology Co., Ltd, Shenyang, China). In this paper, a Shenyang Magnesium Technology light ion gas sensor was used to monitor VOCs. The sensor has an ultra-small UV lamp of 10.6 eV (i.e., manufacturer: Shenyang Magnesium Technology, type: Krypton lamp, window: magnesium fluoride, power: 0.5 W) and can monitor volatile organic compounds such as gasoline, benzenes, and ketones with an ionization potential of less than 10.6 eV. A PID sensor was used to monitor VOCs. The window material of this ultraviolet lamp is magnesium fluoride, enabling the detection of VOCs with ionization potentials lower than 10.6 eV, including benzene derivatives, esters, and aldehydes.

The PID used in this study has a response time of less than 3 s and exhibits varying response values for different VOCs. The detection limit can be obtained by calculating the response coefficient. According to the PID manual, the maximum voltage that the photoionization detector can generate is 2.9 V. If the detection limit for isobutylene by the PID is 1–2000 ppm, the theoretical detection limit for benzene should be 1–1000 ppm. VOCs with varying compositions have different detection limits, and others can be calculated using the response coefficient. The operational temperature range of the PID is −20 to 60 °C.

3.3. Existing Problems in VOC Monitoring with PID

When monitoring VOCs in soil-water-gas systems using PID, it is necessary to volatilize the VOCs from the soil-water environment for measurement. To achieve this, monitoring devices are often equipped with heating devices. However, since VOCs are present in the soil-water medium, volatilization of VOCs is accompanied by a significant amount of water. The presence of this water content increases the humidity in the PID monitoring environment, with humidity up to 92%. In an environment with increased humidity, the PID’s photocathode surface may exhibit barriers and uneven distribution, resulting in abundant low-frequency noise in the PID signal. This noise can affect the accuracy of the PID voltage readings, thereby leading to misinterpretation of VOC concentrations.

SVM exhibits superior generalization capability compared to traditional machine learning methods such as artificial neural networks. Moreover, SVM provides a unique global optimum solution. Hence, this research investigates the utilization of SVR for monitoring VOC concentration in soil-water-air environments.

4. Analysis of VOC Concentration Based on Traditional SVR

4.1. SVR

SVR is the regression version of SVM, used for handling regression problems. Assuming a given training sample set {(x₁, y₁),…, (x_n, y_n)}, x_i, y_i ∈ R, considering the use of a linear regression function [20,27].

F(x) = wx + b

(3)

To ensure the flatness of function F(x), it is crucial to find the smallest value of w. Therefore, the generalization of the Euclidean space is minimized. Assuming that all training data points (x_i, y_i) can be approximated by a linear function within an accuracy of ε, the problem of finding the minimum value of w can be formulated as a convex optimization problem.

\min \frac{1}{2} ‖ w ‖^{2}

(4)

The constraint condition is as follows.

\{\begin{array}{l} y_{i} - w \cdot x_{i} - b \leq ε \\ w \cdot x_{i} + b - y_{i} \leq ε \end{array}

(5)

In consideration of allowing fitting errors, a relaxation factor is introduced:

ξ_{i} \geq 0

and

ξ_{i}^{*} \geq 0

.

Similar to maximizing the classification margin in the optimal classification hyperplane, the problem of regression estimation is transformed into the following two equations.

\min \frac{1}{2} ‖ w ‖^{2} + C \sum_{i = 1}^{n} (x_{i} + x_{i}^{*})

(6)

The constraint condition is as follows

\{\begin{matrix} y_{i} - w \cdot x_{i} - b \leq ε + ξ_{i} \\ w \cdot x_{i} + b - y_{i} \leq ε + ξ_{i}^{*} \\ ξ_{i} \geq 0 \\ ξ_{i}^{*} \geq 0 \end{matrix} i = 1, \dots n

(7)

The constant C > 0 is used to balance the flatness of the regression function F and the number of sample points with a bias greater than ε. Equations (6) and (7) are derived from the ε-insensitive loss function represented by the following Equation (8). The function

| ξ |_{ε}

is expressed as follows:

| ξ |_{ε} : = \{\begin{matrix} 0 & if | ξ | \leq ε \\ | ξ | - ε & otherwise \end{matrix}

(8)

When dealing with a limited number of samples, the solution to SVM is commonly approached using the duality theory, which transforms it into a quadratic programming problem. To accomplish this, the Lagrange equation is established.

\begin{matrix} l (w, ξ, ξ^{*}) = \frac{1}{2} (w \cdot w) + C \sum_{i = 1}^{n} (ξ_{i} + ξ_{i}^{*}) - \sum_{i = 1}^{n} α_{i} (ε + ξ_{i} + y_{i} - w, x_{i} - b) - \dots \\ \dots \sum_{i = 1}^{n} α_{i} (ε + ξ_{i}^{*} + y_{i} - w, x_{i} - b) - \sum_{i = 1}^{n} (η_{i} ξ_{i} + η_{i}^{*} ξ_{i}^{*}) \end{matrix}

(9)

The partial derivatives of the parameters w, b,

ξ_{i}

,

ξ_{i}^{*}

should all be equal to zero. Substituting this condition into Equation (9) results in the dual optimization problem.

\min \frac{1}{2} \sum_{i, j = 1}^{n} (α_{i} - α_{i}^{*}) (α_{j} - α_{j}^{*}) x_{i}, x_{j} + \sum_{i = 1}^{n} α_{i} (ε - y_{i}) + \sum_{i = 1}^{n} α_{i}^{*} (ε + y_{i})

(10)

s . t \{\begin{array}{l} \sum_{i = 1}^{n} (α_{i} - α_{i}^{*}) = 0 \\ α_{i}, α_{i}^{*} \in [0, C] \end{array}

(11)

For nonlinear regression problems, assuming that the samples X are mapped to a high-dimensional space using a nonlinear function [28], the regression problem is then transformed into minimizing the function under the constraint Equation (12).

\frac{1}{2} \sum_{i, j = 1}^{n} (α_{i} - α_{i}^{*}) (α_{j} - α_{j}^{*}) < ϕ (x_{i}), ϕ (x_{j}) > + \sum_{i = 1}^{n} α_{i} (ε - y_{i}) + \sum_{i = 1}^{n} α_{i}^{*} (ε + y_{i})

(12)

As a result, the following equation is derived:

w = \sum_{}^{\underset{n}{i}} (α_{i} - α_{i}^{*}) ϕ (x_{i})

.

Kernel function.

In SVR, the kernel function is applied to simplify nonlinear approximation [29]. If the kernel function

k (x, x^{'})

has

k (x, x^{'}) = ϕ (x), ϕ (x^{'})

, the following equation is achieved.

\min \frac{1}{2} \sum_{i, j = 1}^{n} (α_{i} - α_{i}^{*}) (α_{j} - α_{j}^{*}) k (x, x^{'}) + \sum_{i = 1}^{n} α_{i} (ε - y_{i}) + \sum_{i = 1}^{n} α_{i}^{*} (ε + y_{i})

(13)

k (x, x^{'})

must satisfy the following condition.

\iint k (x, x^{'}) g (x) g (x ’) d x d x ’ > 0, g \in L_{2}

(14)

The selection and construction of kernel functions were discussed in Ref. [25]. In the work, the construction of the SVR was performed using the following Gaussian radial basis function (RBF) kernel.

k (x, x^{'}) = \exp (- \frac{{| | x - x^{'} | |}^{2}}{2 δ^{2}})

(15)

4.2. Analysis of VOC Concentration Based on Traditional SVR

Based on the characteristics of SVR in constructing regression models for small-sample data, this work utilizes SVM to build a regression model for VOC concentration data based on PID response.

The data are obtained by using PID in response to different concentrations of standard benzene gas in the laboratory environment. First, 84 different sets of VOCs concentration data were randomly arranged, 67 sets of data were used as the training set, and 17 sets of data were used as the test set, and each set of data had 6001 values, of which the first 6000 values were the response values of PID to VOCs concentration, and the 6001th value was the real concentration of VOCs.

To eliminate the adverse effects caused by anomalous samples, both the training and testing sets were normalized. It is essential to normalize the data for SVM, as it ensures feature scaling, preventing elliptical feature distributions that could hinder model training and lead to convergence issues or poor prediction accuracy.

The parameters of SVR were set, and the Gaussian radial basis kernel function was chosen as the core function for SVR. Different values of kernel function parameters and penalty factor C were experimented on to observe their impact on the accuracy of SVR. Since no prior knowledge was available, the penalty factor C was initially set to 0.1 and the kernel function parameter was temporarily set to 1. Under these settings, the R² value of the SVR was 0.21. When C was set to 0.2, the R² value decreased to 0.13, and further reduced to 0.015 with C set to 0.3. The model accuracy was not satisfactory. To improve it, C was subsequently set to 1, resulting in an R² value of 0.30. Interestingly, when C was increased to 2, 3, and 5, the R² value remained at 0.30, but it slightly increased to 0.38 with C set to 4. However, no clear mathematical relationship between C and model accuracy was identified. Thus, for SVR, a penalty factor of 4 was provisionally used in the work.

The traditional SVR was ultimately employed to construct a model using a penalty factor of C = 4 and kernel function parameter of β = 0.8 for 84 different concentrations of VOCs. The test set results of this model are shown in Figure 1. The mean squared error (MSE) of this model was 69,458.6, and the coefficient of determination (R²) was 0.38. However, there was a significant discrepancy between the predicted VOC concentrations and the actual VOC concentrations, indicating a large prediction error.

After repeated comparisons and analysis, the main reasons for the high errors of the model are as follows. Firstly, the uncertainty in determining optimal values for the penalty factor C and kernel function parameters makes it difficult for determination of their optimal values. Secondly, the original training data used in this model contain redundancies, where relevant information is not well-identified, while irrelevant information impacts the accuracy of the model construction. Hence, this work will optimize it from the two aspects.

5. PID Calibration Model

In the section, the genetic algorithm (GA) is employed to automatically select the values of the penalty factor C and the parameters of the kernel function in SVR. This approach utilizes the expansive search space and global search capability of the genetic algorithm. As a result, an optimized SVR model is built, and the concentration of VOCs is determined through signal generation using PID.

5.1. SVR Based on PCA of PID Signal Features

5.1.1. Subsubsection

Principal component analysis (PCA) is an effective and widely applied dimensionality reduction algorithm. It decomposes the principal components into mutually orthogonal directions, thereby effectively eliminating redundant and overlapping information among the original data. Generally, a few significant principal components can cover the majority of information regarding the signal produced by VOCs in PID response. The computational procedure of the PCA algorithm is as follows [20].

(1): Perform zero-mean normalization on the sample set of dimensionality d and $O_{i} = (O_{1}, O_{2}, \dots .., O_{n})$

$O_{i} = O_{i} - \frac{1}{n} \sum_{i = 1}^{n} O_{i}$

(16)
(2): Compute the covariance matrix $\sum$ of vector O;
(3): Use the method of singular value decomposition to obtain the eigenvalues and eigenvectors of the covariance matrix $\sum$ .
(4): Take the eigenvectors corresponding to the top v eigenvalues to form a new matrix, where v should be smaller than n.
(5): Obtain a new low-dimensional sample set.
(6): Calculate the contribution rate of each principal component and the cumulative contribution rate.

The time-domain and frequency-domain features of the PID response signals to VOCs were utilized as the original dataset for PCA to extract the principal components. The contributions of the mean, mean frequency, centroid frequency, root mean square frequency, frequency standard deviation, standard deviation, skewness, kurtosis, and maximum value are 0.4495, 0.2439, 0.1082, 0.0787, 0.0557, 0.0300, 0.0152, 0.0080, and 0.0065, respectively. The cumulative contribution rate of these nine features is 0.9958. Therefore, these nine features are used to replace the original 16 features.

5.1.2. SVR after PCA of PID Signal Features

Perform PCA algorithm on the 12 time-domain features and 4 frequency-domain features of 84 different sets of VOCs concentration signals. Use the feature data obtained from principal component analysis of 67 signal sets as the training set, and the feature data obtained from principal component analysis of 17 signal sets as the testing set. Each data set consists of 10 values, where the first 9 values are the PCA-based feature parameters of the PID response signal to VOCs concentration, and the 10th value represents the true concentration of VOCs. Then, the training and testing sets are normalized, and the SVM parameters are set with the Gaussian radial basis kernel function chosen as the kernel function for SVR. Set the parameters of the radial basis kernel function to 0.8 and the penalty factor C in SVR as 4. Finally, the results of the testing set are shown in Figure 2. The model has a mean squared error of 372 and an R² value of 0.996. From Figure 2, it can be seen that the regression performance of the model is improved, but it still has some errors. Based on the analysis, these errors are attributed to suboptimal choices of the penalty factor C and kernel function parameters. To address this issue, the work employs GA algorithm to optimize them.

5.2. Proposed Method Based on PCA-GA-SVR

5.2.1. SVR after PCA of PID Signal Features

The genetic algorithm simulates the problem-solving process as a biological evolution, generating the next generation of solutions through operations such as reproduction, crossover, and mutation. It gradually eliminates solutions with low fitness values and increases solutions with high fitness values. After evolving for N generations, it is highly likely to obtain individuals with high fitness values, which represent the optimal results of the objective function [30,31]. The steps for selecting the optimal kernel function parameters and penalty factor using the genetic algorithm are as follows.

(1): The dataset consisting of 84 group different concentrations of PID response to VOCs was split into an 80% training set and a 20% testing set.
(2): Normalize the input of the training and testing sets.
(3): Set the parameters of the genetic algorithm, such as the population size, iteration count, crossover probability, mutation probability, etc. Here, the chromosome dimension is set to 2, where the two numbers in the chromosome represent δ and C.
(4): Initialize the population by initializing each chromosome and calculating its objective function value.
(5): Begin iterative loop.
(6): Selection operator.
(7): Crossover and mutation operators (simulated binary crossover and polynomial mutation).
(8): Recalculate the objective function value for the updated chromosomes, where the objective function is the minimum mean squared error.
(9): Update the optimal objective of the global best chromosome.
(10): Proceed to the next iteration until the maximum iteration count is reached.
(11): Export the global best chromosome and C values, and plot the iteration curve.

5.2.2. SVR after PCA of PID Signal Features

When applying genetic algorithms to problem-solving, there are two encoding methods for chromosomes: binary encoding and floating-point encoding. The floating-point chromosome encoding is suitable for solving problems with a large value range, while the binary encoding is suitable for problems with a smaller value range. Since this paper applies a genetic algorithm to optimize SVM regression and involves processing small-sample data, the binary encoding method is adopted. Through the study of parameter settings for support vector machine regression in domestic and foreign research, the minimum values for penalty factor and kernel function parameters in this study were determined as 0.001. The maximum value for the penalty factor was set to 100, and the maximum value for the kernel function parameter was set to 10. Genetic algorithm operations are performed based on a population. During iterations, an initial population is provided to the genetic algorithm, and subsequent iterations are performed using this population. The population size was set to 20. Through experiments, it was determined that the convergence of the iteration curve occurs within 50 iterations, hence, the iteration limit was set to 50. The population information was defined as a structure, and the population loop was initiated. Chromosomes are formed by encoding, and for each gene of the chromosome, a random number is generated between its maximum and minimum values. This represents the chromosome. The penalty factor is represented by the first number of the chromosome, and the kernel function parameters are represented by the second number of the chromosome.

GA distinguishes individuals based on the evaluation of the fitness function value for each chromosome. In GA, the larger the fitness value of a chromosome, the better the individual it represents. After initialization, the optimal objective and its corresponding chromosome are identified, followed by iterative processes, selection operators, computation of current objective fitness, calculation of the fitness proportion for each chromosome, and generation of nonzero random numbers within the population.

The cumulative fitness proportion is computed by iterating through the population, and when it exceeds the random number, the last accumulated individual is chosen as the selected individual. Select offspring of the same size as the original population, calculate the optimization variable dimension and population number, and then determine the crossover probability. Compare a random number with the crossover probability to decide whether to perform crossover. If crossover is performed, randomly select two different chromosomes. We export the two selected chromosomes and iterate through each dimension of the chromosome. The crossover operator simulates binary crossover and then limits the boundaries of the crossed individuals, replacing values exceeding the maximum with the maximum value and values below the minimum with the minimum value. Copy the resulting individuals back into the population, and then loop through the population. Generate a random number and compare it with the crossover mutation probability. If a mutation occurs, randomly select an individual for mutation preparation. Loop through each gene of the chromosome, and perform polynomial mutation on the selected chromosome. Then, copy the mutated individuals back into the population. Recalculate the objectives of the offspring from the crossover-mutated individuals to obtain the best and worst objectives. Then, compare the current best objective with the global best objective. Replacing the worst with the historical global best increases the probability of the population iterating towards better individuals. Record the average objective and the best objective of the current generation. After the iteration, export C and δ.

The GA optimizes the iterative curve of SVR parameters, as shown in Figure 3. The iteration curve has converged at 16 iterations, which means that the minimum value of MSE can be obtained after the 16th iteration.

5.2.3. Results

In the genetic algorithm optimized SVM regression method based on PID signal time-domain and frequency-domain with principal component analysis, 84 sets of different PIDs generate 9 principal component features that reflect the VOC concentration signals. These nine features are considered as the characteristics of the signal. Among these, 67 sets of features obtained from the principal component analysis are used as the training set, while 17 sets of features obtained from the principal component analysis are used as the test set. Each set of data consists of 10 values, where the first 9 values represent the principal component analysis parameters of the PID response signal to VOC concentration, and the 10th value represents the true VOC concentration.

The training set and test set are normalized. Through genetic algorithm, the optimal parameters for the radial basis kernel function are found to be 0.0101, and the optimal penalty factor C is 7.8783. Based on the chosen parameters, the support vector machine regression function is derived as F(x) = w × x + b. The results of the test set are shown in Figure 4. The mean squared error of this model is 0.000059, and the R² value is 0.9999. Compared with the R² value obtained by Wang Jin, which is 99.8%, and Li Hai, which is 98.2%, 98.5%, 96.9%, etc., the proposed research method demonstrated superiority.

5.2.4. Results

In order to verify the effectiveness of the high-accuracy models for the 84 samples in this study, and to determine the minimum sample size required for constructing the PCA-GA-SVM model, this study conducted experiments based on our training data and 17 sets of testing data. By sequentially reducing four sets of training data and one set of testing data, this study aimed to find the minimum sample size for model establishment. Among them, A represents the original data; B represents the data with a reduction of 4 sets of training data and 1 set of testing data; C represents the data with a reduction of 8 sets of training data and 2 sets of testing data; D represents the data with a reduction of 12 sets of training data and 3 sets of testing data; E represents the data with a reduction of 16 sets of training data and 4 sets of testing data; and F represents the data with a reduction of 20 sets of training data and 5 sets of testing data. The accuracy results of the PCA-GA-SVR are shown in Figure 5.

By reducing the number of training sets to 12 and testing sets to 3, the R² of the model can still be maintained above 0.99. However, if the number of training and testing sets is further reduced, the accuracy of the PCA-GA-SVM model will drastically decrease. Therefore, in the calculation of VOC concentration using the PCA-GA-SVR based on signals generated by PID, it is necessary to ensure that the sample size is greater than 69. In this study, the PID response to VOC signals consists of 84 groups, thus meeting the requirements for model establishment.

6. Conclusions

This paper addresses the issue of long computation time and low accuracy in SVR caused by redundant data information. It conducts PCA on time-frequency features to reduce the data dimensionality. Furthermore, it solves the difficulty in determining the optimal values for the penalty factor C and kernel function parameters in traditional SVR by utilizing genetic algorithms. This approach effectively improves the generalizability and robustness of the SVR. The mean squared error of signal time-frequency feature extraction PCA-GA-SVR is 0.000059, with an R² of 0.9999.

Moreover, this paper analyses the impact of the number of experimental samples on the regression accuracy of the signal time-frequency feature extraction PCA-GA-SVR. The model maintains a high level of accuracy when the number of samples exceeds 69 groups. It also confirms that the 84 sets of data in the study meet the sample requirements for the regression. This demonstrates the effectiveness of the PCA-GA-SVR method for VOC monitoring in a humid environment, and validates the effectiveness and robustness of the proposed method in the paper. In the future, this model and method will be used to analyze and calibrate more VOCs, forming a standard method for calibrating VOCs by the PID method.

Author Contributions

Conceptualization, X.F.; methodology, X.F.; software, Z.L., C.D. and Y.R.; validation, Z.L. and X.F.; formal analysis, X.F.; investigation, C.D., Z.L. and X.F.; resources, C.D. and X.F.; data curation, Z.L., C.D. and X.F.; writing—original draft preparation, Y.R.; writing—review and editing, Y.R. and X.F.; visualization, Z.L. and C.D.; supervision, X.F.; project administration, X.F. and C.D.; funding acquisition, X.F. All authors have read and agreed to the published version of the manuscript.

Funding

Thanks to Shandong Provincial Science and Technology Department for financial support. The Major Innovation Program of Shandong Province (Typical heavy industry soil pollution monitoring, early warning and remediation technology integration and equipment research and development, 2021CXGC011206).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding authors.

Acknowledgments

This paper used an AI-assisted translation tool to help translate the manuscript from Chinese to English.

Conflicts of Interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest. The authors declare no conflicts of interest.

References

Meyer, C. Overview of TVOC and Indoor Air Quality; Renesas Electronics Corporation: Tokyo, Japan, 2018. [Google Scholar]
Lagoudi, A.; Lois, E.; Fragioudakis, K.; Karavanas, A.; Loizidou, M. Design of an inventory system for the volatile organic compounds, emitted by various activities. Environ. Sci. Technol. 2001, 35, 1982–1988. [Google Scholar] [CrossRef]
Li, X.Q.; Zhang, L.; Yang, Z.Q.; He, Z.Q.; Wang, P.; Yan, Y.F.; Ran, J.Y. Hydrophobic modified activated carbon using PDMS for the adsorption of VOCs in humid condition. Sep. Purif. Technol. 2020, 239, 116517. [Google Scholar] [CrossRef]
Kuranchie, F.A.; Angnunavuri, P.N.; Attiogbe, F.; Nerquaye-Tetteh, E.N. Occupational exposure of benzene, toluene, ethylbenzene and xylene (BTEX) to pump attendants in Ghana: Implications for policy guidance. Cogent Environ. Sci. 2019, 5, 1603418. [Google Scholar] [CrossRef]
Wang, Y.; Zhou, B.; Yang, M.R.; Xiao, G.; Xiao, H.; Dai, X.R. Bibliometrics and Knowledge Map Analysis of Research Progress on Biological Treatments for Volatile Organic Compounds. Sustainability 2023, 15, 9274. [Google Scholar] [CrossRef]
Liu, H.; Meng, G.; Deng, Z.; Li, M.; Chang, J.; Dai, T.; Fang, X. Progress in Research on VOC Molecule Recognition by Semiconductor Sensors. Acta Phys.-Chim. Sin. 2022, 38, 2008018. [Google Scholar] [CrossRef]
Wang, Q.; Zhou, G.; Zhong, Q.; Zhao, J.B.; Yang, K. Status and needs research for on-line monitoring of VOCs emissions from stationary sources. Huan Jing Ke Xue Huanjing Kexue 2013, 34, 4764–4770. [Google Scholar] [PubMed]
Meng, F.; Zheng, H.; Chang, Y.; Zhao, Y.; Li, M.; Wang, C.; Sun, Y.; Liu, J. One-step synthesis of Au/SnO₂/RGO nanocomposites and their VOC sensing properties. IEEE Trans. Nanotechnol. 2018, 172, 212–219. [Google Scholar] [CrossRef]
Meng, F.; Li, X.; Yuan, Z.; Lei, Y.; Qi, T.; Li, J. Ppb-Level Xylene Gas Sensors based on Co₃O₄ Nanoparticles coated Reduced Graphene Oxide(rGO) Nanosheets Operating at Low Temperature. IEEE Trans. Instrum. Meas. 2021, 70, 9511510. [Google Scholar] [CrossRef]
Meng, F.; Shi, X.; Yuan, Z.; Ji, H.; Qin, W.; Shen, Y.; Xing, C. Detection of Four Alcohol Homologue Gases by ZnO Gas Sensor in Dynamic Interval Temperature Modulation Mode. Sens. Actuators B Chem. 2022, 350, 130867. [Google Scholar] [CrossRef]
Termonia, A.; Termonia, M. Characterisation and on-site monitoring of odorous organic compounds in the environment of a landfill site. Int. J. Environ. Anal. Chem. 1999, 73, 43–57. [Google Scholar] [CrossRef]
Je, C.-h.; Stone, R.; Oberg, S.G. Development and application of a multi-channel monitoring system for near real-time VOC measurement in a hazardous waste management facility. Sci. Total Environ. 2007, 382, 364–374. [Google Scholar] [CrossRef] [PubMed]
Manes, G.; Collodi, G.; Gelpi, L.; Fusco, R.; Ricci, G.; Manes, A.; Passafiume, M. Realtime Gas Emission Monitoring at Hazardous Sites Using a Distributed Point-Source Sensing Infrastructure. Sensors 2016, 16, 121. [Google Scholar] [CrossRef]
Oka, K.; Iizuka, A.; Inoue, Y.; Mizukoshi, A.; Noguchi, M.; Yamasaki, A.; Yanagisawa, Y. Development of a Combined Real Time Monitoring and Integration Analysis System for Volatile Organic Compounds (VOCs). Int. J. Environ. Res. Public Health 2010, 7, 4100–4110. [Google Scholar] [CrossRef] [PubMed]
Peng, C.; Qian, K.; Wang, C. Design and Application of a VOC-Monitoring System Based on a ZigBee Wireless Sensor Network. IEEE Sens. J. 2015, 15, 2255–2268. [Google Scholar] [CrossRef]
Wang, J.; Hao, X.W.; Dong, J.G.; Xiong, J.J.; Hong, Y.P. Design of high precision photoionization detector. Infrared Laser Eng. 2020, 49, 248–255. [Google Scholar]
Li, H. VOCs Detection Based on Photoionization Technology. Ph.D. Thesis, Chongqing University of Posts and Telecommunications, Chongqing, China, 2020. [Google Scholar]
Wang, L.; Cheng, Y.; Gopalan, S.; Luo, F.; Amreen, K.; Singh, R.K.; Goel, S.; Lin, Z.; Naidu, R. Review and Perspective: Gas Separation and Discrimination Technologies for Current Gas Sensors in Environmental Applications. ACS Sens. 2023, 8, 1373–1390. [Google Scholar] [CrossRef] [PubMed]
Liu, Z.; Feng, X.; Dong, C.; Jiao, M. Study on Denoising Method of Photoionization Detector Based on Wavelet Packet Transform. Chemosensors 2023, 11, 146. [Google Scholar] [CrossRef]
Roy, A.; Chakraborty, S. Support vector machine in structural reliability analysis: A review. Reliab. Eng. Syst. Saf. 2023, 233, 109126. [Google Scholar] [CrossRef]
Healy, R.M.; Wang, J.M.; Karellas, N.S.; Todd, A.; Sofowote, U.; Su, Y.; Munoz, A. Assessment of a passive sampling method and two on-line gas chromatographs for the measurement of benzene, toluene, ethylbenzene and xylenes in ambient air at a highway site. Atmos. Pollut. Res. 2019, 10, 1123–1127. [Google Scholar] [CrossRef]
Shi, J.; Teh, J. Load forecasting for regional integrated energy system based on complementary ensemble empirical mode decomposition and multi-model fusion. Appl. Energy 2024, 353, 122146. [Google Scholar] [CrossRef]
Zhu, H.; Nidetz, R.; Zhou, M.; Lee, J.; Buggaveeti, S.; Kurabayashi, K.; Fan, X. Flow-through microfluidic photoionization detectors for rapid and highly sensitive vapor detection. Lab A Chip 2015, 15, 3021–3029. [Google Scholar] [CrossRef] [PubMed]
Bilek, J.; Marsolek, P.; Bilek, O.; Bucek, P. Field Test of Mini Photoionization Detector-Based Sensors-Monitoring of Volatile Organic Pollutants in Ambient Air. Environments 2022, 9, 49. [Google Scholar] [CrossRef]
Liu, R.Y.; Hu, H. Design of Photoionization Sensor for VOC Gas Detection. Instrum. Tech. Sens. 2020, 7, 1–5. [Google Scholar]
Zhou, Q.; Zhang, S.; Zhang, X.; Ma, X.; Zhou, W. Development of a Novel Micro Photoionization Detector for Rapid Volatile Organic Compounds Measurement. Appl. Bionics Biomech. 2018, 2018, 5651315. [Google Scholar] [CrossRef] [PubMed]
Das, S.; Khanwelkar, D.R.; Maiti, J. A semi-automated coding scheme for occupational injury data: An approach using Bayesian decision support system. Expert Syst. Appl. 2024, 237, 121610. [Google Scholar] [CrossRef]
Rymarczyk, T.; Klosowski, G.; Hola, A.; Sikora, J.; Tchorzewski, P.; Skowron, L. Optimising the use of Machine learning algorithms in electrical tomography of building Walls: Pixel oriented ensemble approach. Measurement 2022, 188, 110581. [Google Scholar] [CrossRef]
Ding, S.; Zhao, X.; Zhang, J.; Zhang, X.; Xue, Y. A review on multi-class TWSVM. Artif. Intell. Rev. 2019, 52, 775–801. [Google Scholar] [CrossRef]
Wang, Y.; Xue, W. Sustainable development early warning and financing risk management of resource-based industrial clusters using optimization algorithms. J. Enterp. Inf. Manag. 2022, 35, 1374–1391. [Google Scholar] [CrossRef]
Qiao, Y.; Luo, J.; Cui, T.; Liu, H.; Tang, H.; Zeng, Y.; Liu, C.; Li, Y.; Jian, J.; Wu, J.; et al. Soft Electronics for Health Monitoring Assisted by Machine Learning. Nano-Micro Lett. 2023, 15, 66. [Google Scholar] [CrossRef]

Figure 1. Quantitative analysis effect of VOCs concentration based on traditional SVR.

Figure 2. VOC concentration regression by SVR after PCA of PID signal features.

Figure 3. Trend of prediction changes with the evolution generations (this means that the optimal values of C and δ are obtained).

Figure 4. Analysis results for VOC concentration based on PCA-GA-SVR.

Figure 5. Effect of sample size on regression accuracy((A) the original data; (B) the data with a reduction of 4 sets of training data and 1 set of testing data; (C) the data with a reduction of 8 sets of training data and 2 sets of testing data; (D) the data with a reduction of 12 sets of training data and 3 sets of testing data; (E) the data with a reduction of 16 sets of training data and 4 sets of testing data; and (F) the data with a reduction of 20 sets of training data and 5 sets of testing data).

Table 1. Response coefficients of common VOCs.

Chemicals	Response Coefficient	Chemicals	Response Coefficient	Chemicals	Response Coefficient
benzene	1.00	acrolein	7.36	acetone	2.26
isobutanol	8.87	n-butyl	6.42	isobutene	1.887
cyclohexane	2.83	butyl acetate	4.53	butadiene	1.30
styrene	0.75	2-dimethylbenzene	1.02	propylene oxide	12.30
phenol	1.887	naphthalene	0.70	chlorobenzene	0.75

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Feng, X.; Liu, Z.; Ren, Y.; Dong, C. A High-Precision Monitoring Method Based on SVM Regression for Multivariate Quantitative Analysis of PID Response to VOC Signals. Chemosensors 2024, 12, 74. https://doi.org/10.3390/chemosensors12050074

AMA Style

Feng X, Liu Z, Ren Y, Dong C. A High-Precision Monitoring Method Based on SVM Regression for Multivariate Quantitative Analysis of PID Response to VOC Signals. Chemosensors. 2024; 12(5):74. https://doi.org/10.3390/chemosensors12050074

Chicago/Turabian Style

Feng, Xiujuan, Zengyuan Liu, Yongjun Ren, and Chengliang Dong. 2024. "A High-Precision Monitoring Method Based on SVM Regression for Multivariate Quantitative Analysis of PID Response to VOC Signals" Chemosensors 12, no. 5: 74. https://doi.org/10.3390/chemosensors12050074

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A High-Precision Monitoring Method Based on SVM Regression for Multivariate Quantitative Analysis of PID Response to VOC Signals

Abstract

1. Introduction

2. Relate Work

2.1. Monitoring Method for VOCs Based on PID

2.2. Quantitative Analysis Method for PID Signal

3. PID Selection and Problem Statement

3.1. Calibration of PID for Various VOCs

3.2. PID Selection in the Work

3.3. Existing Problems in VOC Monitoring with PID

4. Analysis of VOC Concentration Based on Traditional SVR

4.1. SVR

4.2. Analysis of VOC Concentration Based on Traditional SVR

5. PID Calibration Model

5.1. SVR Based on PCA of PID Signal Features

5.1.1. Subsubsection

5.1.2. SVR after PCA of PID Signal Features

5.2. Proposed Method Based on PCA-GA-SVR

5.2.1. SVR after PCA of PID Signal Features

5.2.2. SVR after PCA of PID Signal Features

5.2.3. Results

5.2.4. Results

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI