*Article* **Development and Optimization of a Novel Soft Sensor Modeling Method for Fermentation Process of** *Pichia pastoris*

**Bo Wang, Jun Liu \*, Ameng Yu and Haibo Wang**

Key Laboratory of Agricultural Measurement and Control Technology and Equipment for Mechanical Industrial Facilities, School of Electrical and Information Engineering, Jiangsu University, Zhenjiang 212013, China; wangbo@ujs.edu.cn (B.W.); 2222107008@stmail.ujs.edu.cn (A.Y.); 2212107036@stmail.ujs.edu.cn (H.W.) **\*** Correspondence: liujun4503@126.com

**Abstract:** This paper introduces a novel soft sensor modeling method based on BDA-IPSO-LSSVM designed to address the issue of model failure caused by varying fermentation data distributions resulting from different operating conditions during the fermentation of different batches of *Pichia pastoris*. First, the problem of significant differences in data distribution among different batches of the fermentation process is addressed by adopting the balanced distribution adaptation (BDA) method from transfer learning. This method reduces the data distribution differences among batches of the fermentation process, while the fuzzy set concept is employed to improve the BDA method by transforming the classification problem into a regression prediction problem for the fermentation process. Second, the soft sensor model for the fermentation process is developed using the least squares support vector machine (LSSVM). The model parameters are optimized by an improved particle swarm optimization (IPSO) algorithm based on individual differences. Finally, the data obtained from the *Pichia pastoris* fermentation experiment are used for simulation, and the developed soft sensor model is applied to predict the cell concentration and product concentration during the fermentation process of *Pichia pastoris*. Simulation results demonstrate that the IPSO algorithm has good convergence performance and optimization performance compared with other algorithms. The improved BDA algorithm can make the soft sensor model adapt to different operating conditions, and the proposed soft sensor method outperforms existing methods, exhibiting higher prediction accuracy and the ability to accurately predict the fermentation process of *Pichia pastoris* under different operating conditions.

**Citation:** Wang, B.; Liu, J.; Yu, A.; Wang, H. Development and Optimization of a Novel Soft Sensor Modeling Method for Fermentation Process of *Pichia pastoris*. *Sensors* **2023**, *23*, 6014. https://doi.org/10.3390/ s23136014

Academic Editor: Simon Tomažiˇc

Received: 17 May 2023 Revised: 25 June 2023 Accepted: 26 June 2023 Published: 29 June 2023

**Copyright:** © 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

**Keywords:** soft sensor; improved particle swarm algorithm; least squares support vector machine; transfer learning; *Pichia pastoris*

#### **1. Introduction**

The *Pichia pastoris* expression system is a eukaryotic expression system that has developed in the past decade and is one of the most successful foreign protein expression systems [1]. Compared with other expression systems, *Pichia pastoris* has significant advantages in processing, secretion, post-translational modifications, and glycosylation of expressed products, and has been widely applied [2]. Over 1,000 proteins have been efficiently expressed using the *Pichia pastoris* expression system. High-density fermentation is an important strategy for improving foreign protein expression levels in *Pichia pastoris* [3]. To effectively increase the expression level of secreted foreign proteins in *Pichia pastoris*, the fermentation process needs to be dynamically regulated and optimized in real-time by changing the fermentation environment and cultivation conditions to find the optimal environmental parameters for improving the secretion effect of foreign proteins [4]. However, *Pichia pastoris* fermentation is a complex, nonlinear, and uncertain process with multiple variables and time-varying properties [5,6]. Due to the actual process technology and cost reasons, key biochemical parameters that directly reflect fermentation process

quality, such as cell concentration and product concentration, cannot be directly measured online and can only be estimated roughly through offline sampling and analysis [7]. This not only causes lagging information acquisition but also affects the correct judgment and decision-making of operators on real-time reaction status, while also limiting the implementation of optimization and control strategies. Therefore, there is an urgent need to find a method to achieve optimal estimation and prediction of key biochemical parameters during *Pichia pastoris* fermentation processes.

The soft sensor method is an effective solution to the problem of difficult online measurement of key biochemical parameters in biological fermentation processes. Many scholars worldwide have conducted in-depth research on soft sensor technology and achieved a series of results. Shao et al. [8] proposed a semisupervised Gaussian regression for the ammonia synthesis process, which achieved accurate real-time prediction of ammonia production concentration with fewer labeled samples. However, the accuracy parameter in Bayesian regularization needs to be manually predefined, which greatly reduces the accuracy of the model. Yuan et al. [9] used a supervised long short-term memory network to achieve soft sensor modeling of the penicillin fermentation process, which fully utilized the quality variables in the long short-term memory network and realized nonlinear dynamic modeling of the penicillin fermentation process with a good prediction effect. The limitation of this method is that the amount of computation and training time of the model are greatly increased due to adding the quality variable to each LSTM cell. Zheng et al. [10] proposed a real-time semisupervised extreme learning machine, which integrated semisupervised learning and just-in-time learning strategies into the modeling framework to establish a local prediction model, and fully utilized a large amount of unlabeled data information to achieve fast and accurate measurement of Mooney viscosity in the rubber mixing process. Chang et al. [11] proposed a consistent contrastive network to realize the time awareness and robustness of the soft sensor model, which overcame the limitations of manifold regularization and fully utilized abnormal data and unlabeled data information. The effectiveness of the consistent contrastive network was verified in the soft sensor modeling of ammonia and sulfur removal industrial processes. Fan et al. [12] proposed a soft sensor regression model based on the long short-term memory recurrent neural network in deep learning using the data obtained by the sensor, by designing the relative error loss and the normalized L1 loss function using the time step of the sensor to predict the measured value, to realize the detection of the wafer manufacturing process to reduce the recall rate of the wafer. The experimental results show that the proposed soft sensor model can realize various types of inspection and prediction in complex manufacturing processes. However, the generalization ability of this model is poor, and it can only be used in the manufacturing process of one working condition. Zhang et al. [13] deeply analyzed the relevant factors affecting the formation of glutamate, and proposed a soft sensor model based on fuzzy reasoning based on a support vector machine using the soft sensor method, and used the particle swarm optimization algorithm to optimize the key parameters to realize the control of glutamate concentration. The precise prediction of the model is optimized by using the fuzzy reasoning mechanism and the fuzzy basis function to optimize the kernel function of the support vector machine, which improves the anti-interference ability and adaptability of the model, whose prediction ability is good. However, the generalization ability of the model is insufficient, and the calculation process is relatively complicated, increasing the time cost. Han et al. [14], inspired by the adversarial network, used the adversarial domain adaptation method to improve the performance of the deep migration learning model and realize the accurate diagnosis of mechanical failures with small data volumes in industrial processes. However, this method must have sufficient target domain data. After the migration is completed, the target domain data in the actual industrial process are extremely limited, which affects the predictive performance of the model. Although the soft sensor models established in the above literature achieve accurate online prediction of key biological parameters, they do not consider the problem of model

failure and performance degradation caused by the mismatch between modeling data and real-time data under different operating conditions in biological fermentation processes.

Regarding the abovementioned issue, [15] utilized deep transfer learning strategies to reduce the differences in data distribution between the source and target domains, effectively solving the problems of missing data and deterioration of soft sensor model performance in complex industrial processes. However, this method is only suitable for transferring from one working condition to another, which will reduce the ability of the model to fit. Ref. [16] proposed an online transfer learning technique based on slow feature analysis and variational Bayesian inference to improve the predictive performance of the target process, solving the problem of online measurement of water content in crude oil emulsions using steam-assisted gravity drainage technology and greatly improving production efficiency. Two weighting functions related to the transformation and emission equations are introduced and dynamically updated to quantify the transferability from the source domain to the target domain at each moment. Ref. [17] aims at the problem of online detection of key variables in industrial processes, proposes a soft sensor model based on variational mode decomposition, stacked enhanced autoencoder and transfer learning to achieve high-precision regression prediction, and introduces a transfer learning algorithm based on the maximum mean deviation in transfer learning to solve the domain under different operating conditions of the adaptation problem. However, the hyperparameters of the built model need to be manually selected, which greatly increases the prediction error of the model. Ref. [18] implements transfer learning by fine-tuning the weights of the network, freezing inner layers and updating outer layers. The method of transfer learning proposed in the literature realizes the prediction of complex industrial processes with small amounts of data. However, the prediction accuracy of this method is poor, and multiple parameters need to be manually set. Therefore, the application of transfer learning can effectively solve the dilemma of model failure and performance deterioration under different operating conditions. In summary, this paper proposes a multioperating condition transfer learning-based soft sensor modeling method for *Pichia pastoris* fermentation. First, to address the issue of significant data distribution differences between batches in the fermentation process, the BDA method in transfer learning is adopted to reduce the differences [19], and the fuzzy set concept is introduced to improve the BDA method, effectively converting the classification problem into a regression prediction problem of the fermentation process. Then, considering the nonlinearity and small sample characteristics of the *Pichia pastoris* fermentation process, the LSSVM is used as the basic model of the soft sensor process, and the adapted data are used to train the LSSVM model. The model is then optimized using an IPSO based on psychological mechanisms. Finally, the adapted target domain data are used to predict the *Pichia pastoris* cell concentration and product concentration through the constructed soft sensor model. The experimental results demonstrate that this soft sensor method is significantly superior to existing methods, has high prediction accuracy, and can accurately predict the *Pichia pastoris* fermentation process under different operating conditions. This paper makes significant contributions to current research on soft sensors. First, an IPSO algorithm is proposed to optimize the parameters of LSSVM, which greatly improves prediction accuracy. Compared with GWO and ABC optimization algorithms, the proposed IPSO algorithm has good dynamic performance and optimization performance. Second, this paper proposes to use the improved BDA algorithm in transfer learning to adapt the source domain and target domain to reduce the differences between different domains, so that the soft sensor model achieves good performance under multiworking conditions.

#### **2. Methods**

#### *2.1. Principle and Solution of Balanced Distribution Adaptation*

During the fermentation process of *Pichia pastoris*, the real-time data and modeling data distributions between different fermentation batches do not match due to varying operating conditions [20]. Soft sensor models established based on historical operational data may not be applicable to new batches, leading to model degradation and misalignment issues. Considering that transfer learning can learn useful information from other fermentation batches to assist in completing tasks for the target fermentation batch and does not require training and prediction data to conform to the requirement of independent and identically distributed data [21], it is an effective way to solve the problem of soft sensor modeling for the fermentation process with multiple operating conditions across different batches. Transfer learning has been widely used in medical images, industrial processes, and so on [22,23]. Transfer learning aims to improve the performance of target learners on target domains by transferring knowledge contained in different but related source domains [24]. Therefore, this article introduces a transfer learning strategy to construct a soft sensor model for the fermentation process of *Pichia pastoris*.

Data distribution adaptation is one of the most commonly used feature-based transfer learning methods [25]. The main idea behind this method is to use some transformations to bring the distance between data distributions of different domains closer together [26]. Currently, the most commonly used algorithm for this purpose is BDA, which reduces the joint probability distribution distance between the source and target domains to achieve transfer learning and improve the applicability and predictive accuracy of traditional soft sensor models. However, since the traditional BDA method is only suitable for solving classification problems [27], and soft sensor modeling for the fermentation process of *Pichia pastoris* is a regression problem, this article introduces the concept of fuzzy sets [28] into the BDA method to transform the classification problem into a regression problem and to achieve soft sensor modeling for the fermentation process of *Pichia pastoris*.

For a given labeled source domain data *Qs* = {*xsi* , *ysi* }*n <sup>i</sup>*=<sup>1</sup> and unlabeled target domain data *Qt* = *xtj* , *ytj <sup>m</sup> <sup>j</sup>*=<sup>1</sup> of the *Pichia pastoris* fermentation process, assuming the feature spaces are *χ<sup>s</sup>* and *χt*, respectively, and *χ<sup>s</sup>* = *χt*, the label spaces are *Ys* and *Yt*, respectively, and *Ys* = *Yt*, the marginal distributions are *Ps*(*xs*) and *Pt*(*xt*), respectively, and *Ps*(*xs*) = *Pt*(*xt*), and the conditional distributions are *Ps*(*ys*|*xs*) and *Ps*(*yt*|*xt*), respectively, and *Ps*(*ys*|*xs*) = *Ps*(*yt*|*xt*). The goal of BDA is to complete transfer learning by minimizing the marginal and conditional distributions between the source and target domains, which is minimizing the following equation.

$$DIS(Q\_{s}, Q\_{t}) \approx (1 - \mu) DIS(P(\mathbf{x}\_{s}), P(\mathbf{x}\_{t})) + \mu DIS(P(y\_{s}|\mathbf{x}\_{s}), P(y\_{t}|\mathbf{x}\_{t})) \tag{1}$$

where *μ* is the balance factor to adjust the distance between the two distributions, *μ* ∈ [0, 1]. When *μ* → 1, it indicates that data sets are similar, and the conditional distribution has a greater proportion. When *μ* → 0, it indicates that the data sets are dissimilar, and the marginal distribution has a greater proportion.

Since there are no labels in the target domain *Qt*, it is impossible to calculate the conditional probability distribution. Therefore, further training of a preclassifier is necessary to obtain soft labels *ytj* .

Let *Q<sup>c</sup> <sup>s</sup>* = {*xi*|*xi* ∈ *Qs* ∧ *yi* = *c*} be the sample set of class *c* in the source domain *s*, and *Qc <sup>t</sup>* = {*xi*|*xi* ∈ *Qt* ∧ *yi* = *c*} be the sample set of class *c* in the target domain *t*.

Using Maximum Mean Discrepancy (MMD) [29] to measure the distance between two neighboring domains, Equation (1) can be expressed as:

$$DIS(Q\_{\mathcal{S}}, Q\_{\mathcal{I}}) \approx (1 - \mu) \left| \left. \frac{1}{n} \sum\_{i=1}^{n} \mathbf{x}\_{\mathcal{S}\_{i}} - \frac{1}{m} \sum\_{j=1}^{m} \mathbf{x}\_{l\_{j}} \right| \left| \left. \frac{2}{\mathbf{H}} + \mu \sum\_{c=1}^{\mathbb{C}} \right| \left. \frac{1}{n\_{\mathcal{C}}} \sum\_{\mathbf{x}\_{\mathcal{S}\_{i}} \in Q\_{\mathcal{S}}} \mathbf{x}\_{\mathcal{S}\_{i}} - \frac{1}{m\_{\mathcal{C}}} \sum\_{\mathbf{x}\_{l\_{j}} \in Q\_{\mathcal{I}}} \mathbf{x}\_{l\_{j}} \right| \left| \right. \right. \tag{2}$$

where H is the Reproducing Kernel Hilbert Space [30], *c* denotes different class labels, and *c* ∈ {1, 2, ··· , *C*}, *n* and *m* represent the number of samples in the source and target domains, respectively. *Qs <sup>c</sup>* and *Qt <sup>c</sup>* refer to class *c* samples in the source and target domains, respectively. *nc* and *mc*, respectively, represent the number of samples in *Qs <sup>c</sup>* and *Qt c*. The first term in Equation (2) computes the distance between the marginal probability

distributions of the source and target domains, and the second term calculates the distance between the conditional probability distributions.

Considering that the soft sensor problem of *Pichia pastoris* fermentation process studied in this paper is a regression problem, while the BDA method is only applicable to classification problems, this paper introduces the fuzzy set method to improve the BDA method and make it applicable to regression problems.

In traditional classification problems, a sample can only belong to one class, while using the fuzzy set method allows a sample to belong to multiple classes to varying degrees, therefore, for the output *yz i <sup>i</sup>*=1,··· ,*<sup>n</sup>* of the *<sup>z</sup>*-th source domain, the 5th, 50th, and 95th percentile values are found and denoted as *p<sup>z</sup>* <sup>5</sup>, *<sup>p</sup><sup>z</sup>* <sup>50</sup>, *<sup>p</sup><sup>z</sup>* <sup>95</sup>. Three fuzzy sets denoted as smallz , mediumz , largez are defined based on these percentiles, as shown in Figure 1.

**Figure 1.** Percentage based fuzzy set division (**a**) Fuzzy sets in the source domain; (**b**) Fuzzy sets in the target domain.

Let the membership degree of *y<sup>z</sup> <sup>i</sup>* in class *<sup>p</sup>* in the source domain be denoted as *<sup>α</sup><sup>z</sup> ip*, and the membership degree of *y<sup>t</sup> <sup>i</sup>* in class *<sup>q</sup>* in the target domain be denoted as *<sup>α</sup><sup>t</sup> iq*, and normalize *α<sup>z</sup> ip* and *<sup>α</sup><sup>t</sup> iq*, as shown in Equations (3) and (4):

$$\overline{\alpha}\_{ip}^{\overline{z}} = \frac{\alpha\_{ip}^{\overline{z}}}{\sum\_{i=1}^{n} \alpha\_{ip}^{\overline{z}}} \text{ p} = 1,2,3; \mathbf{i} = 1,2,\cdots,\mathbf{n} \tag{3}$$

$$\overline{\alpha}\_{i\dot{q}}^{t} = \frac{\alpha\_{i\dot{q}}^{t}}{\sum\_{i=1}^{n} \alpha\_{i\dot{q}}^{t}} \neq 1,2,3; \mathbf{i} = 1,2,\cdots,n \tag{4}$$

Based on Equations (3) and (4), Equation (2) can be represented as:

$$DIS(Q\_s, Q\_t) \approx (1 - \mu) \mid \left| \frac{1}{n} \sum\_{i=1}^n \mathbf{x}\_{s\_i} - \frac{1}{m} \sum\_{j=1}^m \mathbf{x}\_{t\_j} \right| \left| \frac{2}{\mathbf{H}} + \mu \sum\_{c=1}^3 \left| \begin{array}{c} \Xi\_{\mathbf{\tilde{a}}}^c \mathbf{x}\_{s\_i} - \sum\_{\mathbf{x}\_{t\_j} \in D\_l(c)} \Xi\_{\mathbf{\tilde{a}}}^t \mathbf{x}\_{t\_j} \right| \end{array} \right| \tag{5}$$

Using matrix techniques, Equation (5) can be written in the following form:

$$\begin{aligned} DIS(Q\_{\mathcal{S}}, Q\_{\mathcal{I}}) & \approx A^T X (1 - \mu) M\_0 X^T A + A^T X \mu \sum\_{c=1}^3 M\_c X^T A\\ &= A^T X [(1 - \mu) M\_0 + \mu M\_R] X^T A \end{aligned} \tag{6}$$

where *MR* = *M*<sup>1</sup> + *M*<sup>2</sup> + *M*3, *M*1, *M*<sup>2</sup> and *M*<sup>3</sup> are matrices for maximum mean discrepancy, defined as follows:

$$(M\_c)\_{ij} = \begin{cases} \overline{\mathfrak{a}}\_{ic}^{\varepsilon} \overline{\mathfrak{a}}\_{jc}^{\varepsilon} & \mathfrak{x}\_{i\prime} \mathfrak{x}\_{j} \in Q\_s^{\varepsilon} \\ \overline{\mathfrak{a}}\_{ic}^{t} \overline{\mathfrak{a}}\_{jc}^{t} & \mathfrak{x}\_{i\prime} \mathfrak{x}\_{j} \in Q\_t^{\varepsilon} \\ -\overline{\mathfrak{a}}\_{ic}^{\varepsilon} \overline{\mathfrak{a}}\_{jc}^{\varepsilon} & \mathfrak{x}\_{i} \in Q\_s^{\varepsilon} \mathfrak{x}\_{j} \in Q\_t^{\varepsilon} \\ -\overline{\mathfrak{a}}\_{ic}^{t} \overline{\mathfrak{a}}\_{jc}^{t} & \mathfrak{x}\_{i} \in Q\_s^{\varepsilon} \mathfrak{x}\_{j} \in Q\_t^{\varepsilon} \\ 0 & other \end{cases} \tag{7}$$

The calculation formula of *M*<sup>0</sup> is as follows:

$$(M\_0)\_{ij} = \begin{cases} \frac{1}{n^2} & \boldsymbol{\chi}\_i, \boldsymbol{\chi}\_j \in Q\_s \\ \frac{1}{m^2} & \boldsymbol{\chi}\_i, \boldsymbol{\chi}\_j \in Q\_t \\ \frac{-1}{mn} & \textit{other} \end{cases} \tag{8}$$

The objective function Equation (2) can be represented as:

$$\begin{array}{c} \min \operatorname{tr} (A^T X [(1 - \mu) M\_0 + \mu M\_R] X^T A) + \lambda ||A||^2\_F \\ \text{s.t.} \ A^T X H X^T A = I, 0 \le \mu \le 1 \end{array} \tag{9}$$

where *λ* is the regularization parameter, *A* is the transformation matrix, *X* is the input matrix composed of *xsi* and *xtj* , ||•||<sup>2</sup> *<sup>F</sup>* is the Hilbert-Schmidt norm, *I* is the identity matrix, *<sup>I</sup>* <sup>∈</sup> <sup>R</sup>(*n*+*m*)∗(*n*+*m*), and *<sup>H</sup>* <sup>=</sup> *<sup>I</sup>* <sup>−</sup> (1/*n*)*E*, *<sup>E</sup>* is the identity matrix. By using the Lagrange multiplier method, the Lagrange function for Equation (9) is:

$$L = \operatorname{tr}(A^T X[(1-\mu)M\_0 + \mu M\_R]X^T A) + \lambda ||A||\_F^2 + \operatorname{tr}([I - A^T X H X^T A] \phi) \tag{10}$$

where the Lagrange multiplier *φ* = (*φ*1, *φ*2, ··· , *φd*), set derivative *∂L*/*∂A* = 0, then the optimization problem can be transformed into a generalized eigenvalue decomposition problem, which can be expressed as:

$$(X[(1 - \mu)M\_0 + \mu M\_R]X^T + \lambda I)A = X H X^T A \phi \tag{11}$$

The optimal transformation matrix *A* can be obtained by solving Equation (11).

By using the optimal transformation matrix *A*, the distributions of the source domain data and the target domain data can be adapted, thereby improving the applicability and prediction accuracy of the soft sensor model.

#### *2.2. Improving Particle Swarm—Least Squares Support Vector Machine Algorithm*

2.2.1. Least Squares Support Vector Machine

The key biochemical parameters reflecting the real-time fermentation status and fermentation quality of *Pichia pastoris* (such as cell concentration and product concentration) are currently mainly obtained through offline sampling and laboratory analysis methods. However, this method is cumbersome and leads to long intervals between data collection for the same batch of fermentation, resulting in a limited number of actual collected data samples. In addition, the fermentation process has strong nonlinear characteristics. In light of the favorable performance of LSSVM in solving small sample, nonlinear, and high-dimensional regression tasks [31] and its fast solution speed and robust fitting ability, this paper applied LSSVM to soft sensor modeling of *Pichia pastoris* fermentation.

Suykens et al. [32] proposed the least LSSVM as a variant of the support vector machine (SVM). This method greatly reduces the complexity of the algorithm by using the sum of squared errors as the loss function. Given a dataset {*xi*, *yi*}*<sup>l</sup> <sup>i</sup>*=<sup>1</sup> with the input data *xi* ∈ *<sup>R</sup><sup>n</sup>*

and the output data *yi* ∈ *<sup>R</sup>n*, the optimization objective of LSSVM can be represented as follows:

$$\min\_{\omega, \varepsilon} J(\omega, e) = \frac{1}{2} \omega^T \omega + \frac{\gamma}{2} \sum\_{i=1}^{l} e\_i^2 \tag{12}$$

Subject to:

$$y\_i[\omega^T \boldsymbol{\varphi}(\mathbf{x}\_i) + b] = 1 - \boldsymbol{e}\_i \,\, i = 1, 2, \dots, l \tag{13}$$

where *ω* represents the weight vector, *ϕ*(*i*) is a nonlinear function that maps the data to a high-dimensional space, *γ* is the regularization parameter, *ei* is the error introduced by the samples, and *b* is the constant bias. The optimization objective can be converted into a dual variable optimization problem using the Lagrange duality, which can be expressed as follows:

$$\begin{aligned} L(\boldsymbol{\omega}, \boldsymbol{b}, \boldsymbol{e}, \boldsymbol{a}) &= J(\boldsymbol{\omega}, \boldsymbol{e}) - \sum\_{i=1}^{l} \boldsymbol{a}\_{i} (\boldsymbol{\omega}^{T} \boldsymbol{\varrho}(\boldsymbol{x}\_{i}) + \boldsymbol{b} + \boldsymbol{e}\_{i} - \boldsymbol{y}\_{i}) \\ &= \frac{1}{2} \boldsymbol{\omega}^{T} \boldsymbol{\omega} + \frac{\gamma}{2} \sum\_{i=1}^{l} \boldsymbol{e}\_{i}^{2} - \sum\_{i=1}^{l} \boldsymbol{a}\_{i} \left\{ [\boldsymbol{\omega}^{T} \boldsymbol{\varrho}(\boldsymbol{x}\_{i}) + \boldsymbol{b}] + \boldsymbol{e}\_{i} - \boldsymbol{y}\_{i} \right\} \end{aligned} \tag{14}$$

where *α<sup>i</sup>* is the Lagrange multiplier for the *i*-th constraint, according to the Karush–Kuhn– Tucker (KKT) conditions:

$$\begin{aligned} \frac{\partial L}{\partial \boldsymbol{\omega}} &= 0 \rightarrow \sum\_{i=1}^{l} \boldsymbol{\alpha}\_{i} \boldsymbol{\varrho}(\boldsymbol{x}\_{i}) = \boldsymbol{\omega} \\ \frac{\partial L}{\partial \boldsymbol{\theta}} &= 0 \rightarrow -\sum\_{i=1}^{l} \boldsymbol{\alpha}\_{i} \boldsymbol{y}\_{i} = 0 \\ \frac{\partial L}{\partial \boldsymbol{c}\_{i}} &= 0 \rightarrow \boldsymbol{\alpha}\_{i} = \gamma \boldsymbol{c}\_{i} \\ \frac{\partial L}{\partial \boldsymbol{a}\_{i}} &= 0 \rightarrow \boldsymbol{\omega}^{T} \boldsymbol{\varrho}(\boldsymbol{x}\_{i}) + \boldsymbol{b} + \boldsymbol{e}\_{i} - \boldsymbol{y}\_{i} = 0 \end{aligned} \tag{15}$$

After eliminating *ω*,*e*, Equation (16) is obtained as follows:

$$
\begin{bmatrix} 0 & E^\mathrm{T} \\ E & \Omega + \gamma^{-1} I\_l \end{bmatrix} \begin{bmatrix} \mathbf{b} \\ \boldsymbol{\kappa} \end{bmatrix} = \begin{bmatrix} 0 \\ \boldsymbol{\mathcal{Y}} \end{bmatrix} \tag{16}
$$

where *E* = [1, . . . , 1] *<sup>T</sup>*, *Il* is an *<sup>l</sup>* <sup>×</sup> *<sup>l</sup>* identity matrix, *<sup>α</sup>* = [*α*1,..., *<sup>α</sup>l*] *<sup>T</sup>*, and Ω is the kernel matrix, <sup>Ω</sup> <sup>∈</sup> <sup>R</sup>*N*×*N*, and <sup>Ω</sup>*ij* <sup>=</sup> *<sup>ϕ</sup>*(*xi*) *<sup>T</sup> <sup>ϕ</sup>*(*xj*) = *<sup>K</sup>*(*xi*, *xj*) for a given RBF kernel function:

$$K(\mathbf{x}\_{i}, \mathbf{x}\_{j}) = \exp(\frac{-\left\|(\mathbf{x}\_{i} - \mathbf{x}\_{j})\right\|^{2}}{2\sigma^{2}})\tag{17}$$

Thus, the objective function can be derived as follows:

$$f(\mathbf{x}) = \sum\_{i=1}^{l} \alpha\_i \mathbf{K}(\mathbf{x}, \mathbf{x}\_i) + b \tag{18}$$

where *α*, *b* are the solutions to Equation (16).

The predictive performance of LSSVM mainly depends on the regularization parameter *γ* and the kernel width *σ*, where *γ* balances the trade-off between fitting accuracy and model generalization ability, and *σ* affects the complexity of the distribution of the sample data in the mapped space. Traditional methods for parameter selection are often based on experience and trial, which may not guarantee regression accuracy and computational efficiency. To improve the LSSVM model, the IPSO algorithm is used to optimize the parameters (*σ*, *γ*) of LSSVM.

#### 2.2.2. Improved Particle Swarm Optimization

Particle swarm optimization (PSO) is an evolutionary algorithm used to solve global optimization problems [33]. However, PSO has certain limitations such as a lack of dynamic adaptability, which can result in local optima trapping and slow convergence speed [34]. Researchers have proposed improved PSO algorithms to address these limitations, including dynamic PSO and adaptive PSO [35]. These algorithms have shown promising results in enhancing PSO's performance.

Yang Ge et al. [36] introduced a psychological model into PSO and proposed Emotional PSO (EPSO) to enhance the search ability of particles and accelerate the convergence speed of PSO. However, the algorithm lacks dynamic adaptability and is prone to getting trapped in local optima [37]. Therefore, this paper proposes an improved PSO algorithm based on EPSO to optimize the regularization parameter and kernel width of the LSSVM. The improved PSO algorithm is expected to overcome the limitations of EPSO and enhance the performance of PSO in solving optimization problems.

Assuming that each particle possesses emotional states and perception abilities, the Weber–Fechner law [38] has been utilized to enhance the performance of PSO. Specifically, particles utilize their emotional states to determine their next actions, and they can perceive stimuli by evaluating the difference between their current position and the historical best position. When the stimulus exceeds the perception threshold, the particle's emotional state changes, resulting in a stronger response to the stimulus. Three emotional states have been defined for the particles, namely happy, normal, and sad, which correspond to different particle reactions. The emotional state of each particle is updated in each iteration. If the particle's fitness is higher than that of the previous iteration, its emotional state increases; otherwise, its emotional state decreases. The emotional state of the *i*-th particle in the *t*-th iteration is represented by *eX<sup>t</sup> <sup>i</sup>* , and its initial emotional state is a random number. The formula for *eX*<sup>0</sup> *<sup>i</sup>* can be expressed as:

$$eX\_i^0 = rand[-0.1, 0.1] \tag{19}$$

The emotional state of the particle is adjusted according to its fitness. If the particle's fitness is better than the previous iteration, its emotional state increases; otherwise, its emotional state decreases. The increase and decrease of emotional state can be represented as:

$$\Delta^{+} = \frac{\left(f(X\_i^{t-1}) - f(X\_i^t)\right) \cdot \left(f(X\_i^t) - f(gb^t)\right)}{\left(f(g\omega^t) - f(gb^t)\right)^2} \tag{20}$$

$$\Delta^{-} = \frac{\left(f(X\_i^t) - f(X\_i^{t-1})\right) \cdot \left(f(g\omega^t) - f(X\_i^t)\right)}{\left(f(g\omega^t) - f(gb^t)\right)^2} \tag{21}$$

where *X<sup>t</sup> <sup>i</sup>* represents the position of the *<sup>i</sup>*-th particle during the *<sup>t</sup>*-th iteration, *gb<sup>t</sup>* represents the global best position at the *t*-th iteration, and *gω<sup>t</sup>* represents the global worst position at the *t*-th iteration.

Regarding particle swarm *pX* = {*pXi*; *i* = 1, 2, ··· , *n*}, its emotional state is sorted, and all particles are divided into three different emotional states based on the average value of their emotional state, as shown in Figure 2.

**Figure 2.** Particle swarm emotional state *eX* sorting.

Subsequently, the particle's behavior is determined based on its emotional state. The Weber–Fechner law is employed to describe the particle's perception ability, which can be expressed as:

$$r\_{\mathcal{S}} = -k \ln \frac{\mathcal{S}(f(gBest) - f(pX\_i))}{S\_0} \tag{22}$$

$$r\_h = -k \ln \frac{S(f(pBest\_i) - f(pX\_i))}{S\_0} \tag{23}$$

where *rg* represents global perception, *rh* represents historical perception, *k* is a constant factor, *S*(·) is the stimulus function, *S*<sup>0</sup> is the stimulus threshold, *gBest* is the historical best position of the particle swarm, and *pBesti* is the historical position of the *i*-th particle. According to the paper [35], the velocity and position update formulas for a particle in a normal emotional state are:

$$V\_{i}^{t+1} = \mathfrak{F} \cdot V\_{i}^{t} + c\_{1} \cdot r\_{1} \cdot (pBest\_{i}^{t} - pX\_{i}^{t}) + c\_{2} \cdot r\_{2} \cdot (gBest^{t} - pX\_{i}^{t}) \tag{24}$$

$$pX\_i^t = pX\_i^{t-1} + V\_i^t \tag{25}$$

When the particle is in the happy state, it will be more energetic in the current position. The update formula of the particle speed and position in the happy state is:

$$V\_{i}^{t+1} = \mathfrak{F} \cdot V\_{i}^{t} + \mathfrak{c}\_{1} \cdot r\_{1} \cdot r\_{\mathfrak{F}} \cdot \left( pBest\_{i}^{t} - X\_{i}^{t} \right) + \mathfrak{c}\_{2} \cdot r\_{2} \cdot r\_{h} \cdot \left( gBest^{t} - X\_{i}^{t} \right) \tag{26}$$

$$pX\_i^t = pX\_i^{t-1} + V\_i^t \tag{27}$$

When a particle is in a sad emotional state, it primarily focuses on its historical best position and contracts towards it from its current position. However, due to the decreasing fitness over iterations, the particle may become trapped in a local optimum. Based on the psychological model, a particle in a sad emotional state is on the verge of collapse and requires assistance from particles with better emotional states in the swarm to improve its condition. To address this issue, this paper proposes a restart strategy for updating the velocity of sad particles. The restart strategy is as follows:

$$V\_i^t = c\_1 \cdot (u - l) \cdot rand[-1, 1] \tag{28}$$

where *c*<sup>1</sup> is a constant, *u* represents the upper limit of the particle search range, *l* represents the lower limit of the particle search range, and rand [−1, 1] denotes a random number within the range of [−1, 1]. To maintain both convergence performance and diversity in the particle swarm, a combination of random initialization and the global best position is employed for updating the position of sad particles, which can be expressed as follows:

$$pX\_i^t = \begin{cases} \text{ } gb^{t-1} \text{ } \text{ } < 0.5\\ \text{ } l + (u-l) \cdot rand[-1, 1] \text{ } \text{ } \text{ } \ge 0.5 \end{cases} \tag{29}$$

The IPSO algorithm workflow is illustrated in Figure 3.

According to the analysis presented above, the improved PSO algorithm that utilizes the psychological mechanism demonstrates desirable dynamic and convergence performance, as well as enhanced search ability of particles when compared to traditional PSO algorithms. As a result, this study employed the improved PSO algorithm based on the psychological mechanism to optimize the key parameters (*γ*, *σ*) of the LSSVM.

#### *2.3. Soft Sensor Modeling Based on BDA-IPSO-LSSVM*

In this study, a soft sensor modeling strategy based on BDA-IPSO-LSSVM is proposed to address the issue of soft sensor model failure caused by the mismatch between training data and actual operating condition data in the fermentation process of *Pichia pastoris*. The proposed strategy utilizes the idea of transfer learning and employs the improved BDA method to match the training data and operating condition data, thereby enhancing the generalization ability and prediction accuracy of the established soft sensor model. Additionally, an improved PSO algorithm is proposed to optimize the established LSSVMbased soft sensor model and overcome the problem of arbitrary local optima in the PSO algorithm. To illustrate the proposed soft sensor modeling strategy based on BDA-IPSO-LSSVM, Figure 4 is provided, which presents a graphical representation of the strategy.

#### *2.4. Introduction of the Pichia pastoris Experimental Work*

The focus of this study is on *Pichia pastoris*, and *Pichia pastoris* GS115, MutsHis+ strain was selected as the strain. The RTY-C-100L fermenter was used as the fermentation equipment. The input variables for the soft sensor model were chosen based on the analysis of the fermentation process of *Pichia pastoris* using the absolute relation degree method. The stirring speed *v*, temperature *T*, airflow *q*, pH of the fermentation liquid, dissolved oxygen *Do*, and fermenter pressure *p* were selected as input variables, and the concentrations of production *P* and cell concentration *C* were selected as output variables. The fermentation process is shown in Figure 5.

**Figure 4.** The framework of the BDA-IPSO-LSSVM.

**Figure 5.** Structure of *Pichia pastoris* fermentation process.

The specific steps for modeling the *Pichia pastoris* fermentation soft sensor model based on BDA method are as follows, where the source domain data and the target domain data are represented by *Qs* = {*xsi* , *ysi* }*n <sup>i</sup>*=<sup>1</sup> and *Qt* <sup>=</sup> *xtl* , *ytl <sup>m</sup> <sup>l</sup>*=1, respectively:

Step 1: Carry out *Pichia pastoris* fermentation experiments, build the datasets, and normalize the dataset.


The specific steps of the *Pichia pastoris* fermentation experiment are as follows:


$$RMSE = \sqrt{\frac{1}{n} \sum\_{i=1}^{n} \left( y\_{pre}^{(i)} - \hat{y}\_{real}^{(i)} \right)^2} \tag{30}$$

$$R^2 = 1 - \frac{\sum\_{i=1}^n \left(\mathcal{Y}\_{pre}^{(i)} - \mathcal{Y}\_{real}^{(i)}\right)^2}{\sum\_{i=1}^n \left(\mathcal{Y}\_{real}^{(i)} - \mathcal{Y}\_{real}\right)^2} \tag{31}$$

$$MAE = \frac{1}{n} \sum\_{i=0}^{n} |y\_{pr\varepsilon}^{(i)} - y\_{rel}^{(i)}| \tag{32}$$

#### **3. Result and Discussion**

#### *3.1. Result*

To better demonstrate the effectiveness of the proposed method in this article, simulations were conducted for the LSSVM soft sensor model, the PSO-LSSVM soft sensor model, the IPSO-LSSVM soft sensor model, and the BDA-IPSO-LSSVM soft sensor model, respectively, to achieve real-time prediction of cell concentration *C* and product concentration *P* of *Pichia pastoris*. The simulation results for cell concentration *C* are shown in Figure 6, where Figure 6a represents the simulation result for LSSVM, using the RBF function as the kernel function, the regularization parameter set to 125, and the RBF kernel width set to 10. PSO and IPSO were introduced to optimize the LSSVM model, with the regularization parameter lower bound set to zero, the upper bound set to 300, the initialization set to 100, the kernel width lower bound set to zero, the upper bound set to 50, the particle swarm size set to 100, and *c*<sup>1</sup> set to 0.35, when the number of iterations reaches 200 or the global optimal position is less than 110, it is used as the termination condition of the IPSO algorithm. Finally, the IPSO algorithm was used to obtain the optimized LSSVM regularization parameter of 132.4 and kernel width of 16.4, resulting in the simulation results shown in Figure 6b,c, respectively. The "actual value" in the figures represents the real value measured during the fermentation experiment of *Pichia pastoris*. In order to further illustrate the effectiveness of the proposed IPSO algorithm, Figure 7 shows the fitness curves of the PSO algorithm and the IPSO algorithm. It can be seen that the fitness of the IPSO algorithm is significantly better than that of the PSO algorithm. To further demonstrate the favorable convergence performance and optimization capability of the IPSO algorithm, a comparative analysis was conducted by the IPSO with two classical optimization algorithms, namely Grey Wolf Optimization [39] (GWO) and Artificial Bee Colony [40] (ABC), under identical input conditions. Figure 7 presents the fitness curves of the IPSO, PSO, GWO, and ABC optimization algorithms. It is evident that the IPSO algorithm exhibits a significantly accelerated optimization speed in comparison to the other algorithms. Concerning the prediction results for the cell concentration of *Pichia pastoris*, Figure 8 shows the prediction result of the IPSO-LSSVM, WOLF-LSSVM and ABC-LSSVM. The figure distinctly illustrates that IPSO-LSSVM produces markedly smaller prediction errors compared to the other two algorithms. The results provide substantial evidence that the IPSO algorithm demonstrates pronounced superiority in terms of parameter optimization effectiveness, surpassing both GWO and ABC algorithms.

The improved BDA algorithm was used to further improve the prediction accuracy by adapting the source domain data and target domain data. In the BDA algorithm, the balancing factor *μ* approaches one as the conditional probability distribution becomes more important and approaches zero as the marginal probability distribution becomes more important. In this simulation, the balancing factor was set to 0.62 to achieve optimal prediction performance. The final simulation result for the BDA-IPSO-LSSVM soft sensor model is shown in Figure 6d.

By comparing Figure 6a,b, it can be observed that the PSO algorithm can enhance the prediction accuracy of the LSSVM model. However, the prediction effect of the LSSVM model is not ideal and cannot meet the needs of the fermentation industry, significant errors remain present in the model by comparing the predicted value with the actual value. A further comparison of Figure 6b,c indicates that the IPSO algorithm can effectively enhance the prediction accuracy of the model by improving the dynamic performance of the PSO algorithm, and the IPSO algorithm reduces the prediction error of the model. The result presented in Figure 6d demonstrates that the proposed IBDA algorithm is effective in reducing the distribution distance between the source domain and the target domain and makes the predicted value of the model meet the actual fermentation production needs.

It can be seen that the proposed BDA-IPSO-LSSVM model has good prediction accuracy in Figure 6e.

**Figure 6.** Prediction results of different models for cell concentration: (**a**) LSSVM; (**b**) PSO-LSSVM; (**c**) IPSO-LSSVM; (**d**) BDA-IPSO-LSSVM; (**e**) Combination of the four models.

**Figure 7.** The fitness curve of the IPSO, PSO, WOLF and ABC optimization algorithms.

**Figure 8.** The prediction result of the IPSO-LSSVM, WOLF-LSSVM and ABC-LSSVM.

Figure 9 shows the residuals of the different soft sensor models in predicting the cell concentration of *Pichia pastoris*. The prediction performance of the soft sensor models is shown in Table 1. It can be intuitively seen in Figure 8 and Table 1 that the BDA-IPSO-LSSVM residual is relatively small compared to other models, and it performs well in the *RMSE*, *R*2, and *MAE* performance metrics. GFLOPs reflects the complexity of the model. From the value of GFLOPs of each model in Table 1, it can be seen that the proposed hybrid model does not increase the complexity of the model.

**Figure 9.** The residuals of the different soft sensor models in predicting the cell concentration of *Pichia pastoris*.

To further illustrate the effectiveness of the proposed BDA-IPSO-LSSVM model, similar to Figure 6, Figure 10 illustrates the predicted values of the Pichia pastoris product concentration. Specifically, Figure 10 presents the results of the four different soft sensor models, while Figure 10a–d show the prediction results of the LSSVM model, PSO-LSSVM

model, IPSO-LSSVM model, and BDA-IPSO-LSSVM model for product concentration during the fermentation process of *Pichia pastoris*. As evidenced by Figure 10e, the BDA-IPSO-LSSVM model proposed in this paper exhibits strong predictive performance in regard to the key parameters during the fermentation process of *Pichia pastoris*. Figure 11 displays the residuals of the different soft sensor models, and Table 2 presents the performance metrics of the prediction results. By means of comparison, it can be concluded that the proposed BDA-IPSO-LSSVM soft sensor model exhibits superior prediction accuracy compared to the other models.


**Table 1.** Prediction performance indexes of different soft sensor models in predicting cell concentration.

**Figure 10.** Prediction results of different models for product concentration: (**a**) LSSVM; (**b**) PSO-LSSVM; (**c**) IPSO-LSSVM; (**d**) BDA-IPSO-LSSVM; (**e**) Combination of the four models.

**Figure 11.** The residuals of the different soft sensor models in predicting the product concentration of *Pichia pastoris*.

**Table 2.** Prediction performance indexes of different soft sensor models in predicting product concentration.


#### *3.2. Discussion*

This paper proposes a soft sensor model of *Pichia pastoris* based on LSSVM. Through Figures 6 and 10, it can be seen that the proposed soft sensor model has good predictive performance and can better realize the real-time monitoring of the fermentation process of *Pichia pastoris*. This paper first proposes an IPSO algorithm to optimize the parameters of LLSVM to achieve good prediction performance of LSSVM. Figures 7 and 8 show that compared with the GWO and ABC optimization algorithms, the IPSO algorithm proposed in this paper has good dynamic performance and convergence performance: it better solves the optimization problem of LSSVM parameters and realizes an accurate prediction of the LSSVM model. Second, this paper uses the BDA method in transfer learning to match the source domain data and target domain data, and realizes the accurate prediction of the *Pichia pastoris* soft sensor model under different working conditions. In Figures 9 and 11, it can be seen that, compared with the model without transfer, the proposed hybrid model reduces the prediction error of the model to a large extent and solves the problem of model failure under different working conditions. It can be seen from Tables 1 and 2 that the hybrid model proposed in this paper has good performance in various evaluation indicators. Moreover, the simulation results show that the hybrid model proposed in this paper has good predictive performance and can realize the real-time and accurate prediction of the cell concentration and product concentration in the fermentation process of *Pichia pastoris*, which greatly improves the production efficiency of *Pichia pastoris* fermentation products.

#### **4. Conclusions**

This study has proposed a novel soft sensor modeling strategy based on BDA-IPSO-LSSVM to address the issue of data distribution differences resulting from different operating conditions during the fermentation process of *Pichia pastoris*. The proposed strategy employs the BDA method and a fuzzy set-based improvement to reduce the distribution differences and improve the generalization ability of the traditional soft sensor model. Additionally, an improved PSO algorithm is proposed to optimize the established LSSVM- based soft sensor model, which addresses the issue of PSO algorithms becoming trapped in local optima and results in a significant improvement in the prediction accuracy of the soft sensor model. The experimental results demonstrate that the proposed BDA-IPSO-LSSVM soft sensor model exhibits strong performance in terms of the *RMSE*, *R*2, and *MAE* prediction performance indicators. The soft sensor model can effectively predict the key parameters of *Pichia pastoris* fermentation in real-time, including cell concentration and product concentration. The proposed strategy offers a promising solution to the issue of soft sensor model failure caused by the mismatch between training data and actual operating condition data and has potential applications in the fermentation industry. Future studies may explore the generalization of the proposed strategy to other fermentation processes or even other fields. The main limitation at present is that only one source domain and one target domain can be adapted, and the IPSO can be optimized for a single objective. In the future, we aim to research the transfer learning from the data of multiple historical operating conditions to further reduce the difference between data, and generalization of IPSO to multiobjective optimization problems.

**Author Contributions:** Conceptualization, B.W. and J.L.; methodology, B.W.; software, J.L.; validation, J.L., A.Y. and H.W.; formal analysis, J.L.; investigation, J.L.; resources, J.L.; data curation, H.W.; writing—original draft preparation, H.W.; writing—review and editing, A.Y.; visualization, J.L.; supervision, B.W.; project administration, B.W.; funding acquisition, B.W. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by Natural Science Foundation of China (No. 61705093) and the Wuxi Science and Technology Plan Project—Basic Research: No. K20221054.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** The dataset in this article is unavailable because it involves privacy.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Abbreviations**


#### **Nomenclature**




#### **References**


**Disclaimer/Publisher's Note:** The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

**Simon Tomažiˇc \*,† and Igor Škrjanc †**

Faculty of Electrical Engineering, University of Ljubljana, 1000 Ljubljana, Slovenia; igor.skrjanc@fe.uni-lj.si

**\*** Correspondence: simon.tomazic@fe.uni-lj.si; Tel.: +386-1-4768-760

† These authors contributed equally to this work.

**Abstract:** This paper presents a comprehensive study on the development of models and soft sensors required for the implementation of the automated bioreactor feeding of Chinese hamster ovary (CHO) cells using Raman spectroscopy and chemometric methods. This study integrates various methods, such as partial least squares regression and variable importance in projection and competitive adaptive reweighted sampling, and highlights their effectiveness in overcoming challenges such as high dimensionality, multicollinearity and outlier detection in Raman spectra. This paper emphasizes the importance of data preprocessing and the relationship between independent and dependent variables in model construction. It also describes the development of a simulation environment whose core is a model of CHO cell kinetics. The latter allows the development of advanced control algorithms for nutrient dosing and the observation of the effects of different parameters on the growth and productivity of CHO cells. All developed models were validated and demonstrated to have a high robustness and predictive accuracy, which were reflected in a 40% reduction in the root mean square error compared to established methods. The results of this study provide valuable insights into the practical application of these methods in the field of monitoring and automated cell feeding and make an important contribution to the further development of process analytical technology in the bioprocess industry.

**Keywords:** spectroscopy; Raman; modelling; soft sensor; variable selection; outliers; simulator; kinetic model

#### **1. Introduction**

Chemometrics, which deals with the application of various mathematical and statistical methods, could be described by a broad definition in which the most important part is the application of a multivariate data analysis to data relevant to chemistry [1]. The multivariate statistical data analysis is a powerful tool for analysing and structuring data sets obtained from different measurement systems and for building empirical mathematical models that can predict, for example, the values of important properties that cannot be measured directly [2,3]. Multivariate calibration is often used in the industry for the rapid online determination of important process parameters and critical quality characteristics and enables non-destructive measurements, online monitoring and process control.

In analytical chemistry, molecular spectroscopic methods, including infrared, nearinfrared and Raman spectroscopy, are widely used to determine the molecular structure of various substances [4–6]. These methods work by assessing the radiant energy that is either absorbed or scattered when excited by a high intensity monochromatic beam that induces a transient energy state in the molecule. The process of Raman scattering occurs when the material under investigation is exposed to monochromatic light, causing a tiny percentage of the light to be inelastically scattered at wavelengths other than the incident light.

Raman spectroscopy is an optical method that enables the non-destructive investigation of molecular structures and chemical compositions. However, due to its low intensity, the study of Raman scattering requires the use of sophisticated instruments [7]. The data

**Citation:** Tomažiˇc, S.; Škrjanc, I. Halfway to Automated Feeding of Chinese Hamster Ovary Cells. *Sensors* **2023**, *23*, 6618. https:// doi.org/10.3390/s23146618

Academic Editor: Yuan Yao

Received: 26 June 2023 Revised: 14 July 2023 Accepted: 21 July 2023 Published: 23 July 2023

**Copyright:** © 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

obtained from spectroscopy contain thousands of wavenumbers (variables) and measurements (observations), which requires multivariate analysis to determine the relationship between these variables [8,9]. Modern Raman instruments usually use a laser as the illumination source because of its high-intensity monochromatic properties. The wavelength of this laser can range from the UV (*λ* = 200 nm) to the near-infrared (*λ* = 1064 nm), but for pharmaceutical or biological applications, near-infrared wavelengths (*λ* = 785 or *λ* = 830 nm) are usually preferred to minimise fluorescence interference.

In bioprocess literature, spectroscopic sensors are sometimes referred to as soft sensors [10] because the spectroscopic data are modelled in software programmes that provide information analogous to that of hardware sensors. It is critical that data analysis models are used to extract the optimal amount of information from Raman spectra, an area that has received much attention in research [11]. The complexity and difficulties associated with interpreting results from Raman and IR spectroscopy can be mitigated by applying various data mining methods required for a more comprehensive understanding. These methods must be able to manage large multidimensional data sets while exploring the totality of spectral information [12].

Chemometric techniques, including the commonly used Partial Least Squares (PLS) [13,14] method, exploit the transformation capabilities of the principal component analysis (PCA). In this technique, the attributes of a data set are transformed into uncorrelated principal components, which allows a reduction in data dimensions with minimal loss of information. PCA-based techniques complemented by machine learning methods such as decision trees [15], Support Vector Machine (SVM) [16] and artificial neural networks (ANN) [17,18] allow for an even finer analysis. Additional preprocessing steps can be implemented, including normalisation and smoothing via *k*-th order Savitzky–Golay derivative [19], while model accuracy can be assessed by the standard error of calibration, factors used and coefficient of determination (*R*2).

The inherent complexity of spectral data derived from vibrational spectroscopic techniques, including IR, NIR and Raman, has sparked debates on the topic of variable selection in PLS regression models [20,21]. This complexity arises from the interference caused by the scattering of diffuse light, instrumental noise and overlapping absorption bands. Given this complexity, variable selection strategies focus either on single wavelengths (e.g., variable importance in projection [22]) or on informative spectral intervals (such as interval PLS [23]). These methods help to eliminate superfluous information, a concept introduced by Spiegelman et al. [24]. More recently, the technique of the Competitive Adaptive Reweighted Sampling (CARS) has proven its effectiveness in processing NIR and RAMAN spectra [25,26].

Certain Raman spectra obtained from the same sample may differ from the group due to factors such as instrumental artefacts and variations in the sample. These spectra are often referred to as unwanted spectra or outliers. Omitting these spectra is considered crucial before applying multivariate techniques to obtain the desired results.

Raman spectroscopy, known for its precise spectral features that correlate with the molecular structure of a sample, has demonstrated its strengths in a non-destructive analysis and its ability to work with aqueous systems. These properties make it particularly suitable for the study of cell cultures and tissues [27]. It is widely used for the study of polysaccharides, amino acids, alcohols and metabolites and has secured its position as an important process analytical technology (PAT) in the bioprocess industry [18,28,29]. The ability of inline Raman spectroscopy to monitor and adjust critical parameters in real time ensures consistent drug production.

Although mammalian cell cultures are widely used in the pharmaceutical industry to produce biological products such as antibodies and growth factors, the full potential of advances in process monitoring and control has not yet been realised [10,27]. Conventional methods, often based on offline sampling and manual calculations, are still widely used. In particular, mammalian cells are mainly used for the production of protein therapeutics, which account for 60–70% of biopharmaceuticals. These processes usually involve the delivery of glucose to CHO cells [30–33].

By using non-invasive real-time measurements PAT in conjunction with closed-loop feedback control, feeding strategies can be optimised to improve yield [29,34,35]. Raman spectroscopy plays an important role in this, as it enables in-situ measurements and process control in real time. In situ Raman measurements, first presented by [36], allow the simultaneous measurement of total cell density (TCD), viable cell density (VCD) and concentrations of glucose, glutamate, lactate and ammonia. This method has proven successful in monitoring mammalian cell cultures in bioreactors. Several successful examples can be found in recent literature [18,34,36]. Subsequent studies have extended this application from developmental scales of 3 to 15 L [27,34] to clinical production scales of 2000 L [37], demonstrating the scaling potential of this approach.

This manuscript represents a significant advance in the field of bioprocess technology by providing a comprehensive PLS model construction procedure for Raman spectroscopy that incorporates data preprocessing and outlier removal, thereby improving the understanding and control of bioprocess behaviour. In addition, the development of a simulator that incorporates CHO cell kinetics is an important contribution to the field. It paves the way for the development of a model predictive control system for the automated feeding of CHO cells, revolutionising the way we approach the automation and control of bioprocesses.

The paper is organized as follows. Sections 2 and 2.1 describe the process of data acquisition and introduce the process of spectra processing, which is the initial step of data analysis. Section 2.2 explains the development of the PLS models for soft sensor design and different methods for variable selection in spectroscopic multivariate calibration. This subsection also discusses the process of identifying and removing outlier spectra to improve the robustness and accuracy of the PLS model. Section 3 discusses the CHO cell kinetics model required to develop an advanced simulation environment. Section 4 presents the results of the model construction and simulator implementation. Sections 5 and 6 provide the discussion and concluding remarks.

#### **2. Materials and Methods**

#### *2.1. Spectra Processing*

The extensive research began with the systematic compilation of measurements and data obtained from the cultivation of CHO cells in a stainless steel bioreactor. The local pharmaceutical company, which was in charge of designing the experiment, played an important role. Our task, on the other hand, was to analyse the collected data, create the necessary models and establish a suitable simulation environment, which is described in this paper.

The cultivation of the CHO cells took place in a bioreactor with a volume of 10 L. To collect measurements (Raman spectra), the probe of a Kaiser RamanRXN2 spectrometer was inserted into the bioreactor. The RamanRXN2 spectrometer is a sophisticated analytical device that uses laser light with a wavelength of 532 nm. The resulting Raman spectrum is collected over a period of at least 30 min, a measure that improves the signal-to-noise ratio. It is important to note that Raman scattering, which is essentially inelastic photon scattering, is a rather small fraction compared to its elastic counterpart.

For data storage, a desktop computer with a Windows operating system was used, which was directly connected to the Raman spectrometer. Four different experiments were performed to grow the cells in the bioreactor, with each batch lasting about two weeks. The bioreactor contained CHO-S cell lines. This cell line is a sub-line of the original CHO-K1, with adaptations for suspension culture. CHO-S cells are commonly used in the industrial production of therapeutic proteins.

To maintain the optimal environment of the bioreactor, the pH and temperature were strictly controlled and nutrient dosing (glucose and glutamine) was conducted manually on a daily basis using reference measurements. A Roche Cedex Bio Analyzer, known for its reliability and precision, was used to record these reference measurements daily. This allowed for the accurate monitoring of parameters such as glucose and glutamine concentration, viable cell count and others.

The development of useful models depends on appropriate methods, but even more important is the selection of appropriate data. In our case, the raw data consist of the Raman spectra shown in Figure 1. For a first experiment, the choice between regression methods such as principal component regression, partial least squares or an artificial neural network may not be so important [27]. However, it is important that the selected independent variables (*x*-data) have a strong relationship with the dependent variables (*y*-data) to be modelled [38]. The choice of method then depends on the type and amount of data available.

**Figure 1.** Spectra obtained with Raman spectroscopy (from four different batches where only spectra, which are used for training and validation, are shown).

In cases where the x-data for objects represent time series or digitised data from a continuous spectrum (e.g., Raman spectra, see Figure 1), possible pre-processing strategies could include smoothing or a transition to a first or second derivative. Smoothing attempts to reduce random noise by eliminating sharp peaks in the spectrum, while differencing brings relevant data to light despite noise amplification. The first derivative achieves alignment of spectra with different absorbance values that are shifted in parallel by cancelling out an additive baseline. A second derivative removes a constant and linear baseline. Each object vector, referred to as **x***i*, undergoes separate processes of smoothing and differentiation.

For both differentiation and smoothing, the Savitzky–Golay method is used. This is a method widely used in chemistry. This technique, a local polynomial regression using the method of least linear squares, requires x-values that are both exact and uniformly distributed. For each point, symbolised as *j* with value *xj*, a linear combination is used to calculate the weighted sum of the neighbouring values. These weights determine whether smoothing or a derivative calculation is performed. Factors such as the number of neighbours and the polynomial order determine the strength of the smoothing. Choosing the right polynomial order is crucial, as incorrectly chosen higher order polynomials could misinterpret significant Raman bands as mere background. In the Savitzky–Golay method, a vector component *xj* is transformed by

$$\mathbf{x}\_{j}^{\*} = \frac{1}{N} \sum\_{h=-k}^{k} c\_{h} \mathbf{x}\_{j+h\prime} \tag{1}$$

where *x*∗ *<sup>j</sup>* is the new value (of a smoothed curve or a derivative), *N* is the normalisation constant, *k* is the number of neighbouring values (determining the size of the moving window) on each side of *j* and *ch* are the coefficients, which depend on the degree of the polynomial used and the objective (smoothing, first or second derivative). For example, if a second order polynomial is fitted through a window of five points (*k* = 2), the following coefficients *c*−2, *c*−1, *c*0, *c*1, *c*<sup>2</sup> can be used for smoothing: −3, 12, 17, 12, −3, the first derivative: −2, −1, 0, 1, 2, and the second derivative: 2, −1, −2, −1, 2 [19]. Figure 2 shows the Raman spectra to which the Savitzky–Golay filtering was applied.

**Figure 2.** Raman spectra to which Savitzky–Golay filtering has been applied.

The process of pre-processing includes both filtering and normalisation, with the latter playing an important role. The reason for this is that even spectra recorded for the same material may demonstrate differences due to different recording times or unequal instrument conditions such as laser power and alignment. These variations can lead to different intensity values for spectra of the same material.

To compensate for these intensity differences, normalisation comes into play. This process ensures a maximum similarity of the intensity of a given Raman band of a given material when the spectra were taken under the same experimental parameters; however, some conditions are slightly different. Various normalisation methods are explored in the literature, including min-max normalisation, vector normalisation and Standard Normal Variate (SNV) normalisation. Of these methods, SNV normalisation is the most commonly used [39,40]. SNV normalisation works on the basis of the Equation (2), which can be outlined as follows:

$$\hat{\mathbf{x}}\_{j}^{\*} = \frac{\mathbf{x}\_{j}^{\*} - \bar{\mathbf{x}}^{\*}}{\sigma} \text{ where } \sigma = \sqrt{\frac{1}{N} \sum\_{j=1}^{N} (\mathbf{x}\_{j}^{\*} - \bar{\mathbf{x}}^{\*})^{2}} \text{ and } j = 1, 2, \dots, N. \tag{2}$$

Figure 3 shows the Raman spectra for which SNV normalisation was performed in addition to Savitzky–Golay filtering.

**Figure 3.** Raman spectra to which Savitzky–Golay filtering and SNV normalization are applied.

#### *2.2. Model Construction*

The construction of predictive models for bioprocesses, particularly for the cultivation of CHO cells in bioreactors, has made significant progress through the application of chemometric methods to Raman spectroscopic data [38]. These models can predict several key variables such as the concentrations of glucose, glutamine, lactate and other biochemical parameters, as well as cell growth metrics such as total cell count (TCC) and viable cell count (VCC). Raman spectroscopy, a non-invasive, label-free technique, provides detailed chemical information about the bioprocess by recording the molecular vibrations of the components. The resulting Raman spectra serve as input data for the prediction model and provide a comprehensive, high-dimensional data set.

Model construction begins with a calibration phase in which known samples are analysed using Raman spectroscopy and appropriate laboratory tests. This process generates a set of reference data that includes Raman spectra and associated concentrations of glucose, glutamine, lactate and cell counts. Another way to collect reference measurements is to use a device such as Roche's Cedex Analyzer. Once the reference data are prepared, multivariate analysis techniques such as Partial Least Squares Regression (PLSR) are used to build the predictive model. These methods work by identifying correlation patterns within the Raman spectra and relating them to the biochemical parameters of interest.

For more complex data sets or non-linear relationships, machine learning techniques such as Random Forest or SVM can be used. Advanced deep learning techniques such as Convolutional Neural Networks (CNN) are particularly effective for processing highdimensional spectral data, as they can automatically extract meaningful features and improve prediction accuracy [18]. However, one must be aware that such a method of creating a model requires a large database, which is not always available.

This approach not only improves our understanding of the bioprocess, but also our control over it. The real-time predictive capability of the model leads to optimised and consistent bioproduction outcomes by enabling rapid, data-driven decision-making and process adjustments, thereby increasing bioprocess performance, reducing costs and improving product quality. The model is continuously refined as more data become available, improving its predictive power over time.

#### 2.2.1. Partial Least Squares

Partial Least Squares (PLS) is a statistical method that finds a linear regression model by projecting the predicted variables and the observable variables onto a new space. The method was first developed by Swedish statistician Herman Wold and has since been widely used in fields such as chemometrics, neuroimaging, bioinformatics and social sciences [41,42].

PLS simultaneously accounts for the covariance of both the independent variables (predictors) and the dependent variables (responses). This approach is advantageous when dealing with complex, multivariate data sets where the predictors are highly collinear or where there are more predictors than observations. The method can handle noisy and missing data, which makes it robust and flexible.

Partial Least Squares (PLS) regression is a multivariate technique that combines features of principal component analysis (PCA) and multiple linear regression. Although PCA is not explicitly used in the PLS method, the concept of extracting principal components or latent variables is central to both methods. In PCA, the goal is to find a small number of uncorrelated variables, called principal components, that explain most of the variation in the data. Each principal component is a linear combination of the original variables and is orthogonal to all other components. PLS works in a similar way, but instead of trying to explain as much of the variance in the predictor variables as possible, PLS tries to extract components that explain as much of the covariance between the predictor and response variables as possible. Essentially, PLS looks for directions in which the predictors not only explain a large part of their own variance (as in PCA), but are also highly correlated with the response. PLS regression can be summarised in the following steps:


In terms of its statistical properties, PLS is a form of regularised regression. Like other forms of regularisation, it can prevent overfitting by introducing some bias into the model, but it reduces the variance of the model and thus improves its predictive performance.

PLS has been extended to handle different types of data and different modelling scenarios. The most popular versions of PLS include PLS-DA (PLS Discriminant Analysis) [43] for classification problems and PLS-PM (PLS Path Modelling) [44] for structural equation modelling. These extensions have made PLS a versatile and powerful tool for multivariate analysis. When considering the use of PLS, it is important to understand its assumptions and limitations. Although PLS does not assume that predictors are independent or normally distributed, it does assume a linear relationship between predictors and responses. In addition, PLS may not work well with unrelated predictors because it attempts to use all predictors in the model, which can lead to overfitting. It is recommended to evaluate the performance of PLS against other multivariate methods such as principal component regression (PCR) or ridge regression to ensure that it is appropriate for a particular data set and research question.

The Nonlinear Iterative Partial Least Squares (NIPALS) algorithm is a common method for calculating PLS components. The goal is to find a set of components (also called latent vectors) that capture the covariance between the predictors and the responses. The algorithm of the simplified NIPALS method can be summarised in the following five points:

• Initialization:

$$\mathbf{X} \in \mathbb{R}^{n \times m}, \ \mathbf{Y} \in \mathbb{R}^{n \times p}, \tag{3}$$

where *X* is a predictor matrix and *Y* is a response matrix.

• Selection of an initial column vector. Typically, the first column of the *Y* matrix represents the vector *u*:

$$\mathbf{u} = \mathbf{Y}[:,1] \tag{4}$$

• Iteratively compute the weights *w* and *t* until convergence:

$$\mathbf{w} = \frac{\mathbf{X}^T \mathbf{u}}{\mathbf{u}^T \mathbf{u}} \tag{5}$$

Normalize the weights:

$$\mathbf{w} = \frac{\mathbf{w}}{||\mathbf{w}||}\tag{6}$$

Compute the score vector:

$$\mathbf{t} = \mathbf{\color{red}{Xw}}$$

Reassign *u* as:

$$\mathbf{u} = \mathbf{Y}^{\mathsf{T}} \mathbf{t} / \mathbf{t}^{\mathsf{T}} \mathbf{t} \tag{8}$$

The iteration continues until the difference between the new and old score vectors falls below a certain threshold, indicating convergence.

• Deflate *X* and *Y*:

Calculate the outer product of *t* and *p* (the loading vector for the *X*), then subtract it from *X*. Do the same for *Y* with *t* and *q* (the loading vector for the *Y*):

$$\mathbf{p} = \frac{\mathbf{x}^T \mathbf{t}}{\mathbf{t}^T \mathbf{t}}, \quad \mathbf{q} = \frac{\mathbf{y}^T \mathbf{t}}{\mathbf{t}^T \mathbf{t}} \tag{9}$$

$$\mathbf{X} = \mathbf{X} - \mathbf{t}\mathbf{p}^T, \quad \mathbf{Y} = \mathbf{Y} - \mathbf{t}\mathbf{q}^T \tag{10}$$

The iterations end when *X* (or *Y*) can no longer be deflated or when the number of extracted latent variables is enough to describe the data according to some criterion.

• Calculate the regression coefficients. Once all the latent vectors are extracted, the regression coefficients *B* can be calculated as:

$$\mathbf{B} = \mathbf{W} (\mathbf{P}^T \mathbf{W})^{-1} \mathbf{Q}^T,\tag{11}$$

where *W* is the matrix of weight vectors, *P* is the loading matrix of *X*.

The Root Mean Square Error of Cross-Validation (RMSECV), which is calculated during the creation of the PLS model, can be used as a criterion to find the right number of latent variables and prevent overfitting. For example, Figure 4 shows that in the case of a PLS model for glucose concentration, the most appropriate number of latent variables is four, as the RMSECV does not drop drastically after that.

**Figure 4.** Finding the most appropriate number of latent variables in a PLS model.

#### 2.2.2. Selection of Key Variables

To further improve the PLS models and reduce the possibility of overfitting, the Variable Importance in Projection (VIP) and Competitive Adaptive Reweighted Sampling— Partial Least Squares (CARS-PLS) methods were used.

Variable Importance in the Projection is a popular method for assessing the importance of variables in a Partial Least Squares (PLS) regression model. PLS is a statistical approach used in predictive modelling where the prediction of a set of dependent variables from a set of independent variables is conducted through latent variable regression.

The VIP score for a variable is a measure of that variable's contribution to the model, taking into account both its contribution to explaining the dependent variable and its

contribution to explaining the independent variable. A high VIP score indicates that the variable is highly significant in the model (Figure 5 shows an example of selecting key variables in a PLS model of glucose concentration). However, the VIP method also has some disadvantages:


On the other hand, Competitive Adaptive Reweighted Sampling—Partial Least Squares is a more recent technique used for variable selection in spectroscopic multivariate calibration. It has gained considerable attention in the field of chemometrics. CARS-PLS was developed to overcome two major challenges in the analysis of spectroscopic data: high dimensionality and multicollinearity. These problems can lead to overfitting of the model, poor generalisation ability and difficulties in interpretation. The method CARS-PLS consists of two main stages:


**Figure 5.** Key variables determined with the methods VIP and CARS for the PLS model of glucose concentration.

The CARS-PLS method has been used successfully in many areas where spectroscopic data are used, such as pharmaceutical analysis, food quality control and environmental monitoring. However, like all methods, it has its limitations and assumptions. It assumes that there is a linear relationship between predictors and responses, and it may not work well if this assumption is not met. In addition, the performance of CARS-PLS may depend on the initial weights of the variables and the number of Monte Carlo iterations. Therefore, it is often advisable to make several runs of CARS-PLS with different initial settings and determine the consensus of the results.

Compared to VIP, CARS offers the following advantages:


#### 2.2.3. Removal of Outlier Spectra

The PLS model can be further improved by searching for spectra representing outliers. Therefore, a resampling method commonly used in statistics and machine learning was used, which can also be referred to as Monte Carlo cross-validation or repeated random sub-sampling validation. The outlier detection method consists of the following five steps:


The advantage of this method is that it can help to increase the robustness of the PLS models by removing outliers that would otherwise distort the model parameters. It is a relatively simple and intuitive approach that combines the robustness of resampling with the ability to identify and remove problematic data points. This method helps to further reduce the Root Mean Square Error of Prediction (RMSEP) and thus improve the overall performance of the model.

However, as with any method, it should be used judiciously. Removing outliers too aggressively can lead to over-fitting, where the model becomes over-fitted to the "typical" data points and performs poorly on new, unknown data. This method is most useful if you have a large enough dataset so that removing some data points does not significantly reduce the overall size of the dataset.

**Figure 6.** The mean error and standard deviation for all spectra.

#### **3. Simulator Construction**

In order to develop a predictive control algorithm for automated nutrient feeding in a bioreactor, a simulation environment based on a dynamic model was implemented. The latter describes the kinetics of the growth of a CHO cell culture in a fed-batch bioreactor. It is well known that the process parameters (temperature, pH, feeding, ammonia removal, etc.) have a significant impact on cell growth and especially on the quality of the monoclonal antibodies (mAbs) produced [45]. Therefore, the model is important not only for the development of management algorithms, but also for the observation and identification of the key factors (variables and parameters) that have the greatest influence on cell productivity. This is particularly important from the point of view of optimising protein production in a mammalian cell line.

#### *3.1. Modelling CHO Cell Culture Kinetics*

Chinese Hamster Ovary cells are the most commonly used mammalian hosts for the industrial production of therapeutic proteins, due to their capacity to perform human-like post-translational modifications. The growth kinetics of CHO cells can be studied using a mechanistic model [32]. A mechanistic model is a type of model used to describe biological processes based on underlying physiological mechanisms. These models allow us to interpret, predict and simulate biological phenomena by using mathematical equations to represent the interactions and transformations that occur in a system. In the context of CHO cell growth kinetics, a mechanistic model would include at least the following components. One of the most important mechanisms determining the growth kinetics of CHO cells is cell division. The rate at which cells grow and divide depends on various influencing factors such as the availability of nutrients, the accumulation of waste products and the passage of time. Mathematical models such as the Gompertz model or the logistic growth model are often used to represent these complicated dynamics of cell growth. Another crucial determinant of cell growth is the assimilation and utilisation of nutrients such as glucose and glutamine. The rate at which these nutrients are consumed can have a significant impact on cell growth and is usually modelled using Monod or Michaelis– Menten kinetics, which provides essential insights into cell metabolism and growth patterns. As cells grow and metabolise nutrients, they inevitably generate waste products such as lactate or ammonia. The accumulation of these waste products can have a suppressive effect on cell growth. To quantify this inhibitory effect, mathematical models are used to provide detailed insight into the relationship between the accumulation of waste products and cell proliferation. The loss of cells through mechanisms such as apoptosis, nutrient deprivation or the toxic effect of accumulated by-products is an inevitable aspect of cell culture. Mathematical models are used to express the rate of cell death as a function of various parameters, providing valuable insights into the factors that influence cell viability

over time. Finally, the growth kinetics of CHO cells are significantly influenced by external environmental factors such as temperature, pH and osmolality. These factors must be carefully incorporated into the mechanistic model to ensure its relevance and accuracy. These environmental influences represent an additional layer of complexity and require a comprehensive understanding of their effects on cell growth and survival. Each of these components is interconnected and forms a complex network of interactions that determine the growth kinetics of CHO cells. Together, they form a robust mechanistic model that allows the prediction, interpretation and simulation of the behaviour of CHO cells under different conditions. A mechanistic model of CHO cell growth kinetics would typically be a system of differential equations, where each equation represents a particular biological process (such as cell growth, nutrient consumption, production of waste products, etc.). These models can be quite complex and usually require a large amount of experimental data for their parameterisation.

However, despite their complexity, mechanistic models can provide valuable insights into the cell growth process and can be helpful in optimising cell culture conditions for maximum productivity. Many authors [45–48] who have worked on modelling the kinetics of CHO cell cultures have set up various dynamic models in the form of differential equations based on steady-state analysis. In most cases, these simple models only describe the variation of extracellular metabolite concentrations and the number of live/dead cells during the cell cycle. The models differ in the number of factors considered (number of variables and parameters), which are more or less relevant to describe what actually happens in a mammalian cell line (in a bioreactor). However, in order to have a practical and universally applicable simulator, a model was needed that took all the important variables into account. An example of such a model was also developed by M. Ivarsson [48] in her PhD thesis, as it takes into account the four phases of the cell cycle, temperature, glutamine concentration, number of dead cells, etc., in addition to the number of living cells and the concentrations of glucose, lactate and ammonia. For the development of a predictive controller for automated feeding, only a model prediction of glucose concentration would be required at this stage. However, as glucose concentration variations are also highly dependent on other variables, these should also be considered in the model. As mentioned above, the chosen dynamic model [48] describes four phases of the cell cycle: *G*0, *G*1, *S* and *G*2/*M* and the number of cells per phase: *XG*0, *XG*1, *XS* and *XG*2/*M*: *G*<sup>1</sup> phase:

$$\frac{d(X\_{\rm G1}V)}{dt} = 2k\_{\rm G2/M-\rm G1}X\_{\rm G2/M}V - k\_{\rm G1-S}X\_{\rm G1}V - k\_{\rm G1-G0}X\_{\rm G1}V - k\_dX\_{\rm G1}V - F\_{\rm OUT}X\_{\rm G1} \tag{12}$$

*S* phase:

$$\frac{d(X\_S V)}{dt} = k\_{G1-S} m\_{\text{stress}} X\_{G1} V - k\_{S-G2/M} X\_S V - k\_d X\_S V - F\_{\text{OUT}} X\_S \tag{13}$$

*G*2/*M* phase:

$$\frac{d(X\_{\rm G2/M}V)}{dt} = k\_{\rm S-G2/M}X\_{\rm S}V - k\_{\rm G2/M-G1}X\_{\rm G2/M}V - k\_{\rm d}X\_{\rm G2/M}V - F\_{\rm OUT}X\_{\rm G2/M} \tag{14}$$

*G*<sup>0</sup> phase:

$$\frac{d(X\_{\rm G0}V)}{dt} = k\_{\rm G1-\rm S}(1 - m\_{\rm stress})X\_{\rm G1}V + k\_{\rm G1-\rm G0}X\_{\rm G1}V - k\_{\rm d}X\_{\rm G0}V - F\_{\rm OIT}X\_{\rm G0} \tag{15}$$

The equations include transition factors *k*, where, e.g., *kG*<sup>1</sup>−*<sup>S</sup>* represents the transition from the *G*<sup>1</sup> phase to the *S* phase. The transition factors between subpopulations depend mainly on the growth rate, which in turn is determined by the times (*tG*1, *ts* and *tG*2/*M*) required for the completion of each cellular phase:

$$\mu = \frac{\ln(2)}{t\_{G1} + t\_S + t\_{G2/M}} \tag{16}$$

The transition from the *G*<sup>1</sup> to the *G*<sup>0</sup> phase is determined by the transition factor *kG*<sup>1</sup>−*<sup>G</sup>*0, which represents the temperature stress. However, the transition to phase *G*<sup>0</sup> may also cause metabolic stress *mstress*. The number of viable cells is calculated as the sum of the cells from each phase, where *V* represents the current volume of material in the bioreactor:

$$\frac{d(X\_V V)}{dt} = \frac{d(X\_{G0} V)}{dt} + \frac{d(X\_{G1} V)}{dt} + \frac{d(X\_S V)}{dt} + \frac{d(X\_{G2/M} V)}{dt} \tag{17}$$

The volume varies depending on the nutrient dosage (*FGlc* and *FGln*) and the potential sampling *FOUT*:

$$\frac{dV}{dt} = F\_{\rm Glc} + F\_{\rm Gln} - F\_{\rm OUT} \tag{18}$$

Glutamine concentration varies according to consumption factor *QGln* and degradation to ammonia *Kdeg* and potential dose *FGln*. Glutamine consumption depends on the cell growth factor, the specific yield *YGln* and the limiting function *fupt*:

$$\frac{d(Gl nV)}{dt} = -Q\_{\text{Gln}} X\_V V - K\_{\text{deg}} \text{Gln} V + F\_{\text{Gln}} \text{Gln}\_{\text{Feed}} - F\_{\text{OUT}} \text{Gln} \tag{19}$$

The ammonia concentration depends largely on changes in the glutamine concentration, since the ammonia concentration increases with glutamine consumption (factors *YAmn* and *Kdeg*):

$$\frac{d(A\_{mn}V)}{dt} = Q\_{Gln}Y\_{A\_{mn}}X\_VV + K\_{d \text{cg}}GlnV - F\_{\text{OUT}}Amn\tag{20}$$

The glucose concentration varies according to the consumption factor *QGlc* and the minimum consumption to keep the cells alive *mGlc*, and the amount of glucose added *FGlc*. The consumption factor *QGlc* is influenced by temperature and lactate as an inhibitor:

$$\frac{d(\text{GlcV})}{dt} = -Q\_{\text{Glc}}X\_V(1 - f\_{\text{G0}})V - m\_{\text{Glc}}X\_Vf\_{\text{G0}}V + F\_{\text{Glc}}\text{Glc}\_{\text{Feed}} - F\_{\text{OUT}}\text{Glc} \tag{21}$$

The lactate concentration depends on the glucose consumption (*QGlc* and *mGlc*):

$$\frac{d(LacV)}{dt} = \chi\_{\rm{Lac}}Q\_{\rm{Gld}}X\_V(1 - f\_{\rm{G0}})V - \chi\_{\rm{Lac}}m\_{\rm{Gld}}X\_Vf\_{\rm{G0}}V - F\_{\rm{OLT}}Lac \tag{22}$$

The change in monoclonal antibody concentration is determined by factors representing the productivity level (*qG*1/*G*0, *qS* and *qG*2/*M*) per cell phase:

$$\frac{d(mAbV)}{dt} = \mu[q\_{\frac{G1}{G0}}(X\_{G1} + X\_{G0}) + q\_S X\_S + q\_{\frac{G2}{M}} X\_{\frac{G2}{M}}] - F\_{\text{OUT}} mAb \tag{23}$$

#### **4. Results**

In order to be able to monitor the process in the bioreactor in detail during the entire batch, which usually takes about 14 days, seven PLS models were developed in the Matlab environment. The latter models, which represent soft sensors, allow the monitoring of the most important process variables in CHO cell cultivation. These variables are: Glucose concentration, viable cell concentration (VCC), total cell count (TCC), glutamine, glutamate, lactate and ammonium.

Data from four different batches were available to us for the development of PLS models. Raman spectra were collected every half hour and reference measurements (offline) were performed once or twice a day with Cedex Analyzer. Thus, the first step was to find the pairs of spectra and reference measurements that matched best in terms of acquisition time. The Raman measurement takes about half an hour to obtain a good signal-to-noise ratio and to remove fluorescence interference.

As described in Section 2.1, two key initial steps in the development of PLS models are the preprocessing of the Raman spectra with the Savitzky–Golay filter and the normalisation with the Standard Normal Variate method (see Figures 2 and 3). Savitzky–Golay lowpass filtering was performed for all independent variables (Raman shift (cm−1)) of each spectrum, with a quadratic function chosen for smoothing with the Savitzky–Golay filter and the window length (smoothing) set to 15 samples. In addition, a normalisation or Standard Normal Variate function is applied to the independent variables for all spectra, resulting in spectra with a mean of zero and a standard deviation of one.

As described in Section 2.2 and illustrated in Figure 2, careful consideration is also required in the selection of the parameter that determines the number of latent variables. For each PLS model, the optimal number of latent variables is determined based on crossvalidation, aiming for the smallest RMSECV error. In general, it is preferred to keep the number of latent variables as small as possible.

Characteristic independent variables of the spectrum (i.e., energy shifts) at which a spike occurs can be extracted from the literature for individual observed variables. Taking these characteristic energy shifts into account when calculating PLS models is therefore considered useful as it further weighs the individual independent variables of the spectrum and improves the model in this way. If these characteristic energy shifts are not known, various methods are available to identify the more important independent variables and take them into account to a greater extent.

The Variable Importance in the Projection method, described in Section 2.2.2, was tested first. However, the prediction results were not improved by this simple method; thus, alternative approaches to selecting key variables were investigated. Attempts to select "key" intervals or several independent variables of the spectrum together also did not lead to better results.

It turned out that the Competitive Adaptive Reweighted Sampling method, which is also discussed in Section 2.2.2, gave the best results for selecting key variables when building PLS models. As can be observed in Figure 5, the method CARS identifies fewer key variables than the method VIP. Nevertheless, the validation results of the PLS model (using glucose concentration as an example) were better when the method CARS was used, as evidenced by the smaller Root Mean Square Error (see Figure 7 and Table 1).

**Figure 7.** Validation of the PLS glucose model using VIP, CARS and outlier removal methods.

**Table 1.** Root Mean Square Error of glucose concentration prediction and the coefficient of determination (*R*2).


The reference values in Figure 7 represent offline measurements performed with the Cedex Analyzer. In some cases of glucose measurement, the VIP method even leads to worse results than not using a method, as shown in Table 1 (see RMSE).

Assuming that Cedex's offline measurements are reliable, the training set was examined for spectra representing outliers that could affect the parameters of the PLS model during the learning phase and consequently affect the prediction accuracy. Applying the Monte Carlo sampling method and calculating the mean error and standard deviation for each PLS model led to the identification of spectra within the dataset that represent outliers, as shown in Figure 6 and discussed in Section 2.2.3. This process allowed a further increase in the accuracy and robustness of the PLS models, as can be observed in Figure 7 and Table 1. In this case, the coefficient of determination for the PLS model for glucose is *R*<sup>2</sup> = 0.97, which means that the PLS model has been further improved compared to the method CARS (where *R*<sup>2</sup> = 0.96). An accurate prediction of glucose concentration can also be observed in Figure 8, which shows a comparison of experimental and predicted values using CARS and methods to remove outliers. Ideally, all points should lie on a straight line.

**Figure 8.** Validation of the PLS glucose model: comparison of experimental and predicted values using CARS and outlier removal methods.

Table 2 shows the RMSE and coefficient of determination (*R*2) for the following constructed PLS models in addition to the glucose PLS model: VCC, TCC, glutamine, glutamate, lactate and ammonium. The results demonstrate that all PLS models developed provide an accurate prediction of the main process variables (*R*<sup>2</sup> > 0.8), and only the PLS model for glutamine has a slightly worse prediction (*R*<sup>2</sup> = 0.33). The reason for this lies in the following fact. In Raman spectroscopy, glutamine and glutamate are related because they have a similar molecular structure and similar active Raman vibrational modes that produce similar spectral features. Glutamine and glutamate are structurally similar amino acids, both containing a carboxyl group (-COOH) and an amine group (-NH2). The main structural difference between them is that glutamate has an additional carboxyl group, while glutamine has an amide group (-CONH2) instead. It is important to note that while Raman spectroscopy is a powerful technique for identifying molecules, its resolution is often insufficient to distinguish between similar molecules in a mixture. In such cases, additional techniques, such as chromatographic separation or more sophisticated spectral analysis methods, are required.


**Table 2.** Root Mean Square Error and the coefficient of determination (*R*2) for all other constructed PLS models.

Table 3 shows the best RMSE results for PLS models according to the existing literature [11,37]. A comparison with the data in Tables 1 and 2 shows that our method for building PLS models excels at accurately predicting key variables from Raman spectra. This comparison essentially underlines the effectiveness of our approach. It is particularly noteworthy that our PLS models have an RMSE that is on average three times smaller than the RMSE published in recent research [11,37].

**Table 3.** The best RMSE results for PLS models found in the literature [11,37].


The learning process for the PLS models depended on a single offline measurement (Cedex) of each variable (e.g., glucose) per day. Therefore, only the Raman spectroscopy spectra that matched the offline measurements in time could be used. However, once the PLS models were built, all spectra collected every half hour could be used, giving an informative representation of the time course of each variable (see Figure 9). These data are then used in the optimisation to determine the parameters of the dynamic model for the CHO cell kinetics, as described in the Section 3.1. Careful examination of the time series signal for glucose and glutamine concentrations in Figure 9 reveals a sawtooth pattern due to the daily manual dosing of nutrients. This pattern is not conducive to the optimal growth of the CHO cells.

The problem can be solved by implementing an automated feeding system that continuously doses the nutrients according to a predefined reference signal. However, such a system requires not only the application of the previously developed soft sensors (PLS models), but also a simulation environment. In this environment, a control algorithm can be developed and different scenarios such as different feeding regimes, the removal of inhibitors and the observation of important process variables can be investigated. The heart of the simulator, represented by the Simulink schema in Figure 10, is a dynamic model of CHO cell kinetics, which is explained in the Section 3.1. Figure 10 also shows the controller and optimisation blocks, the details of which will be explained in more detail in forthcoming scientific publications.

Based on known process parameters (temperature and pH) and time series signals of the main process variables (VCC, glucose, glutamine, etc.), it is possible to perform the optimisation of the parameters of the dynamic model of CHO cell kinetics (presented in the Section 3.1). This optimisation aims at aligning the model results as much as possible with the measurements of previous batches.

**Figure 9.** Signal reconstruction of key process variables via PLS models.

**Figure 10.** Implementation of a simulator within the Simulink environment based on the CHO cell kinetics model.

For the parameter optimisation, the particle swarm optimisation (PSO) method was used, which makes it possible to find the global minimum of the chosen criterion function while optimising a large number of parameters. In this case, the criterion function was given as RMSE, with the final values presented in the Table 4.

**Table 4.** Root Mean Square Error and the coefficient of determination (*R*2) in the case of predicting all important process variables using the CHO cell kinetics model.


A comparison of glucose concentration measurements from one of the batches with a glucose concentration prediction derived from a mechanistic model of CHO cell kinetics is shown in Figure 11. The results of the agreement were excellent in this case, with an RMSE of 0.18 g/L and *R*<sup>2</sup> = 0.99.

**Figure 11.** Validation of the CHO cell kinetics model in the case of glucose concentration prediction for the entire batch run.

Furthermore, Figure 12 shows the remarkable matching between the measurements and the predicted values; ideally, all points should lie on a straight line. However, it is important to note that the available data were limited to only four batches. If a larger number of batches are included in the optimisation process, a slight deviation between the individual batches and the process variables is to be expected. In the future, it would be beneficial to combine the data from the individual batches based on the criterion of mutual similarity and then determine the model parameters for the individual clusters.

The predictions for the other process variables, as shown in Table 4, prove satisfactory when the CHO cell kinetics model is used. Only in the case of glutamine concentration does a somewhat larger error occur, which has already been pointed out. The reason for this is that when the PLS model predicts the time series signal for glutamine with less accuracy, the variance of the "measurements" (derived from the soft sensor) increases. Consequently, the time series signal of glutamine is predicted with lower accuracy by the mechanistic model.

**Figure 12.** Validation of the CHO cell kinetics model: comparison of experimental and predicted glucose concentrations.

#### **5. Discussion**

In developing models that allow the use of soft sensors to monitor key process variables (VCC, TCC, glucose, glutamine, glutamate, lactate and ammonium) in the bioreactor, it was found that using the PLS method alone did not provide the required accuracy and robustness of the models. In particular, with a limited data set (a few batches), the model can be overfitted, leading to a sharp drop in predictive performance compared to what the validation with limited data promises.

Since in our work only about 100 spectra with reference measurements were available during the learning phase and Raman spectra contain more than 3000 components, the phase of selecting key variables became crucial for model construction. By using the CARS method, better handling of collinearity between variables was observed, as well as better performance on small data sets and higher robustness compared to the VIP method. As a result, the RMSE was reduced by up to 30%.

It was found that the VIP method further impaired the predictive ability of the models in certain cases, indicating an overfitting problem, as the number of key variables selected was significantly larger than required by the CARS method. The VIP method also had stability problems, as the results may have become unstable with small samples. Minor variations in the data can lead to significant shifts in the scores, making it difficult to extrapolate the results to other data sets. When calculating the VIP scores based on the weighted sum of squares of the PLS loadings, high variability was found in small data sets.

In Raman spectroscopy, it is important to understand that outlier spectra can occur, influenced by various factors. For example, if the sample in the bioreactor is not evenly mixed, this can lead to deviations in the spectra obtained. Raman spectroscopy derives its readings from the average properties of the area illuminated by the laser. Therefore, a lack of homogeneity in the sample can lead to inconsistent measurements.

Moreover, the components of the sample can play an important role. If components fluoresce under the laser light of the Raman spectrometer, the resulting fluorescence could overshadow the Raman signal and distort the spectra. Additionally, bubbles or particles in the bioreactor can cause scattering or absorption of the laser light, resulting in unpredictable spectra.

Given these potential sources of error, it is important to carefully identify and remove outlier spectra during the modelling phase, as described in Section 2.2.3. This step reduced the root mean square error (RMSE) by 10% (in addition to 30% reduction with the method CARS).

The efficient growth and production of desired products by CHO cells requires specific, strictly controlled conditions in the bioreactor. These conditions include the regulation of pH and temperature, which affect cell metabolic rate, protein folding and expression levels. Equally important is the careful control of nutrient content, especially glucose and glutamine, according to a predetermined profile for the duration of the batch.

Another critical factor is the control of inhibitor concentrations. Metabolic by-products such as ammonia and lactate can potentially inhibit cell growth and protein production if they reach high concentrations. Since glucose is the primary source of energy, its concentration directly affects cell metabolism. Too little glucose can starve cells and inhibit growth, while too much glucose can cause osmotic stress or trigger overproduction of waste products such as lactate.

Given these complexities, the use of an automated bioreactor control system is essential for CHO cell cultivation. Such a system offers several advantages, including maintaining consistent conditions, real-time monitoring, reducing human error and improving efficiency and scalability. Given the significant costs associated with realistic bioreactor experiments, the development of a simulation environment is essential. This environment enables the creation of control algorithms and the evaluation of the effects of different parameters on cell growth and productivity.

The main reason for the lack of advanced automated control techniques in cell culture bioprocesses and bioreactor operations is that these techniques require robust and

reliable measurement methods that are available on site. Concentrations of nutrients and metabolites, cell densities and viability are not measured and are uncontrolled or are only controlled manually with long sampling times (12–24 h, as shown in Figure 9). As a result, possible process disturbances may only be detected after long delays, making it difficult to take corrective action and increasing the risk of batch losses.

For the development of an advanced simulation environment, the choice of a CHO kinetic model is also crucial. The chosen model should represent the complex kinetics of CHO cells in sufficient detail. Simpler models based on the Monod equation, for example, are often inadequate in this respect. More complex models, however, pose the challenge of determining numerous parameters that can only be accurately determined with a suitable optimisation method and sufficiently heterogeneous data. In our study, the parameters of a dynamic model of CHO cell kinetics were successfully determined using the PSO method.

To enable the development of a predictive control algorithm, the complex kinetics model will be simplified and linearised, and online adjustment of the (adaptive) model parameters will be facilitated. This adjustment is made possible by an optimisation method that uses the measurements of the current batch to facilitate the online parameterisation.

Future efforts include the development of a model predictive control algorithm based on the simplified model of CHO cell kinetics. Subsequently, the monitoring and control system will be integrated into a real bioreactor. Finally, a practical test of the implemented system will be carried out.

#### **6. Conclusions**

This study demonstrates the significant advances in fully automated feeding of CHO cells achieved through the development of advanced models, soft sensors and a novel simulation environment. The research has required a thorough understanding of various chemometric methods and demonstrated their context-specific application in combination with Raman spectroscopy. It has demonstrated the effectiveness of CARS-PLS and an outlier removal method in overcoming difficult challenges such as high dimensionality, multicollinearity and outlier detection. The models created are versatile and scalable and can be applied to a wide range of products, media and cell lines based on CHO host cells. They can be conveniently scaled up for use in large pilot studies and extensive manufacturing processes. However, the success of these methods depends not only on the right choice of techniques, but also crucially on the quality of the input data. Therefore, the preprocessing of the data to remove interfering signals is of the utmost importance. Raman spectra have no inherent value, but when integrated with the appropriate models, they allow for the creation of a sophisticated measurement system. This system, which consists of soft sensors, is used for real-time monitoring and control of important process variables. The measurements reconstructed with these soft sensors play a crucial role in the design of the simulation environment, which significantly speeds up and cheapens the development of control algorithms and thus the automated nutrient dosing system. In essence, this study provides essential insights into the pragmatic application of Raman spectroscopy and innovative methods that form a solid foundation for further research and development in the field of automated cell feeding.

**Author Contributions:** Methodology, S.T. and I.Š.; Software, S.T.; Validation, S.T. and I.Š.; Formal analysis, S.T.; Investigation, S.T.; Data curation, S.T.; Writing—original draft, S.T. and I.Š.; Writing—review & editing, I.Š.; Supervision, I.Š.; Project administration, I.Š. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** The data presented in this study are available on request from the corresponding author. The data is not publicly available due to trade secrets.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


**Disclaimer/Publisher's Note:** The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

MDPI St. Alban-Anlage 66 4052 Basel Switzerland www.mdpi.com

*Sensors* Editorial Office E-mail: sensors@mdpi.com www.mdpi.com/journal/sensors

Disclaimer/Publisher's Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Academic Open Access Publishing

www.mdpi.com ISBN 978-3-0365-8523-9