1. Introduction
Verified relations between process parameters and cell-specific characteristics is the key knowledge to develop robust and scalable bioprocesses [
1]. This knowledge is often based on regression analysis of historical data, where all possible relations are inspected [
2]. Besides classical process parameters such as pH, temperature, and dissolved oxygen, scalable biomass-specific reaction rates are included in this analysis both as independent (such as product formation
) and regressor variables (such as growth µ or substrate uptake rates
) [
3,
4,
5,
6]. The knowledge of the interdependencies among these reaction rates ease the scale-up, process design, and control.
The calculation of biomass-specific rates is based on different data sources, including online signals and offline measurements and their timely changes [
7]. Different approaches can be followed to obtain the targeted rates, which all include a series of evaluation procedures to maintain the information contained in the signals. The use of smoothing and spline fits might be a good way to obtain smooth time derivatives of noisy measurements [
8], but their usage is critical with a low number of measurements with the risk of smoothing out important biological events. Due to the low number of measured samples, the finite differences of two subsequent measurements are most commonly used in biotechnology. As displayed by [
7,
9], the calculation accuracy of the rates is highly dependent on the measurement frequency and the underlying signal-to-noise ratio of the used measurement methodology. This results in a trade-off between laborious high-frequency sampling including smoothing and spline fits or low-frequency sampling and finite differences among the few measurement points.
To circumvent this trade-off, kinetic models can be used [
10]. The underlying reaction kinetics are hereby described by a mass balance model with different reaction kinetics, e.g., first-order or Monod terms [
11]. By fitting the model to the data, realistic rate trajectories can be deduced by the model dynamics [
12] or by employing model-based state estimation techniques [
13,
14]. Although the approximation of reaction rates based on an underlying model combined with offline and online measurements leads to good results, an appropriate model and knowledge of the internal reaction dynamics are needed, which is often not the case in biotechnological processes.
If no exact reaction kinetics are known, a constant first-order rate between two measurements can be assumed. In this case, the reaction rates can be determined by solving the mass balance, where the state change is described by a general material balance including inputs, conversion, and output terms. By minimizing the error between the balance equation and the included measurement points, the optimal constant rate can be determined for the analyzed time interval, which can include at least two or more measurement points [
15].
No matter what calculation approach is followed, the underlying measurements are prone to errors, which propagates throughout all rate evaluation and regression procedures [
16]. To propagate these uncertainties through mathematical functions, a calculus-based approximation or a functional approach can be followed [
17,
18]. The calculus-based approximation propagates the uncertainty by mathematical error propagation laws, whereas the functional approach re-evaluates the function by including the expected or observed ranges of the measurements. Although computationally efficient, the propagation rules for the calculus approach need to be derived specifically for every evaluation procedure, which is not straightforward for more sophisticated functions such as least squares regressions or differential equation solvers [
17]. Due to its easier implementation and generic applicability, especially in numerical- and spreadsheet-based evaluation software, the functional approach is often preferred [
18]. By re-evaluating the function with the highest measurement deviations, upper and lower confidence bounds on the results can be determined. As in practice, measurement errors occur randomly, the mentioned procedure potentially overestimates the propagated errors and gives rather a realistic error estimate, a worst-case scenario that hinders the interpretability of the calculated outputs, as discussed by [
9] for biotechnological processes.
With today’s computational power, Monte Carlo sampling approaches are gaining more and more attention [
19]. This consists of repeating the calculations by varying the input randomly within the stated limits of precision [
20]. According to [
21], error propagation based on Monte Carlo sampling is the most reliable approach to assign realistic errors to calculated results. For biotechnological processes, Monte Carlo methods have already been successfully used to determine rate calculation errors [
13], confidence bounds of model parameters [
22], and simulations outputs [
23]. The determination of the realistic uncertainties of the target variable is of central relevance for further correlation and regression analysis, where visual inspection model identification and process design can be significantly facilitated by the inclusion of measurement and calculation uncertainties. For specific reaction rates, uncertainties in the range of 20% have been reported to be suitable for conclusive interpretation [
13] and process control [
24].
Bivariate and multivariate regression are hereby a standard analysis to identify and describe input and output dependencies. Although weighted least squares regression is able to include errors on the predictor variables, possible errors on the regressor variables are often not considered [
25]. York (1966) [
26] introduced an algorithm that enabled linear regression for data with errors in both the regressor and predictor variables. In addition to finding the best fitting parameters in the case of imperfect measurements, some other important outputs of regression analysis are the parameter and prediction confidence intervals, giving information on the reliability of the found relations. According to [
27], Monte Carlo sampling is also well suited to evaluate, uncertainty in regression analysis, which was shown by [
21] for geochronology and by [
22] to determine the parameter confidence intervals of a nonlinear biotechnological model.
For targeted process development, efficient process transfer, and the definition of operational spaces, it is important to deduce reliable and transferable information. Within this contribution, we show therefore a generic Monte Carlo error propagation approach to obtain a realistic error estimate on both the regressor and predictor variables based on real measurement errors and how to use them, to determine uncertainty in subsequent regression analysis. Based on the known uncertainty of the target variables, the most suitable one can be selected for subsequent regression, and expected impacts can be determined. This greatly facilitates the right conclusions and expectations, leading to a quicker process development and time to market.
The paper is organized as follows: The determination of the scalable reaction and their errors determined by Monte Carlo sampling are described in
Section 2. In the subsequent
Section 3, the propagated errors are included in the regression analysis, resulting in trustable confidence bounds for the parameters and predictions. Based on these regression models, effective control limits for an
E. coli and a CHO fed-batch process were established. After discussing the relevance of the obtained results (
Section 4), the contribution concludes with a strong suggestion to include measurement errors wherever possible (
Section 5), which is strongly facilitated by Monte Carlo sampling procedures.
5. Conclusions
Within this contribution, a ubiquitous applicable procedure was shown to propagate measurement errors through bioprocess evaluation with the aim to achieve valid correlations between target variables and reliable control limits for manipulable variables, as is schematically displayed in
Figure 8. The procedure consists of propagating the crude measurement errors through a series of data evaluation methodologies before both determined errors on regressor and predictor variables were included in a regression analysis. Based on the determined regression uncertainty, expected results and effective control limits can be predicted to meet the process needs.
Based on two industrially relevant organisms, E. coli and CHO cells, its applicability to biotechnological cultivations was shown. For the calculation of the cell-specific uptake and production rates, the propagation procedures revealed that with typical sampling frequencies, the specific growth rate can be determined with the lowest precision (approximately 20%), whereas the determination of other specific rates showed higher precisions, below 10%. These precisions are important for further regression analysis or for monitoring and control considerations.
Through simple linear regression analysis, correlations between the biomass-specific substrate uptake rate and the production rate could be determined for E. coli, and the relation between the cell-specific glutamine uptake rate and lactate formation for the CHO cell process was determined. Under the usage of these errors, a realistic distribution of the regression parameters, their covariance, and prediction confidence intervals could be determined. Under the usage of error weighting in both the predictor and regressor variables, confidence bounds could be significantly narrowed, without the need for additional data points.
Within three use cases, the usefulness of the error propagation was assessed. For the two examined organisms, control limits could be successfully established to guarantee high production rates in a E. coli and to avoid excessive lactate formation in a CHO cell fed-batch. In addition to that, probabilistic decisions were possible, as shown for the harvest time point determination. Based on this, we avoided imperfect measurements being wrongly interpreted, ensuring consistent decisions and the extraction of relevant information, which are important to continuously improve and guarantee the quality of biochemical processes. A sound inclusion of measurement uncertainty and its propagation along process evaluation can additionally lead to a reduction of the needed experimental iterations during process development and enable the assessment of needed measurement accuracies, to obtain the aimed at regression and control accuracies.