A Deep Learning Approach to Optimize Recombinant Protein Production in Escherichia coli Fermentations
Round 1
Reviewer 1 Report
Review Report
MS title: Fermentation MS “A deep learning approach to optimize recombinant protein production in E. Coli fermentations”
Comments:
1. This reviewer has a hard time to understand the ultimate objective of this study; whether it is to predict and optimize the recombinant protein production as stated in the title, or simply predict the final cell density in OD600. From a perspective of a biotechnologist in fermentation, the key point is that one can neither estimate a target protein titer nor predict inclusion body concentration from final OD of a fermentation, because OD value has no linkage to target protein concentration. Unfortunately, this MS does not provide any model for target protein production profile.
2. Some of the CPPs listed in Table 2 have incorrect concepts and units. For example, pure oxygen and compressed air should have a flow rate unit such as L/min. Also, what is critical for induction is the feed rate of the fed-batch medium, which is missing; only the induction start times are shown in Fig. 2(k). And it was stated in the text they expressed different proteins, but the protein types were not listed. What are meant by Spump1 through 4 should have been clearly explained.
3. The authors included the fermentation batches with standard and canonical progress, excluding other batches and trimming out inconsistent data points. It could make one to doubt the soundness or validity of the machine learning outcome. A table of the batches that were excluded in the fermentation selection is necessary to show why they were excluded.
4. [Equation (2)] What polynomial equation was used? What was the reason for setting α = β = 0.5? Is it from a theoretical background or experimental observations?
5. For a fermentation scientist to understand the deep learning model, more thorough explanation of Fig. 5 is needed. For example, what are the functions and outputs of each “layer” and “RNN/LSTM module”.
6. [Fig. 6] What is the difference between LSTM and RNN? In most batches, they overlapped each other with the measured values; however, in (b) and (c) they deviated much from the measured values. Any explanations for the deviations?
In sum, despite the interesting approach to apply deep learning algorithm to microbial fermentation OD profiles, I recommend “Rejection” of this MS. Only after a comprehensive revision reflecting the above-mentioned comments, it may be resubmitted to the Fermentation journal.
[END]
Author Response
- This reviewer has a hard time to understand the ultimate objective of this study; whether it is to predict and optimize the recombinant protein production as stated in the title, or simply predict the final cell density in OD600.
We trained our ML model to predict real time OD600 values from CPPs historical series. This black-box model allows an early stop of a process recognized to be drifting by monitoring real-time CPP values. Furthermore, it is trained on CPPs only, i.e.: we did not include as input any MPP upon which (by definition) no control would be possible. Thus, the (black-box) relationship we developed between OD600nm and CPP values paves the way for a ML driven control system by offering a reliable alternative to numerous trial-and-error experiments to identify the optimal fermentation conditions and related yields. In fact, CPP optimal setpoints maximizing OD600nm value are recommended in real time by the prediction algorithm on the basis of the learned (yet not analytically) relationship inferred from the training data.
From a perspective of a biotechnologist in fermentation, the key point is that one can neither estimate a target protein titer nor predict inclusion body concentration from final OD of a fermentation, because OD value has no linkage to target protein concentration.
As we have used the same expression vector, bearing the same neurotrophin which was expressed as recombinant protein in the same medium, it was reasonable to hypothesize that increasing Biomass production will directly corresponds to an increase in inclusion bodies production and therefore to an increase in recombinant protein production. So, having fixed medium, strain and expression vector, we hypothesize, as starting point, a correlation between culture OD600nm and Inclusion Bodies production.
This starting hypothesis is well in agreement with the nice correlation between those three parameters reported in table I. We apologize for a typo error in the inclusion bodies value of run 12 which was 12gr instead of the 20.2 reported in the table (20.2 is the IB % in the biomass). This has been corrected in the table, and a further column reporting the ratio OD600nm/IB has been added.
But we agree with you OD600nm is not predictive of recombinant protein production if you apply it out of our specific case.
- Some of the CPPs listed in Table 2 have incorrect concepts and units. For example, pure oxygen and compressed air should have a flow rate unit such as L/min.
Please consider that in the Lucullus interface running the fermentors we had imposed a loop on dissolved oxygen value. When this value was below 50% the loop was first automatically increasing stirrer speed, then compressed air (from 0.5L/min to 1.5L/min) and then pure oxygen (from 0 to 1 L/min). The system is then recording the value in a cumulative manner in an excel file and this explains why we have a liter and not liter per minute. Those cumulative recorded values by Lucullus are then transformed into non-cumulative data for processing.
Also, what is critical for induction is the feed rate of the fed-batch medium, which is missing; only the induction start times are shown in Fig. 2(k).
The feeding rates have been inserted in the manuscript; they are ranging from 0.3mL/min to 0.9 mL/min.
And it was stated in the text they expressed different proteins, but the protein types were not listed.
You are right. We are expressing Neurotrophins, and more precisely it is the same protein in two different versions. The first one, used in the 7 first fermentations, corresponds to an insoluble version of the protein and therefore it accumulates in inclusion bodies. The second one is a soluble version and for this reason we do not have production of inclusion bodies. The fedbatch is performed at 20°C overnight in order to favor protein refolding. This has been adjusted in the text.
What are meant by Spump1 through 4 should have been clearly explained.
Sorry for the lack of precision, Spump (pump) N°1 is the base, N°2 corresponds to acid, N°3 to antifoam, and the fourth to feeding. This has been precised in Table 2
- The authors included the fermentation batches with standard and canonical progress, excluding other batches and trimming out inconsistent data points. It could make one to doubt the soundness or validity of the machine learning outcome. A table of the batches that were excluded in the fermentation selection is necessary to show why they were excluded.
This is already described in the manuscript in the 2.2.1 section.
Being in the R&D department, we have explored the design space not always leading to improvements, and some fermentations have been directly excluded from the machine learning studies.
The first 7 fermentations have been used to roughly explore fermentation temperatures and medium composition. Run 8 can be considered as an optimization starting point and was the first fermentation to be historically inserted in the selected panel.
Fermentation analyses have been performed post run, so at this step of software development, fermentation success is not yet linked to program validity as today this one can only predict the final OD600nm.
Some Batches were not selected either because technical problems occurs during the fermentation as for example in batch 9 and 10 where a cooling problem has been encountered during the night just after the fermentation influencing biomass harvest the day after, or during medium preparation as in run 13, or even when culture medium was changed in its composition and glucose substituted with glycerol (runs 15,18,21). In general, we decided to exclude from the training set all those batches whose experimental conditions were not representative of the future industrial fermentations in the GMP plant. In fact, the recommendations provided by the model at hand will have to help Dompe’s R&D Biotech process Development laboratory in maximizing productions’ yield, in a way that it could be transferred and implemented in the production Plant. On those bases the first 7 fermentations, reported in table I, have been selected on their consistency and results that can be used as starting point for further industrial process scaling up.
Thus, we focused on production admissible CPP ranges and restricted our Applicability Domain accordingly.
A table describing all fermentations is available in the extra material and in the comments one can find the reasons why batches have been processed or not.
- [Equation (2)] What polynomial equation was used? What was the reason for setting α = β = 0.5? Is it from a theoretical background or experimental observations?
We used a 6th degree polynomial which adds a smoothing component to the non-derivable curve obtained from local linear interpolation (piecewise linear). In doing so, we separately fitted linear and polynomial interpolation curves and then we linearly combined the two using α and β weights. By setting α = β , we assigned the same weight to the linear and polynomial interpolation curves towards constructing the final fitted value, i.e.: an arithmetic mean between linear and a polynomial interpolated values. The choice of using a polynomial of 6th degree and the choice of letting α = β were driven by experimental observations. In fact, although these choices do not grant overall smoothness to the final interpolating curve as would be by letting α = 0, the combined choice effectively reduces the unphysical effect of having discontinuities in derivative function of the final interpolating curve upon known experimental values, while at the same time preventing unrealistic swings between consecutive ones as would be the case by using a fitting polynomial of higher degree.
- For a fermentation scientist to understand the deep learning model, more thorough explanation of Fig. 5 is needed. For example, what are the functions and outputs of each “layer” and “RNN/LSTM module”.
Thank you, As also suggested by Reviewer 2, Figure 5 and its caption have been improved in order to provide more details to readers who are not experts in deep learning. More details can be also inferred from the code provided. Yet, a comprehensive explanation of lstm cells is hard to be given in the current work, we suggest https://colah.github.io/posts/2015-08-Understanding-LSTMs/ for a friendly explanation.
- [Fig. 6] What is the difference between LSTM and RNN? In most batches, they overlapped each other with the measured values;
From a methodological perspective, LSTMs are more advanced algorithms which were often proven to better model temporal data series. They can be regarded as a more sophisticated version of RNNs. However, being more sophisticated, they also comprise more parameters to be optimized and are often prone to overfitting (poor generalization). For this reason RNNs were also investigated in this work. From a quantitative perspective, Table 4 shows that they can model data equally well. Figure 6 shows the same in a qualitative manner.
however, in (b) and (c) they deviated much from the measured values. Any explanations for the deviations?
Yes, you are perfectly right on this point, those two batches are different and deviate from the model. Your observation gives us the chance of providing a broader discussion of the results.
For batch 11 This can be explained by the fact that, in this batch, the maximum stirrer reachable speed was higher than in the other ones. In fact in the figure 2b where the stirrer speeds are reported we can observe that the orange curve, corresponding to batch 11, has higher maximums than the other ones. In this fermentation the maximum stirrer speed has been setted to 1700 rpm, whereas in the others it was 1600 rpm. As the stirrer speed has a positive influence on the dissolved oxygen concentration, the higher mixing speed has reduced the oxygen consumption by 23% as compared to batch 24 for example. So two of the eleven CCP have been modified due to the different fermentor setting.
For run 12 still in figure 2, but this time in panel a corresponding to the culture pH, we can observe a different trend for this batch. This was due to a bad connection of pump 1 dispensing the base. Infact, the pump was running inefficiently and, even if the pump was turned on, no base was added to the medium. This can be seen in the g panel in the same figure. When the problem was solved by the operator the pH returned to the correct value and the curve had the same trend as the others. So again in this fermentation due to a technical problem two values (spump1) and pH have been influenced.
We have added the above analysis in section 6 (Conclusions) of the paper.
Reviewer 2 Report
This manuscript proposed two deep learning-based models, RNN model and LSTM model to predict the protein production in E. coli fermentations. The manuscript is clear, however, two issues need to be fixed:
1. Figure 2 plots the related features against sample size, but how it helps to build the deep learning models? The results in this figure are not analyzed and explained until section 5 (only panels 2g and 2e are discussed). On the other hand, the results of dissolved oxygen (Fig. 2b) and stirrer (Fig. 2d) are very hard to read.
2. Figure 5 visualizes the LSTM model, but the visualization is not clear. On the left, it seems that the hidden layers and the LSTM layer are only functioned on the first time-point input sequence. I suggest the authors to replace this visualization to better convey the model details.
Author Response
This manuscript proposed two deep learning-based models, RNN model and LSTM model to predict the protein production in E. coli fermentations. The manuscript is clear, however, two issues need to be fixed:
- Figure 2 plots the related features against sample size, but how it helps to build the deep learning models? The results in this figure are not analyzed and explained until section 5 (only panels 2g and 2e are discussed). On the other hand, the results of dissolved oxygen (Fig. 2b) and stirrer (Fig. 2d) are very hard to read.
Sample size in Fig.2 refers to the timestamp of the time series, we have replaced the x-axis label to avoid confusion. Fig. 2b and Fig. 2d have been replaced by a subsampled version (by a factor of 10) for better readability. In any case, the full data is available at the link provided at the end of the paper.
For the sake of clarity, there is no direct (explicit) analytical relationship between inputs’ functional forms and network predictions, since we’re building a "black box" model. We included the graphs describing input variables’ historical series for reader’s awareness but the most important point here is to choose an architecture suitable to properly get in input and process raw data coming from the equipment. For example, the input layer is made up of 11 neurons since we are handling 11 CPPs, while remaining hyperparameters were optimized by conducting a grid search and a tailored model validation procedure.
- Figure 5 visualizes the LSTM model, but the visualization is not clear. On the left, it seems that the hidden layers and the LSTM layer are only functioned on the first time-point input sequence. I suggest the authors to replace this visualization to better convey the model details.
Dear reviewer, Figure 5 has been updated in the revised version in order to clarify your point: indeed the LSTM is applied to the whole time series, with a time window of 20 and a stride of 5.
Reviewer 3 Report
Authors presents a generic deep learning model to optimize the fermentation yield in E coli based on the recurrent neural networks (RNN) and long short term memory (LSTM) neural networks. The manuscript has detailed descriptions of the data pre-processing, algorithm and model performance evaluation, making is easy for non-ML experts to follow. The model is trained and validated with 11 pre-processed datasets using LOOCV and the model prediction performance is evaluated with RMSE and REFY. The model performance support the model to be a promising tool to design strategies for E coli fermentation optimization. I suggest the paper to be published with revising the following two issues.
· The model is only trained with 11 datasets. Can author explain more how the model will minimize the overfitting issue for the data?
· There are other institutions and companies working on the fermentation optimizations using AL/ML, such as Benchling. Authors should have a brief review of this area and highlight novelty or practicability of this deep learning model compared to other established methods.
1. line 127: "then centrifuged and washed twice with the same buffer
" to "then centrifuged and the pallet was washed twice with the same buffer"
Author Response
Authors presents a generic deep learning model to optimize the fermentation yield in E coli based on the recurrent neural networks (RNN) and long short term memory (LSTM) neural networks. The manuscript has detailed descriptions of the data pre-processing, algorithm and model performance evaluation, making is easy for non-ML experts to follow. The model is trained and validated with 11 pre-processed datasets using LOOCV and the model prediction performance is evaluated with RMSE and REFY. The model performance support the model to be a promising tool to design strategies for E coli fermentation optimization. I suggest the paper to be published with revising the following two issues.
- The model is only trained with 11 datasets. Can author explain more how the model will minimize the overfitting issue for the data?
We trained a model with a few layers to minimize potential overfitting issues. Moreover, we took advantage of cross-validation for hyperparameters’ tuning as a widespread state-of-the-art technique to prevent overfitting.
Some Batches were not selected either because technical problems occurs during the fermentation as for example in batch 9 and 10 where a cooling problem has been encountered during the night just after the fermentation influencing biomass harvest the day after, or during medium preparation as in run 13, or even when culture medium was changed in its composition and glucose substituted with glycerol (runs 15,18,21). In general, we decided to exclude from the training set all those batches whose experimental conditions were not representative of the future industrial fermentations in the GMP plant. In fact, the recommendations provided by the model at hand will have to help Dompe’s R&D Biotech process Development laboratory in maximizing productions’ yield, in a way that it could be transferred and implemented in the production Plant. On those bases the first 7 fermentations, reported in table I, have been selected on their consistency and results that can be used as starting point for further scaling up to industrial process.
A complete table with comments summarizing all fermentations is available in the extra material.
Thus, we focused on production admissible CPP ranges and restricted our Applicability Domain accordingly and reached an acceptable compromise between bias and variance errors.
- There are other institutions and companies working on the fermentation optimizations using AL/ML, such as Benchling. Authors should have a brief review of this area and highlight novelty or practicability of this deep learning model compared to other established methods.
In [1] authors review how ML methods have been applied so far in bioprocess development, especially in strain engineering and selection, bioprocess optimization, scale-up, monitoring, and control of bioprocesses. For each topic, they highlight successful application cases, current challenges and point out several domains that can benefit from further progress in the field of ML.
In [2] traditional knowledge-driven mathematical approaches like Constraint-Based Modeling (CBM) and data-driven black-box approaches like ML are reviewed (both independently and in synergy) as powerful methods for analyzing and optimizing fermentation parameters and predicting related yields.
A benchmark of Artificial Neural Networks (ANNs) and Support Vector Machines (SVM) models is provided in [3] to offer a series of effective optimization methods for the production of an antifungal lipopeptide biosurfactant. Among the others, General Regression Neural Network (GRNN) appears the most suitable ANN model for the design of the fed-batch fermentation conditions for the production of iturin A because of its high robustness and precision, where the SVM one appears to be a very suitable alternative.
An interesting example of synergistic use of ML models is given in [4] where authors combine descriptors derived from fermentation process conditions with information extracted from amino acid-sequence to construct an ML model based on XGBoost classifiers, Support Vector Machines (SVM) and Random Forests (RF) that predicts the final protein yields and the corresponding fermentation conditions for the expression of target recombinant protein in the Escherichia coli periplasm.
Another example of synergistic use of ML models towards bioprocess optimization is provided in [5] where authors use ANN and Genetic Algorithms (GA) to model and optimize a fermentation medium for the production of the enzyme hydantoinase by radiobacter trained with experimental data reported in literature. In their approach GA are used to optimize the input space of the NN models to find the optimum settings for maximum enzyme and cell production, thereby integrating two ML techniques towards creating a powerful tool for process modeling and optimization.
Finally, an example of how ML models are paving the way for ML-based process controllers is provided in [6] where an optimized decision-making system (OD-MS) algorithm in ML for optimizing the enzymatic hydrolysis saccharification and fermentation conditions and maximizing related yield was studied to find the optimum parameter conditions for obtaining a better yield.
—
[1] Machine learning in bioprocess development: From promise to practice (2022): https://pubmed.ncbi.nlm.nih.gov/36456404/
[2] Synergisms of machine learning and constraint-based modeling of metabolism for analysis and optimization of fermentation parameters (2021) https://onlinelibrary.wiley.com/doi/abs/10.1002/biot.202100212
[3] User-friendly optimization approach of fed-batch fermentation conditions for the production of iturin A using artificial neural networks and support vector machine (2015): https://www.sciencedirect.com/science/article/pii/S0717345815000640
[4] PERISCOPE-Opt: Machine learning-based prediction of optimal fermentation conditions and yields of recombinant periplasmic protein expressed in Escherichia coli (2022) https://pubmed.ncbi.nlm.nih.gov/35765650/
[5] Optimization of a fermentation medium using neural networks and genetic algorithms (2003): https://link.springer.com/article/10.1023/A:1026225526558
[6] Bioethanol production optimization through machine learning algorithm approach: biomass characteristics, saccharification, and fermentation conditions for enzymatic hydrolysis (2022): https://link.springer.com/article/10.1007/s13399-022-03163-z
The discussion of the references above has been added to the paper (Introduction).
Round 2
Reviewer 1 Report
The authors have faithfully reflected the revision suggestions, and the MS is now acceptable for publication.