MultiSec: Multi-Task Deep Learning Improves Secreted Protein Discovery in Human Body Fluids

He, Kai; Wang, Yan; Xie, Xuping; Shao, Dan

doi:10.3390/math10152562

Open AccessArticle

MultiSec: Multi-Task Deep Learning Improves Secreted Protein Discovery in Human Body Fluids

¹

Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun 130012, China

²

School of Artificial Intelligence, Jilin University, Changchun 130012, China

³

College of Computer Science and Technology, Changchun University, Changchun 130022, China

^*

Author to whom correspondence should be addressed.

Mathematics 2022, 10(15), 2562; https://doi.org/10.3390/math10152562

Submission received: 15 June 2022 / Revised: 8 July 2022 / Accepted: 21 July 2022 / Published: 22 July 2022

(This article belongs to the Section Mathematical Biology)

Download

Browse Figures

Versions Notes

Abstract

:

Prediction of secreted proteins in human body fluids is essential since secreted proteins hold promise as disease biomarkers. Various approaches have been proposed to predict whether a protein is secreted into a specific fluid by its sequence. However, there may be relationships between different human body fluids when proteins are secreted into these fluids. Current approaches ignore these relationships directly, and therefore their performances are limited. Here, we present MultiSec, an improved approach for secreted protein discovery to exploit relationships between fluids via multi-task learning. Specifically, a sampling-based balance strategy is proposed to solve imbalance problems in all fluids, an effective network is presented to extract features for all fluids, and multi-objective gradient descent is employed to prevent fluids from hurting each other. MultiSec was trained and tested in 17 human body fluids. The comparison benchmarks on the independent testing datasets demonstrate that our approach outperforms other available approaches in all compared fluids.

Keywords:

secreted protein discovery; multi-task learning; deep learning

MSC:

68T07; 92B20

1. Introduction

A large number of molecules are contained in human body fluids, and these molecules are promising as biomarkers for disease diagnosis and therapeutic monitoring [1,2,3]. Among these molecules, one of the essential types of biomarkers is the secreted proteins. Because of this, discovering secreted protein is an important step toward secreted protein biomarker identification. In recent years, although many secreted proteins have been identified through experiments, it remains a challenge to identify new secreted proteins in some human body fluids [4,5]. To facilitate the detection of secreted proteins, several computational approaches have been proposed to predict whether a protein is secreted into a specific fluid [6,7,8,9,10]. These efforts can accelerate the detection of secreted proteins and avoid many unnecessary wet experiments. Nowadays, secreted protein discovery by computational methods has become a well-studied topic in bioinformatics.

Among these approaches, the most successful method uses a support vector machine (SVM) and protein features [6,7]. First, the features of each protein are computed based on their sequence by using some computational tools and websites [11,12]. Second, a feature selection method is used to choose some representative features from those features. Finally, the SVM classifier is used to differentiate the secreted proteins from not secreted proteins based on features selected previously. Although this approach is fast and effective, its weak representative ability limits the performance of secreted protein discovery. Another effective approach is using deep learning and protein sequences [5]. Compared with the previous SVM-based approach, this approach can usually learn more complex features from protein sequences using a convolutional neural network (CNN), long short-term memory (LSTM), etc. [13]. These complex features enhance the representative ability of this approach and promote higher performance. However, deep learning always requires a large amount of data. Due to the limited number of secreted proteins in some human body fluids, the performance in many fluids may suffer from overfitting. Furthermore, for some human body fluids, such as sweat, the number of secreted proteins is too small to learn representative features for prediction. Therefore, an effective approach urgently needs to be presented to obtain a more accurate prediction and enable computational detection in some human body fluids.

These available approaches ignore relationships between different human body fluids. Typically, a protein can be secreted into several human body fluids, which may be related. Therefore, predictions of computational methods in different human body fluids may also be related. When designing a computational approach to secreted protein discovery, relationships between different fluids need to be considered. Multi-task learning is a machine learning method that exploits relationships between tasks to improve the performances of all tasks [14,15]. Thus, we could use the multi-task learning method to take the relationships between human body fluids into account. Prediction of whether a protein is secreted into a specific fluid is regarded as a task. In addition, a shared network for different tasks is beneficial in preventing overfitting [16,17,18]. However, several problems occurred in employing multi-task learning in multi-fluid secreted protein discovery. First of all, many of these human fluids have a really poor number of secreted proteins. As a result, positive samples may be less than negative samples. Even predicting a secreted protein to a specific fluid, the imbalanced dataset is still a problem and needs to be solved [5]. All the datasets for different human body fluids need to be considered simultaneously. In addition, the performance of these human body fluids may conflict with each other [19]. Performance of some fluids might get hurt if they were not coordinated well. To obtain decent performance in all human body fluids, all these problems must be solved.

In this paper, we propose MultiSec, a novel approach that takes advantage of progress in multi-task learning and improves the state-of-the-art performance in secreted protein discovery. MultiSec was designed to simultaneously predict the probability that a protein is secreted into each of 17 human body fluids based on its sequence. This approach is composed of three modules: a balanced sampling module that generates balanced samples for each human body fluid during training, a lightweight convolutional neural network architecture to extract deep features for proteins, and a multi-task classification module that calculates the probabilities for protein to be secreted proteins in 17 human body fluids. Finally, we trained MultiSec on 17 human body fluids, including plasma, saliva, urine, cerebrospinal fluid (CSF), seminal fluid, amniotic fluid, tear fluid, bronchoalveolar lavage fluid (BALF), milk, synovial fluid, nipple aspirate fluid (NAF), cervical–vaginal discharge (CVF), pleural effusion (PE), sputum, exhaled breath condensate (EBC), pancreatic juice (PJ), and sweat. MultiSec achieved a more accurate prediction in all human body fluids with area under the ROC curves of 0.89–0.98. Comparison benchmarks on the independent testing datasets demonstrate that our approach outperforms other state-of-the-art approaches in all the compared human body fluids.

2. Materials and Methods

2.1. Data Collection

To use the relationships between different human body fluids, we need to first construct a multi-task dataset for secreted protein discovery. Therefore, a dataset named SecretedP17 was constructed from the publicly available database Human Body Fluid Proteome (HBFP) [20]. The HBFP database has collected 11,827 experimentally validated secreted proteins in 17 human body fluids. These human body fluids include plasma, saliva, urine, cerebrospinal fluid, seminal fluid, amniotic fluid, tear fluid, bronchoalveolar lavage fluid, milk, synovial fluid, nipple aspirate fluid, cervical–vaginal discharge, pleural effusion, sputum, exhaled breath condensate, pancreatic juice, and sweat. From this database, secreted proteins in 17 human body fluids and corresponding sequences were retrieved. Based on these data, 17 sub-datasets corresponding to 17 fluids were constructed individually.

Taking the collection of plasma sub-dataset as an example, the construction process follows: First, from the HBFP database, proteins that were verified to be secreted into plasma fluid were collected as positive samples. Second, negative samples in plasma fluid were generated based on these positive samples and protein family information by using the method of the previous study [6]. Specifically, negative samples were chosen from those proteins that belong to more than one family and have not been reported as plasma-secreted proteins. If all families of a protein do not contain any plasma-secreted proteins, this protein was regarded as a negative sample. Thus, negative samples in plasma fluid can be obtained based on protein family information. Third, redundant proteins with similar sequences were filtered out to evaluate an accurate performance. The PSI-CD-HIT program was used to calculate the sequence similarity, and proteins are considered redundant if the similarity exceeds 90% [21]. Finally, the plasma sub-dataset is divided into the training, validation, and testing datasets according to 60%, 20%, and 20%. The same proportion of positive samples was kept the same in these datasets. Other sub-datasets were obtained by using a similar process in the other 16 human body fluids. The plasma sub-dataset and other sub-datasets were merged into the SecretedP17 dataset. The SecretedP17 dataset details are shown in Table 1.

Previous processing has collected sequences and secreted labels in 17 human body fluids for proteins. Here, a Position-Specific Score Matrix (PSSM) was calculated for each protein with a similar method to the previous study [5]. The PSSM of each protein was obtained by running the PSI-BLAST program with an E-value threshold of 0.001 and 3 iterations against the UniRef90 database (2020_01) [11,22]. The dimension of the PSSM is

L \times 20

, where the first dimension L corresponds to the sequence length of the protein, and the second dimension 20 corresponds to the presence of 20 amino acids (aa) in each position. Then, the sigmoid function transformed the PSSM data into values between 0 and 1. After that, the PSSM data were processed into a fixed-length (1000 aa) for efficient computation. If the length of protein sequences exceeds 1000, the first 500 and last 500 positions are merged. For the remaining proteins, 0s of

(1000 - L) \times 20

dimension are padded at the end of their PSSM data. Because of the fixed sliding window strategy in Section 2.2, our model is robust to the procession of input PSSM data. In addition, this process can also keep the the consistency of the input. Finally, for each protein in the SecretedP17 dataset, PSSM data of dimension

1000 \times 20

were collected.

2.2. Multi-Fluid Secreted Protein Discovery

Here, multi-fluid secreted protein discovery is regarded as a special case of multi-task classification, where the goal is to predict whether a protein is secreted into each of the human body fluids based on its sequence, and the dataset of each task may be very imbalanced. Figure 1 summarizes the architecture of MultiSec in this paper for secreted protein discovery, which consists of three modules: balanced sampling, feature extraction, and multi-task classification. In the balanced sampling module, 17 groups of proteins corresponding to 17 human body fluids were generated separately. The proportion of positive and negative samples in each group is kept the same. After that, through the feature extraction module, the deep features of these 17 groups of proteins are individually extracted from the PSSM data. In the multi-task classification module, the probabilities for each group of proteins to be secreted into the corresponding human body fluid are computed first. With these probabilities and true labels, losses to these 17 fluids can be calculated. The multi-fluid loss was calculated based on the losses of 17 human body fluids. By optimizing this loss, MultiSec can be trained on these human body fluids simultaneously.

2.2.1. Balanced Sampling

Many fluids have more negative than positive samples in the multi-fluid secreted protein discovery. In addition, neural networks usually prefer to predict the class with a more significant number of samples. Because of this, it is hard to discover the secreted proteins accurately. The imbalance problem of the dataset reduces the performance of secreted protein discovery and needs to be solved. Several methods (under-sampling, over-sampling, bagging, etc.) have been proposed to overcome the imbalanced dataset problem in machine learning [23]. Although the previous study can solve the imbalanced dataset problem, it needs to train several networks [5]. To reduce the computation by training only one network while solving this problem, we propose the balanced sampling module. By independently sampling in a single network training, the imbalanced dataset problems in different human body fluids are addressed while utilizing all samples of different classes as much as possible.

In our opinion, this problem is caused by the large difference in the number of positive and negative samples during training. The class with a large number of samples always accounts for a large proportion of the loss. The large class can always obtain a large gradient to update the network when minimizing loss. Finally, the network tends to predict this class and ignore others. Therefore, we believe this problem could be avoided by keeping the number of different classes the same.

Here, the balanced sampling module is proposed to generate same-size data for human body fluids in multi-fluid secreted protein discovery. First, each dataset is divided into positive and negative sets based on its corresponding label. By dividing all these 17 datasets, 34 sets (17 positive sets and 17 negative sets) are obtained. Then, independent random sampling is applied in each of these sets. As a result, 34 groups of data with the same number of samples are obtained from these sets. Afterward, groups with the same body fluid (positive and negative groups for a specific fluid) are merged. Finally, 17 groups of balanced data are generated for these 17 human body fluids.

By using this module, 17 groups of data are generated randomly at each training iteration. These groups of data correspond to 17 human body fluids and will be used for the calculation of other modules. When training with these data groups, the network would learn a good balance between secreted proteins and non-secreted proteins.

2.2.2. Feature Extraction

The feature extraction module is designed to extract deep features from protein sequences for all human body fluids. In multi-fluid secreted protein discovery, this module is shared by 17 human body fluids. As a result, the computation will be increased to 17 times. If the network architecture is too complex, the training process would be very slow. Therefore, the feature extraction module for multi-fluid secreted protein discovery requires a small, fast, and effective architecture.

A simple and effective architecture is adopted here, consisting of four parallel convolution-pooling operations and a fully connected layer. The input of the feature extraction module is the PSSM data with the dimension

1000 \times 20

, which were collected previously. From PSSM data, the convolution-pooling operation can extract fixed-length features. Then, a fully connected layer is used to extract deeper features from the fixed-length features.

The convolution-pooling operation consists of a convolutional layer that extracts information from the input sequence and a pooling layer that selects the important information. A convolutional layer contains a group of filters, and each filter can be regarded as a single motif detector. A motif detector scans the input sequence to detect the absence of the corresponding motif. Specifically, a score about this motif is calculated at each position based on the local sequence and this detector. By calculating all the positions of input sequences, the information is extracted by motif detectors [13,24]. The calculation of motif detectors usually contains weighted summing and a non-linear activation function. The motif information corresponds to the weights and bias parameters of the filters, which can be learned in training. For computational efficiency, ReLU is used as the activation function. The computation of the convolutional layer is defined as follows:

C_{i, j} = max (0, \sum_{d = - (w - 1) / 2}^{(w - 1) / 2} \sum_{c = 1}^{20} X_{i + d, c} W_{d + (w - 1) / 2, c}^{j} + b_{j}),

(1)

where

X

is the PSSM data of the protein,

W

and

b

are the weight and bias of the convolution layer, respectively,

C

is the protein information extracted by the convolution layer, w is the filter size, and

max (0, x)

is the ReLU activation function.

There is much useless information in the extracted information. The pooling layer is a dimensionality reduction method that can extract a specific part of it. Average pooling extracts the global information by averaging the input information. The maximum pooling extracts the important information by selecting the maximum value from the input information. The pooling size controls the size of the local area. Here, the pooling size is the length of the feature sequence, and this is global max pooling [25]. The largest regions are all used to extract the most important information for each motif detector. Global max pooling is defined as follows:

v_{j} = max C_{i, j},

(2)

where

v_{j}

denotes the information selected for the j-th motif detector. The convolution-pooling operation can finally extract a fixed-size feature vector v, and the size is the number of filters in this convolutional layer.

Only features for a fixed length motif can be extracted by using a convolution-pooling operation. Four convolution-pooling operations with different filter sizes are adopted to extract features for different length motifs. These operations are used to extract from the input sequence parallelly. Four feature vectors are extracted and merged. The merged feature vector is computed as follows:

u = [v^{1}, v^{2}, \dots, v^{N}],

(3)

where

v^{i}

represents the feature vector extracted by the i-th convolution-pooling operation.

A fully connected layer transforms the feature vector into more complex features. A fully connected layer consists of many neurons, each of which is connected to all input features. In each neuron, the output value is calculated by a weighted summation of the input features followed by a nonlinear transformation [13]. Similar to the convolutional layer, the ReLU activation function is used to speed up the computation. In the FC layer, the computation of the i-th neuron

h_{i}

is defined as follows:

h_{i} = max (0, u \cdot α_{i} + d_{i}),

(4)

where

α_{i}

and

d_{i}

are the weight and bias of the i-th neuron, respectively. The values of all neurons in the fully connected layer constitute the final feature vector of the protein.

The feature extraction module is shared among all body fluids. Through this module, the feature vector of each protein can be extracted according to its corresponding PSSM data. Therefore, the 17 sets of proteins generated by balanced sampling are converted into 17 sets of corresponding feature vectors.

2.2.3. Muli-Task Classification

The multi-task classification module can calculate the probabilities for proteins secreted into 17 human body fluids based on features extracted by the last module. This module contains 17 output layers, and each output layer is similar to the fully connected layer. Each output layer contains two neurons representing secreted protein and non-secreted protein, and the output value is calculated by a linear transformation of the input features. The output of the i-th neuron in the k-th output layer is calculated as follows:

o_{i}^{k} = h \cdot β_{i}^{k} + q_{i}^{k},

(5)

where

β_{i}^{k}

and

q_{i}^{k}

represent the weight and bias of the i-th neuron in the k-th output layer, respectively. After that, the softmax function transforms the values of the output layer into the secreted probabilities. The k-th probability

p^{k}

corresponding to the k-th human body fluid is calculated as follows:

p^{k} = \frac{exp o_{2}^{k}}{exp o_{1}^{k} + exp o_{2}^{k}} .

(6)

Furthermore, when

p^{k} > 0.5

, this protein is predicted to be secreted into the k-th human body fluid. By these predicted probabilities, the overall probability for protein to be secreted into 17 human body fluids simultaneously can be computed as follows:

\hat{p} = \prod_{k = 1}^{17} p^{k} .

(7)

Cross-entropy loss is used as the loss function for secreted protein discovery in each human body fluid. This single-fluid loss is calculated based on predicted probabilities and true labels in the same human body fluid. By calculating for each of these fluids, 17 single-fluid losses for human body fluids can be obtained. The k-th single-fluid loss is calculated as follows:

L^{k} = - \frac{1}{N_{k}} \sum_{i = 1}^{N_{k}} (y_{i}^{k} log p_{i}^{k} + (1 - y_{i}^{k}) log (1 - p_{i}^{k})),

(8)

where

y_{i}^{k}

and

p_{i}^{k}

represent the label and predicted probability for i-th protein to be secreted into k-th human body fluid, respectively, and

N^{k}

represents the number of proteins.

To discover secreted protein in 17 human body fluids, the multi-fluid loss needs to be computed based on these single-fluid losses. Here, the multi-fluid loss is computed by the weighted summation, which is defined as follows:

L = \sum_{k = 1}^{17} λ^{k} L^{k},

(9)

where

λ^{k}

represents the weight coefficient of the k-th single-fluid loss. In addition, the weight coefficients still need to be obtained. Usually, the coefficients are set by hand, but fixing coefficients may cause the network to suffer from task conflict and negative transfer [16,17,19]. Because of this, performances in some fluids may be reduced.

To ensure all the fluids can obtain a good performance, a multi-objective gradient descent (MGDA) algorithm is employed [19,26]. The MGDA algorithm can dynamically calculate weight coefficients based on gradient vectors in all human body fluids. First, 17 gradient vectors are calculated individually, each of which is the derivative of the corresponding human body fluid loss for the parameters of the feature extraction module. The k-th gradient vector

g^{k}

is calculated as follows:

g^{k} = \frac{\partial L^{k}}{\partial θ},

(10)

where

θ

contains all the weights and biases in the feature extraction module. After that, the weight coefficients are solved by finding the minimum norm point in the convex hull of these gradient vectors, which is optimized as follows:

\begin{matrix} min_{λ_{1}, λ_{2}, \dots, λ_{17}} {∥\sum_{k = 1}^{17} λ_{k} g^{k}∥}_{2}^{2} \\ \begin{matrix} s . t . & \sum_{k = 1}^{17} λ_{k} = 1, \\ λ_{k} \geq 0 . \end{matrix} \end{matrix}

(11)

This formula is a convex quadratic problem with linear constraints, which can be easily solved by the available optimization algorithm packages. Solving this equation can obtain the weighting coefficients for these fluid losses. The multi-fluid loss for secreted protein discovery can be calculated by substituting these coefficients into this equation. By optimizing the multi-fluid loss, secreted protein predictions for 17 human body fluids can be trained simultaneously.

In the training time, the gradient vector used to update the parameters of the feature extraction module corresponds to the solution of the minimum norm point [19]. The value of this norm controls the step size of this update. When this value is not 0, the MGDA algorithm can always find a proper direction in which the updated feature extraction module can achieve better performance in all these fluids. On the other hand, because the step is very close to 0, the feature extraction can obtain a slight gradient, and the performances of all fluids will not decrease. At the same time, the output layers for different fluids can still be optimized because they only rely on their corresponding fluid. Therefore, optimizing multi-fluid loss with the MGDA algorithm can prevent these fluids from being hurt by other fluids.

2.2.4. Evaluation

The performance of secreted protein discovery is evaluated by sensitivity (SN), specificity (SP), accuracy (ACC), F1 score (F1), Matthew’s correlation coefficient (MCC), and Area under the ROC Curve (AUC). These metrics are defined as follows:

SN = \frac{T P}{T P + F N},

(12)

SP = \frac{T N}{T N + F P},

(13)

ACC = \frac{T P + T N}{T P + T N + F P + F N},

(14)

F 1 = \frac{2 T P}{2 T P + F P + F N},

(15)

MCC = \frac{T P \times T N - F N \times F P}{\sqrt{(T P + F N) \times (T P + F P) \times (T N + F P) \times (T N + F N)}},

(16)

where

T P

,

T N

,

F P

, and

F N

represent the number of protein samples corresponding to true positive, true negative, false positive, and false negative, respectively.

3. Results

3.1. Performance of MultiSec in 17 Human Body Fluids

The implementation of MultiSec is based on the Python packages PyTorch, CVXOPT, and Scikit-Learn [27,28]. MultiSec was trained on 17 sub-datasets of SecretedP17 simultaneously. First, 17 groups of data were generated by a balanced sampling module. Each group has 32 samples. Second, four parallel convolutional-pooling operations were employed. These filter sizes are {1, 3, 5, 9}, and the number of filters is 128 for all of them. With these operations, four feature vectors were extracted. After merging these vectors, one feature vector with a size of 512 was obtained. The FC layer used 64 neurons in the feature extraction module; 17 groups of features with a size of 64 were extracted. Third, the multi-task classification module calculated the predicted value for each human body fluid based on these features. Then the classification loss corresponding to each body fluid was obtained. The weight coefficients of all body fluids were optimized by the QP program in the CVXOPT package. After that, the multi-task loss was calculated by weighted summation. The multi-task classification loss for secreted protein prediction was optimized by the Adam optimizer with a learning rate of 1 × 10

^{- 4}

. MultiSec was trained with 20,000 iterations, and the iterations with the highest F1 scores were selected for each body fluid through the corresponding validation dataset.

After training, MultiSec was evaluated on the testing datasets of 17 human body fluids, including plasma, saliva, urine, cerebrospinal fluid, seminal fluid, amniotic fluid, tear fluid, bronchoalveolar lavage fluid, milk, synovial fluid, nipple aspirate fluid, cervical–vaginal discharge, pleural effusion, sputum, exhaled breath condensate, pancreatic juice, and sweat. Table 2 reports the benchmarks of MultiSec on testing datasets of 17 human body fluids. MultiSec achieved performances of 81.70–98.62%, 55.38–88.37%, 85.99–99.56%, 61.08–87.06%, 54.43–76.26%, and 87.99–98.07% on ACC, SN, SP, F1, MCC, and AUC metrics, respectively. This demonstrates that MultiSec obtained impressive performances in all 17 fluids simultaneously.

3.2. Comparison with Other Methods in 14 Human Body Fluids

We compared the performances of MultiSec with other available methods, including the SVM-based approach, the RF-based approach, and DeepSec. Furthermore, we also utilized random sampling to replace the balanced sampling module in the training of MultiSec as a comparison. This model was denoted as MultiSecRS (MultiSec with random sampling). The hyper-parameters of these methods were selected by MCC metric on the validation datasets, the performances on the testing datasets were reported as benchmarks of these methods.

The SVM-based and RF-based methods are built based on protein features [6,7]. First, 1610 features (protein properties, such as sequence length, weight, amino acid composition, etc.) were collected based on protein sequences by using computational tools and websites. After that, the t-test and false discovery rate (FDR) were used to select the 50 most important features for each fluid [5]. Finally, SVM classifiers were used to predict whether a protein is secreted into a specific fluid based on these 50 features. The performances on the testing datasets were reported as SVM-based approach benchmarks. Through a similar process to the SVM-based approach, the benchmarks of the DT-based approach on all body fluids were obtained.

Unlike previous feature-based methods, DeepSec does not need feature collection and selection but performs end-to-end training via protein PSSM data [5]. The DeepSec was trained on each sub-dataset individually. A bagging-based strategy was adopted to solve the imbalance problem. First, the class with more samples was divided and then combined with this class into several datasets. Second, several DeepSec networks were separately trained on these datasets. Finally, predictions were made by averaging the probabilities obtained by these networks. In DeepSec, many networks were trained simultaneously to discover a single-fluid secreted protein. Because of this, DeepSec always costs computational time and resources. Therefore, we only trained DeepSec in 14 human body fluids where the number of networks is no more than 10, including plasma, saliva, urine, cerebrospinal fluid, seminal fluid, amniotic fluid, tear fluid, bronchoalveolar lavage fluid, milk, synovial fluid, nipple aspirate fluid, pleural effusion, sputum, and sweat. The benchmarks of DeepSec were calculated on the independent testing datasets of these 14 human body fluids.

To intuitively compare MultiSec with other approaches, the benchmarks of these methods were averaged over 14 human body fluids. Table 3 presents the average benchmarks of these approaches. From this table, either MultiSecRS or MultiSec achieves the highest scores on all metrics. Specifically, MultiSec outperforms MultiSecRS on the SN metric by 17.65%. The improvement of SN shows that a balanced sampling module can help MultiSec detect more secreted proteins. Furthermore, MultiSec outperforms DeepSec by 7.34%, 4.85%, 8.62%, 12.12%, 15.28%, and 5.36% on ACC, SN, SP, F1 and AUC metrics. These improvements demonstrate that our approach successfully exploited relationships between human body fluids and predicted more accurate secreted proteins on average metrics.

Four radar charts show the comparison benchmarks, including ACC, F1, MCC, and AUC, in 14 human body fluids. From Figure 2, MultiSec achieves the highest values on all axes of all four radar charts except the Sputum axis of ACC. Although the ACC of the DT-based approach is higher, the rest of the indicators of MultiSec outperform other approaches. The other metrics can always reveal real performances rather than accuracy in an imbalanced dataset. Because of this, MultiSec still outperforms other approaches in sputum fluid. In addition, the ACC metric of MultiSec is only slightly higher than the DT-based approach in sweat fluid. It is possible that the DT-based approach predicted more negative samples, and MultiSec barely improved performance in sputum and sweat fluids. Comparing the results in Figure 2, we can conclude that MultiSec predicts more accurate probabilities for secreted protein discovery than other approaches in all these 14 human body fluids.

We also compare the computational consumption of MultiSec and DeepSec in Table 4. This table shows that MultiSec can detect more body fluids with less time and parameters than DeepSec. Especially in the training time, MultiSec is about 30 times faster than DeepSec.

3.3. Potentially Secreted Proteins

MultiSec was also applied to identify potentially secreted proteins (PSPs) in 17 human body fluids, which are not verified by experiment but predicted as positive by our approach. We retrained MultiSec using the training dataset and validation datasets, and the testing datasets were used to choose the appropriate iteration for each fluid.

Those proteins that are neither secreted proteins nor negative samples were selected as candidates. Secreted probabilities of these candidates can be calculated by using MultiSec. When the probability of a protein in the corresponding fluid is greater than 0.5, this protein is predicted to be a secreted protein, which is also named PSP. From 17 human body fluids, 17 groups of PSPs were identified by our approach. Table 5 shows the PSP information in each fluid. The details of PSPs and their probabilities are reported in Supplementary Table S1.

Among these PSPs, 103 PSPs were found in all 17 groups, which means our approach predicted these PSPs to be secreted into all 17 human body fluids. We refer to these 103 PSPs as potentially universal secreted proteins (PUSPs). Furthermore, the probability for a protein to be secreted into all 17 human body fluids was calculated using the cumulative multiplication of all secreted probabilities. After that, 20 PUSPs were found to have a relatively high probability of being secreted into all 17 human body fluids. The information on PUSPs is reported in Table 6, and the details of all 103 PUSPs are listed in Supplementary Table S2. We believe that these 20 PUSPs are more worthwhile to discover than others because they are more likely to be simultaneously secreted in 17 human body fluids.

4. Disscussion

Comparison benchmarks presented in the previous section demonstrate that exploiting the relationships between different human body fluids can improve the discovery of secreted proteins. Furthermore, Table A5 has shown that our method outperforms DeepSec by 6.08–23.75% on the MCC metric. All these comparisons indicate that MultiSec is the current best performing method and is superior to other state-of-the-art methods in secreted protein discovery.

To discover secreted protein in these 14 human body fluids, DeepSec needs to train 56 networks (1, 3, 1, 2, 2, 3, 5, 2, 3, 6, 6, 6, 6, and 10 for plasma, saliva, urine, cerebrospinal fluid, seminal fluid, amniotic fluid, tear fluid, bronchoalveolar lavage fluid, milk, synovial fluid, nipple aspirate fluid, pleural effusion, sputum, and sweat). Therefore, DeepSec always costs many computational resources. However, our novel approach can discover secreted proteins in all 17 human body fluids by using only a single network. Furthermore, our approach also performs better than DeepSec in all 14 human body fluids. From the comparison with DeepSec, we can conclude that MultiSec improves performance in all the human body fluids and significantly reduces the number of networks and training time.

5. Conclusions

In summary, we present a novel approach MultiSec to predict whether a protein is secreted into each of 17 human body fluids based on its sequence. The new approach was designed to exploit relationships between different human body fluids via multi-task learning. Compared with other state-of-the-art approaches, the benchmarks show that MultiSec outperforms other approaches in all compared human body fluids. Furthermore, compared with DeepSec, our approach decreases many networks into only one to discover secreted protein in 17 human body fluids. Our improvements also confirm the relationships between different human body fluids exist and help to discover secreted proteins.

Afterward, MultiSec was used to identify potentially secreted proteins. With this approach, 1244–6742 potentially secreted proteins have been discovered from 17 human body fluids. Furthermore, 103 proteins are predicted to be secreted into all these fluids simultaneously. Among these proteins, 20 are reported with a relatively high probability for protein to be secreted into 17 human body fluids simultaneously. We believe these identified proteins are worthwhile for further study with biological experiments.

In the future, we would consider fusing more features, such as protein features and secondary structures, into our approach. In addition, due to the limited number of secreted proteins, computational approaches in secreted protein discovery are very easy to overfit. Therefore, a more effective network architecture is also worthwhile to discover. Furthermore, other protein prediction tasks may also be related to secreted protein discovery, such as single peptide identification, protein localization, etc. [24,29,30]. We will also find more tasks to improve secreted protein discovery.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/math10152562/s1, Table S1: Potentially secreted proteins in 17 human body fluids; Table S2: Potentially universal secreted proteins in 17 human body fluids.

Author Contributions

Conceptualization, K.H.; methodology, K.H. and Y.W.; validation, X.X. and D.S.; formal analysis, D.S.; investigation, K.H and X.X.; data curation, K.H. and D.S.; writing—original draft preparation, K.H.; writing—review and editing, K.H. and Y.W.; visualization, K.H., D.S. and X.X.; supervision, Y.W.; project administration, Y.W.; funding acquisition, Y.W. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported by the National Natural Science Foundation of China (No. 62072212) and the Development Project of Jilin Province of China (Nos. 20200401083GX, 2020C003, 20200403172SF).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data and code that support the reported results can be found at https://sites.google.com/view/multisec (accessed on 10 June 2022).

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Comparison Details with Other Methods in 17 Human Body Fluids

The main text only includes comparative average benchmarks and the radar plots of MultiSec and other approaches because of the space limitations. Here, we present all the comparison benchmarks of these methods in 17 human body fluids. Table A1, Table A2, Table A3, Table A4, Table A5 and Table A6 show comparison benchmarks of ACC, SN, SP, F1, MCC, and AUC metrics. These tables also provides the comparison benchmarks of the rest fluids, including cervical-vaginal discharge, exhaled breath condensa, and pancreatic juice. From these tables, MultiSec achieves higher scores than the DT-based approach and the SVM-based approach. Especially on the MCC metric shown in Table A5, MultiSec outperforms the SVM-based approach by 32.59%, 43.42%, and 36.64% in cervical-vaginal discharge, exhaled breath condensa, and pancreatic juice, respectively. This also confirms that our method is more accurate than other methods.

Table A1. ACC benchmarks of MultiSec and compared methods on the independent testing datasets of 17 human body fluids.

Fluid Name	DT	SVM	DeepSec	MultiSec
Plasma/Serum	0.7224	0.6992	0.8300	0.8682
Saliva	0.7885	0.7681	0.8339	0.9257
Urine	0.7251	0.7319	0.8218	0.8730
Cerebrospinal fluid	0.7375	0.7370	0.8026	0.8296
Seminal fluid	0.7297	0.7441	0.7929	0.8270
Amniotic fluid	0.8224	0.8034	0.8487	0.9254
Tear fluid	0.8225	0.7744	0.8321	0.9174
Bronchoalveolar lavage fluid	0.7361	0.7314	0.8363	0.8801
Milk fluid	0.7617	0.6731	0.8171	0.8979
Synovial fluid	0.8537	0.7456	0.7887	0.9049
Nipple aspirate fluid	0.8645	0.8173	0.8317	0.9209
Cervical-vaginal discharge	0.9304	0.7905	–	0.9385
Pleural effusion	0.8370	0.6949	0.7795	0.8950
Sputum	0.9014	0.8930	0.8296	0.8863
Exhaled breath condensate	0.9744	0.9264	–	0.9878
Pancreatic juice	0.9489	0.9346	–	0.9577
Sweat	0.8943	0.7165	0.7769	0.8986

The best results are in bold.

Table A2. SN benchmarks of MultiSec and compared methods on the independent testing datasets of 17 human body fluids.

Fluid Name	DT	SVM	DeepSec	MultiSec
Plasma/Serum	0.7933	0.6126	0.8553	0.8614
Saliva	0.3413	0.7520	0.6964	0.8373
Urine	0.8192	0.7346	0.8329	0.8630
Cerebrospinal fluid	0.4914	0.6422	0.7696	0.7806
Seminal fluid	0.6166	0.7898	0.7771	0.7924
Amniotic fluid	0.4870	0.8017	0.8348	0.8817
Tear fluid	0.2989	0.7582	0.7690	0.8614
Bronchoalveolar lavage fluid	0.4182	0.7423	0.7299	0.8256
Milk fluid	0.2004	0.6853	0.7026	0.7823
Synovial fluid	0.2033	0.7311	0.7541	0.8230
Nipple aspirate fluid	0.3384	0.7530	0.8018	0.8811
Cervical-vaginal discharge	0.2000	0.7829	–	0.8629
Pleural effusion	0.2230	0.7491	0.7422	0.8014
Sputum	0.4425	0.5015	0.7935	0.7375
Exhaled breath condensate	0.0769	0.4769	–	0.5538
Pancreatic juice	0.2403	0.3566	–	0.8837
Sweat	0.1552	0.7672	0.7716	0.7802

The best results are in bold.

Table A3. SP benchmarks of MultiSec and compared methods on the independent testing datasets of 17 human body fluids.

Fluid Name	DT	SVM	DeepSec	MultiSec
Plasma/Serum	0.6272	0.8157	0.7951	0.8599
Saliva	0.9285	0.7732	0.8782	0.9528
Urine	0.5872	0.7279	0.8036	0.9002
Cerebrospinal fluid	0.8973	0.7986	0.7699	0.8917
Seminal fluid	0.7911	0.7192	0.7877	0.8686
Amniotic fluid	0.9330	0.8040	0.8510	0.9444
Tear fluid	0.9229	0.7775	0.8426	0.9390
Bronchoalveolar lavage fluid	0.8755	0.7267	0.8823	0.9073
Milk fluid	0.9393	0.6692	0.8574	0.9093
Synovial fluid	0.9569	0.7479	0.7911	0.9101
Nipple aspirate fluid	0.9526	0.8281	0.8372	0.9352
Cervical-vaginal discharge	0.9834	0.7910	–	0.9544
Pleural effusion	0.9340	0.6863	0.7936	0.9009
Sputum	0.9832	0.9627	0.8360	0.8833
Exhaled breath condensate	0.9940	0.9362	–	0.9956
Pancreatic juice	0.9842	0.9633	–	0.9606
Sweat	0.9678	0.7114	0.7804	0.9104

The best results are in bold.

Table A4. F1 benchmarks of MultiSec and compared methods on the independent testing datasets of 17 human body fluids.

Fluid Name	DT	SVM	DeepSec	MultiSec
Plasma/Serum	0.7663	0.7002	0.8523	0.8835
Saliva	0.4349	0.6074	0.6686	0.8456
Urine	0.7798	0.7650	0.8471	0.8911
Cerebrospinal fluid	0.5958	0.6579	0.7201	0.7792
Seminal fluid	0.6162	0.6847	0.6972	0.7503
Amniotic fluid	0.5761	0.6691	0.7327	0.8550
Tear fluid	0.3514	0.5196	0.5958	0.7670
Bronchoalveolar lavage fluid	0.4914	0.6275	0.7319	0.8087
Milk fluid	0.2879	0.5020	0.6466	0.7759
Synovial fluid	0.2756	0.4403	0.4952	0.6910
Nipple aspirate fluid	0.4173	0.5417	0.5774	0.7577
Cervical-vaginal discharge	0.2800	0.3358	–	0.6638
Pleural effusion	0.2718	0.4011	0.4751	0.6697
Sputum	0.5758	0.5862	0.5848	0.6614
Exhaled breath condensate	0.1136	0.2168	–	0.6542
Pancreatic juice	0.3085	0.3407	–	0.6686
Sweat	0.2099	0.3287	0.3863	0.5724

The best results are in bold.

Table A5. MCC benchmarks of MultiSec and compared methods on the independent testing datasets of 17 human body fluids.

Fluid Name	DT	SVM	DeepSec	MultiSec
Plasma/Serum	0.4271	0.4278	0.6522	0.7323
Saliva	0.3356	0.4686	0.5592	0.7968
Urine	0.4196	0.4562	0.6345	0.7396
Cerebrospinal fluid	0.4353	0.4448	0.5802	0.6410
Seminal fluid	0.4076	0.4878	0.5406	0.6182
Amniotic fluid	0.4814	0.5498	0.6390	0.8058
Tear fluid	0.2576	0.4261	0.5173	0.7218
Bronchoalveolar lavage fluid	0.3297	0.4379	0.6141	0.7220
Milk fluid	0.2043	0.3074	0.5265	0.7119
Synovial fluid	0.2232	0.3536	0.4210	0.6411
Nipple aspirate fluid	0.3578	0.4671	0.5135	0.7189
Cervical-vaginal discharge	0.2745	0.3338	–	0.6597
Pleural effusion	0.1907	0.3090	0.3949	0.6175
Sputum	0.5584	0.5369	0.5147	0.5981
Exhaled breath condensate	0.1183	0.2302	–	0.6644
Pancreatic juice	0.2972	0.3067	–	0.6731
Sweat	0.1734	0.2916	0.3560	0.5380

The best results are in bold.

Table A6. AUC benchmarks of MultiSec and compared methods on the independent testing datasets of 17 human body fluids.

Fluid Name	DT	SVM	DeepSec	MultiSec
Plasma/Serum	0.7823	0.7969	0.9085	0.9383
Saliva	0.7392	0.8339	0.8775	0.9546
Urine	0.7811	0.8159	0.9008	0.9462
Cerebrospinal fluid	0.7721	0.8042	0.8642	0.8867
Seminal fluid	0.7716	0.8238	0.8576	0.8976
Amniotic fluid	0.8061	0.8765	0.9250	0.9654
Tear fluid	0.7305	0.8504	0.8929	0.9493
Bronchoalveolar lavage fluid	0.7460	0.8108	0.8968	0.9400
Milk fluid	0.7111	0.7490	0.8591	0.9152
Synovial fluid	0.7053	0.8241	0.8452	0.9219
Nipple aspirate fluid	0.7354	0.8605	0.8812	0.9644
Cervical-vaginal discharge	0.7079	0.8681	–	0.9715
Pleural effusion	0.6184	0.7883	0.8404	0.9068
Sputum	0.8004	0.8273	0.8911	0.9177
Exhaled breath condensate	0.6255	0.7871	–	0.9276
Pancreatic juice	0.7335	0.8809	–	0.9810
Sweat	0.7397	0.8191	0.8470	0.9340

The best results are in bold.

References

Lathrop, J.T.; Anderson, N.L.; Anderson, N.G.; Hammond, D.J. Therapeutic potential of the plasma proteome. Curr. Opin. Mol. Ther. 2003, 5, 250–257. [Google Scholar] [PubMed]
Anderson, N.L. The Clinical Plasma Proteome: A Survey of Clinical Assays for Proteins in Plasma and Serum. Clin. Chem. 2010, 56, 177–185. [Google Scholar] [CrossRef] [PubMed]
Shen, F.; Zhang, Y.; Yao, Y.; Hua, W.; Zhang, H.S.; Wu, J.S.; Zhong, P.; Zhou, L.F. Proteomic analysis of cerebrospinal fluid: Toward the identification of biomarkers for gliomas. Neurosurg. Rev. 2014, 37, 367–380. [Google Scholar] [CrossRef]
Huang, L.; Shao, D.; Wang, Y.; Cui, X.; Li, Y.; Chen, Q.; Cui, J. Human body-fluid proteome: Quantitative profiling and computational prediction. Brief. Bioinform. 2021, 22, 315–333. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Shao, D.; Huang, L.; Wang, Y.; He, K.; Cui, X.; Wang, Y.; Ma, Q.; Cui, J. DeepSec: A deep learning framework for secreted protein discovery in human body fluids. Bioinformatics 2021, 38, 228–235. [Google Scholar] [CrossRef]
Cui, J.; Liu, Q.; Puett, D.; Xu, Y. Computational prediction of human proteins that can be secreted into the bloodstream. Bioinformatics 2008, 24, 2370–2375. [Google Scholar] [CrossRef]
Wang, Y.; Du, W.; Liang, Y.; Chen, X.; Zhang, C.; Pang, W.; Xu, Y. PUEPro: A Computational Pipeline for Prediction of Urine Excretory Proteins. In Proceedings of the 12th Advanced Data Mining and Applications, Gold Coast, QLD, Australia, 12–15 December 2016; Volume 10086 LNAI, pp. 714–725. [Google Scholar]
Wang, J.; Liang, Y.; Wang, Y.; Cui, J.; Liu, M.; Du, W.; Xu, Y. Computational Prediction of Human Salivary Proteins from Blood Circulation and Application to Diagnostic Biomarker Identification. PLoS ONE 2013, 8, e80211. [Google Scholar] [CrossRef]
Sun, Y.; Du, W.; Zhou, C.; Zhou, Y.; Cao, Z.; Tian, Y.; Wang, Y. A Computational Method for Prediction of Saliva-Secretory Proteins and Its Application to Identification of Head and Neck Cancer Biomarkers for Salivary Diagnosis. IEEE Trans. Nanobiosci. 2015, 14, 167–174. [Google Scholar] [CrossRef]
Hu, L.L.; Huang, T.; Cai, Y.D.; Chou, K.C. Prediction of Body Fluids where Proteins are Secreted into Based on Protein Interaction Network. PLoS ONE 2011, 6, e22989. [Google Scholar] [CrossRef] [Green Version]
Apweiler, R. The Universal Protein Resource (UniProt) in 2010. Nucleic Acids Res. 2010, 38, D142–D148. [Google Scholar]
Rao, H.B.; Zhu, F.; Yang, G.B.; Li, Z.R.; Chen, Y.Z. Update of PROFEAT: A web server for computing structural and physicochemical features of proteins and peptides from amino acid sequence. Nucleic Acids Res. 2011, 39, W385–W390. [Google Scholar] [CrossRef] [Green Version]
LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef] [PubMed]
Caruana, R. Multitask Learning. Mach. Learn. 1997, 28, 41–75. [Google Scholar] [CrossRef]
Zhang, Y.; Yang, Q. A Survey on Multi-Task Learning. IEEE Trans. Knowl. Data Eng. 2021, 1. [Google Scholar] [CrossRef]
Cipolla, R.; Gal, Y.; Kendall, A. Multi-task Learning Using Uncertainty to Weigh Losses for Scene Geometry and Semantics. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake Organization, Salt Lake, UT, USA, 18–22 June 2018; pp. 7482–7491. [Google Scholar]
Chen, Z.; Badrinarayanan, V.; Lee, C.Y.; Rabinovich, A. GradNorm: Gradient Normalization for Adaptive Loss Balancing in Deep Multitask Networks. In Proceedings of the 35th International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; Volume 2, pp. 1240–1251. [Google Scholar]
Lin, X.; Zhen, H.L.; Li, Z.; Zhang, Q.; Kwong, S. Pareto Multi-Task Learning. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; Volume 32. [Google Scholar]
Sener, O. Multi-Task Learning as Multi-Objective Optimization. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 3–8 December 2018; pp. 525–536. [Google Scholar]
Shao, D.; Huang, L.; Wang, Y.; Cui, X.; Li, Y.; Wang, Y.; Ma, Q.; Du, W.; Cui, J. HBFP: A new repository for human body fluid proteome. Database 2021, 2021, 1–14. [Google Scholar] [CrossRef]
Huang, Y.; Niu, B.; Gao, Y.; Fu, L.; Li, W. CD-HIT Suite: A web server for clustering and comparing biological sequences. Bioinformatics 2010, 26, 680–682. [Google Scholar] [CrossRef]
Altschul, S. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Res. 1997, 25, 3389–3402. [Google Scholar] [CrossRef] [Green Version]
A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explor. Newsl. 2004, 6, 20–29. [CrossRef]
Savojardo, C.; Martelli, P.L.; Fariselli, P.; Casadio, R. DeepSig: Deep learning improves signal peptide detection in proteins. Bioinformatics 2018, 34, 1690–1696. [Google Scholar] [CrossRef] [Green Version]
Kim, Y. Convolutional Neural Networks for Sentence Classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, Stroudsburg, PA, USA, 25–29 October 2014; pp. 1746–1751. [Google Scholar]
Désidéri, J.A. Multiple-gradient descent algorithm (MGDA) for multiobjective optimization. C. R. Math. 2012, 350, 313–318. [Google Scholar] [CrossRef]
Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; Volume 32. [Google Scholar]
Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Müller, A.; Nothman, J.; Louppe, G.; et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
Standley, T.; Zamir, A.; Chen, D.; Guibas, L.; Malik, J.; Savarese, S. Which tasks should be learned together in multi-task learning? In Proceedings of the 37th International Conference on Machine Learning, Virtual, 13–18 July 2020; pp. 9120–9132. [Google Scholar]
Almagro Armenteros, J.J.; Sønderby, C.K.; Sønderby, S.K.; Nielsen, H.; Winther, O. DeepLoc: Prediction of protein subcellular localization using deep learning. Bioinformatics 2017, 33, 3387–3395. [Google Scholar] [CrossRef] [PubMed]

Figure 1. The architecture of MultiSec to discover secreted proteins by sequence: (a) the balanced sampling generates balanced data for each fluid; (b) feature extraction adapts four convolution-pooling operations and a fully-connected layer; (c) the multi-task classification contains 17 output layers to calculate 17 human body fluids probabilities.

Figure 2. Radar charts of comparative benchmarks on the independent testing datasets corresponding to 14 human body fluids. Higher is better: (a) radar chart of accuracy in 14 human body fluids; (b) radar chart of F1 score in 14 human body fluids; (c) radar chart of Matthew’s correlation coefficient in 14 human body fluids; (d) radar chart of area under the ROC Curve in 14 human body fluids.

Table 1. The numbers of proteins and the ratio of positive and negative samples in each human body fluid of the SecretedP17 dataset.

Fluid Name	Notation	Positive	Negative	All	Ratio
Plasma/Serum	Plasma	6530	4856	11,386	0.74
Saliva	Saliva	2521	8048	10,569	3.19
Urine	Urine	6972	4760	11,732	0.68
Cerebrospinal fluid	CSF	4082	6281	10,363	1.54
Seminal fluid	Seminal	3929	7230	11,159	1.84
Amniotic fluid	Amniotic	2876	8725	11,601	3.03
Tear fluid	Tear	1843	9597	11,440	5.21
Bronchoalveolar lavage fluid	BALF	3241	7392	10633	2.28
Milk fluid	Milk	2324	7333	9657	3.16
Synovial fluid	Synovial	1525	9624	11,149	6.31
Nipple aspirate fluid	NAF	1640	9800	11,440	5.98
Cervical-vaginal discharge	CVF	877	12,062	12939	13.75
Pleural effusion	PE	1437	9087	10,524	6.32
Sputum	Sputum	1696	9515	11,211	5.61
Exhaled breath condensate	EBC	326	14,903	15229	45.71
Pancreatic juice	PJ	646	12,957	13,603	20.06
Sweat	Sweat	1162	11,660	12,822	10.03

Table 2. Benchmarks of MultiSec on the independent testing datasets of 17 human body fluids.

Fluid Name	ACC	SN	SP	F1	MCC	AUC
Plasma/Serum	0.8682	0.8614	0.8599	0.8835	0.7323	0.9383
Saliva	0.9257	0.8373	0.9528	0.8456	0.7968	0.9546
Urine	0.8730	0.8630	0.9002	0.8911	0.7396	0.9462
Cerebrospinal fluid	0.8296	0.7806	0.8917	0.7792	0.6410	0.8867
Seminal fluid	0.8270	0.7924	0.8686	0.7503	0.6182	0.8976
Amniotic fluid	0.9254	0.8817	0.9444	0.8550	0.8058	0.9654
Tear fluid	0.9174	0.8614	0.9390	0.7670	0.7218	0.9493
Bronchoalveolar lavage fluid	0.8801	0.8256	0.9073	0.8087	0.7220	0.9400
Milk fluid	0.8979	0.7823	0.9093	0.7759	0.7119	0.9152
Synovial fluid	0.9049	0.8230	0.9101	0.6910	0.6411	0.9219
Nipple aspirate fluid	0.9209	0.8811	0.9352	0.7577	0.7189	0.9644
Cervical-vaginal discharge	0.9385	0.8629	0.9544	0.6638	0.6597	0.9715
Pleural effusion	0.8950	0.8014	0.9009	0.6697	0.6175	0.9068
Sputum	0.8863	0.7375	0.8833	0.6614	0.5981	0.9177
Exhaled breath condensate	0.9878	0.5538	0.9956	0.6542	0.6644	0.9276
Pancreatic juice	0.9577	0.8837	0.9606	0.6686	0.6731	0.9810
Sweat	0.8986	0.7802	0.9104	0.5724	0.5380	0.9340

Table 3. Comparative average benchmarks of MultiSec and other approaches on the independent testing datasets of 14 human body fluids.

Method	ACC	SN	SP	F1	MCC	AUC
DT	0.7998	0.4163	0.8783	0.4750	0.3430	0.7457
SVM	0.7521	0.7158	0.7677	0.5737	0.4260	0.8201
DeepSec	0.8159	0.7736	0.8219	0.6436	0.5331	0.8777
MultiSecRS	0.9120	0.6456	0.9564	0.7337	0.6828	0.9254
MultiSec	0.8893	0.8221	0.9081	0.7649	0.6859	0.9313

The best results are in bold. MultiSecRS denotes MultiSec with random sampling.

Table 4. Comparison of MultiSec and DeepSec on computational resources.

Method	Number of Body Fluids	Number of Networks	Number of Parameters (K)	Training Time (h)
DeepSec	14	56	68.37 × 56	33.16
MultiSec	17	1	40.32	1.11

Table 5. The numbers and proportions of potentially secreted proteins in 17 human body fluids.

Fluid Name	Number of PSP	Number of CP	Ratio of PSP
Plasma/Serum	5154	8691	0.59
Saliva	5083	9553	0.53
Urine	4590	8280	0.55
Cerebrospinal fluid	5356	9714	0.55
Seminal fluid	6742	9049	0.75
Amniotic fluid	4189	8607	0.49
Tear fluid	3774	8777	0.43
Bronchoalveolar lavage fluid	6039	9538	0.63
Milk fluid	5403	10,568	0.51
Synovial fluid	4106	9085	0.45
Nipple aspirate fluid	4224	8822	0.48
Cervical-vaginal discharge	1244	7339	0.17
Pleural effusion	4767	9744	0.49
Sputum	5120	9027	0.57
Exhaled breath condensate	1271	5092	0.25
Pancreatic juice	2301	6691	0.34
Sweat	3384	7451	0.45

PSP deontes potentially secreted proteins. CP denotes candidate proteins.

Table 6. Potentially universal secreted proteins in 17 human body fluids with an overall probability greater than 90%.

Id	Accession	Overall Probability ( $\hat{p}$ )
1	Q9H2R5	0.9909
2	Q96P15	0.9908
3	Q9Y5K2	0.9844
4	Q86WD7	0.9790
5	Q4G0T1	0.9673
6	Q96PF1	0.9622
7	P49863	0.9575
8	P12544	0.9568
9	P0DOX4	0.9537
10	P06315	0.9530
11	A0A0C4DH39	0.9496
12	A0A0G2JMI3	0.9463
13	P51124	0.9453
14	Q8IXH8	0.9392
15	P23946	0.9330
16	O95932	0.9280
17	Q7Z410	0.9224
18	Q9H114	0.9131
19	Q0Z7S8	0.9068
20	A0A0B4J1Z2	0.9021

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

He, K.; Wang, Y.; Xie, X.; Shao, D. MultiSec: Multi-Task Deep Learning Improves Secreted Protein Discovery in Human Body Fluids. Mathematics 2022, 10, 2562. https://doi.org/10.3390/math10152562

AMA Style

He K, Wang Y, Xie X, Shao D. MultiSec: Multi-Task Deep Learning Improves Secreted Protein Discovery in Human Body Fluids. Mathematics. 2022; 10(15):2562. https://doi.org/10.3390/math10152562

Chicago/Turabian Style

He, Kai, Yan Wang, Xuping Xie, and Dan Shao. 2022. "MultiSec: Multi-Task Deep Learning Improves Secreted Protein Discovery in Human Body Fluids" Mathematics 10, no. 15: 2562. https://doi.org/10.3390/math10152562

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

MultiSec: Multi-Task Deep Learning Improves Secreted Protein Discovery in Human Body Fluids

Abstract

1. Introduction

2. Materials and Methods

2.1. Data Collection

2.2. Multi-Fluid Secreted Protein Discovery

2.2.1. Balanced Sampling

2.2.2. Feature Extraction

2.2.3. Muli-Task Classification

2.2.4. Evaluation

3. Results

3.1. Performance of MultiSec in 17 Human Body Fluids

3.2. Comparison with Other Methods in 14 Human Body Fluids

3.3. Potentially Secreted Proteins

4. Disscussion

5. Conclusions

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A. Comparison Details with Other Methods in 17 Human Body Fluids

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI