1. Introduction
Protamines are a group of basic proteins, rich in arginine, found in the nucleus of sperm cells in many animals, including fish and mammals. Their primary function is to compact sperm DNA during spermatogenesis, which helps protect and stabilize the DNA and facilitates its transport during fertilization [
1,
2,
3]. The study of protamines offers valuable insights into the regulation of gene expression, chromosome stability, and the mechanisms underlying male infertility [
2,
4,
5]. Furthermore, the application of protamines extends beyond reproductive biology. Their unique properties are utilized in biotechnology and medicine, particularly in formulating drug delivery complexes and as potential therapeutic agents for various diseases [
1,
6,
7]. Despite the importance of these proteins, their identification by conventional methods remains a challenge due to several factors. Protamines are predominantly studied in certain types of organisms, such as fish (e.g., salmon) and some mammals. However, not all organisms produce protamines, and in those that do, these proteins tend to be quite species-specific [
1,
8,
9]. Additionally, protamines exhibit high variability and complexity between different species and even among individuals of the same species [
1,
9,
10]. This variability complicates the annotation and confirmation of protamine sequences in omics databases.
Addressing these challenges, machine learning (ML) and deep learning (DL) offer transformative potential for enhancing protamine research. These technologies can automate the detection and classification of protein sequences, even from noisy or incomplete data, by learning from patterns identified in datasets [
11,
12,
13]. The use of ML-based approaches to develop predictive models for proteins and peptides in the field of reproduction has been relatively unexplored in recent years. However, several models following this approach have been developed, indicating significant potential impact in this field of study [
14,
15,
16,
17].
Considering all the above, ML algorithms could be trained with known protamine sequences to predict the presence of these proteins in new omics samples with high performance. Applying machine learning and deep learning in protamine research represents a promising avenue to overcome current challenges and expand our understanding of these essential proteins. By leveraging the power of these technologies, researchers can gain deeper insights into the biological roles of protamines, their evolutionary diversity, and their potential applications in biotechnology and medicine. Consequently, this study evaluates the proposal of a protamine predictive model based on machine learning and deep learning approaches.
3. Discussion
The arginine clusters form the DNA-binding domains, allowing DNA–protamine complexes to condense and stabilize the spermatid genome. Protamines replace histones during spermatid maturation, protecting DNA from degradation [
6]. Protamines are proteins of great importance for several reasons. They can protect DNA from degradation during sperm formation due to electrostatic interactions between DNA and protamine, which is positively charged. This is crucial for maintaining genetic stability in reproductive cells [
18,
19]. They are widely used in medicine as adjuvants in insulin formulations, extending their duration of action by forming complexes with insulin through electrostatic interactions [
20]. Moreover, protamine nanoparticles have been shown to have outstanding immunomodulatory properties, making them promising components in new vaccine technologies, especially in RNA delivery systems for vaccines against infectious diseases and in cancer treatment [
21,
22,
23]. However, due to the unique characteristics of protamines, extracting and analyzing them is more complex compared with other chromatin-associated proteins, for which there are numerous detailed protocols [
24]. This complexity could be one of the main reasons why few protamines have been sequenced using mass spectrometry techniques to determine their primary sequence.
The SMOTE technique has been reported in several works as an effective method for dealing with balanced data derived from the computation of molecular descriptors from primary sequences of proteins and peptides [
25,
26,
27,
28,
29,
30,
31]. On the other hand, the use of GANs for data augmentation is a more recent approach, which has proven to be very robust in generating high-quality synthetic data [
32,
33,
34,
35,
36]. In this study, both methods enabled the development of models with high quality according to performance metrics, which were greater than 0.9 in all cases. However, the models obtained with the GAN-based approach showed an improvement, although not a significant one, in the evaluated performance metrics. These results demonstrate and align with previous studies on the robustness of applying both data augmentation techniques to tabular data derived from the computation of molecular descriptors [
31,
32,
37,
38].
The integration of GANs into PROTA, despite similar performance metrics with SMOTE, was based on several key considerations. GANs showed subtle performance improvements during cross-validation, with XGBOOST achieving marginally higher accuracy (0.997) compared with the best performance of SMOTE (0.996 with LIGHTGBM). While not statistically significant in our current dataset, this suggests potential for enhanced performance with larger, more diverse datasets. The ability of GANs to capture and generate complex, non-linear patterns aligns well with the intricate nature of protamine sequences, potentially capturing a wider range of variations compared with the interpolation approach of SMOTE [
39,
40,
41]. Additionally, GANs offer superior scalability and adaptability as deep learning models, making PROTA more robust to potential increases in dataset size and complexity. This positions PROTA at the forefront of bioinformatics advancements, facilitating future improvements in protein sequence analysis.
ML-based methods offer several significant advantages over traditional sequence alignment methods based on the principle of homology. Firstly, ML algorithms can handle large volumes of data and detect complex patterns that are not evident with traditional alignment techniques. Traditional methods heavily rely on the similarity of known sequences, which can limit their effectiveness in identifying less conserved protein sequences or detecting new variants [
42,
43,
44]. Consequently, ML-based methods are particularly relevant in the context of proteins such as protamines, which exhibit high variability and complexity across species. As an example of the application of PROTA, in this study, we conducted a large-scale analysis of unannotated sequences in the UniProt database for the first time [
45]. The number of sequences analyzed was 3512, which are currently available without annotation in this database due to the deficiencies of traditional sequence alignment methods and the intrinsic characteristics of these proteins regarding their high variability and low homology mentioned earlier. From this total number of amino acid sequences, we identified 591 protamines using our best model incorporated in PROTA, which are available at
https://github.com/jfbldevs/BioChemIntelli_datasets accessed on 2 August 2024, with binary labels of one for protamines and zero for non-protamines.
The protamines identified by PROTA exhibit a marked predominance of arginine residues, with a frequency approximately three-fold higher than any other amino acid (
Figure 3). This arginine-rich composition is consistent with the canonical structure of protamines and their function in DNA condensation [
1,
9,
46,
47]. The high arginine content facilitates the electrostatic neutralization of DNA phosphate groups, enabling chromatin hypercondensation in spermatozoa [
48,
49,
50].
The consistent arginine enrichment across the identified sequences validates the specificity of PROTA in protamine detection. This characteristic amino acid profile serves as a robust molecular signature, distinguishing protamines from other nuclear proteins. The conservation of this feature across diverse taxa in the sequences identified by PROTA suggests the efficacy of the tool in capturing evolutionary variants of protamines. The pronounced arginine bias in sequences identified by PROTA corroborates the accuracy and specificity of the tool in protamine detection. This finding underscores the utility of PROTA as a computational resource for protamine identification and analysis in complex sequence datasets.
Furthermore, the identification of protamines and protamine-like proteins by PROTA has implications beyond chromatin condensation. These proteins play a crucial role in environmental toxicology and reproductive health, as semen bioaccumulates pollutants that can alter sperm quality. Pollutant-induced conformational changes in protamines can affect DNA binding and chromatin structure across various species [
51]. The ability of PROTA to accurately identify both protamines and protamine-like proteins positions it as a valuable tool for investigating these environmental interactions, potentially contributing to the development of biomarkers for pollution and reproductive disorders.
While PROTA demonstrates robust performance in protamine prediction, there are several avenues for the future enhancement and expansion of its capabilities. First, although our dataset is comprehensive, continuously updating it with newly discovered protamines from diverse species could further improve the tool accuracy and broaden its applicability across different taxonomic groups. Second, while our data augmentation techniques (SMOTE and GANs) have proven effective, exploring their impact on the biological relevance of synthetic sequences presents an interesting area for future research. This could potentially lead to even more sophisticated data augmentation strategies tailored specifically for proteomic data. These considerations not only highlight the current strengths of PROTA but also underscore its potential for growth and refinement, paving the way for future advancements in protamine research and computational biology.
PROTA represents a significant step forward in protamine research, offering a powerful tool for researchers in reproductive biology, evolution, and biotechnology. By leveraging the strengths of machine learning and deep learning, we have created a robust method for protamine prediction that overcomes many of the limitations of traditional approaches. This work not only enhances our ability to identify and study these crucial proteins but also opens new avenues for research in reproductive biology and beyond.